# Phase 1: Enhanced BERT Training with Input Augmentation

This notebook trains a BERT model on the enhanced dataset with market context prefixes.

**Key improvements:**
1. Input Augmentation: Add market state prefixes (e.g., `[Strong Rally]`, `[Sharp Decline]`)
2. Class Weighting: Handle severe class imbalance with automatic weights
3. Optimized hyperparameters: Lower LR, longer sequences, early stopping

**Expected results:**
- Macro F1: 0.35-0.45 (vs baseline 0.16)
- Class 3/4/5 F1: > 0.10 (vs baseline 0.00)

## Step 1: Setup Environment

In [None]:
# Check GPU availability
!nvidia-smi

In [None]:
# Install/upgrade dependencies
!pip install -U transformers datasets evaluate accelerate huggingface_hub -q

## Step 2: Mount Google Drive

In [None]:
from google.colab import drive
drive.mount('/content/drive')

## Step 3: Clone Repository (First Time Only)

In [None]:
# Clone repo (skip if already cloned)
import os
if not os.path.exists('Graduation_Project'):
    !git clone https://github.com/Caria-Tarnished/Graduation_Project.git
else:
    print("Repository already exists, pulling latest changes...")
    !cd Graduation_Project && git pull

%cd Graduation_Project

## Step 4: Verify Enhanced Data in Drive

**IMPORTANT**: Make sure you have uploaded these files to your Google Drive:
- `/content/drive/MyDrive/Graduation_Project/data/processed/train_enhanced.csv`
- `/content/drive/MyDrive/Graduation_Project/data/processed/val_enhanced.csv`
- `/content/drive/MyDrive/Graduation_Project/data/processed/test_enhanced.csv`

In [None]:
# Check if enhanced data exists
import os

DATA_DIR = '/content/drive/MyDrive/Graduation_Project/data/processed'
files_to_check = ['train_enhanced.csv', 'val_enhanced.csv', 'test_enhanced.csv']

print("Checking for enhanced data files...")
all_exist = True
for filename in files_to_check:
    filepath = os.path.join(DATA_DIR, filename)
    exists = os.path.exists(filepath)
    status = "?" if exists else "?"
    print(f"{status} {filepath}")
    if not exists:
        all_exist = False

if all_exist:
    print("\n? All enhanced data files found!")
else:
    print("\n?? Some files are missing. Please upload them to Google Drive first.")
    print("\nLocal files location: E:\\Projects\\Graduation_Project\\data\\processed\\*_enhanced.csv")

## Step 5: Preview Enhanced Data

In [None]:
import pandas as pd

# Load a sample
train_path = '/content/drive/MyDrive/Graduation_Project/data/processed/train_enhanced.csv'
df = pd.read_csv(train_path)

print(f"Training samples: {len(df)}")
print(f"\nColumns: {df.columns.tolist()}")
print(f"\nLabel distribution:")
print(df['label_multi_cls'].value_counts().sort_index())

print(f"\n{'='*80}")
print("Sample enhanced texts:")
print(f"{'='*80}")
for i in range(3):
    print(f"\n{i+1}. Label: {df.iloc[i]['label_multi_cls']}")
    print(f"   Original: {df.iloc[i]['text'][:80]}...")
    print(f"   Enhanced: {df.iloc[i]['text_enhanced'][:120]}...")

## Step 6: Train Enhanced Model

**Training configuration:**
- Model: `hfl/chinese-roberta-wwm-ext`
- Epochs: 5
- Learning rate: 1e-5
- Max length: 384 (increased for longer context)
- Batch size: 16 (effective 32 with gradient accumulation)
- Class weighting: Auto (handles imbalance)
- Early stopping: 3 patience

In [None]:
# Set output directory
OUTPUT_DIR = '/content/drive/MyDrive/Graduation_Project/experiments/bert_enhanced_v1'

# Training command
!python scripts/modeling/bert_finetune_cls.py \
  --train_csv /content/drive/MyDrive/Graduation_Project/data/processed/train_enhanced.csv \
  --val_csv /content/drive/MyDrive/Graduation_Project/data/processed/val_enhanced.csv \
  --test_csv /content/drive/MyDrive/Graduation_Project/data/processed/test_enhanced.csv \
  --output_dir {OUTPUT_DIR} \
  --label_col label_multi_cls \
  --model_name hfl/chinese-roberta-wwm-ext \
  --class_weight auto \
  --epochs 5 \
  --lr 1e-5 \
  --max_length 384 \
  --train_bs 16 \
  --eval_bs 32 \
  --gradient_accumulation_steps 2 \
  --warmup_ratio 0.06 \
  --weight_decay 0.01 \
  --eval_steps 100 \
  --save_steps 100 \
  --early_stopping_patience 3

## Step 7: Check Results

In [None]:
import json

OUTPUT_DIR = '/content/drive/MyDrive/Graduation_Project/experiments/bert_enhanced_v1'

# Load metrics
with open(f'{OUTPUT_DIR}/metrics_val.json', 'r') as f:
    val_metrics = json.load(f)

with open(f'{OUTPUT_DIR}/metrics_test.json', 'r') as f:
    test_metrics = json.load(f)

print("="*80)
print("Validation Metrics")
print("="*80)
print(f"Accuracy: {val_metrics['eval_accuracy']:.4f}")
print(f"Macro F1: {val_metrics['eval_macro_f1']:.4f}")
print(f"Loss: {val_metrics['eval_loss']:.4f}")

print("\n" + "="*80)
print("Test Metrics")
print("="*80)
print(f"Accuracy: {test_metrics['eval_accuracy']:.4f}")
print(f"Macro F1: {test_metrics['eval_macro_f1']:.4f}")
print(f"Loss: {test_metrics['eval_loss']:.4f}")

# Compare with baseline
baseline_f1 = 0.163
improvement = (test_metrics['eval_macro_f1'] - baseline_f1) / baseline_f1 * 100
print("\n" + "="*80)
print("Comparison with Baseline")
print("="*80)
print(f"Baseline Macro F1: {baseline_f1:.4f}")
print(f"Enhanced Macro F1: {test_metrics['eval_macro_f1']:.4f}")
print(f"Improvement: {improvement:+.1f}%")

In [None]:
# Show classification report
with open(f'{OUTPUT_DIR}/report_test.txt', 'r') as f:
    report = f.read()

print("="*80)
print("Classification Report (Test Set)")
print("="*80)
print(report)

## Step 8: Analyze Predictions

In [None]:
# Load predictions
pred_df = pd.read_csv(f'{OUTPUT_DIR}/pred_test.csv')

print("Prediction distribution:")
print(pred_df['pred'].value_counts().sort_index())

print("\nConfusion matrix (simplified):")
from sklearn.metrics import confusion_matrix
import numpy as np

cm = confusion_matrix(pred_df['label'], pred_df['pred'])
print(cm)

# Check if Class 3/4/5 are being predicted
rare_classes = [3, 4, 5]
for cls in rare_classes:
    count = (pred_df['pred'] == cls).sum()
    print(f"\nClass {cls} predictions: {count}")
    if count > 0:
        print("? Model is predicting this class!")
    else:
        print("?? Model never predicts this class")

## Step 9: Download Results (Optional)

If you want to download the results to your local machine:

In [None]:
# Create a zip file of results
import shutil

OUTPUT_DIR = '/content/drive/MyDrive/Graduation_Project/experiments/bert_enhanced_v1'
zip_path = '/content/bert_enhanced_v1_results'

# Copy only small files (metrics, reports, predictions)
!mkdir -p {zip_path}
!cp {OUTPUT_DIR}/metrics_*.json {zip_path}/
!cp {OUTPUT_DIR}/report_*.txt {zip_path}/
!cp {OUTPUT_DIR}/pred_*.csv {zip_path}/
!cp {OUTPUT_DIR}/eval_results.json {zip_path}/ 2>/dev/null || true

# Create zip
shutil.make_archive('/content/bert_enhanced_v1_results', 'zip', zip_path)

print("? Results packaged!")
print("Download: /content/bert_enhanced_v1_results.zip")

# Download using Colab's file browser or:
from google.colab import files
files.download('/content/bert_enhanced_v1_results.zip')

## Next Steps

Based on the results:

### If Macro F1 > 0.35 ?
- Proceed to Phase 2: Try financial pre-trained model (`mengzi-bert-base-fin`)
- Implement Focal Loss for better minority class handling

### If Macro F1 < 0.30 ??
- Revisit data quality (check label correctness)
- Simplify to 3-class problem (bearish/neutral/bullish only)
- Try different labeling thresholds

### If Class 3/4/5 still have F1=0 ??
- Try manual class weights: `{0:0.5, 1:2.0, 2:2.0, 3:20.0, 4:20.0, 5:3.0}`
- Consider merging Class 3/4 into single "priced-in" class
- Generate synthetic samples using LLM