# FinCompress — Notebook 01: Teacher Training

**RUN ON: Colab/GPU**

## Project Overview

FinCompress is a practitioner's study in compressing a domain-pretrained language model for production inference. We start with **ProsusAI/FinBERT** — a BERT-base model further pre-trained on financial text — as our teacher, then compress it three ways:

1. **Knowledge Distillation** (Notebooks 01–02): Train a 4-layer student to mimic the teacher
2. **INT8 Quantization** (Notebook 03): Reduce weight precision from FP32 to INT8
3. **Structured Pruning** (Notebook 04): Remove entire attention heads and FFN neurons

All variants are benchmarked identically in Notebook 05.

## Why Start with FinBERT?

BERT-base was pre-trained on Wikipedia and BookCorpus — general English text. Financial text has a very different distribution: domain-specific vocabulary ("EPS beat", "EBITDA", "margin compression"), numerical reasoning, and hedged language ("may adversely affect", "subject to regulatory approval"). FinBERT was further pre-trained on financial news, earnings calls, and analyst reports, giving it:

- Better tokenization of financial terms (they appear as full tokens, not subwords)
- Contextualized representations that already encode financial sentiment signals
- Higher baseline accuracy on financial NLP tasks *before* any task-specific fine-tuning

Starting with a domain-aware teacher gives the student a richer target to distill — the teacher's soft labels encode more domain-relevant uncertainty structure than BERT-base's would.

## Setup

**Run the Setup cell below first** (mounts Drive, syncs the repo, installs dependencies). All three steps are combined so the pip install always uses the correct path.

## Google Drive Setup

Mount your Google Drive and clone/import the FinCompress repository:

```
# Recommended Drive structure:
# MyDrive/
# └── fincompress/
#     ├── fincompress/          ← the Python package
#     │   ├── data/
#     │   ├── checkpoints/
#     │   └── ...
#     └── requirements_colab.txt
```

After mounting, `cd` into the project root so relative paths work correctly.

In [None]:
# ── Setup: mount Drive, sync repo, install deps ──────────────────────────────
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

import sys, os

PROJECT_PATH = '/content/drive/MyDrive/fincompress'

if os.path.exists(os.path.join(PROJECT_PATH, '.git')):
    print("Repo already exists — pulling latest...")
    os.system(f'git -C {PROJECT_PATH} pull')
else:
    print("Cloning fresh...")
    os.makedirs(PROJECT_PATH, exist_ok=True)
    os.chdir('/content/drive/MyDrive')
    os.system('git clone https://github.com/Rohanjain2312/FinCompress.git fincompress')

os.chdir(PROJECT_PATH)
sys.path.insert(0, PROJECT_PATH)
print('Working directory:', os.getcwd())

# Install dependencies using absolute path (drive must be mounted first).
# numpy/pandas/scipy/matplotlib/seaborn/scikit-learn are intentionally NOT
# pinned — Colab's pre-installed versions are mutually compatible; pinning
# numpy downgrades it and breaks pre-compiled C extensions (numpy.dtype error).
os.system('pip install -r /content/drive/MyDrive/fincompress/requirements_colab.txt -q')
print("Dependencies installed.")
print()
print("If this is your FIRST run on this runtime:")
print("  Runtime -> Restart session -> then re-run ALL cells from the top.")
print("  (Required so Colab reloads torch/transformers into the fresh process.)")

## Dataset: FinancialPhraseBank + FiQA-2018

We combine two complementary datasets:

| Dataset | Source | Size | Labels |
|---------|--------|------|--------|
| FinancialPhraseBank (allagree) | Expert-annotated financial news | ~2,264 | {negative, neutral, positive} |
| FiQA-2018 Sentiment | Crowdsourced financial Q&A | ~1,174 | Continuous [-1, 1] → discretized |

**Why combine?** FinancialPhraseBank has high-quality expert labels but is small (~2K samples). FiQA provides additional diversity but uses continuous scores that we threshold at ±0.1. Combined and de-duplicated, we get a richer training set.

**Class distribution rationale:** Financial news skews neutral (most corporate announcements are factual). We use stratified splits to preserve this distribution across train/val/test — testing on a different distribution than training would give misleadingly optimistic F1.

In [None]:
# Prepare dataset (downloads from HuggingFace Hub)
!python -m fincompress.data.prepare_dataset

import pandas as pd
df_train = pd.read_csv('fincompress/data/train.csv')
df_val   = pd.read_csv('fincompress/data/val.csv')
df_test  = pd.read_csv('fincompress/data/test.csv')

label_map = {0: 'negative', 1: 'neutral', 2: 'positive'}
for split_name, df in [('train', df_train), ('val', df_val), ('test', df_test)]:
    print(f'\n{split_name} ({len(df)} samples):')
    print(df['label'].map(label_map).value_counts(normalize=True).round(3).to_string())

## Teacher Fine-Tuning

We fine-tune FinBERT on our combined dataset using a manual PyTorch training loop (no HuggingFace Trainer). Key choices:

- **AdamW** with weight decay 0.01 (bias/LayerNorm excluded from decay)
- **Linear warmup** for 10% of total steps, then linear decay — prevents large gradient updates before representations stabilize
- **Early stopping** after 3 consecutive non-improving epochs — saves compute and prevents overfitting
- **Target: val Macro F1 ≥ 0.87** — below this threshold, the teacher's soft labels carry insufficient information for effective distillation

Why Macro F1 (not accuracy)? The dataset has class imbalance (neutral >> negative). Macro F1 weights each class equally, penalizing a model that ignores the minority class.

In [None]:
# Train teacher (logs to checkpoints/teacher/ and logs/teacher_training.csv)
!python -m fincompress.teacher.train_teacher

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

log_df = pd.read_csv('fincompress/logs/teacher_training.csv')

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

axes[0].plot(log_df['epoch'], log_df['train_loss'], 'o-', label='Train Loss', color='steelblue')
axes[0].plot(log_df['epoch'], log_df['val_loss'], 's-', label='Val Loss', color='coral')
axes[0].set_xlabel('Epoch')
axes[0].set_ylabel('Cross-Entropy Loss')
axes[0].set_title('Teacher Training and Validation Loss')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

axes[1].plot(log_df['epoch'], log_df['val_f1'], 'o-', color='mediumseagreen')
axes[1].axhline(y=0.87, color='red', linestyle='--', label='Target F1 = 0.87')
axes[1].set_xlabel('Epoch')
axes[1].set_ylabel('Val Macro F1')
axes[1].set_title('Teacher Val Macro F1 by Epoch')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()
print(f"Best val F1: {log_df['val_f1'].max():.4f}")

In [None]:
import torch, json
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import confusion_matrix
from transformers import AutoModelForSequenceClassification, AutoTokenizer
from torch.utils.data import DataLoader, Dataset

class SimpleDS(Dataset):
    def __init__(self, df, tok, maxlen):
        self.texts = df['text'].tolist()
        self.labels = df['label'].tolist()
        self.tok = tok; self.maxlen = maxlen
    def __len__(self): return len(self.texts)
    def __getitem__(self, i):
        enc = self.tok(self.texts[i], max_length=self.maxlen, padding='max_length', truncation=True, return_tensors='pt')
        return {'input_ids': enc['input_ids'].squeeze(0), 'attention_mask': enc['attention_mask'].squeeze(0), 'label': torch.tensor(self.labels[i])}

device = 'cuda' if torch.cuda.is_available() else 'cpu'
tokenizer = AutoTokenizer.from_pretrained('fincompress/checkpoints/teacher/tokenizer')
teacher = AutoModelForSequenceClassification.from_pretrained('fincompress/checkpoints/teacher').to(device)
teacher.eval()

df_val = pd.read_csv('fincompress/data/val.csv')
loader = DataLoader(SimpleDS(df_val, tokenizer, 128), batch_size=64)

all_preds, all_labels = [], []
with torch.no_grad():
    for batch in loader:
        out = teacher(batch['input_ids'].to(device), attention_mask=batch['attention_mask'].to(device))
        all_preds.extend(out.logits.argmax(dim=-1).cpu().tolist())
        all_labels.extend(batch['label'].tolist())

cm = confusion_matrix(all_labels, all_preds)
label_names = ['negative', 'neutral', 'positive']
fig, ax = plt.subplots(figsize=(7, 6))
sns.heatmap(cm, annot=True, fmt='d', xticklabels=label_names, yticklabels=label_names, cmap='Blues', ax=ax)
ax.set_xlabel('Predicted')
ax.set_ylabel('True')
ax.set_title('Teacher Confusion Matrix (Val Set)')
plt.tight_layout()
plt.show()

## Confusion Matrix Interpretation

The confusion matrix reveals which sentiment classes are hardest to distinguish:

- **Negative ↔ Neutral**: The most common error. Many negative financial statements use hedged or understated language ("results were below expectations") that sits close to the neutral boundary in representation space.
- **Positive ↔ Neutral**: Less common. Positive announcements tend to use more explicit language ("record revenues", "exceeds estimates") that is more distinctly positive.
- **Negative ↔ Positive**: Rare. Direct sentiment opposition is usually linguistically unambiguous.

This confusion pattern tells us what to watch for in the student: if distillation degrades accuracy, it will first affect negative recall.