
# Sentence Classification (Sentiment) — Linear Probe vs Full Fine-Tuning on **TweetEval: Sentiment**

**Audience:** 4th-year Computer Science students  
**Task:** Multi-class sentiment classification (**negative / neutral / positive**) on the **TweetEval** benchmark (subset: `sentiment`).

You'll build and compare two approaches using a Hugging Face encoder:
1. **Linear Probe (Frozen Encoder):** Freeze the transformer encoder and train only a small classification head.
2. **Full Fine-Tuning:** Unfreeze the encoder and fine-tune end-to-end.

We'll evaluate both on the same test set and visualize improvements.



## 0) Setup & Reproducibility

Run this cell to (optionally) install dependencies and set the random seed.  
If running on a managed environment (e.g., Colab) uncomment the `pip` line.


In [None]:

# If needed, uncomment to install:
# %pip install -U transformers datasets accelerate evaluate scikit-learn matplotlib

import os, random, time, json
import numpy as np

import evaluate
import torch
from datasets import load_dataset, DatasetDict
from transformers import (AutoTokenizer, AutoModelForSequenceClassification,
                          DataCollatorWithPadding, TrainingArguments, Trainer)

from sklearn.metrics import confusion_matrix, classification_report
import matplotlib.pyplot as plt

SEED = 42
def set_seed(seed=SEED):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False

set_seed()
print("Torch:", torch.__version__, "| CUDA available:", torch.cuda.is_available())



## 1) Configuration

Tweak hyperparameters here. To make a quick run on CPU, use a **subset_fraction** like `0.3`. Set to `None` for the full dataset.


In [None]:

CONFIG = {
    "dataset_name": "tweet_eval",
    "dataset_subset": "sentiment",   # 3-way: negative(0), neutral(1), positive(2)
    "text_col": "text",
    "label_col": "label",
    "labels": ["negative", "neutral", "positive"],
    "model_name": "distilbert-base-uncased",
    "max_length": 128,
    "per_device_train_batch_size": 16,
    "per_device_eval_batch_size": 32,
    "epochs_probe": 2,           # linear-probe training epochs
    "epochs_finetune": 3,        # full finetune epochs
    "learning_rate_probe": 5e-4, # higher since only head trains
    "learning_rate_finetune": 2e-5,
    "weight_decay": 0.01,
    "warmup_ratio": 0.06,
    "subset_fraction": 0.3,      # None for full data; use fraction like 0.3 for speed
    "output_dir": "checkpoints_tweeteval_sentiment"
}
print(json.dumps(CONFIG, indent=2))



## 2) Load the **TweetEval: Sentiment** Dataset

We use the **TweetEval** benchmark (not GLUE). The `sentiment` subset has labels: 0=negative, 1=neutral, 2=positive.  
Splits: `train`, `validation`, `test`.


In [None]:

raw = load_dataset(CONFIG["dataset_name"], CONFIG["dataset_subset"])

# Optionally downsample for a quick demo run
subset_fraction = CONFIG["subset_fraction"]
if subset_fraction is not None and 0 < subset_fraction < 1:
    def take_fraction(dset, frac):
        n = max(30, int(len(dset) * frac))  # keep a minimum
        return dset.shuffle(seed=SEED).select(range(n))
    raw = DatasetDict({
        "train": take_fraction(raw["train"], subset_fraction),
        "validation": take_fraction(raw["validation"], subset_fraction),
        "test": raw["test"]  # keep full test for better generalization measurement
    })

raw



## 3) Tokenization

We use the tokenizer associated with the chosen encoder. Tweets are short; we cap `max_length` to keep it efficient.


In [None]:

tokenizer = AutoTokenizer.from_pretrained(CONFIG["model_name"], use_fast=True)

def tokenize_fn(batch):
    return tokenizer(batch[CONFIG["text_col"]], truncation=True, max_length=CONFIG["max_length"])

remove_cols = [c for c in raw["train"].column_names if c not in (CONFIG["text_col"], CONFIG["label_col"])]
tokenized = raw.map(tokenize_fn, batched=True, remove_columns=remove_cols)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

num_labels = len(CONFIG["labels"])
label_names = CONFIG["labels"]



## 4) Metrics

We report **accuracy** and **macro-F1** (averages F1 across classes).


In [None]:

accuracy = evaluate.load("accuracy")
f1 = evaluate.load("f1")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = np.argmax(logits, axis=-1)
    return {
        "accuracy": accuracy.compute(predictions=preds, references=labels)["accuracy"],
        "macro_f1": f1.compute(predictions=preds, references=labels, average="macro")["f1"]
    }



### Helper: Confusion Matrix


In [None]:

def plot_confusion_matrix(y_true, y_pred, title="Confusion Matrix", labels=None):
    if labels is None:
        labels = [str(i) for i in sorted(np.unique(y_true))]
    from sklearn.metrics import confusion_matrix
    cm = confusion_matrix(y_true, y_pred, labels=list(range(len(labels))))
    fig, ax = plt.subplots()
    im = ax.imshow(cm)  # default colormap
    ax.set_xticks(range(len(labels)))
    ax.set_yticks(range(len(labels)))
    ax.set_xticklabels(labels)
    ax.set_yticklabels(labels)
    ax.set_xlabel("Predicted")
    ax.set_ylabel("True")
    ax.set_title(title)

    for i in range(cm.shape[0]):
        for j in range(cm.shape[1]):
            ax.text(j, i, cm[i, j], ha="center", va="center")

    plt.show()



## 5) Model Builder


In [None]:

def build_model(freeze_encoder=True):
    model = AutoModelForSequenceClassification.from_pretrained(
        CONFIG["model_name"], num_labels=num_labels
    )
    if freeze_encoder:
        if hasattr(model, "distilbert"):
            for p in model.distilbert.parameters():
                p.requires_grad = False
        else:
            base = getattr(model, "bert", None) or getattr(model, "roberta", None) or getattr(model, "deberta", None)
            if base is not None:
                for p in base.parameters():
                    p.requires_grad = False
    return model



## 6) Baseline: **Linear Probe** (Frozen Encoder)


In [None]:

probe_model = build_model(freeze_encoder=True)

args_probe = TrainingArguments(
    output_dir=os.path.join(CONFIG["output_dir"], "probe"),
    per_device_train_batch_size=CONFIG["per_device_train_batch_size"],
    per_device_eval_batch_size=CONFIG["per_device_eval_batch_size"],
    learning_rate=CONFIG["learning_rate_probe"],
    num_train_epochs=CONFIG["epochs_probe"],
    weight_decay=CONFIG["weight_decay"],
    warmup_ratio=CONFIG["warmup_ratio"],
    logging_steps=50,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    seed=SEED,
    report_to="none"
)

trainer_probe = Trainer(
    model=probe_model,
    args=args_probe,
    train_dataset=tokenized["train"],
    eval_dataset=tokenized["validation"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics
)

t0 = time.time()
trainer_probe.train()
probe_train_time = time.time() - t0

probe_val = trainer_probe.evaluate(tokenized["validation"])
probe_test = trainer_probe.evaluate(tokenized["test"])

print("Probe Validation:", probe_val)
print("Probe Test:", probe_test, "| Train time (s):", round(probe_train_time, 2))

probe_logits, _, _ = trainer_probe.predict(tokenized["test"])
probe_preds = np.argmax(probe_logits, axis=-1)
y_test = np.array(tokenized["test"][CONFIG["label_col"]])

plot_confusion_matrix(y_test, probe_preds, title="Frozen Encoder (Linear Probe) — Test", labels=label_names)
print(classification_report(y_test, probe_preds, target_names=label_names))



## 7) **Full Fine-Tuning** (Encoder + Head)


In [None]:

ft_model = build_model(freeze_encoder=False)

args_ft = TrainingArguments(
    output_dir=os.path.join(CONFIG["output_dir"], "finetune"),
    per_device_train_batch_size=CONFIG["per_device_train_batch_size"],
    per_device_eval_batch_size=CONFIG["per_device_eval_batch_size"],
    learning_rate=CONFIG["learning_rate_finetune"],
    num_train_epochs=CONFIG["epochs_finetune"],
    weight_decay=CONFIG["weight_decay"],
    warmup_ratio=CONFIG["warmup_ratio"],
    logging_steps=50,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    seed=SEED,
    report_to="none"
)

trainer_ft = Trainer(
    model=ft_model,
    args=args_ft,
    train_dataset=tokenized["train"],
    eval_dataset=tokenized["validation"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics
)

t0 = time.time()
trainer_ft.train()
ft_train_time = time.time() - t0

ft_val = trainer_ft.evaluate(tokenized["validation"])
ft_test = trainer_ft.evaluate(tokenized["test"])

print("Finetune Validation:", ft_val)
print("Finetune Test:", ft_test, "| Train time (s):", round(ft_train_time, 2))

ft_logits, _, _ = trainer_ft.predict(tokenized["test"])
ft_preds = np.argmax(ft_logits, axis=-1)

plot_confusion_matrix(y_test, ft_preds, title="Full Fine-Tuned — Test", labels=label_names)
print(classification_report(y_test, ft_preds, target_names=label_names))



## 8) Compare Results


In [None]:

def metric(d, key):
    return float(d.get(key, "nan"))

probe_acc = metric(probe_test, "eval_accuracy")
probe_f1m = metric(probe_test, "eval_macro_f1")
ft_acc = metric(ft_test, "eval_accuracy")
ft_f1m = metric(ft_test, "eval_macro_f1")

print(f"Probe — Test Accuracy: {probe_acc:.4f} | Macro F1: {probe_f1m:.4f}")
print(f"FT    — Test Accuracy: {ft_acc:.4f} | Macro F1: {ft_f1m:.4f}")
print(f"Δ Accuracy: {ft_acc - probe_acc:+.4f}")
print(f"Δ Macro F1: {ft_f1m - probe_f1m:+.4f}")

labels_disp = ["Probe (Frozen)", "Finetuned"]
accs = [probe_acc, ft_acc]
f1s = [probe_f1m, ft_f1m]

plt.figure()
plt.bar(labels_disp, accs)
plt.title("Test Accuracy")
plt.ylabel("Accuracy")
plt.ylim(0, 1.0)
plt.show()

plt.figure()
plt.bar(labels_disp, f1s)
plt.title("Test Macro-F1")
plt.ylabel("Macro-F1")
plt.ylim(0, 1.0)
plt.show()



## 9) Discussion & Extensions

- **Try other datasets:** `imdb`, `amazon_polarity`, `yelp_polarity`, or other `tweet_eval` tasks.
- **Try other encoders:** `bert-base-uncased`, `roberta-base`, `google/electra-small-discriminator`.
- **Compute budget:** Adjust `subset_fraction` for CPU demos vs. full GPU runs.
- **PEFT:** Explore LoRA/adapters to approach full-FT accuracy with less compute.
- **Error analysis:** Inspect misclassifications; per-class precision/recall; calibration.
- **Robustness:** Evaluate on different time slices or domains.

> ✍️ **Short write-up prompt:** Explain why full fine-tuning improves performance vs. a frozen encoder. Relate to representation learning and task/domain adaptation.
