
# Sentence Classification (Sentiment) — Linear Probe vs Full Fine-Tuning on **TweetEval: Sentiment**

**Task:** Multi-class sentiment classification (**negative / neutral / positive**) on the **TweetEval** benchmark (subset: `sentiment`).

You'll build and compare two approaches using a Hugging Face encoder:
1. **Head only fine tuned encoder (Frozen Encoder):** Freeze the transformer encoder and train only a small classification head.
2. **Full Fine-Tuning:** Unfreeze the encoder and fine-tune end-to-end.

We'll evaluate both on the same test set and visualize improvements.



## 0) Setup & Reproducibility

Run this cell to (optionally) install dependencies and set the random seed.  
If running on a managed environment (e.g., Colab) uncomment the `pip` line.


In [4]:

# If needed, uncomment to install:
# %pip install -U transformers datasets accelerate evaluate scikit-learn matplotlib

import os, random, time, json
import numpy as np

# Set the environment variable to use the first GPU
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"

import evaluate
import torch
from datasets import load_dataset, DatasetDict
from transformers import (AutoTokenizer, AutoModelForSequenceClassification,
                          DataCollatorWithPadding, TrainingArguments, Trainer)

from sklearn.metrics import confusion_matrix, classification_report
import matplotlib.pyplot as plt

SEED = 42
def set_seed(seed=SEED):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False

set_seed()
print("Torch:", torch.__version__, "| CUDA available:", torch.cuda.is_available())


  from .autonotebook import tqdm as notebook_tqdm


Torch: 2.8.0+cu128 | CUDA available: True



## 1) Configuration

Tweak hyperparameters here. To make a quick run on CPU, use a **subset_fraction** like `0.3`. Set to `None` for the full dataset.


In [5]:

CONFIG = {
    "dataset_name": "tweet_eval",
    "dataset_subset": "sentiment",   # 3-way: negative(0), neutral(1), positive(2)
    "text_col": "text",
    "label_col": "label",
    "labels": ["negative", "neutral", "positive"],
    "model_name": "distilbert-base-uncased",
    "max_length": 128,
    "per_device_train_batch_size": 16,
    "per_device_eval_batch_size": 32,
    "epochs_probe": 2,           # linear-probe training epochs
    "epochs_finetune": 3,        # full finetune epochs
    "learning_rate_probe": 5e-4, # higher since only head trains
    "learning_rate_finetune": 2e-5,
    "weight_decay": 0.01,
    "warmup_ratio": 0.06,
    "subset_fraction": 0.3,      # None for full data; use fraction like 0.3 for speed
    "output_dir": "checkpoints_tweeteval_sentiment"
}
print(json.dumps(CONFIG, indent=2))


{
  "dataset_name": "tweet_eval",
  "dataset_subset": "sentiment",
  "text_col": "text",
  "label_col": "label",
  "labels": [
    "negative",
    "neutral",
    "positive"
  ],
  "model_name": "distilbert-base-uncased",
  "max_length": 128,
  "per_device_train_batch_size": 16,
  "per_device_eval_batch_size": 32,
  "epochs_probe": 2,
  "epochs_finetune": 3,
  "learning_rate_probe": 0.0005,
  "learning_rate_finetune": 2e-05,
  "weight_decay": 0.01,
  "warmup_ratio": 0.06,
  "subset_fraction": 0.3,
  "output_dir": "checkpoints_tweeteval_sentiment"
}



## 2) Load the **TweetEval: Sentiment** Dataset

We use the **TweetEval** benchmark (not GLUE). The `sentiment` subset has labels: 0=negative, 1=neutral, 2=positive.  
Splits: `train`, `validation`, `test`.


In [6]:

raw = load_dataset(CONFIG["dataset_name"], CONFIG["dataset_subset"])

# Optionally downsample for a quick demo run
subset_fraction = CONFIG["subset_fraction"]
if subset_fraction is not None and 0 < subset_fraction < 1:
    def take_fraction(dset, frac):
        n = max(30, int(len(dset) * frac))  # keep a minimum
        return dset.shuffle(seed=SEED).select(range(n))
    raw = DatasetDict({
        "train": take_fraction(raw["train"], subset_fraction),
        "validation": take_fraction(raw["validation"], subset_fraction),
        "test": raw["test"]  # keep full test for better generalization measurement
    }) # type: ignore

raw


DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 13684
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 600
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 12284
    })
})


## 3) Tokenization

We use the tokenizer associated with the chosen encoder. Tweets are short; we cap `max_length` to keep it efficient.


In [7]:

# Load the tokenizer for our chosen model (DistilBERT)
tokenizer = AutoTokenizer.from_pretrained(CONFIG["model_name"], use_fast=True)

# Define a function to tokenize text batches
# This converts raw text into token IDs that the model can process
def tokenize_fn(batch):
    return tokenizer(batch[CONFIG["text_col"]], truncation=True, max_length=CONFIG["max_length"])

# Remove columns we don't need for training (keep only text and label)
remove_cols = [c for c in raw["train"].column_names if c not in (CONFIG["text_col"], CONFIG["label_col"])]

# Apply tokenization to all splits (train, validation, test)
# batched=True processes multiple examples at once for efficiency
tokenized = raw.map(tokenize_fn, batched=True, remove_columns=remove_cols)

# Data collator handles padding sequences to the same length within each batch
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

# Store number of classes and their names for later use
num_labels = len(CONFIG["labels"])  # 3 classes: negative, neutral, positive
label_names = CONFIG["labels"]  # Human-readable names for confusion matrix and reports


Map: 100%|██████████| 600/600 [00:00<00:00, 12331.59 examples/s]



## 4) Metrics

We report **accuracy** and **macro-F1** (averages F1 across classes).


In [8]:

# Load evaluation metrics from the evaluate library
# accuracy: measures the proportion of correct predictions
# f1: measures the harmonic mean of precision and recall
accuracy = evaluate.load("accuracy")
f1 = evaluate.load("f1")

def compute_metrics(eval_pred):
    """
    Compute evaluation metrics for model predictions.
    
    This function is called by the Trainer during evaluation to calculate
    how well the model is performing.
    
    Args:
        eval_pred: A tuple containing (logits, labels)
            - logits: raw model outputs (shape: [num_samples, num_classes])
            - labels: true labels (shape: [num_samples])
    
    Returns:
        Dictionary with accuracy and macro-averaged F1 score
    """
    logits, labels = eval_pred
    
    # Convert logits to predicted class labels
    # argmax finds the index of the highest score for each sample
    # e.g., [0.1, 0.7, 0.2] -> 1 (neutral)
    preds = np.argmax(logits, axis=-1)
    
    return {
        # Accuracy: percentage of correct predictions
        "accuracy": accuracy.compute(predictions=preds, references=labels)["accuracy"],
        
        # Macro F1: average F1 score across all classes (treats each class equally)
        # Good for imbalanced datasets where we care about all classes
        "macro_f1": f1.compute(predictions=preds, references=labels, average="macro")["f1"]
    }


## We are using Accuracy.  A fine meric if dataset is not unbalanced.

<mark>Rule of thumb: a ratio of 1 to 10 or higher between the largest and smallest class means dataset is unbalanced

In [9]:
# Check class distribution across all splits
for split_name in ["train", "validation", "test"]:
    labels = tokenized[split_name][CONFIG["label_col"]]
    unique, counts = np.unique(labels, return_counts=True)
    
    print(f"\n{split_name.upper()} split:")
    print(f"  Total samples: {len(labels)}")
    for label_id, count in zip(unique, counts):
        percentage = (count / len(labels)) * 100
        print(f"  {label_names[label_id]:>8}: {count:>5} ({percentage:>5.2f}%)")


TRAIN split:
  Total samples: 13684
  negative:  2129 (15.56%)
   neutral:  6261 (45.75%)
  positive:  5294 (38.69%)

VALIDATION split:
  Total samples: 600
  negative:    90 (15.00%)
   neutral:   249 (41.50%)
  positive:   261 (43.50%)

TEST split:
  Total samples: 12284
  negative:  3972 (32.33%)
   neutral:  5937 (48.33%)
  positive:  2375 (19.33%)



## 5) Model Builder


In [10]:

def build_model(model=None, encoder_requires_grad=False):
    """
    Build or reuse a sequence classification model and (un)freeze its encoder.

    Args:
        model: An existing Hugging Face sequence classification model. If None,
               a model is loaded from CONFIG["model_name"] with `num_labels`.
        encoder_requires_grad (bool): If False, freeze the encoder (linear probe).
                                      If True, unfreeze the encoder (full finetune).

    Returns:
        The model with its encoder parameters' requires_grad set accordingly
        (only applied when the backbone is DistilBERT and accessible via
        `model.distilbert`).

    Notes:
        - This function assumes a DistilBERT-based classifier where the encoder
          module is exposed as `model.distilbert`.
        - If the provided model does not have a `distilbert` attribute, no
          parameters are modified.
    """
    # Create a fresh model if none is provided
    if model is None:
        model = AutoModelForSequenceClassification.from_pretrained(
            CONFIG["model_name"],
            num_labels=num_labels
        )

    # For DistilBERT, the encoder lives under model.distilbert; (un)freeze it
    if hasattr(model, "distilbert"):
        for p in model.distilbert.parameters():
            p.requires_grad = encoder_requires_grad

    return model


## 6) Baseline: **Linear Probe** (Frozen Encoder)


In [11]:

# TrainingArguments for the linear probe (frozen encoder) experiment.
args_probe = TrainingArguments(
    output_dir=os.path.join(CONFIG["output_dir"], "probe"),              # Directory to save model checkpoints
    per_device_train_batch_size=CONFIG["per_device_train_batch_size"],   # Batch size for training
    per_device_eval_batch_size=CONFIG["per_device_eval_batch_size"],     # Batch size for evaluation
    learning_rate=CONFIG["learning_rate_probe"],                         # Learning rate (higher for frozen encoder)
    num_train_epochs=CONFIG["epochs_probe"],                             # Number of training epochs
    weight_decay=CONFIG["weight_decay"],                                 # Weight decay for regularization
    warmup_ratio=CONFIG["warmup_ratio"],                                 # Fraction of steps for learning rate warmup
    logging_steps=50,                                                    # Log metrics every N steps
    eval_strategy="epoch",                                               # Evaluate at the end of each epoch
    save_strategy="epoch",                                               # Save checkpoint at the end of each epoch
    load_best_model_at_end=True,                                         # Load the best model based on validation metrics
    seed=SEED,                                                           # Random seed for reproducibility
    report_to="none"                                                     # Disable reporting to external services
)

#create frozen encoder model
model = build_model(encoder_requires_grad=False)

trainer_probe = Trainer(
    model=model,                                    # The frozen encoder model
    args=args_probe,                                # Training arguments for linear probe
    train_dataset=tokenized["train"],               # Training data
    eval_dataset=tokenized["validation"],           # Validation data
    tokenizer=tokenizer,                            # Tokenizer for text processing
    data_collator=data_collator,                    # Handles padding in batches
    compute_metrics=compute_metrics                 # Function to compute accuracy and F1
)

# Train the linear probe (frozen encoder)
trainer_probe.train()

# Test set evaluation
probe_test = trainer_probe.evaluate(tokenized["test"])
print("Probe Test Metrics:", probe_test)

# Save predictions for analysis
probe_logits, _, _ = trainer_probe.predict(tokenized["test"])
probe_preds = np.argmax(probe_logits, axis=-1)
y_test = np.array(tokenized["test"][CONFIG["label_col"]])


print("Confusion matrix:")
print(confusion_matrix(y_test, probe_preds))

print("Classification report:")
print(classification_report(y_test, probe_preds, target_names=label_names))


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer_probe = Trainer(


Epoch,Training Loss,Validation Loss,Accuracy,Macro F1
1,0.7897,0.788938,0.63,0.556562
2,0.7454,0.782529,0.616667,0.587171


Probe Test Metrics: {'eval_loss': 0.7459889650344849, 'eval_accuracy': 0.6471019211983068, 'eval_macro_f1': 0.640696252816535, 'eval_runtime': 4.4274, 'eval_samples_per_second': 2774.569, 'eval_steps_per_second': 86.734, 'epoch': 2.0}
Confusion matrix:
[[2633 1143  196]
 [1219 3809  909]
 [ 113  755 1507]]
Classification report:
              precision    recall  f1-score   support

    negative       0.66      0.66      0.66      3972
     neutral       0.67      0.64      0.65      5937
    positive       0.58      0.63      0.60      2375

    accuracy                           0.65     12284
   macro avg       0.64      0.65      0.64     12284
weighted avg       0.65      0.65      0.65     12284




## 7) **Full Fine-Tuning** (Encoder + Head)


In [13]:

ft_model = build_model(model=model,encoder_requires_grad=True)

args_probe.output_dir=os.path.join(CONFIG["output_dir"], "finetune")
args_probe.learning_rate=CONFIG["learning_rate_finetune"]
args_probe.num_train_epochs=CONFIG["epochs_finetune"]

trainer_ft = Trainer(
    model=ft_model,
    args=args_probe,
    train_dataset=tokenized["train"],
    eval_dataset=tokenized["validation"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics
)


trainer_ft.train()

# Test set evaluation
ft_test = trainer_ft.evaluate(tokenized["test"])
print("Finetune Test:", ft_test)

# Save predictions for analysis
ft_logits, _, _ = trainer_ft.predict(tokenized["test"])
ft_preds = np.argmax(ft_logits, axis=-1)

print("Confusion matrix:")
print(confusion_matrix(y_test, full_ft_model_preds))

print("Classification report:")
print(classification_report(y_test, full_ft_model_preds, target_names=label_names))


  trainer_ft = Trainer(


Epoch,Training Loss,Validation Loss,Accuracy,Macro F1
1,0.4119,0.866622,0.68,0.664017
2,0.3291,0.975004,0.686667,0.661984
3,0.2695,1.304437,0.7,0.676646


Finetune Test: {'eval_loss': 0.8187649250030518, 'eval_accuracy': 0.6767339628785412, 'eval_macro_f1': 0.6737022877236408, 'eval_runtime': 4.4235, 'eval_samples_per_second': 2776.979, 'eval_steps_per_second': 86.809, 'epoch': 3.0}
Confusion matrix:


NameError: name 'full_ft_model_preds' is not defined