
# Sentence Classification (Sentiment) â€” Linear Probe vs Full Fine-Tuning on **TweetEval: Sentiment**

**Task:** Multi-class sentiment classification (**negative / neutral / positive**) on the **TweetEval** benchmark (subset: `sentiment`).

You'll build and compare two approaches using a Hugging Face encoder:
1. **Head only fine tuned encoder (Frozen Encoder):** Freeze the transformer encoder and train only a small classification head.
2. **Full Fine-Tuning:** Unfreeze the encoder and fine-tune end-to-end.

We'll evaluate both on the same test set and visualize improvements.



## 0) Setup & Reproducibility

Run this cell to (optionally) install dependencies and set the random seed.  
If running on a managed environment (e.g., Colab) uncomment the `pip` line.


In [1]:

# If needed, uncomment to install:
# %pip install -U transformers datasets accelerate evaluate scikit-learn matplotlib

import os, random, time, json
import numpy as np

# Set the environment variable to use the first GPU (otherwise trainer uses them all)
os.environ["CUDA_VISIBLE_DEVICES"] = "0"

import evaluate
import torch
from datasets import load_dataset, DatasetDict
from transformers import (AutoTokenizer, AutoModelForSequenceClassification,
                          DataCollatorWithPadding, TrainingArguments, Trainer)

from sklearn.metrics import confusion_matrix, classification_report
import matplotlib.pyplot as plt

SEED = 42
def set_seed(seed=SEED):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False

set_seed()
print("Torch:", torch.__version__, "| CUDA available:", torch.cuda.is_available())


Torch: 2.8.0+cu128 | CUDA available: True



## 1) Configuration

Tweak hyperparameters here. To make a quick run on CPU, use a **subset_fraction** like `0.3`. Set to `None` for the full dataset.


In [2]:

CONFIG = {
    "dataset_name": "tweet_eval",
    "dataset_subset": "sentiment",   # 3-way: negative(0), neutral(1), positive(2)
    "text_col": "text",
    "label_col": "label",
    "labels": ["negative", "neutral", "positive"],
    "model_name": "distilbert-base-uncased",
    "max_length": 128,
    "per_device_train_batch_size": 512,
    "per_device_eval_batch_size": 1024,
    "epochs_probe": 2,           # linear-probe training epochs
    "epochs_finetune": 3,        # full finetune epochs
    "learning_rate_probe": 5e-4, # higher since only head trains
    "learning_rate_finetune": 2e-5,
    "weight_decay": 0.01,
    "warmup_ratio": 0.06,
    "subset_fraction": .3,      # None for full data; use fraction like 0.3 for speed
    "output_dir": "checkpoints_tweeteval_sentiment"
}
print(json.dumps(CONFIG, indent=2))


{
  "dataset_name": "tweet_eval",
  "dataset_subset": "sentiment",
  "text_col": "text",
  "label_col": "label",
  "labels": [
    "negative",
    "neutral",
    "positive"
  ],
  "model_name": "distilbert-base-uncased",
  "max_length": 128,
  "per_device_train_batch_size": 512,
  "per_device_eval_batch_size": 1024,
  "epochs_probe": 2,
  "epochs_finetune": 3,
  "learning_rate_probe": 0.0005,
  "learning_rate_finetune": 2e-05,
  "weight_decay": 0.01,
  "warmup_ratio": 0.06,
  "subset_fraction": 0.3,
  "output_dir": "checkpoints_tweeteval_sentiment"
}



## 2) Load the **TweetEval: Sentiment** Dataset

We use the **TweetEval** benchmark (not GLUE). The `sentiment` subset has labels: 0=negative, 1=neutral, 2=positive.  
Splits: `train`, `validation`, `test`.


In [3]:

raw = load_dataset(CONFIG["dataset_name"], CONFIG["dataset_subset"])

# Optionally downsample for a quick demo run
subset_fraction = CONFIG["subset_fraction"]
if subset_fraction is not None and 0 < subset_fraction < 1:
    def take_fraction(dset, frac):
        n = max(30, int(len(dset) * frac))  # keep a minimum
        return dset.shuffle(seed=SEED).select(range(n))
    raw = DatasetDict({
        "train": take_fraction(raw["train"], subset_fraction),
        "validation": take_fraction(raw["validation"], subset_fraction),
        "test": raw["test"]  # keep full test for better generalization measurement
    }) # type: ignore

raw


DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 13684
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 600
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 12284
    })
})

In [4]:
raw['train'][0:5]

{'text': ['Few more hours to iPhone 6s launch and im still using the 4th generation ^_^',
  "Last night we were named NZ's 27th fastest growing co. in the Deloitte Fast 50. Our 2nd year making the list and we are totally thrilled!",
  'All the hoes will be out this Saturday at the Chris brown concert.',
  'BUENOS AIRES--Argentina late Wednesday approved a law to lower the legal voting age to 16 in a move that could s ...',
  '"Every time I see a runner slide into 1st, I see Kenny Lofton laying on the ground, arm limp like a dead fish."'],
 'label': [2, 2, 0, 1, 0]}


## 3) Tokenization

We use the tokenizer associated with the chosen encoder. Tweets are short; we cap `max_length` to keep it efficient.


In [5]:
# Load the tokenizer for our chosen model (DistilBERT)
tokenizer = AutoTokenizer.from_pretrained(CONFIG["model_name"], use_fast=True)

# Define a function to tokenize text batches
# This converts raw text into token IDs that the model can process
def tokenize_fn(batch):
    return tokenizer(batch[CONFIG["text_col"]], truncation=True, max_length=CONFIG["max_length"])

# Remove columns we don't need for training (keep only text and label)
# remove_cols = [c for c in raw["train"].column_names if c not in (CONFIG["text_col"], CONFIG["label_col"])]

# Apply tokenization to all splits (train, validation, test)
# batched=True processes multiple examples at once for efficiency
tokenized = raw.map(tokenize_fn, batched=True)# .remove_columns(remove_cols)

# Data collator handles padding sequences to the same length within each batch
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

# Store number of classes and their names for later use
num_labels = len(CONFIG["labels"])  # 3 classes: negative, neutral, positive
label_names = CONFIG["labels"]  # Human-readable names for confusion matrix and reports


Map:   0%|          | 0/13684 [00:00<?, ? examples/s]

Map:   0%|          | 0/600 [00:00<?, ? examples/s]

Map:   0%|          | 0/12284 [00:00<?, ? examples/s]


## 4) Metrics

We report **accuracy** and **macro-F1** (averages F1 across classes).


In [6]:

# Load evaluation metrics from the evaluate library
# accuracy: measures the proportion of correct predictions
# f1: measures the harmonic mean of precision and recall
accuracy = evaluate.load("accuracy")        #type accuracy in a seperate cell to see options
f1 = evaluate.load("f1")

def compute_metrics(eval_pred):
    """
    Compute evaluation metrics for model predictions.
    
    This function is called by the Trainer during evaluation to calculate
    how well the model is performing.
    
    Args:
        eval_pred: A tuple containing (logits, labels)
            - logits: raw model outputs (shape: [num_samples, num_classes])
            - labels: true labels (shape: [num_samples])
    
    Returns:
        Dictionary with accuracy and macro-averaged F1 score
    """
    logits, labels = eval_pred
    
    # Convert logits to predicted class labels
    # argmax finds the index of the highest score for each sample
    # e.g., [0.1, 0.7, 0.2] -> 1 (neutral)
    preds = np.argmax(logits, axis=-1)
    
    return {
        # Accuracy: percentage of correct predictions
        "accuracy": accuracy.compute(predictions=preds, references=labels)["accuracy"],
        
        # Macro F1: average F1 score across all classes (treats each class equally)
        # Good for imbalanced datasets where we care about all classes
        "macro_f1": f1.compute(predictions=preds, references=labels, average="macro")["f1"]
    }


## We are using Accuracy.  A fine metric if dataset is not unbalanced.

<mark>Rule of thumb: a ratio of 1 to 10 or higher between the largest and smallest class means dataset is unbalanced

In [7]:
# Check class distribution across all splits
for split_name in ["train", "validation", "test"]:
    labels = tokenized[split_name][CONFIG["label_col"]]
    unique, counts = np.unique(labels, return_counts=True)
    
    print(f"\n{split_name.upper()} split:")
    print(f"  Total samples: {len(labels)}")
    for label_id, count in zip(unique, counts):
        percentage = (count / len(labels)) * 100
        print(f"  {label_names[label_id]:>8}: {count:>5} ({percentage:>5.2f}%)")


TRAIN split:
  Total samples: 13684
  negative:  2129 (15.56%)
   neutral:  6261 (45.75%)
  positive:  5294 (38.69%)

VALIDATION split:
  Total samples: 600
  negative:    90 (15.00%)
   neutral:   249 (41.50%)
  positive:   261 (43.50%)

TEST split:
  Total samples: 12284
  negative:  3972 (32.33%)
   neutral:  5937 (48.33%)
  positive:  2375 (19.33%)



## 5) Model Builder


In [8]:

def build_model(model=None, encoder_requires_grad=False):
    """
    Build or reuse a sequence classification model and (un)freeze its encoder.

    Args:
        model: An existing Hugging Face sequence classification model. If None,
               a model is loaded from CONFIG["model_name"] with `num_labels`.
        encoder_requires_grad (bool): If False, freeze the encoder (linear probe).
                                      If True, unfreeze the encoder (full finetune).

    Returns:
        The model with its encoder parameters' requires_grad set accordingly
        (only applied when the backbone is DistilBERT and accessible via
        `model.distilbert`).

    Notes:
        - This function assumes a DistilBERT-based classifier where the encoder
          module is exposed as `model.distilbert`.
        - If the provided model does not have a `distilbert` attribute, no
          parameters are modified.
    """
    # Create a fresh model if none is provided
    if model is None:
        model = AutoModelForSequenceClassification.from_pretrained(
            CONFIG["model_name"],
            num_labels=num_labels
        )

    # For DistilBERT, the encoder lives under model.distilbert; (un)freeze it
    if hasattr(model, "distilbert"):
        for p in model.distilbert.parameters():
            p.requires_grad = encoder_requires_grad

    return model

In [9]:
build_model(model=None, encoder_requires_grad=False)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): DistilBertSdpaAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)



## 6) Baseline: **Linear Probe** (Frozen Encoder)


In [10]:

# TrainingArguments for the linear probe (frozen encoder) experiment.
args_probe = TrainingArguments(
    output_dir=os.path.join(CONFIG["output_dir"], "probe"),              # Directory to save model checkpoints
    per_device_train_batch_size=CONFIG["per_device_train_batch_size"],   # Batch size for training
    per_device_eval_batch_size=CONFIG["per_device_eval_batch_size"],     # Batch size for evaluation
    learning_rate=CONFIG["learning_rate_probe"],                         # Learning rate (higher for frozen encoder)
    num_train_epochs=CONFIG["epochs_probe"],                             # Number of training epochs
    weight_decay=CONFIG["weight_decay"],                                 # Weight decay for regularization
    warmup_ratio=CONFIG["warmup_ratio"],                                 # Fraction of steps for learning rate warmup
    logging_steps=50,                                                    # Log metrics every N steps
    eval_strategy="epoch",                                               # Evaluate at the end of each epoch
    save_strategy="epoch",                                               # Save checkpoint at the end of each epoch
    load_best_model_at_end=True,                                         # Load the best model based on validation metrics
    seed=SEED,                                                           # Random seed for reproducibility
    report_to="none"                                                     # Disable reporting to external services
)

#create frozen encoder model
model = build_model(encoder_requires_grad=False)

trainer_probe = Trainer(
    model=model,                                    # The frozen encoder model
    args=args_probe,                                # Training arguments for linear probe
    train_dataset=tokenized["train"],               # Training data
    eval_dataset=tokenized["validation"],           # Validation data
    tokenizer=tokenizer,                            # Tokenizer for text processing
    data_collator=data_collator,                    # Handles padding in batches
    compute_metrics=compute_metrics                 # Function to compute accuracy and F1
)

# Train the linear probe (frozen encoder)
trainer_probe.train()

# Test set evaluation
probe_test = trainer_probe.evaluate(tokenized["test"])
print("Probe Test Metrics:", probe_test)

# Save predictions for analysis
probe_logits, _, _ = trainer_probe.predict(tokenized["test"])
probe_preds = np.argmax(probe_logits, axis=-1)

#get the valid labels
y_test = np.array(tokenized["test"][CONFIG["label_col"]])

print("Confusion matrix:")
print(confusion_matrix(y_test, probe_preds))

print("Classification report:")
print(classification_report(y_test, probe_preds, target_names=label_names))


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer_probe = Trainer(
  trainer_probe = Trainer(


Epoch,Training Loss,Validation Loss,Accuracy,Macro F1
1,No log,0.918468,0.541667,0.390629
2,0.929000,0.881011,0.565,0.408466


Probe Test Metrics: {'eval_loss': 0.9080833792686462, 'eval_accuracy': 0.5267828069032888, 'eval_macro_f1': 0.4110568904970793, 'eval_runtime': 4.1561, 'eval_samples_per_second': 2955.679, 'eval_steps_per_second': 2.887, 'epoch': 2.0}
Confusion matrix:
[[  50 3514  408]
 [   2 4951  984]
 [   1  904 1470]]
Classification report:
              precision    recall  f1-score   support

    negative       0.94      0.01      0.02      3972
     neutral       0.53      0.83      0.65      5937
    positive       0.51      0.62      0.56      2375

    accuracy                           0.53     12284
   macro avg       0.66      0.49      0.41     12284
weighted avg       0.66      0.53      0.43     12284

Confusion matrix:
[[  50 3514  408]
 [   2 4951  984]
 [   1  904 1470]]
Classification report:
              precision    recall  f1-score   support

    negative       0.94      0.01      0.02      3972
     neutral       0.53      0.83      0.65      5937
    positive       0.51      


## 7) **Full Fine-Tuning** (Encoder + Head)


In [11]:

#make entire model trainable
ft_model = build_model(model=model,encoder_requires_grad=True)

args_probe.output_dir=os.path.join(CONFIG["output_dir"], "finetune")
args_probe.learning_rate=CONFIG["learning_rate_finetune"]   #lower learning rate for full finetune
args_probe.num_train_epochs=CONFIG["epochs_finetune"]

trainer_ft = Trainer(
    model=ft_model,
    args=args_probe,
    train_dataset=tokenized["train"],
    eval_dataset=tokenized["validation"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics
)

# Train the full finetune model
trainer_ft.train()

# Test set evaluation
ft_test = trainer_ft.evaluate(tokenized["test"])
print("Finetune Test:", ft_test)

# Save predictions for analysis
ft_logits, _, _ = trainer_ft.predict(tokenized["test"])
ft_preds = np.argmax(ft_logits, axis=-1)

print("Confusion matrix:")
print(confusion_matrix(y_test, ft_preds))

print("Classification report:")
print(classification_report(y_test, ft_preds, target_names=label_names))


  trainer_ft = Trainer(


Epoch,Training Loss,Validation Loss,Accuracy,Macro F1
1,No log,0.744152,0.655,0.599401
2,0.732800,0.711859,0.671667,0.634934
3,0.732800,0.706729,0.675,0.63572


Finetune Test: {'eval_loss': 0.7125346064567566, 'eval_accuracy': 0.6802344513187887, 'eval_macro_f1': 0.6759480259513638, 'eval_runtime': 4.1456, 'eval_samples_per_second': 2963.133, 'eval_steps_per_second': 2.895, 'epoch': 3.0}
Confusion matrix:
[[2685 1089  198]
 [1030 3985  922]
 [  89  600 1686]]
Classification report:
              precision    recall  f1-score   support

    negative       0.71      0.68      0.69      3972
     neutral       0.70      0.67      0.69      5937
    positive       0.60      0.71      0.65      2375

    accuracy                           0.68     12284
   macro avg       0.67      0.69      0.68     12284
weighted avg       0.68      0.68      0.68     12284

Confusion matrix:
[[2685 1089  198]
 [1030 3985  922]
 [  89  600 1686]]
Classification report:
              precision    recall  f1-score   support

    negative       0.71      0.68      0.69      3972
     neutral       0.70      0.67      0.69      5937
    positive       0.60      0.71 


## 8) **Save and Reload Model, then test it** 


In [12]:
import os

# Save and reload the fine-tuned model + tokenizer, then sanity-check predictions


save_dir = os.path.join(CONFIG["output_dir"], "finetune_saved")
os.makedirs(save_dir, exist_ok=True)

# Save via the trainer (saves model + config) and tokenizer
trainer_ft.save_model(save_dir)
tokenizer.save_pretrained(save_dir)

# Reload model and tokenizer
model_reloaded = AutoModelForSequenceClassification.from_pretrained(save_dir)
tokenizer_reloaded = AutoTokenizer.from_pretrained(save_dir)

# Move to device and eval mode
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model_reloaded.to(device)
model_reloaded.eval()

print(f"Saved model directory: {save_dir}")


Saved model directory: checkpoints_tweeteval_sentiment/finetune_saved


In [13]:
# Select a single example from the test set
single_example = tokenized["test"].select([2])
print(single_example[0])

# Use the fine-tuned trainer to predict
predictions = trainer_ft.predict(single_example)
logits = predictions.predictions
predicted_class = np.argmax(logits, axis=-1)[0]

# Get the actual label
actual_label = single_example["label"][0]

# Display results
print(f"Text: {raw['test'][2]['text']}")
print(f"\nPredicted class: {predicted_class} ({label_names[predicted_class]})")
print(f"Actual label: {actual_label} ({label_names[actual_label]})")
print(f"\nLogits: {logits[0]}")
print(f"Probabilities: {np.exp(logits[0]) / np.sum(np.exp(logits[0]))}")

{'text': "@user @user That's coming, but I think the victims are going to be Medicaid recipients.", 'label': 1, 'input_ids': [101, 1030, 5310, 1030, 5310, 2008, 1005, 1055, 2746, 1010, 2021, 1045, 2228, 1996, 5694, 2024, 2183, 2000, 2022, 19960, 5555, 3593, 15991, 1012, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


Text: @user @user That's coming, but I think the victims are going to be Medicaid recipients.

Predicted class: 1 (neutral)
Actual label: 1 (neutral)

Logits: [ 0.42532378  0.65774786 -0.90948796]
Probabilities: [0.39606118 0.4996924  0.10424636]
