
# Sentiment Analysis: Frozen Encoder vs Full Fineâ€‘Tuning (SSTâ€‘2)

**Objective:** Compare a **head only fine tuned encoder** to a **fully fineâ€‘tuned encoder**  for sentiment classification on **GLUE/SSTâ€‘2**.

You'll:
- Load the **SSTâ€‘2** dataset from ðŸ¤— `datasets`
- Tokenize with a Hugging Face encoder (`distilbert-base-uncased` by default)
- Train a **head only fine tuned encoder** (encoder frozen, only the classification head trained)
- Train a **fully fineâ€‘tuned encoder** model (encoder + head fine tuned)
- Evaluate both models and visualize the improvement

> ðŸ’¡ *Why this design?*  
> Testing a **frozen encoder** quantifies how much task signal a generalâ€‘purpose language model already encodes. **Full fineâ€‘tuning** measures the benefit of adapting the encoder to the task.

<mark>**Note: Possible 25 point penalty here!**<br>
Please remove all repetitive code (this especially applies to the TrainingArguments and Trainer), if you are typing code twice then refactor into a function.<br>
Please do not use any code that you do not understand<br><br>



## 0) Setup

Run the cell below to install dependencies (if needed) and set a reproducible environment.


In [1]:

# If you're running in a fresh environment, uncomment the installs.
%pip install -U transformers datasets accelerate evaluate scikit-learn matplotlib

import os, random, math, time, json
import numpy as np

# Set the environment variable to use the first GPU
os.environ["CUDA_VISIBLE_DEVICES"] = "0"

import evaluate
import torch
from datasets import load_dataset
from transformers import (AutoTokenizer, AutoModelForSequenceClassification,
                          DataCollatorWithPadding, TrainingArguments, Trainer)

from sklearn.metrics import confusion_matrix, classification_report
import matplotlib.pyplot as plt

SEED = 42
def set_seed(seed=SEED):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False

set_seed()
print("Torch:", torch.__version__, "| CUDA available:", torch.cuda.is_available())


Note: you may need to restart the kernel to use updated packages.
Torch: 2.5.1+cu121 | CUDA available: True



## 1) Configuration

You can tweak model and training settings here. When short on compute, enable **SUBSET_FRACTION** to run a quick demo.


In [2]:

CONFIG = {
    "model_name": "distilbert-base-uncased",  # small, fast baseline
    "num_labels": 2,
    "max_length": 128,
    "per_device_train_batch_size": 32,
    "per_device_eval_batch_size": 64,
    "epochs_headonly_finetune": 2,      # linear-probe epochs
    "epochs_full_finetune": 2,   # full finetune epochs
    "learning_rate_headonly_finetune": 5e-4,     # a bit higher since only head trains
    "learning_rate_full_finetune": 2e-5,  # standard for full finetune
    "weight_decay": 0.01,
    "warmup_ratio": 0.06,
    "subset_fraction": None,   # set to None for full dataset; e.g., 0.3 uses 30% for quicker runs
    "output_dir": "checkpoints_sst2"
}
print(json.dumps(CONFIG, indent=2))


{
  "model_name": "distilbert-base-uncased",
  "num_labels": 2,
  "max_length": 128,
  "per_device_train_batch_size": 32,
  "per_device_eval_batch_size": 64,
  "epochs_headonly_finetune": 2,
  "epochs_full_finetune": 2,
  "learning_rate_headonly_finetune": 0.0005,
  "learning_rate_full_finetune": 2e-05,
  "weight_decay": 0.01,
  "warmup_ratio": 0.06,
  "subset_fraction": null,
  "output_dir": "checkpoints_sst2"
}



## 2) (20 pts) Load the SSTâ€‘2 Dataset

SSTâ€‘2 (from GLUE) is a **binary** sentiment task (positive/negative). It comes with 3 splits; train, validation and test.<br>
<mark>Ignore the test set as it has hidden labels.  Generate a new test set from part of the train set.


In [3]:
raw_datasets = load_dataset("glue", "sst2")

#your code here
# The 'test' split has no labels, so weâ€™ll re-split the 'train' set.
# We'll keep 90% for training and 10% for testing to simulate a labeled test set.
raw_datasets = raw_datasets["train"].train_test_split(test_size=0.1, seed=SEED)

# Keep the validation split as provided by GLUE (for dev evaluation)
val_dataset = load_dataset("glue", "sst2", split="validation")

print(raw_datasets)
print(val_dataset)


README.md: 0.00B [00:00, ?B/s]

sst2/train-00000-of-00001.parquet:   0%|          | 0.00/3.11M [00:00<?, ?B/s]

sst2/validation-00000-of-00001.parquet:   0%|          | 0.00/72.8k [00:00<?, ?B/s]

sst2/test-00000-of-00001.parquet:   0%|          | 0.00/148k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/67349 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/872 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1821 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 60614
    })
    test: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 6735
    })
})
Dataset({
    features: ['sentence', 'label', 'idx'],
    num_rows: 872
})



## 3) (10 points) Tokenization

Use the DistilBERT tokenizer and truncate/pad to a max_length defined in CONFIG above


In [4]:
# Load the tokenizer for the model specified in CONFIG
# Tokenize all splits of the dataset using the tokenize_fn function
# - Truncate sequences to max_length (128 tokens)
# - Use DataCollatorWithPadding for dynamic padding during batching
# Define label names and extract num_labels for the classification task
tokenizer = AutoTokenizer.from_pretrained(CONFIG["model_name"], use_fast=True)

#your code here
def preprocess_function(examples):
    # Tokenize input text with truncation and padding handled later by DataCollator
    return tokenizer(examples["sentence"], truncation=True, max_length=CONFIG["max_length"])

# Apply tokenizer to all splits
tokenized_datasets = {}
for split_name, dataset in raw_datasets.items():
    tokenized_datasets[split_name] = dataset.map(preprocess_function, batched=True)

tokenized_val = val_dataset.map(preprocess_function, batched=True)
from torch.utils.data import DataLoader
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

tokenized_datasets["train"].set_format("torch", columns=["input_ids", "attention_mask", "label"])
tokenized_datasets["test"].set_format("torch", columns=["input_ids", "attention_mask", "label"])
tokenized_val.set_format("torch", columns=["input_ids", "attention_mask", "label"])


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Map:   0%|          | 0/60614 [00:00<?, ? examples/s]

Map:   0%|          | 0/6735 [00:00<?, ? examples/s]

Map:   0%|          | 0/872 [00:00<?, ? examples/s]


## 4) (10 pts) Metrics

Track accuracy and F1.


In [5]:
# Define accuracy and F1 metrics using the evaluate library, and create a compute_metrics function
# that takes model predictions and labels (A tuple (logits, labels) from the Trainer evaluation loop), 
# computes predicted classes, and returns a dictionary
# with with 'accuracy' and 'f1' scores for use in Hugging Face Trainer evaluation.

# your code here
# Load metrics from the evaluate library
metric_accuracy = evaluate.load("accuracy")
metric_f1 = evaluate.load("f1")

def compute_metrics(eval_pred):
    """
    Compute accuracy and F1 from model predictions.
    eval_pred is a tuple (logits, labels) automatically provided by Trainer.
    """
    logits, labels = eval_pred
    # Convert model logits to predicted class indices
    preds = np.argmax(logits, axis=-1)
    
    # Compute metrics
    acc = metric_accuracy.compute(predictions=preds, references=labels)
    f1 = metric_f1.compute(predictions=preds, references=labels)
    
    # Return a combined dictionary (Hugging Face Trainer expects this format)
    return {"accuracy": acc["accuracy"], "f1": f1["f1"]}


Downloading builder script: 0.00B [00:00, ?B/s]

Downloading builder script: 0.00B [00:00, ?B/s]

## 5) (10 pts) Is accuracy a good metric for this task?  Why or why not?

Back up your answer with data

In [6]:
# your code here
from collections import Counter

# Count how many positive (1) and negative (0) samples are in the training data
label_counts = Counter(raw_datasets["train"]["label"])
total = sum(label_counts.values())
for label, count in label_counts.items():
    print(f"Label {label}: {count} samples ({count/total:.2%})")

# Compute class balance ratio
balance_ratio = min(label_counts.values()) / max(label_counts.values())
print(f"\nClass balance ratio: {balance_ratio:.3f}")


Label 1: 33878 samples (55.89%)
Label 0: 26736 samples (44.11%)

Class balance ratio: 0.789



## 6) (20 pts) First Stage **fine tuned head only with Frozen Encoder**

- Load `AutoModelForSequenceClassification` for CONFIG.num_labels labels.
- **Freeze** all encoder layers so only the classification head trains.
- Train for CONFIG.epochs_headonly_finetune and evaluate.


In [17]:
def build_model(model=None, encoder_requires_grad=False):
    """
    Build or reuse a sequence classification model and (un)freeze its encoder.

    Args:
        model: An existing Hugging Face sequence classification model. If None,
               a model is loaded from CONFIG["model_name"] with `num_labels`.
        encoder_requires_grad (bool): If False, freeze the encoder (linear probe).
                                      If True, unfreeze the encoder (full finetune).

    Returns:
        The model with its encoder parameters' requires_grad set accordingly
        (only applied when the backbone is DistilBERT and accessible via
        `model.distilbert`).

    Notes:
        - This function assumes a DistilBERT-based classifier where the encoder
          module is exposed as `model.distilbert`.
        - If the provided model does not have a `distilbert` attribute, no
          parameters are modified.
    """
    # your code here
    if model is None:
        model = AutoModelForSequenceClassification.from_pretrained(
            CONFIG["model_name"], num_labels=CONFIG["num_labels"]
        )

    if hasattr(model, "distilbert"):
        for param in model.distilbert.parameters():
            param.requires_grad = encoder_requires_grad

    trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
    frozen_params = sum(p.numel() for p in model.parameters() if not p.requires_grad)
    print(f"Trainable parameters: {trainable_params:,} | Frozen parameters: {frozen_params:,}")

    return model





In [18]:
# Linear probe training (frozen encoder, train head only)
args = TrainingArguments(
    output_dir=os.path.join(CONFIG["output_dir"], "probe"),  # where to save checkpoints and logs
    per_device_train_batch_size=CONFIG["per_device_train_batch_size"],  # train batch size per device (GPU/CPU)
    per_device_eval_batch_size=CONFIG["per_device_eval_batch_size"],    # eval batch size per device (GPU/CPU)
    learning_rate=CONFIG["learning_rate_headonly_finetune"],            # optimizer learning rate for linear probe
    num_train_epochs=CONFIG["epochs_headonly_finetune"],                 # number of training epochs
    weight_decay=CONFIG["weight_decay"],                                 # L2 regularization strength
    warmup_ratio=CONFIG["warmup_ratio"],                                 # fraction of total steps for LR warmup
    logging_steps=50,                                                    # log metrics every N training steps
    eval_strategy="epoch",                                         # run evaluation at the end of each epoch
    save_strategy="epoch",                                               # save a checkpoint at the end of each epoch
    load_best_model_at_end=True,                                         # restore best checkpoint after training
    seed=SEED,                                                           # RNG seed for reproducibility
    report_to="none",                                                     # disable external logging integrations
)

# create frozen encoder model
frozen_model = build_model(encoder_requires_grad=False)
# create trainer
trainer_probe = Trainer(
    model=frozen_model,
    args=args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_val,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics
)
# Train the model
print("\nðŸ§  Training Head-Only (Frozen Encoder)...")
trainer_probe.train()

# Test set evaluation -get evaluation metrics on test set via trainer.evaluate(your test set) 
# get raw test set logits via trainer.predict(your test set)
# get the max of these logits (the predicted value)
print("\nðŸ“Š Evaluating Frozen Encoder Model...")
probe_metrics = trainer_probe.evaluate(tokenized_datasets["test"])
print(probe_metrics)

# Get predictions for confusion matrix and classification report
probs = trainer_probe.predict(tokenized_datasets["test"])
preds = np.argmax(probs.predictions, axis=-1)
labels = probs.label_ids
#print a confusion matrix and a Classification report (both imported above)
cm = confusion_matrix(labels, preds)
print("\nConfusion Matrix:\n", cm)
print("\nClassification Report:\n", classification_report(labels, preds, digits=3))



model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer_probe = Trainer(


Trainable parameters: 592,130 | Frozen parameters: 66,362,880

ðŸ§  Training Head-Only (Frozen Encoder)...


Epoch,Training Loss,Validation Loss,Accuracy,F1
1,0.3524,0.356524,0.844037,0.842593
2,0.3423,0.356477,0.83945,0.838337



ðŸ“Š Evaluating Frozen Encoder Model...


{'eval_loss': 0.3091977536678314, 'eval_accuracy': 0.8617668893838158, 'eval_f1': 0.8714621013392241, 'eval_runtime': 12.8979, 'eval_samples_per_second': 522.18, 'eval_steps_per_second': 8.218, 'epoch': 2.0}

Confusion Matrix:
 [[2648  396]
 [ 535 3156]]

Classification Report:
               precision    recall  f1-score   support

           0      0.832     0.870     0.850      3044
           1      0.889     0.855     0.871      3691

    accuracy                          0.862      6735
   macro avg      0.860     0.862     0.861      6735
weighted avg      0.863     0.862     0.862      6735




## 7) (20 pts) **Full Fineâ€‘Tuning** (Encoder + Head)

Now unfreeze the encoder and fineâ€‘tune endâ€‘toâ€‘end. We typically use a smaller learning rate.


In [20]:
#modify args as appropriate for full finetuning
args_full = TrainingArguments(
    output_dir=os.path.join(CONFIG["output_dir"], "full"),
    per_device_train_batch_size=CONFIG["per_device_train_batch_size"],
    per_device_eval_batch_size=CONFIG["per_device_eval_batch_size"],
    learning_rate=CONFIG["learning_rate_full_finetune"],
    num_train_epochs=CONFIG["epochs_full_finetune"],
    weight_decay=CONFIG["weight_decay"],
    warmup_ratio=CONFIG["warmup_ratio"],
    logging_steps=50,
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    seed=SEED,
    report_to="none",
)
# unfreeze all model layers (model is from previous step) for full finetuning
full_model = build_model(frozen_model, encoder_requires_grad=True)
# create trainer
trainer_full = Trainer(
    model=full_model,
    args=args_full,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_val,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics
)
# Train the model
print("\nðŸ”¥ Training Full Fine-Tuned Model...")
trainer_full.train()
# Test set evaluation -get evaluation metrics on test set via trainer.evaluate(your test set) 
# get raw test set logits via trainer.predict(your test set)
# get the max of these logits (the predicted value)
print("\nðŸ“Š Evaluating Full Fine-Tuned Model...")
full_metrics = trainer_full.evaluate(tokenized_datasets["test"])
print(full_metrics)

# Predictions and reports
probs_full = trainer_full.predict(tokenized_datasets["test"])
preds_full = np.argmax(probs_full.predictions, axis=-1)
labels_full = probs_full.label_ids
#print a confusion matrix and a Classification report (both imported above)
cm_full = confusion_matrix(labels_full, preds_full)
print("\nConfusion Matrix:\n", cm_full)
print("\nClassification Report:\n", classification_report(labels_full, preds_full, digits=3))

Trainable parameters: 66,955,010 | Frozen parameters: 0

ðŸ”¥ Training Full Fine-Tuned Model...


  trainer_full = Trainer(


Epoch,Training Loss,Validation Loss,Accuracy,F1
1,0.1845,0.236355,0.911697,0.9124
2,0.0916,0.30933,0.90367,0.906459



ðŸ“Š Evaluating Full Fine-Tuned Model...


{'eval_loss': 0.15776830911636353, 'eval_accuracy': 0.9446176688938381, 'eval_f1': 0.9488831026449226, 'eval_runtime': 12.9235, 'eval_samples_per_second': 521.145, 'eval_steps_per_second': 8.202, 'epoch': 2.0}

Confusion Matrix:
 [[2900  144]
 [ 229 3462]]

Classification Report:
               precision    recall  f1-score   support

           0      0.927     0.953     0.940      3044
           1      0.960     0.938     0.949      3691

    accuracy                          0.945      6735
   macro avg      0.943     0.945     0.944      6735
weighted avg      0.945     0.945     0.945      6735




## 8) (10 pts) Compare Results

Let's quantify the improvement from full fineâ€‘tuning vs the frozen encoder baseline.
Tell me what the delta is between the first model and the second for accuracy


In [21]:
#your code here
# Extract accuracy values from the two evaluation results
acc_probe = probe_metrics["eval_accuracy"]
acc_full  = full_metrics["eval_accuracy"]

# Compute absolute and relative improvements
delta_acc = acc_full - acc_probe
rel_improvement = (delta_acc / acc_probe) * 100

print(f"Baseline (Frozen Encoder) Accuracy: {acc_probe:.4f}")
print(f"Full Fine-Tuned Accuracy:           {acc_full:.4f}")
print(f"Î” Accuracy: {delta_acc:.4f}  ({rel_improvement:.2f}% relative improvement)")


Baseline (Frozen Encoder) Accuracy: 0.8618
Full Fine-Tuned Accuracy:           0.9446
Î” Accuracy: 0.0829  (9.61% relative improvement)
