# Exercise 4.1 — Fine-Tuning a Transformer for Text Classification

In this exercise you will:
- Load and explore a text classification dataset.
- Test a pre-trained Transformer for zero-shot sentiment analysis.
- Fine-tune a model for your dataset using the Hugging Face `Trainer` API.
- Evaluate and compare your fine-tuned model.

> **Learning goals:**  
> • Practise using pre-trained models for downstream tasks.  
> • Learn tokenisation and training workflow with `transformers`.  
> • Observe accuracy improvements after fine-tuning.


# 1. Environment Setup
# Install the Hugging Face and evaluation libraries we need.

In [1]:
!pip install -q transformers datasets evaluate accelerate

import numpy as np
from datasets import load_dataset
from transformers import (AutoTokenizer, AutoModelForSequenceClassification,
                          TrainingArguments, Trainer, pipeline)
import evaluate
import torch

device = "cuda" if torch.cuda.is_available() else "cpu"
print("Using device:", device)

Using device: cpu


# 2. Load the IMDb Dataset
### We use a smaller subset for faster training in Colab.
### Each sample is a movie review labelled positive (1) or negative (0).

In [2]:
dataset = load_dataset("imdb")
dataset["train"] = dataset["train"].shuffle(seed=42).select(range(5000))   # first 5k for training
dataset["test"] = dataset["test"].shuffle(seed=42).select(range(2000))     # first 2k for testing

print(dataset)
print("\nExample review:", dataset["train"][0])

README.md: 0.00B [00:00, ?B/s]

plain_text/train-00000-of-00001.parquet:   0%|          | 0.00/21.0M [00:00<?, ?B/s]

KeyboardInterrupt: 

# 3. Quick Baseline: Zero-Shot Sentiment Analysis
### Use a pre-trained model (already fine-tuned on SST-2) to get baseline predictions.

In [None]:
zero_shot_classifier = pipeline(
    "sentiment-analysis",
    model="distilbert-base-uncased-finetuned-sst-2-english",
    device=0 if device == "cuda" else -1
)

examples = [
    "This movie was surprisingly good!",
    "I found the plot boring and predictable."
]
zero_shot_classifier(examples)

# 4. Tokenisation
### Transformers need tokenised inputs (word pieces mapped to IDs).
### We also pad and truncate to a fixed max length for batch training.

In [None]:
model_ckpt = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)

def tokenize(batch):
    return tokenizer(batch["text"], truncation=True, padding="max_length", max_length=256)

tokenised_ds = dataset.map(tokenize, batched=True)
tokenised_ds = tokenised_ds.rename_column("label", "labels")
tokenised_ds.set_format(type="torch", columns=["input_ids", "attention_mask", "labels"])

# 5. Load Pre-trained Model for Sequence Classification
### We add a classification head (2 labels) on top of DistilBERT.

In [None]:
model = AutoModelForSequenceClassification.from_pretrained(model_ckpt, num_labels=2).to(device)


# 6. Define Metrics
### We'll use accuracy and F1 score to evaluate the model.

In [None]:
accuracy = evaluate.load("accuracy")
f1 = evaluate.load("f1")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = np.argmax(logits, axis=-1)
    return {
        "accuracy": accuracy.compute(predictions=preds, references=labels)["accuracy"],
        "f1": f1.compute(predictions=preds, references=labels)["f1"]
    }

# 7. Training Configuration
### Define hyperparameters and training behaviour.

In [None]:
batch_size = 16
args = TrainingArguments(
    output_dir="sentiment-model",          # where to save model
    evaluation_strategy="epoch",           # evaluate at the end of each epoch
    save_strategy="no",                    # don't save checkpoints for this demo
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=2,                    # keep low for faster runs
    weight_decay=0.01,
    logging_steps=100,
    push_to_hub=False
)

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=tokenised_ds["train"],
    eval_dataset=tokenised_ds["test"],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

# 8. Fine-Tune the Model
### This will take ~8–12 minutes on a free Colab GPU.

In [None]:
trainer.train()

# 9. Evaluate on Test Set
### See how well the fine-tuned model performs.

In [None]:
metrics = trainer.evaluate()
print(metrics)

# 10. Run Inference with the Fine-Tuned Model
### Use your trained model to predict sentiment on new sentences.

In [None]:
sentiment_model = pipeline(
    "sentiment-analysis",
    model=trainer.model,
    tokenizer=tokenizer,
    device=0 if device == "cuda" else -1
)

sentiment_model([
    "I absolutely loved this film!",
    "The acting was terrible and I would not recommend it."
])

## ✍️ Reflection

When you have finished:
- Summarise the steps you followed.
- Describe any adjustments you made (epochs, batch size, learning rate).
- Report your final **accuracy** and **F1 score**.
- Reflect on:
  - How the fine-tuned model compared to the zero-shot baseline.
  - Any challenges (runtime, token length, memory limits).