# NLP Sentiment Analysis - Step 3: Transformer Model: Fine tuning DistilBERT

In the previous notebooks, we explored the IMDb dataset (`01_data_eda.ipynb`) and built a baseline Logistic Regression model using TF-IDF features (`02_baseline_model.ipynb`). While the baseline performed reasonablity well, it is limited in its ability to capture the deeper semantic meaning of text.
In this notebook we will move beyond the baseline machine learning model and start fine-tuning a modern Transformer model (DistilBERT) for sentient classification.

The goals of this step are:
- Load a pre-trained DistilBERT model from HuggingFace
- Tokenise the IMDb dataset using the model's tokeniser
- Fine-tune DistilBERT on the training dataset
- Evaluate the model on the validation and test sets
- Compare results with the baseline Logistic Regression model


## 1. Imports
Import required libraries and load the IMDb dataset

In [2]:
import torch
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer
import evaluate

# Load the IMDb dataset (train, test, unsupervised)
dataset = load_dataset("imdb")

## 2. Tokenisation
Transformers cannot work directly with raw text.
<br>Instead, they require token IDs that map to subword units.

- load the pre-trained tokeniser for distilbert-base-uncased
- apply tokenisation across the dataset
- ensure each sequence has a fixed maximum length (e.g. 256 tokens)
- use padding and truncation to handle reviews of different lengths

This step transforms each review into the numeric format required by the model.

In [3]:
tokeniser = AutoTokenizer.from_pretrained("distilbert-base-uncased")

def tokenise_function(examples):
    return tokeniser(examples["text"], truncation=True, padding="max_length", max_length=256)

tokenised_datasets = dataset.map(tokenise_function, batched=True)

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

## 3. Model Definition and Training Setup

Now the model and training configuration needs to be defined.

-**Model**:
DistilBERT is a smaller, fast varient of BERT that still retains roughly 97% of its language understanding capabilities. We add a classification head for **binary sentiment classification**.

-**Training Arguments**:
control the training process (batch size, number of epochs, evalution strategy, etc.).

-**Metrics**:
Use accuracy as our main evalution metric. The HuggingFace `evaluate` library makes this straightforward

In [4]:
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2)

training_args = TrainingArguments(
    output_dir="./results",
    eval_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=2,
    weight_decay=0.01,
    logging_dir="./logs"
)

accuracy = evaluate.load("accuracy")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = logits.argmax(axis=-1)
    return accuracy.compute(predictions=predictions, references=labels)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## 4. Training the Model

HuggingFace's high level trainer API is used which abstracts away most of the boilerplate code needed for training deep learning models.

- Define the trainer obkect with our model, dataset, trainig arguments and evaluation metrics
- Fine-tune DistilBERT on a subset of the training dataset (for quicker experimentation)
- Monitor training and evaluation loss at the end of each epoch

In [5]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenised_datasets["train"].shuffle(seed=42).select(range(2000)), # subset for speed
    eval_dataset=tokenised_datasets["test"].shuffle(seed=42).select(range(1000)),
    compute_metrics=compute_metrics
)

trainer.train()



Epoch,Training Loss,Validation Loss,Accuracy
1,No log,0.368461,0.838
2,No log,0.342227,0.859




TrainOutput(global_step=250, training_loss=0.36060275268554687, metrics={'train_runtime': 2453.4945, 'train_samples_per_second': 1.63, 'train_steps_per_second': 0.102, 'total_flos': 264934797312000.0, 'train_loss': 0.36060275268554687, 'epoch': 2.0})

## 5. Model Evaluation and Comparison

### Logistic Regression (Baseline Model 2)
- Test Accuracy: **89.2%**
- Strengths: Lightweight, fast to train, surprisingly competitive performance on bag-of-words features
- Weaknesses: Limited ability to capture semantic nuance (e.g. sarcasm, context beyond word frequency)

### DistilBERT (Transformer Fine-tuning)
- Validation Accuracy after 2 epochs: **86%**
- Strengths: Captures context, word order, and nuanced semantics; generalises better on complex NLP tasks
- Weaknesses: Training is slower and requires more compute; with our current setup, performance did not surpass logistic regression

### Analysis
Interestingly, the baseline **Logistic Regression slightly outperformed DistilBERT in this setup (89% vs 86%)**
This highlights that:
- Classical models can remain strong baselines, especially on well-structured datasets like IMDB
- Pretrained transformers require careful fine-tuning (learning rate, batch size, number of epochs) to reach their full potential

### Future Work
- Train DistilBERT for longer (3–5 epochs) and adjust learning rate schedule
- Try larger models like **BERT-base** or **RoBERTa** which are reported to exceed 90% accuracy on IMDB
- Use regularisation (dropout, weight decay) to reduce overfitting
- Explore data augmentation (e.g. back translation) to improve robustness

Even though DistilBERT did not outperform the baseline in this experiment, the comparison provides valuable insights into the trade-offs between **classical ML** and **transformer-based models** in NLP.