# NLP Sentiment Analysis - Step 3: Transformer Model: Fine tuning DistilBERT

In the previous notebooks, we explored the IMDb dataset (`01_data_eda.ipynb`) and built a baseline Logistic Regression model using TF-IDF features (`02_baseline_model.ipynb`). While the baseline performed reasonablity well, it is limited in its ability to capture the deeper semantic meaning of text.
In this notebook we will move beyond the baseline machine learning model and start fine-tuning a modern Transformer model (DistilBERT) for sentient classification.

The goals of this step are:
- Load a pre-trained DistilBERT model from HuggingFace
- Tokenise the IMDb dataset using the model's tokeniser
- Fine-tune DistilBERT on the training dataset
- Evaluate the model on the validation and test sets
- Compare results with the baseline Logistic Regression model


## 1. Imports
Import required libraries and load the IMDb dataset

In [1]:
import torch
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer
import evaluate

# Load the IMDb dataset (train, test, unsupervised)
dataset = load_dataset("imdb")

## 2. Tokenisation
Transformers cannot work directly with raw text.
<br>Instead, they require token IDs that map to subword units.

- load the pre-trained tokeniser for distilbert-base-uncased
- apply tokenisation across the dataset
- ensure each sequence has a fixed maximum length (e.g. 256 tokens)
- use padding and truncation to handle reviews of different lengths

This step transforms each review into the numeric format required by the model.

In [2]:
tokeniser = AutoTokenizer.from_pretrained("distilbert-base-uncased")

def tokenise_function(examples):
    return tokeniser(examples["text"], truncation=True, padding="max_length", max_length=256)

tokenised_datasets = dataset.map(tokenise_function, batched=True)

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/50000 [00:00<?, ? examples/s]

## 3. Model Definition and Training Setup

Now the model and training configuration needs to be defined.

-**Model**:
DistilBERT is a smaller, fast varient of BERT that still retains roughly 97% of its language understanding capabilities. We add a classification head for **binary sentiment classification**.

-**Training Arguments**:
control the training process (batch size, number of epochs, evalution strategy, etc.).

-**Metrics**:
Use accuracy as our main evalution metric. The HuggingFace `evaluate` library makes this straightforward

In [5]:
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2)

training_args = TrainingArguments(
    output_dir="./results",
    eval_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=2,
    weight_decay=0.01,
    logging_dir="./logs"
)

accuracy = evaluate.load("accuracy")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = logits.argmax(axis=-1)
    return accuracy.compute(predictions=predictions, references=labels)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
