In [46]:
!jupyter nbextension disable --py widgetsnbextension

Disabling notebook extension jupyter-js-widgets/extension...
      - Validating: [32mOK[0m


## IMDB Sentiment Classification with BERT Fine-Tuning

In this experiment, we explore how fine-tuning improves the performance of a pre-trained transformer model (BERT) on a downstream sentiment analysis task.

We use the IMDB movie reviews dataset, which consists of 50,000 labeled reviews (positive or negative). The goal is to classify the sentiment of each review correctly.

We start by evaluating the pre-trained BERT model (bert-base-uncased) on the test set without fine-tuning — this acts as our baseline performance. Then, we fine-tune the model on the IMDB training data and evaluate it again to measure the improvement.

Experiment Steps

1. Load and preprocess the IMDB dataset
2. Tokenize text using BERT tokenizer
3. Evaluate baseline i.e. pretrained model, no finetuning
4. Fine-tune BERT on IMDB training data
5. Evaliuate and comapre model perdormance before and after finetuning

In [32]:
# Import the Libraries
from datasets import load_dataset
from collections import Counter
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer
from transformers import TrainingArguments
from sklearn.metrics import accuracy_score, f1_score
import numpy as np
import torch
import os
os.environ["WANDB_DISABLED"] = "true"

In [33]:
# Load IMDB dataset
dataset = load_dataset("imdb")
train_dataset = dataset["train"]
test_dataset = dataset["test"]


In [34]:
# A little bit of EDA
print(train_dataset.column_names)
print(train_dataset.features)
# Count labels in training dataset
train_counts = Counter(train_dataset['label'])
print("Train set label counts:", train_counts)

# Count labels in test dataset
test_counts = Counter(test_dataset['label'])
print("Test set label counts:", test_counts)

['text', 'label']
{'text': Value('string'), 'label': ClassLabel(names=['neg', 'pos'])}
Train set label counts: Counter({0: 12500, 1: 12500})
Test set label counts: Counter({0: 12500, 1: 12500})


There is equal distribution of positive and negative classes in both train and test set. Let continue with that.

In [35]:
# Load tokenizer and model
model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)

In [36]:
def tokenize_fn(batch):
  """
  Tokenizes each batch of text samples so they can be fed into the BERT model
  """
  return tokenizer(batch["text"], padding="max_length", truncation=True, max_length=256)


In [37]:
# Here we are applying tokenize_fn to every element of the dataset, passed in batches
encoded_train = train_dataset.map(tokenize_fn, batched=True)
encoded_test = test_dataset.map(tokenize_fn, batched=True)

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

In [38]:
# Convert tokenized dataset columns to PyTorch tensors so they can be directly fed into the BERT model
encoded_train.set_format(type="torch", columns=["input_ids", "attention_mask", "label"])
encoded_test.set_format(type="torch", columns=["input_ids", "attention_mask", "label"])


In [39]:
# Load pre-trained model
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [40]:
# Evaluation metric
def compute_metrics(pred):
    labels = pred.label_ids
    preds = np.argmax(pred.predictions, axis=1)
    acc = accuracy_score(labels, preds)
    f1 = f1_score(labels, preds)
    return {"accuracy": acc, "f1": f1}


In [41]:
# Evaluate before fine-tuning (zero-shot)
trainer = Trainer(model=model, tokenizer=tokenizer)
preds = trainer.predict(encoded_test)
baseline_metrics = compute_metrics(preds)
print("Before fine-tuning:", baseline_metrics)

  trainer = Trainer(model=model, tokenizer=tokenizer)
Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


Before fine-tuning: {'accuracy': 0.5126, 'f1': 0.31224247897499574}


In [42]:
from transformers import TrainingArguments
# Fine-tune i.e Configuring the behaviour of the training
training_args = TrainingArguments(
    output_dir="./results", # Directory where model checkpoints and final models will be saved
    eval_strategy="epoch", # Run evaluation at the end of each training epoch
    save_strategy="epoch", # Save model checkpoints at the end of each epoch
    learning_rate=2e-5, # Learning rate for the optimizer
    per_device_train_batch_size=8, # Batch size per GPU/CPU for training
    per_device_eval_batch_size=8, # Batch size per GPU/CPU for evaluation
    num_train_epochs=5,  # Total number of passes through the training dataset
    weight_decay=0.01, # L2 regularization to prevent overfitting
    logging_dir="./logs",  # Directory where training logs (for TensorBoard) will be saved
    logging_steps=100, # Log training metrics every 100 steps
)

Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


In [43]:
trainer = Trainer(
    model=model, # The pre-trained BERT model to fine-tune
    args=training_args, # Training configuration defined in TrainingArguments
    train_dataset=encoded_train.shuffle(seed=42),  # Full training dataset, shuffled
    eval_dataset=encoded_test, # Full test dataset for evaluation
    tokenizer=tokenizer, # The BERT tokenizer used for encoding text inputs
    compute_metrics=compute_metrics, # Function to calculate evaluation metrics (e.g., accuracy, F1)
)

  trainer = Trainer(


In [44]:
trainer.train()


Epoch,Training Loss,Validation Loss,Accuracy,F1
1,0.2345,0.330994,0.91012,0.9134
2,0.2555,0.310378,0.90976,0.914011


Epoch,Training Loss,Validation Loss,Accuracy,F1
1,0.2345,0.330994,0.91012,0.9134
2,0.2555,0.310378,0.90976,0.914011
3,0.1214,0.414992,0.91964,0.919301
4,0.026,0.558844,0.91984,0.919213
5,0.0225,0.577345,0.92088,0.921183


TrainOutput(global_step=15625, training_loss=0.13500612687200308, metrics={'train_runtime': 7716.3001, 'train_samples_per_second': 16.199, 'train_steps_per_second': 2.025, 'total_flos': 1.644444096e+16, 'train_loss': 0.13500612687200308, 'epoch': 5.0})

In [45]:
# Evaluate after fine-tuning
preds_after = trainer.predict(encoded_test)
finetuned_metrics = compute_metrics(preds_after)
print("After fine-tuning:", finetuned_metrics)

After fine-tuning: {'accuracy': 0.92088, 'f1': 0.9211826585910106}
