# Fine-tuning BERT on IMDB Sentiment Data

This notebook demonstrates how to fine-tune a pretrained BERT model for sentiment classification using an IMDB dataset. It includes steps for loading the dataset, preprocessing the text, training the model, evaluating it, and generating predictions along with confidence scores.


## 🔍 What the Notebook Does

This notebook performs the **full fine-tuning pipeline** on the IMDB dataset:

-  Loads the **pre-cleaned IMDB reviews and sentiment labels**.
-  Encodes the sentiment labels (`positive` → 1, `negative` → 0).
-  Splits the dataset into **training and validation sets**.
-  Loads the **BERT tokenizer and base model** (`bert-base-uncased`).
-  Wraps the text and labels in a `torch.utils.data.Dataset` object.
-  Defines **training hyperparameters** and **evaluation strategy**.
-  **Fine-tunes** the model on your sentiment classification task using Hugging Face's `Trainer` API.
-  Saves the **fine-tuned model and tokenizer** locally.
-  Reloads the fine-tuned model and runs **inference on the full dataset**, generating:
  - **Predicted sentiment** (`positive` / `negative`)
  - **Confidence score** (from softmax probabilities)


## 1. Imports and Setup

We begin by importing necessary libraries for data handling, model loading, training, and evaluation.


In [1]:
from transformers import BertTokenizer, BertForSequenceClassification, Trainer, TrainingArguments
import torch
import pandas as pd
from torch.utils.data import Dataset
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split

  from .autonotebook import tqdm as notebook_tqdm


## 2. Load and Encode Dataset

Load the cleaned IMDB dataset and encode sentiment labels to numerical values using `LabelEncoder`.


In [2]:
df = pd.read_csv("../data/clean_imdb_dataset.csv")
label_encoder = LabelEncoder()
df["label"] = label_encoder.fit_transform(df["sentiment"])

## 3. Train-Test Split

Split the dataset into training and validation sets (90% train, 10% validation).

(not sure if this should be the split for the fine tuning, but i gues using 90% makes sense?)

In [3]:
train_texts, val_texts, train_labels, val_labels = train_test_split(
    df["review"].tolist(), df["label"].tolist(), test_size=0.9, random_state=42
)

## 4. Load Tokenizer

We load the `bert-base-uncased` tokenizer from Hugging Face, which converts text into tokens the model understands.


In [4]:
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

In [5]:
class IMDBDataset(Dataset):
    def __init__(self, texts, labels, tokenizer, max_len=512):#need to double check
        self.encodings = tokenizer(texts, padding=True, truncation=True, max_length=max_len, return_tensors="pt")
        self.labels = labels

    def __getitem__(self, idx):
        item = {k: v[idx] for k, v in self.encodings.items()}
        item["labels"] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

## 5. Create Dataset Wrapper

Define a custom dataset class to prepare tokenized inputs and labels.

In [6]:
class IMDBDataset(Dataset):
    def __init__(self, texts, labels, tokenizer, max_len=512):
        self.encodings = tokenizer(texts, padding=True, truncation=True, max_length=max_len, return_tensors="pt")
        self.labels = labels

    def __getitem__(self, idx):
        item = {k: v[idx] for k, v in self.encodings.items()}
        item["labels"] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

## 6. Prepare Dataset Objects

Convert training and validation data into instances of `IMDBDataset`.


In [7]:
train_dataset = IMDBDataset(train_texts, train_labels, tokenizer)
val_dataset = IMDBDataset(val_texts, val_labels, tokenizer)

## 7. Load Pretrained BERT Model

Load the pretrained BERT model with a classification head (2 output labels: positive and negative).

In [8]:
model = BertForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## 8. Training Arguments

Define how the model should be trained, including learning rate, batch size, number of epochs, and evaluation strategy.


In [9]:
training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
    logging_dir="./logs",
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
    save_total_limit=1,
)



## 9. Define Evaluation Metric

Use the `evaluate` library to calculate accuracy on the validation set.

In [10]:
from evaluate import load
import numpy as np

accuracy = load("accuracy")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = np.argmax(logits, axis=1)
    return accuracy.compute(predictions=preds, references=labels)

## 10. Train the Model

Initialize the `Trainer` class and train the model using the training dataset.

In [11]:
from transformers import Trainer

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

trainer.train()

  trainer = Trainer(


Epoch,Training Loss,Validation Loss


KeyboardInterrupt: 

## 11. Save Fine-tuned Model

Save the trained model and tokenizer locally for later inference.

In [None]:
model.save_pretrained("./fine_tuned_bert_imdb")
tokenizer.save_pretrained("./fine_tuned_bert_imdb")

## 12. Use Fine-tuned Model for Predictions on Full Dataset

Reload the model and tokenizer, and apply them to the entire dataset to predict sentiment labels and confidence scores.


In [None]:
# Reload model and tokenizer
model = BertForSequenceClassification.from_pretrained("./fine_tuned_bert_imdb")
tokenizer = BertTokenizer.from_pretrained("./fine_tuned_bert_imdb")
model.eval()

# Encode full dataset
texts = df["review"].tolist()
encodings = tokenizer(texts, return_tensors="pt", padding=True, truncation=True, max_length=512)

# Run inference
with torch.no_grad():
    outputs = model(**encodings)
    probs = torch.nn.functional.softmax(outputs.logits, dim=1)
    predictions = torch.argmax(probs, dim=1)
    confidences = torch.max(probs, dim=1).values.tolist()

# Decode predictions
predicted_labels = label_encoder.inverse_transform(predictions.tolist())

# Save results to DataFrame
df["predicted_sentiment"] = predicted_labels
df["confidence"] = confidences

# Export results
df.to_csv("../data/imdb_predictions_with_confidence.csv", index=False)
print("Saved predictions to ../data/imdb_predictions_with_confidence.csv")
```
