# Binary Classification with IMDb Dataset

## Introduction
In this notebook, I will be working with the **IMDb movie review dataset** from **Hugging Face**. This dataset consists of **two classes**: **positive** and **negative** reviews, making it a **binary classification task**.

## Objective
The goal is to **fine-tune a transformer-based pre-trained model** to classify movie reviews as either **positive or negative**. This approach leverages **transfer learning** to improve performance with minimal training time.

## Workflow:
1. **Load the IMDb dataset** from Hugging Face.
2. **Preprocess the text** (tokenization using a transformer tokenizer).
3. **Fine-tune a transformer model** for sentiment classification.
4. **Evaluate performance** using accuracy, F1-score, and other metrics.
5. **Deployment using Gradio**, this allows users to input text and see classification predictions.


### Let's get started! 🚀


In [None]:
!pip install datasets --q

In [None]:
from datasets import load_dataset

ds = load_dataset("stanfordnlp/imdb")

In [None]:
ds

In [None]:
import matplotlib.pyplot as plt

# Compute word counts for each example in the training set
def compute_word_count(example):
    example["word_count"] = len(example["text"].split())
    return example

ds_train_word_counts = ds["train"].map(compute_word_count)
word_counts = ds_train_word_counts["word_count"]

# Plot the distribution
plt.hist(word_counts, bins=50)
plt.xlabel("Word Count")
plt.ylabel("Frequency")
plt.title("Word Count Distribution in Train Set")
plt.show()

In [None]:
def filter_under_300_words(example):
    return len(example["text"].split()) < 300

# Apply filtering to the full dataset
filtered_dataset = ds.filter(filter_under_300_words)

# Split into train and validation
filtered_train, filtered_val = filtered_dataset["train"].train_test_split(test_size=0.2, seed=42).values()

# Shuffle & select a smaller subset for training & validation
small_train_dataset = filtered_train.shuffle(seed=42).select(range(3000))
small_val_dataset = filtered_val.shuffle(seed=42).select(range(3000))

We will be limiting the word count since the model that will be used in the task has a token limit.

In [None]:
small_train_dataset[5]

In [None]:
small_val_dataset[5]

In [None]:
train_labels = small_train_dataset["label"]
val_labels = small_val_dataset["label"]

# Using Counter to count occurrences of each class
from collections import Counter

train_class_distribution = Counter(train_labels)
val_class_distribution = Counter(val_labels)  # Renamed from eval_class_distribution

print("Train Class Distribution:", train_class_distribution)
print("Validation Class Distribution:", val_class_distribution)

This indicates that the classes are fairly balanced in both the training and evaluation datasets, we don't need to address class imbalance since the dataset seems evenly distributed.

## Preprocess the text

In [None]:
!pip install transformers --q

We will be using BERT which is a transformers model pretrained on a large corpus of English data in a self-supervised fashion.

In [None]:
from transformers import AutoTokenizer

# Define model name
model_name = "bert-base-uncased"

# BERT is great for text understanding, and "bert-base-uncased" is lightweight yet powerful.

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Test the tokenizer with a sample sentence
sample_text = "This movie was absolutely fantastic!"
tokens = tokenizer(sample_text)

print(tokens)

In [None]:
def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True, max_length=256)

# Apply tokenization efficiently
tokenized_small_train_dataset = small_train_dataset.map(
    tokenize_function, batched=True, num_proc=4, remove_columns=["text"]
)

tokenized_small_val_dataset = small_val_dataset.map(
    tokenize_function, batched=True, num_proc=4, remove_columns=["text"]
)

In [None]:
# Convert dataset to PyTorch format
columns = ["input_ids", "attention_mask", "token_type_ids", "label"]

tokenized_small_train_dataset.set_format(type="torch", columns=columns)
tokenized_small_val_dataset.set_format(type="torch", columns=columns)

## Dataloader

In [None]:
import torch
from torch.utils.data import DataLoader
from torch.optim import AdamW
from transformers import AutoModelForSequenceClassification, get_scheduler
from tqdm.auto import tqdm
import numpy as np
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score

# Use GPU if available
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")

In [None]:
# Create DataLoaders
train_dataloader = DataLoader(
    tokenized_small_train_dataset,
    shuffle=True,
    batch_size=8,
    drop_last=True,  # Prevents last batch size mismatches
    pin_memory=True if torch.cuda.is_available() else False  # Optimizes GPU usage
)

val_dataloader = DataLoader(
    tokenized_small_val_dataset,
    batch_size=8,
    drop_last=False  # Keep all validation samples
)

In [None]:
# load a pre-trained model

model_name = "bert-base-uncased"

# This automatically replaces BERT’s original output layer with a new classification layer for the task.

model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)
model.to(device)

 This output confirms that your BERT model is set up for binary classification.

In order to have full control over the training process and gain deeper understanding of how training works, we will use `manual training`. This could have also been done with Trainer API which automates what we're doing manually

In [None]:
# Weight decay helps regularize the model and reduce overfitting.
optimizer = AdamW(model.parameters(), lr=5e-5, weight_decay=0.01)

num_epochs = 5
num_training_steps = num_epochs * len(train_dataloader)

lr_scheduler = get_scheduler(
    name="linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps,
)

This setup ensures that the model training is regularized through weight decay, runs for a defined number of steps, and uses a learning rate that decays gradually to help stabilize training.


## Full Training Loop for Fine-Tuning BERT

In [None]:
# Early stopping parameters to prevent overfitting
best_val_loss = float("inf")
patience = 2  # Number of epochs with no improvement to wait before stopping
counter = 0

for epoch in range(num_epochs):
    model.train()
    total_train_loss = 0
    progress_bar = tqdm(train_dataloader, desc=f"Epoch {epoch+1}", leave=False)

    # Training phase
    for batch in progress_bar:
        batch = {k: v.to(device) for k, v in batch.items()}
        outputs = model(
            input_ids=batch["input_ids"],
            attention_mask=batch["attention_mask"],
            labels=batch["label"]
        )
        loss = outputs.loss
        total_train_loss += loss.item()

        loss.backward()
        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()

        progress_bar.set_postfix(loss=loss.item())

    avg_train_loss = total_train_loss / len(train_dataloader)
    print(f"Epoch {epoch+1} average training loss: {avg_train_loss:.4f}")

    # Evaluation phase
    model.eval()
    total_val_loss = 0
    all_predictions = []
    all_labels = []

    with torch.no_grad():
        for batch in val_dataloader:
            batch = {k: v.to(device) for k, v in batch.items()}
            outputs = model(
                input_ids=batch["input_ids"],
                attention_mask=batch["attention_mask"],
                labels=batch["label"]
            )
            loss = outputs.loss
            total_val_loss += loss.item()

            logits = outputs.logits
            predictions = torch.argmax(logits, dim=-1)
            all_predictions.extend(predictions.cpu().numpy())
            all_labels.extend(batch["label"].cpu().numpy())

    avg_val_loss = total_val_loss / len(val_dataloader)
    print(f"Epoch {epoch+1} average validation loss: {avg_val_loss:.4f}")

    # Calculate evaluation metrics
    accuracy = accuracy_score(all_labels, all_predictions)
    f1 = f1_score(all_labels, all_predictions, average="binary")
    precision = precision_score(all_labels, all_predictions, average="binary")
    recall = recall_score(all_labels, all_predictions, average="binary")
    print("Validation Metrics:")
    print(f"  Accuracy:  {accuracy:.4f}")
    print(f"  F1 Score:  {f1:.4f}")
    print(f"  Precision: {precision:.4f}")
    print(f"  Recall:    {recall:.4f}")

    # Early stopping check: if validation loss doesn't improve, increment counter.
    if avg_val_loss < best_val_loss:
        best_val_loss = avg_val_loss
        counter = 0  # Reset counter if improvement
        torch.save(model.state_dict(), "best_model.pth")  # Save best model when it improves
        print("New best model saved!")
    else:
        counter += 1
        if counter >= patience:
            print(" Early stopping triggered")
            break

## Model test on test data

In [None]:
# Filter original test split
filtered_test = ds["test"].map(compute_word_count).filter(filter_under_300_words)

# Shuffle & subset
small_test_dataset = filtered_test.shuffle(seed=42).select(range(3000))

# Tokenize & remove 'text'
tokenized_small_test_dataset = small_test_dataset.map(
    tokenize_function, batched=True, num_proc=4, remove_columns=["text"]
)

# Set format
tokenized_small_test_dataset.set_format(type="torch", columns=["input_ids", "attention_mask", "token_type_ids", "label"])

# Create test dataloader
test_dataloader = DataLoader(
    tokenized_small_test_dataset,
    batch_size=8
)

In [None]:
# Load the best saved model before testing
model.load_state_dict(torch.load("best_model.pth"))
model.eval()

# Initialize lists for test predictions and labels
test_predictions = []
test_labels = []

# Run model on the test dataset
with torch.no_grad():
    for batch in test_dataloader:
        batch = {k: v.to(device) for k, v in batch.items()}
        outputs = model(
            input_ids=batch["input_ids"],
            attention_mask=batch["attention_mask"],
            labels=batch["label"]
        )
        preds = torch.argmax(outputs.logits, dim=-1)
        test_predictions.extend(preds.cpu().numpy())
        test_labels.extend(batch["label"].cpu().numpy())

# Compute final metrics on test set
test_accuracy = accuracy_score(test_labels, test_predictions)
test_f1 = f1_score(test_labels, test_predictions, average="binary")

print(f" Final Test Accuracy: {test_accuracy:.4f}")
print(f" Final Test F1 Score: {test_f1:.4f}")

For sentiment analysis (binary classification), anything above ~85% is strong.

F1 Score (0.8803) is high, meaning the model balances precision & recall well.

### Quick test on unseen reviews

In [None]:
# Load the best saved model
model.load_state_dict(torch.load("best_model.pth", map_location=device))
model.eval()

# Sample unseen reviews
new_reviews = [
    "I absolutely loved this movie!",
    "The film was boring and too long",
    "An average experience, nothing spectacular"
]

# Tokenize input
inputs = tokenizer(new_reviews, padding=True, truncation=True, return_tensors="pt").to(device)

# Run inference
with torch.no_grad():
    outputs = model(**inputs)
    predictions = torch.argmax(outputs.logits, dim=-1).cpu().numpy()

# Map labels
labels = ["Negative", "Positive"]
results = [labels[pred] for pred in predictions]

# Print results
for review, result in zip(new_reviews, results):
    print(f"Review: {review}\nPrediction: {result}\n")

0 for negative, 1 for positive

## Uploading on HF

In [None]:
# !pip install huggingface_hub

In [None]:
# from huggingface_hub import notebook_login
# notebook_login()

In [None]:
# model.push_to_hub("Lpiziks2/imdb-bert-finetuned")
# tokenizer.push_to_hub("Lpiziks2/imdb-bert-finetuned")

## Deployment using Gradio

In [None]:
!pip install gradio --q

In [None]:
import gradio as gr
import torch.nn.functional as F

# Load the best saved model
model.load_state_dict(torch.load("best_model.pth", map_location=device))
model.eval()

def classify_text(text):
    # Tokenize the input text
    inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)
    inputs = {k: v.to(device) for k, v in inputs.items()}

    with torch.no_grad():
        outputs = model(**inputs)
        probabilities = F.softmax(outputs.logits, dim=-1)  # Convert logits to probabilities
        prediction = torch.argmax(probabilities, dim=-1).item()
        confidence = probabilities[0][prediction].item()  # Get confidence score

    # Map prediction to a label
    label = "Positive" if prediction == 1 else "Negative"
    return f"{label} ({confidence:.2f} confidence)"

# Create the Gradio interface with probability output
interface = gr.Interface(
    fn=classify_text,
    inputs=gr.Textbox(lines=2, placeholder="Enter a movie review here..."),
    outputs=gr.Textbox(label="Prediction"),
    title="Movie Review Sentiment Classifier",
    description="Enter a movie review and the model will predict whether the sentiment is Positive or Negative, along with confidence."
)

interface.launch(debug=True)