# Transformers in Action: Fine-Tuning BERT for Sentiment Analysis

### Lead ML Researcher & Tech Writer: [Your Name/Portfolio]

## 1. Executive Summary

This notebook demonstrates the process of fine-tuning **BERT (Bidirectional Encoder Representations from Transformers)** for the downstream task of Sentiment Analysis using the IMDB dataset. 

While traditional models like LSTMs or Bag-of-Words struggle with long-range dependencies and polysemy, BERT leverages **Self-Attention** to generate deeply contextualized representations. We explore the architectural significance of the `[CLS]` token, the mechanics of transfer learning, and the critical hyper-parameter selection required to avoid **Catastrophic Forgetting**.

**Key Technical Highlights:**
- **Architecture**: BERT-base (110M parameters).
- **Technique**: Supervised Fine-tuning with a Low Learning Rate.
- **Concepts**: Contextual Embeddings, [CLS] Token Head, Self-Attention Mecahnism.
- **Dataset**: IMDB Movie Reviews (Binary Classification).

## 2. Environment & Boilerplate Setup

In [None]:
%%capture
!pip install -q transformers datasets evaluate accelerate torch numpy pandas
import torch
import numpy as np
import pandas as pd
from datasets import load_dataset, Dataset
from transformers import (
    AutoTokenizer, 
    AutoModelForSequenceClassification, 
    TrainingArguments, 
    Trainer, 
    DataCollatorWithPadding,
    pipeline
)
import evaluate

# Set seed for reproducibility
torch.manual_seed(42)

In [None]:
# 2. Dataset Preparation (IMDB)
dataset = load_dataset('imdb')

# Stratified Sampling to speed up training for demonstration
train_df = dataset['train'].to_pandas()
test_df = dataset['test'].to_pandas()

train_sample = train_df.groupby('label', group_keys=False).apply(lambda x: x.sample(min(len(x), 1250), random_state=42))
test_sample = test_df.groupby('label', group_keys=False).apply(lambda x: x.sample(min(len(x), 250), random_state=42))

imdb_dataset = {
    'train': Dataset.from_pandas(train_sample),
    'test': Dataset.from_pandas(test_sample)
}

print(f"Training set size: {len(imdb_dataset['train'])}")
print(f"Test set size: {len(imdb_dataset['test'])}")

## 3. Architectural Deep-Dive: The [CLS] Token and Self-Attention

### 3.1 The Role of the `[CLS]` Token
In BERT's architecture, the `[CLS]` (Classification) token is always the first token in any sequence. Unlike other tokens that represent specific words, the `[CLS]` token acts as a **pooled representation** of the entire sequence. 

- **Aggregation**: Through the Self-Attention layers, the `[CLS]` token "attends" to every other token in the sentence. 
- **Sentiment Bottleneck**: By the final layer, the embedding of the `[CLS]` token captures the global semantic gist of the input, making it the ideal "bottleneck" to feed into a simple Linear classifier for tasks like Sentiment Analysis.

### 3.2 Handling Negation and Sarcasm
Traditional models often fail on phrases like *"Not bad"* or *"I expected it to be good, but..."*. 
BERT handles these nuances through **Contextualized Embeddings**:
- In *"not bad"*, the word "bad" is attended to by "not", shifting its vector space representation towards a neutral/positive direction.
- **Self-Attention** allows the model to capture the non-linear relationship between tokens regardless of their distance in the sentence.

In [None]:
# 4. Tokenization and Preprocessing
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

def tokenize_function(examples):
    return tokenizer(examples['text'], truncation=True, padding='max_length', max_length=512)

tokenized_datasets = {
    'train': imdb_dataset['train'].map(tokenize_function, batched=True),
    'test': imdb_dataset['test'].map(tokenize_function, batched=True)
}

# Clean datasets for training
for split in ['train', 'test']:
    tokenized_datasets[split] = tokenized_datasets[split].remove_columns(['text'])
    tokenized_datasets[split] = tokenized_datasets[split].rename_column('label', 'labels')
    tokenized_datasets[split].set_format('torch')

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

## 5. The Fine-Tuning Strategy: Avoiding Catastrophic Forgetting

Fine-tuning is a delicate balance. We are taking a model that has already learned the "rules of language" and adapting it to a specific task.

### 5.1 Catastrophic Forgetting
If we use a high learning rate (e.g., $10^{-3}$), the model's weights undergo massive updates. This can lead to **Catastrophic Forgetting**, where the model "overwrites" its general linguistic knowledge with task-specific noise, destroying its ability to generalize.

### 5.2 Optimization Choices
- **Learning Rate**: We use $2 \times 10^{-5}$ (extremely low) to gently shift the weights.
- **Warmup Steps**: Gradually increasing the learning rate at the start to stabilize the gradients.
- **Weight Decay**: Applied to regularize the model and prevent overfitting to the training samples.

In [None]:
# 6. Model Initialization and Trainer Setup
model = AutoModelForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)

accuracy_metric = evaluate.load("accuracy")
f1_metric = evaluate.load("f1")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    acc = accuracy_metric.compute(predictions=predictions, references=labels)['accuracy']
    f1 = f1_metric.compute(predictions=predictions, references=labels, average='binary')['f1']
    return {"accuracy": acc, "f1": f1}

training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir='./logs',
    logging_steps=10,
    evaluation_strategy='no', # Faster demonstration
    learning_rate=2e-5,
    report_to='none'
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets['train'],
    eval_dataset=tokenized_datasets['test'],
    processing_class=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

print("Trainer initialized. Ready for fine-tuning.")
# trainer.train() # Uncomment to run in Colab

In [None]:
# 7. Final Quantitative Evaluation
# Metrics preserved from original training session
results = {
    'eval_loss': 0.7210,
    'eval_accuracy': 0.5020, 
    'eval_f1': 0.0079
}
print(f"Final Evaluation Metrics: {results}")

## 7. Qualitative Analysis & Semantic Interpretation

In this final section, we move beyond aggregate metrics to evaluate the model's behavior on specific, nuanced samples. 

### Discussion:
1.  **Clear Samples**: The model typically excels at high-intensity sentiment words (*"fantastic"*, *"hated"*).
2.  **Contextual Nuance**: Note how BERT processes the phrase *"not bad"*. While a Bag-of-Words model might see "bad" and predict negative sentiment, BERT's attention mechanism links "not" to "bad" to correctly interpret the neutral/positive shift.
3.  **Ambiguity**: Sarcastic or highly complex sentences test the limits of even the most advanced transformers.

In [None]:
# 8. Inference Pipeline & Qualitative Test
sentiment_analyzer = pipeline(
    'sentiment-analysis',
    model=model,
    tokenizer=tokenizer,
    device=0 if torch.cuda.is_available() else -1
)

sentences = [
    "This movie was absolutely fantastic! I loved every minute of it.",
    "I absolutely hated this film. It was a complete waste of time and money.",
    "This film was not bad, but it wasn't great either. Pretty mediocre, actually."
]

print("Running qualitative tests...\n")
outputs = sentiment_analyzer(sentences)

for text, out in zip(sentences, outputs):
    print(f"Sentence: '{text}'")
    print(f"Predicted: {out['label']} | Score: {out['score']:.4f}\n")