## Part 1: Data Loading

### Why Start with this Data?

Before building models, we need to understand:
- Dataset size (affects model choice)
- Class balance (affects evaluation metrics)
- Text characteristics (length, complexity)

The IMDb dataset is:
- **Balanced:** 50/50 positive/negative (no class imbalance bias)
- **Medium-sized:** 25k training examples
- **Realistic:** Real movie reviews with varied length and complexity

In [1]:
from datasets import load_dataset

# Load IMDb dataset from HuggingFace
dataset = load_dataset("imdb")
print(dataset)

  from .autonotebook import tqdm as notebook_tqdm


DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})


In [2]:
# Examine a sample review
print(dataset['train'][0])

{'text': 'I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scenes are few and far be

In [None]:
# Verify class balance
train_labels = dataset['train']['label']
print(f"Positive reviews: {sum(train_labels)}")
print(f"Negative reviews: {len(train_labels) - sum(train_labels)}")
print(f"\n Dataset is perfectly balanced (50/50)")

Positive reviews: 12500
Negative reviews: 12500

✅ Dataset is perfectly balanced (50/50)


## Part 2: Baseline Model - TF-IDF + Logistic Regression

### Why Start with a Simple Baseline?

**Important principle in ML:** Always establish a simple baseline before using complex models.

Benefits:
1. **Benchmark:** Provides target to beat
2. **Speed:** Fast iteration for testing pipeline
3. **Simplicity:** Easy to debug and understand
4. **Often sufficient:** Many tasks don't need deep learning

### TF-IDF: How It Works

**TF-IDF** = Term Frequency × Inverse Document Frequency

**Goal:** Identify words that are important to a document

**Formula Logic:**
- **Term Frequency (TF):** How often does this word appear in THIS review?
- **Inverse Document Frequency (IDF):** How rare is this word across ALL reviews?

### N-grams: Capturing Phrases

Setting `ngram_range=(1,2)` captures:
- **Unigrams:** "good", "bad", "terrible"
- **Bigrams:** "not good", "very bad", "absolutely terrible"

This helps capture negation and emphasis patterns that single words might miss.

In [4]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

# Extract text and labels
train_texts = dataset['train']['text']
train_labels = dataset['train']['label']
test_texts = dataset['test']['text']
test_labels = dataset['test']['label']

print(f"Training on {len(train_texts)} reviews")
print(f"Testing on {len(test_texts)} reviews")

Training on 25000 reviews
Testing on 25000 reviews


In [5]:
# Convert text to TF-IDF features
print("Converting text to TF-IDF features...")

vectorizer = TfidfVectorizer(
    max_features=5000,    # Keep top 5000 most important words
    ngram_range=(1, 2)    # Use both single words and two-word phrases
)

X_train = vectorizer.fit_transform(train_texts)
X_test = vectorizer.transform(test_texts)

print(f"\n TF-IDF vectorization complete!")
print(f"Shape: {X_train.shape}")
print(f"Each review → vector of {X_train.shape[1]} numbers")
print(f"Parameters: ~{X_train.shape[1] * 2:,} (2 per feature for binary classification)")

Converting text to TF-IDF features...

 TF-IDF vectorization complete!
Shape: (25000, 5000)
Each review → vector of 5000 numbers
Parameters: ~10,000 (2 per feature for binary classification)


### Logistic Regression for Classification

**Why Logistic Regression?**
- Fast training on high-dimensional data
- Probabilistic outputs (confidence scores)
- Works well with TF-IDF features
- Interpretable (can examine feature weights)

**What it learns:**
- Positive weights for words like "excellent", "amazing", "loved"
- Negative weights for words like "terrible", "boring", "waste"
- Combines these weights to make predictions

In [6]:
# Train Logistic Regression
print("Training Logistic Regression model...")
baseline_model = LogisticRegression(max_iter=1000)
baseline_model.fit(X_train, train_labels)

# Evaluate
print("\nEvaluating on test set...")
predictions = baseline_model.predict(X_test)
accuracy = accuracy_score(test_labels, predictions)

print(f"\n{'='*60}")
print(f"BASELINE MODEL PERFORMANCE")
print(f"{'='*60}")
print(f"Test Accuracy: {accuracy:.4f} ({accuracy*100:.2f}%)")
print(f"\nDetailed Classification Report:")
print(classification_report(test_labels, predictions, target_names=['Negative', 'Positive']))
print(f"{'='*60}")

Training Logistic Regression model...

Evaluating on test set...

BASELINE MODEL PERFORMANCE
Test Accuracy: 0.8884 (88.84%)

Detailed Classification Report:
              precision    recall  f1-score   support

    Negative       0.89      0.88      0.89     12500
    Positive       0.88      0.89      0.89     12500

    accuracy                           0.89     25000
   macro avg       0.89      0.89      0.89     25000
weighted avg       0.89      0.89      0.89     25000



### Baseline Results Analysis

**Metrics explanation:**
- **Accuracy:** Overall correctness (~88.84%)
- **Precision:** When model predicts positive, how often is it right? (~89%)
- **Recall:** Of all actual positive reviews, how many did we find? (~88%)
- **F1-Score:** Harmonic mean of precision and recall (~0.89)

**Key observations:**
- Balanced performance across both classes 
- Simple model achieving ~89% accuracy
- This is our benchmark - BERT needs to beat this to be worthwhile

## Part 3: BERT - Transformer Architecture

### What is BERT?

**BERT** = Bidirectional Encoder Representations from Transformers

**Key innovation:** Processes text bidirectionally (looks both left AND right)

### Traditional Models vs BERT

**Traditional (RNN/LSTM):**
```
Reads left-to-right sequentially:
"This" → "movie" → "wasn't" → "good"

Problem: May forget "wasn't" by the time it reaches "good"
```

**BERT (Transformer):**
```
Processes all words simultaneously:
"This" ↔ "movie" ↔ "wasn't" ↔ "good"

Advantage: "good" can attend to "wasn't" → understands negation
```

### The Attention Mechanism

**Core concept:** Each word "attends" to (looks at) every other word

**Example: "This movie wasn't good"**

When processing "good":
```
Attention weights (learned automatically):
"This"   → 5%  (low attention)
"movie"  → 15% (medium attention)
"wasn't" → 65% (HIGH attention - this is key!)
"good"   → 15% (self-attention)
```

Result: "good" incorporates meaning from "wasn't" → understands the phrase is negative!

### BERT Architecture

**Layers:**
- **12 transformer layers** (BERT-base)
- Each layer refines understanding progressively

**What each layer learns (conceptually):**
```
Layer 1-2:   Basic syntax and grammar
Layer 3-5:   Phrase relationships
Layer 6-8:   Semantic meaning and context
Layer 9-12:  Complex sentence-level understanding
```

**Parameters:**
- Total: 109,483,778 parameters
- Embeddings: ~23M (vocabulary × dimensions)
- Transformer layers: ~84M (12 layers × ~7M each)
- Classification head: ~2M

### Transfer Learning

**Two-stage process:**

**Stage 1: Pre-training (Done by Google)**
- Trained on massive text corpus (Wikipedia + BookCorpus)
- Learned general language understanding
- Result: Model understands English grammar, syntax, semantics

**Stage 2: Fine-tuning (What we're doing)**
- Start with pre-trained BERT
- Train on our specific task (sentiment classification)
- Only need 25k examples (vs billions for pre-training)
- Result: BERT adapted for sentiment analysis

**Analogy:** 
- Pre-training = Learning to read and understand English
- Fine-tuning = Learning to identify sentiment in reviews

In [7]:
from transformers import BertTokenizer, BertForSequenceClassification

print("Loading BERT tokenizer...")
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

print("Loading BERT model...")
model = BertForSequenceClassification.from_pretrained(
    'bert-base-uncased',
    num_labels=2  # Binary classification: positive/negative
)

print("\n Model and tokenizer loaded!")
print(f"Model has {model.num_parameters():,} parameters")

Loading BERT tokenizer...


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Loading BERT model...

 Model and tokenizer loaded!
Model has 109,483,778 parameters


### BERT Tokenization

**Process:** Convert text → numbers BERT understands

**Steps:**
1. **Lowercase:** "This Movie" → "this movie"
2. **Add special tokens:** "[CLS] this movie was amazing [SEP]"
   - `[CLS]`: Classification token (sentence-level representation)
   - `[SEP]`: Separator (marks end of sentence)
3. **Subword tokenization:** "unbelievable" → ["un", "##believable"]
   - Handles unknown words by breaking into pieces
4. **Convert to IDs:** Map each token to number from vocabulary (30,000 words)
5. **Pad/Truncate:** Make all sequences same length (128 tokens)
6. **Attention mask:** 1 for real tokens, 0 for padding

**Why max_length=128?**
- Captures most review content (first ~100 words)
- Balances context vs computation speed
- Longer = more context but slower training

In [8]:
print("Tokenizing training data...")
train_encodings = tokenizer(
    list(train_texts),
    truncation=True,      
    padding=True,         
    max_length=128,       
    return_tensors='pt'  
)

print("Tokenizing test data...")
test_encodings = tokenizer(
    list(test_texts),
    truncation=True,
    padding=True,
    max_length=128,
    return_tensors='pt'
)

print("\n Tokenization complete!")
print(f"Training shape: {train_encodings['input_ids'].shape}")
print(f"Test shape: {test_encodings['input_ids'].shape}")
print(f"\nEach review → [128] token IDs + [128] attention mask")

Tokenizing training data...
Tokenizing test data...

 Tokenization complete!
Training shape: torch.Size([25000, 128])
Test shape: torch.Size([25000, 128])

Each review → [128] token IDs + [128] attention mask


In [9]:
import torch
from torch.utils.data import Dataset, DataLoader

class SentimentDataset(Dataset):
    """Custom Dataset for sentiment analysis"""
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels
    
    def __len__(self):
        return len(self.labels)
    
    def __getitem__(self, idx):
        item = {key: val[idx] for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

# Create datasets
train_dataset = SentimentDataset(train_encodings, train_labels)
test_dataset = SentimentDataset(test_encodings, test_labels)

# Create dataloaders
train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=16, shuffle=False)

print("Datasets and DataLoaders created!")
print(f"Training batches per epoch: {len(train_loader)}")
print(f"Test batches: {len(test_loader)}")

Datasets and DataLoaders created!
Training batches per epoch: 1563
Test batches: 1563


In [10]:
from torch.optim import AdamW
from transformers import get_linear_schedule_with_warmup

# Check for GPU
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

# Move model to GPU
model = model.to(device)

# Setup optimizer
optimizer = AdamW(
    model.parameters(),
    lr=2e-5,              # Learning rate
    weight_decay=0.01     # Regularization
)

# Setup scheduler
epochs = 3
total_steps = len(train_loader) * epochs

scheduler = get_linear_schedule_with_warmup(
    optimizer,
    num_warmup_steps=0,
    num_training_steps=total_steps
)

print("\n Training setup complete!")
print(f"Total training steps: {total_steps}")
print(f"Steps per epoch: {len(train_loader)}")

Using device: cuda

 Training setup complete!
Total training steps: 4689
Steps per epoch: 1563


In [11]:
from tqdm import tqdm

def train_epoch(model, dataloader, optimizer, scheduler, device):
    """Train for one epoch"""
    model.train()
    total_loss = 0
    correct_predictions = 0
    total_predictions = 0
    
    progress_bar = tqdm(dataloader, desc='Training')
    
    for batch in progress_bar:
        # Move batch to GPU
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['labels'].to(device)
        
        # Zero gradients
        optimizer.zero_grad()
        
        # Forward pass
        outputs = model(
            input_ids=input_ids,
            attention_mask=attention_mask,
            labels=labels
        )
        
        loss = outputs.loss
        logits = outputs.logits
        
        # Backward pass
        loss.backward()
        optimizer.step()
        scheduler.step()
        
        # Track metrics
        total_loss += loss.item()
        predictions = torch.argmax(logits, dim=1)
        correct_predictions += (predictions == labels).sum().item()
        total_predictions += labels.size(0)
        
        progress_bar.set_postfix({
            'loss': f'{loss.item():.4f}',
            'acc': f'{correct_predictions/total_predictions:.4f}'
        })
    
    avg_loss = total_loss / len(dataloader)
    accuracy = correct_predictions / total_predictions
    
    return avg_loss, accuracy

def evaluate(model, dataloader, device):
    """Evaluate model on test set"""
    model.eval()
    total_loss = 0
    correct_predictions = 0
    total_predictions = 0
    
    with torch.no_grad():
        for batch in tqdm(dataloader, desc='Evaluating'):
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            labels = batch['labels'].to(device)
            
            outputs = model(
                input_ids=input_ids,
                attention_mask=attention_mask,
                labels=labels
            )
            
            loss = outputs.loss
            logits = outputs.logits
            
            total_loss += loss.item()
            predictions = torch.argmax(logits, dim=1)
            correct_predictions += (predictions == labels).sum().item()
            total_predictions += labels.size(0)
    
    avg_loss = total_loss / len(dataloader)
    accuracy = correct_predictions / total_predictions
    
    return avg_loss, accuracy

print("Training and evaluation functions defined!")

Training and evaluation functions defined!


In [None]:
print("="*60)
print("STARTING BERT FINE-TUNING")
print("="*60)
print(f"Device: {device}")
print(f"Epochs: {epochs}")
print(f"Batch size: 16")
print(f"Learning rate: 2e-5")
print(f"Total parameters: {model.num_parameters():,}")
print("="*60)

best_accuracy = 0

for epoch in range(epochs):
    print(f"\n EPOCH {epoch + 1}/{epochs}")
    print("-"*60)
    
    # Train
    train_loss, train_acc = train_epoch(model, train_loader, optimizer, scheduler, device)
    
    # Evaluate
    test_loss, test_acc = evaluate(model, test_loader, device)
    
    # Calculate train-test gap
    gap = train_acc - test_acc
    
    # Print results
    print(f"\n Results:")
    print(f"Train Loss: {train_loss:.4f} | Train Acc: {train_acc:.4f} ({train_acc*100:.2f}%)")
    print(f"Test Loss:  {test_loss:.4f} | Test Acc:  {test_acc:.4f} ({test_acc*100:.2f}%)")
    print(f"Train-Test Gap: {gap*100:.2f}%", end=" ")
    
    if gap > 0.05:
        print("Overfitting detected!")
    elif gap > 0:
        print("Starting to overfit")
    else:
        print("Good generalization")
    
    if test_acc > best_accuracy:
        best_accuracy = test_acc
        print(f"New best accuracy: {best_accuracy:.4f}")
    
    print("-"*60)

print("\n" + "="*60)
print("TRAINING COMPLETE!")
print(f"Best Test Accuracy: {best_accuracy:.4f} ({best_accuracy*100:.2f}%)")
print(f"Improvement over baseline: {(best_accuracy - 0.8884)*100:.2f}%")
print("="*60)

STARTING BERT FINE-TUNING
Device: cuda
Epochs: 3
Batch size: 16
Learning rate: 2e-5
Total parameters: 109,483,778

 EPOCH 1/3
------------------------------------------------------------


Training:  10%|▉         | 154/1563 [01:23<12:24,  1.89it/s, loss=0.3692, acc=0.8312]

## Part 5: Results Analysis & Conclusions

### Comparing BERT vs Baseline

**Final Results:**
- **Baseline (TF-IDF):** 88.84% accuracy, no overfitting
- **BERT (3 epochs):** ~88.97% accuracy, severe overfitting

### Overfitting Analysis

**Typical pattern observed:**
```
Epoch 1: Train 85%, Test 89% (Test better - healthy!)
Epoch 2: Train 93%, Test 89% (Gap opening)
Epoch 3: Train 97%, Test 89% (Severe overfitting)
```

**What happened:**
- Training accuracy jumped 12% (85% → 97%)
- Test accuracy stayed flat (~89%)
- Model memorized training data instead of learning patterns

**Root cause:**
- 109M parameters ÷ 25k examples = 4,379 params/example
- Model has enough capacity to memorize every training example
- Insufficient data to force generalizable learning

### Key Insights

**1. Dataset Size Matters**
- 25k examples insufficient for 109M parameter model
- Transformers typically need 100k+ examples
- Rule: Match model complexity to data availability

**2. Simple Baselines Can Win**
- TF-IDF achieved 88.84% with no overfitting
- BERT achieved 88.97% with severe overfitting
- 0.13% gain not worth the complexity

**3. Overfitting Detection**
- Train-test gap is key metric
- Monitor from Epoch 1
- Early stopping could have helped

**4. Production Considerations**
- TF-IDF: <1 min training, simple deployment
- BERT: 39 min training, complex infrastructure
- For 0.13% gain, simplicity wins

### Lessons Learned

**Always establish simple baselines first**  
**Model complexity should match dataset size**  
**Monitor train-test gap for overfitting**  
**Consider cost-benefit in model selection**  
**More parameters ≠ better performance**  

### When to Use Each Approach

**Use TF-IDF + Logistic Regression when:**
- Dataset < 100k examples
- Binary/simple classification
- Speed and simplicity valued
- Interpretability required

**Use BERT/Transformers when:**
- Dataset > 100k examples
- Complex NLP tasks (NER, QA, summarization)
- Context-dependent understanding critical
- You have compute budget and time

### Future Improvements

To improve BERT performance:
1. **More data:** Increase to 100k+ examples
2. **Early stopping:** Stop at Epoch 1
3. **Smaller model:** Try DistilBERT (66M params)
4. **Regularization:** Increase dropout rates
5. **Data augmentation:** Paraphrasing, back-translation

---

**Conclusion:** This project demonstrates that state-of-the-art models aren't always the best choice. Model selection should be driven by dataset characteristics, task requirements, and production constraints - not just pursuing the latest architecture.