# 🎬 IMDB Movie Reviews — Learning Journey Overview

## Your NLP Learning Adventure

Welcome to a **hands-on journey** from basic text processing to transformer fine-tuning. This repository teaches you NLP by building every component yourself, then comparing your baseline with TinyBERT.

**Learning Philosophy:** Explain first, code later. Each notebook starts with concepts, then guides you through implementation with hints—no complete solutions provided.


## 🧠 Concept Primer: The Complete NLP Pipeline

### What We're Building
A complete text classification pipeline that transforms raw movie reviews into aspect predictions, comparing two approaches:
1. **Baseline**: Hand-crafted neural network with embeddings
2. **TinyBERT**: Pre-trained transformer fine-tuned for the task

### Why This Approach?
Understanding each step prevents you from treating NLP as a "black box." When something breaks (and it will!), you'll know exactly where to look.

### The 9-Step Pipeline

**Phase 1: Data Preparation (Notebooks 01-04)**
1. **Import & Inspect** → Load CSVs, understand the data structure
2. **Tokenization** → Split text into processable word units  
3. **Vocabulary Building** → Create word→integer mappings
4. **Padding & Tensors** → Convert to fixed-size tensors for neural networks

**Phase 2: Baseline Model (Notebooks 05-06)**
5. **Neural Network** → Embedding layer + mean pooling + classifier
6. **Training & Evaluation** → Train loop + metrics computation

**Phase 3: Transformer Comparison (Notebooks 07-09)**
7. **TinyBERT Setup** → Load pre-trained model, freeze/unfreeze layers
8. **Fine-tuning** → Adapt transformer to our specific task
9. **Comparison** → Side-by-side performance analysis

### How It Maps to Math
- **Text → Tokens**: String splitting with regex
- **Tokens → Integers**: Dictionary lookup (vocab mapping)
- **Integers → Embeddings**: Lookup table → dense vectors
- **Embeddings → Features**: Mean pooling → fixed-size representation
- **Features → Predictions**: Linear layers → class probabilities
- **Predictions → Loss**: Cross-entropy → optimization signal

### Where It Maps to PyTorch Objects
- `torch.nn.Embedding` → word ID lookup table
- `torch.nn.Linear` → weight matrices for classification
- `torch.nn.CrossEntropyLoss` → loss computation
- `torch.optim.Adam` → gradient descent optimizer
- `torch.utils.data.DataLoader` → batch management


## 📋 Checklist Objectives

By the end of this overview, you should be able to:

- [ ] **Narrate the pipeline** in 60-90 seconds without looking at notes
- [ ] **Explain why each step exists** and what breaks without it
- [ ] **Identify the two main approaches** (baseline vs transformer)
- [ ] **Understand the learning philosophy** (explain-first, code-later)

## ✅ Acceptance Criteria

**You've mastered this notebook when you can:**
- Recite the 9-step pipeline from memory
- Explain the difference between baseline and TinyBERT approaches
- Articulate why we build from scratch before using pre-trained models
- Feel confident about the learning journey ahead


## 🔄 Pipeline Visual Diagram

```
Raw Text: "This movie was amazing!"
    ↓
[01] Load Data → DataFrame with columns: review, aspect, aspect_encoded
    ↓
[02] Tokenize → ["this", "movie", "was", "amazing"]
    ↓
[03] Build Vocab → {"this": 2, "movie": 3, "was": 4, "amazing": 5, "<unk>": 0, "<pad>": 1}
    ↓
[04] Encode & Pad → [2, 3, 4, 5, 1, 1, 1, ...] (length 128)
    ↓
[05] Neural Network → Embedding(50) → Mean Pool → Linear(100) → Linear(n_classes)
    ↓
[06] Train & Evaluate → Loss optimization → Accuracy/F1 metrics
    ↓
[07] TinyBERT Setup → Load pre-trained → Freeze most layers → Unfreeze classifier
    ↓
[08] Fine-tune → High LR (2.5e-3) → Adapt to task → Track loss
    ↓
[09] Compare → Baseline vs TinyBERT → Performance analysis → Insights
```

**Key Insight:** Notice how both paths (baseline and TinyBERT) converge at the same evaluation metrics, but they take very different routes to get there!


## 📝 Reflection Prompts

Take 5-10 minutes to reflect on these questions. Write your thoughts in the cell below:

### 🤔 Understanding Check
1. **In your own words, why do we need a vocabulary?** What would happen if we tried to feed raw text directly to a neural network?

2. **What's the fundamental difference between the baseline encoder and TinyBERT?** Think about what each approach "knows" before training starts.

3. **Why do we build from scratch first, then use pre-trained models?** What would you miss if you jumped straight to TinyBERT?

4. **Looking at the pipeline diagram above, which step seems most mysterious to you right now?** What would you like to understand better?

### 🎯 Personal Learning Goals
- What aspect of NLP are you most excited to learn about?
- What's your biggest concern about this learning journey?
- How do you plan to use these notebooks to build understanding (not just copy code)?

---

**Write your reflections here:**


---

## 🚀 Ready to Begin?

You're now equipped with the mental model for the entire NLP pipeline. When you're ready to start building, move on to **Notebook 01: Import and Inspect** where you'll load your first dataset and begin the hands-on journey.

**Remember:** Each notebook builds on the previous one. Don't skip ahead—the learning is in the journey, not just the destination.

**Next up:** Data familiarization and the first glimpse of your movie reviews dataset! 🎬
