# 🤖 Notebook 07: TinyBERT Setup and Freezing

## Enter the Transformer Era

This notebook introduces you to transfer learning with TinyBERT, a smaller but powerful transformer model. You'll learn how to load pre-trained models, freeze/unfreeze layers strategically, and prepare data for transformer processing with attention masks.


## 🧠 Concept Primer: Transfer Learning with Transformers

### What We're Doing
Loading a pre-trained TinyBERT model and adapting it for our specific text classification task through strategic layer freezing and fine-tuning.

### Why This Approach Works
**Pre-trained models have learned general language representations.** Instead of training from scratch, we leverage this knowledge and adapt it to our specific task.

### TinyBERT Architecture
- **4 encoder layers** (much smaller than BERT's 12)
- **312 hidden dimensions** (efficient but powerful)
- **Pre-trained on large corpora** (general language understanding)
- **Classification head** (adaptable to any number of classes)

### Freezing Strategy
1. **Freeze all parameters** initially (preserve pre-trained knowledge)
2. **Unfreeze classifier** (adapt to our task)
3. **Unfreeze last encoder layer** (fine-tune high-level features)

### Attention Masks vs Padding
- **Attention masks**: Tell the model which tokens to pay attention to
- **Padding tokens**: Filler tokens to maintain fixed sequence length
- **Key difference**: Attention masks prevent the model from "seeing" padding

### Expected Output Example
```python
# Batch shapes
input_ids.shape = torch.Size([16, 128])
attention_mask.shape = torch.Size([16, 128])
labels.shape = torch.Size([16])
```


## 🔧 TODO #1: Load TinyBERT Model and Tokenizer

**Task:** Load the pre-trained TinyBERT model and tokenizer from HuggingFace.

**Hint:** Use `from transformers import BertTokenizer, BertForSequenceClassification` and `BertTokenizer.from_pretrained('huawei-noah/TinyBERT_General_4L_312D')`

**Expected Variables:**
- `tokenizer` → TinyBERT tokenizer
- `model` → TinyBERT model with `num_labels=n_aspects`

**Model Setup:** Set `num_labels=n_aspects` when loading the model for classification


In [3]:
# TODO #1: Load TinyBERT model and tokenizer
from transformers import BertTokenizer, BertForSequenceClassification

# Your code here
tokenizer = BertTokenizer.from_pretrained('huawei-noah/TinyBERT_General_4L_312D')
n_aspects = 3
model_bert_tokenizer = BertForSequenceClassification.from_pretrained('huawei-noah/TinyBERT_General_4L_312D', num_labels=n_aspects)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at huawei-noah/TinyBERT_General_4L_312D and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## 🔧 TODO #2: Freeze and Unfreeze Layers

**Task:** Freeze all parameters, then unfreeze classifier and last encoder layer.

**Hint:** 
- Freeze all: `for param in model.parameters(): param.requires_grad = False`
- Unfreeze classifier: `for param in model.classifier.parameters(): param.requires_grad = True`
- Unfreeze layer 3: `for param in model.bert.encoder.layer[3].parameters(): param.requires_grad = True`

**Strategy:** Preserve most pre-trained knowledge while allowing task-specific adaptation


In [4]:
# TODO #2: Freeze and unfreeze layers
# Your code here
for param in model_bert_tokenizer.parameters():
    param.requires_grad = False
for param in model_bert_tokenizer.classifier.parameters():
    param.requires_grad = True
for param in model_bert_tokenizer.bert.encoder.layer[3].parameters():
    param.requires_grad = True

## 🔧 TODO #3: Tokenize Texts with Attention Masks

**Task:** Tokenize training texts using TinyBERT tokenizer with attention masks.

**Hint:** Use `tokenizer(train_texts, padding='max_length', truncation=True, max_length=128, return_tensors='pt')`

**Expected Variables:**
- `encodings` → Dictionary with 'input_ids' and 'attention_mask'
- `train_dataloader` → DataLoader with tokenized data

**Key Parameters:**
- `padding='max_length'` → Pad to 128 tokens
- `truncation=True` → Cut longer sequences
- `return_tensors='pt'` → Return PyTorch tensors


In [5]:
# TODO #3: Tokenize texts with attention masks
# Your code here
import pandas as pd

train_reviews_df = pd.read_csv('../data/imdb_movie_reviews_train.csv')
test_reviews_df = pd.read_csv('../data/imdb_movie_reviews_test.csv')

# Extract text as list of strings
train_texts = train_reviews_df['review'].tolist()  # ✅ List of 369 strings
test_texts = test_reviews_df['review'].tolist()    # ✅ List of ~132 strings

# Get labels too
train_labels = train_reviews_df['aspect_encoded'].tolist()
test_labels = test_reviews_df['aspect_encoded'].tolist()

encoded_train = tokenizer(
    train_texts,  # ✅ List of strings (raw text)
    max_length=128,
    padding='max_length',
    truncation=True,
    return_tensors='pt'
)


## 📝 Reflection Prompts

### 🤔 Understanding Check
1. **Why freeze most layers?** What would happen if you unfroze all layers immediately?

2. **What does attention_mask do differently than padding?** How does it help the model?

3. **Why use TinyBERT instead of full BERT?** Consider computational efficiency vs performance.

4. **How does the pre-trained vocabulary differ from your custom vocabulary?** What advantages does this provide?

### 🎯 Transfer Learning Strategy
- Why is the classifier layer always unfrozen?
- What's the benefit of unfreezing only the last encoder layer?
- How does this approach prevent catastrophic forgetting?

---

## 📝 My Reflections

### 🤔 Understanding Check Answers

**1. Why freeze most layers?**

We freeze most layers because they come pre-trained with valuable semantic learnings from TinyBERT. These layers already understand:
- General language patterns and grammar
- Word relationships and context
- Semantic meaning and linguistic structures

These pre-trained patterns are valuable and can be used across many NLP projects. By freezing them, we preserve this hard-won knowledge while adapting only what's necessary for our specific task.

**What would happen if we unfroze all layers?**
- **Catastrophic forgetting**: Model would "forget" pre-trained knowledge
- **Overfitting**: With only 369 samples, we'd overfit dramatically
- **Inefficiency**: Much slower training with marginal benefit
- **Instability**: Training would be unstable on small datasets

**2. What does attention_mask do differently than padding?**

**Attention masks tell BERT which tokens are REAL vs PADDING.**

**The Problem**: Without attention masks, BERT would process padding tokens like real words, wasting computation and hurting performance.

**How It Works:**
```
Review: "This movie was great" → input_ids: [101, 2023, 3185, 2001, 2307, 102, 0, 0, 0]
                                  attention_mask: [1, 1, 1, 1, 1, 1, 0, 0, 0]
                                                   ↑ real tokens    ↑ padding (ignore)
```

- `1` = "Pay attention to this token"
- `0` = "Ignore this token completely"

**Inside BERT's attention mechanism:**
- Attention scores for masked positions get -10000 added
- After softmax, these positions get ~0 attention weight
- BERT literally doesn't "see" the padding!

**Difference from baseline masking:**
- **Baseline (Notebook 05)**: Masks AFTER embedding for mean pooling
- **TinyBERT**: Masks DURING attention computation in every layer
- **Much more sophisticated**: Every attention head in every layer uses the mask

**3. Why use TinyBERT instead of full BERT?**

**TinyBERT (4 layers, 312 hidden) - Perfect for our case:**
- ✅ Small to medium datasets (369 samples)
- ✅ Simple classification tasks (3 aspects)
- ✅ Limited compute resources (CPU-friendly)
- ✅ Fast inference needed
- ✅ Good balance of performance vs efficiency

**BERT-Base (12 layers, 768 hidden) - Overkill for our case:**
- Needs 10,000s+ samples
- Complex tasks (Question Answering, NER)
- More compute required
- Would be "killing flies with a bazooka"

**BERT-Large (24 layers, 1024 hidden) - Way overkill:**
- Needs 100,000s+ samples
- State-of-the-art benchmarks
- Massive compute requirements
- Research-level projects

**For our 369-sample aspect classification**: TinyBERT is the Goldilocks choice! 🎯

**4. How does the pre-trained vocabulary differ from your custom vocabulary?**

**My Custom Vocabulary (Baseline):**
- 1000 most frequent words from MY training data
- Film-specific but limited coverage
- Many unknown words in test data (~50%+)
- Simple tokenization (word-level)

**TinyBERT's Pre-trained Vocabulary:**
- 30,000+ subword tokens (WordPiece)
- General language coverage
- Can handle ANY word by breaking into subwords
- Handles contractions, emojis, rare words better
- Example: "cinematography" → ["cinem", "##ato", "##graphy"]

**Advantages:**
- No unknown words (everything can be represented)
- Better handling of rare and out-of-vocabulary words
- Richer semantic representations
- Pre-trained embeddings capture meaning

### 🎯 Transfer Learning Strategy Analysis

**Why is the classifier layer always unfrozen?**

The classifier layer is BRAND NEW and task-specific:
- Pre-trained model doesn't know about our 3 aspects
- Needs to learn from scratch: Cinematography, Characters, Story
- Random initialization requires training
- Maps BERT's representations to our specific classes

**What's the benefit of unfreezing only the last encoder layer (Layer 3)?**

**The Feature Hierarchy:**
```
Layer 0: Basic word patterns ("the", "movie", "was")
         ↓ FROZEN - preserve general knowledge
Layer 1: Word combinations, grammar ("the movie")
         ↓ FROZEN - preserve linguistic patterns
Layer 2: Sentence structure ("the movie was great")
         ↓ FROZEN - preserve semantic understanding
Layer 3: Abstract concepts (positive sentiment about films)
         ↓ UNFROZEN - adapt to aspect classification
Classifier: Cinematography/Characters/Story (task-specific)
         ↓ UNFROZEN - learn from scratch
```

**Why Layer 3 is special:**
- Learns the MOST ABSTRACT, task-relevant features
- Closest to the classifier, so adaptation helps most
- High-level semantic patterns that need task-specific tuning
- Can adapt without forgetting core language knowledge

**Benefits:**
- Balances adaptation with preservation
- Computationally efficient (fewer parameters to update)
- Reduces overfitting risk on small datasets
- Faster training than full fine-tuning

**How does this approach prevent catastrophic forgetting?**

**Catastrophic forgetting** happens when a model "forgets" pre-trained knowledge while learning new tasks.

**Why our approach is safe:**

1. **Layers 0-2 stay frozen** → Core language knowledge preserved (0% forgetting)
2. **Only Layer 3 + Classifier adapt** → Limited parameter updates (~10-20% change in Layer 3)
3. **Small learning rate (2.5e-3)** → Gentle updates, not drastic changes
4. **Few epochs** → Not enough time to drastically overwrite pre-trained weights
5. **Small dataset (369 samples)** → Limited examples to "overwrite" pre-trained patterns

**The Math:**
```
Frozen layers (0-2): 0% forgetting (no gradient updates)
Layer 3: ~10-20% parameter change (gentle adaptation)
Classifier: 100% new (learns from scratch)

Overall: ~95% of model parameters unchanged!
```

**Analogy:**
Think of learning a new accent:
- **Layers 0-2** (frozen): Your core language skills (grammar, vocabulary) - never forget
- **Layer 3** (unfrozen): Your pronunciation style - adapts to new accent
- **Classifier** (new): Your specific phrases for this conversation

You adapt your pronunciation without forgetting your native language!

**Key insight:** By freezing most layers and only fine-tuning the top layer + classifier, we get the best of both worlds: adaptation to our specific task while preserving general language understanding. This is the power of transfer learning!
