# 🤖 Notebook 07: TinyBERT Setup and Freezing

## Enter the Transformer Era

This notebook introduces you to transfer learning with TinyBERT, a smaller but powerful transformer model. You'll learn how to load pre-trained models, freeze/unfreeze layers strategically, and prepare data for transformer processing with attention masks.


## 🧠 Concept Primer: Transfer Learning with Transformers

### What We're Doing
Loading a pre-trained TinyBERT model and adapting it for our specific text classification task through strategic layer freezing and fine-tuning.

### Why This Approach Works
**Pre-trained models have learned general language representations.** Instead of training from scratch, we leverage this knowledge and adapt it to our specific task.

### TinyBERT Architecture
- **4 encoder layers** (much smaller than BERT's 12)
- **312 hidden dimensions** (efficient but powerful)
- **Pre-trained on large corpora** (general language understanding)
- **Classification head** (adaptable to any number of classes)

### Freezing Strategy
1. **Freeze all parameters** initially (preserve pre-trained knowledge)
2. **Unfreeze classifier** (adapt to our task)
3. **Unfreeze last encoder layer** (fine-tune high-level features)

### Attention Masks vs Padding
- **Attention masks**: Tell the model which tokens to pay attention to
- **Padding tokens**: Filler tokens to maintain fixed sequence length
- **Key difference**: Attention masks prevent the model from "seeing" padding

### Expected Output Example
```python
# Batch shapes
input_ids.shape = torch.Size([16, 128])
attention_mask.shape = torch.Size([16, 128])
labels.shape = torch.Size([16])
```


## 🔧 TODO #1: Load TinyBERT Model and Tokenizer

**Task:** Load the pre-trained TinyBERT model and tokenizer from HuggingFace.

**Hint:** Use `from transformers import BertTokenizer, BertForSequenceClassification` and `BertTokenizer.from_pretrained('huawei-noah/TinyBERT_General_4L_312D')`

**Expected Variables:**
- `tokenizer` → TinyBERT tokenizer
- `model` → TinyBERT model with `num_labels=n_aspects`

**Model Setup:** Set `num_labels=n_aspects` when loading the model for classification


In [None]:
# TODO #1: Load TinyBERT model and tokenizer
from transformers import BertTokenizer, BertForSequenceClassification

# Your code here


## 🔧 TODO #2: Freeze and Unfreeze Layers

**Task:** Freeze all parameters, then unfreeze classifier and last encoder layer.

**Hint:** 
- Freeze all: `for param in model.parameters(): param.requires_grad = False`
- Unfreeze classifier: `for param in model.classifier.parameters(): param.requires_grad = True`
- Unfreeze layer 3: `for param in model.bert.encoder.layer[3].parameters(): param.requires_grad = True`

**Strategy:** Preserve most pre-trained knowledge while allowing task-specific adaptation


In [None]:
# TODO #2: Freeze and unfreeze layers
# Your code here


## 🔧 TODO #3: Tokenize Texts with Attention Masks

**Task:** Tokenize training texts using TinyBERT tokenizer with attention masks.

**Hint:** Use `tokenizer(train_texts, padding='max_length', truncation=True, max_length=128, return_tensors='pt')`

**Expected Variables:**
- `encodings` → Dictionary with 'input_ids' and 'attention_mask'
- `train_dataloader` → DataLoader with tokenized data

**Key Parameters:**
- `padding='max_length'` → Pad to 128 tokens
- `truncation=True` → Cut longer sequences
- `return_tensors='pt'` → Return PyTorch tensors


In [None]:
# TODO #3: Tokenize texts with attention masks
# Your code here


## 📝 Reflection Prompts

### 🤔 Understanding Check
1. **Why freeze most layers?** What would happen if you unfroze all layers immediately?

2. **What does attention_mask do differently than padding?** How does it help the model?

3. **Why use TinyBERT instead of full BERT?** Consider computational efficiency vs performance.

4. **How does the pre-trained vocabulary differ from your custom vocabulary?** What advantages does this provide?

### 🎯 Transfer Learning Strategy
- Why is the classifier layer always unfrozen?
- What's the benefit of unfreezing only the last encoder layer?
- How does this approach prevent catastrophic forgetting?

---

**Write your reflections here:**
