# 🚀 Notebook 08: TinyBERT Fine-Tuning Training Loop

## Adapting Pre-Trained Knowledge

This notebook teaches you how to fine-tune TinyBERT using a high learning rate (2.5e-3) to observe the effects of aggressive learning on pre-trained models. You'll implement the training loop with attention masks and track the model's adaptation to your specific task.


## 🧠 Concept Primer: Fine-Tuning with High Learning Rate

### What We're Doing
Fine-tuning TinyBERT on your specific task using an experimental high learning rate (2.5e-3) to observe the effects of aggressive learning on pre-trained models.

### Why This Learning Rate is Experimental
**Standard transformer fine-tuning uses ~2e-5.** Our 2.5e-3 is 100x higher! This will help you observe:
- **Loss oscillation** vs smooth decrease
- **Overfitting risk** with high LR
- **Pre-trained knowledge retention** under aggressive updates

### Training Loop Differences
- **Attention masks** must be passed to the model
- **Labels** can be passed directly to the model (it computes loss internally)
- **Unfrozen parameters only** are updated by the optimizer

### Expected Behavior
- **Smooth decrease**: Model adapts well to task
- **Oscillation**: Learning rate too high, causing instability
- **Plateau**: Model has reached local minimum

### Common Pitfalls
- **Passing labels separately** instead of inside `model()` call
- **Forgetting attention masks** breaks attention mechanism
- **Not tracking unfrozen parameters** updates frozen layers


## 🔧 TODO #1: Create Optimizer for Unfrozen Parameters

**Task:** Build AdamW optimizer that only updates unfrozen parameters.

**Hint:** Use `optimizer = torch.optim.AdamW([p for p in model.parameters() if p.requires_grad], lr=2.5e-3, weight_decay=0.01)`

**Expected Variables:**
- `optimizer` → AdamW optimizer with high learning rate and weight decay

**Key Parameters:**
- `lr=2.5e-3` → High learning rate (experimental)
- `weight_decay=0.01` → L2 regularization to prevent overfitting


In [None]:
# TODO #1: Create optimizer for unfrozen parameters
import torch

# Your code here


## 🔧 TODO #2: Implement Fine-Tuning Training Loop

**Task:** Create training loop that fine-tunes TinyBERT with attention masks.

**Hint:** Use `model.train()`, unpack batch as `input_ids, attention_mask, labels = batch`, forward with `outputs = model(input_ids=input_ids, attention_mask=attention_mask, labels=labels)`, then `loss = outputs.loss`

**Expected Function:**
```python
def train_tinybert(model, train_loader, optimizer, num_epochs=10):
    for epoch in range(num_epochs):
        total_loss = 0
        # TODO: Training loop here
        print(f"Epoch {epoch+1}, Loss: {total_loss/len(train_loader):.4f}")
```

**Track:** Loss per epoch to observe training behavior


In [None]:
# TODO #2: Implement fine-tuning training loop
# Your code here


## 📝 Reflection Prompts

### 🤔 Understanding Check
1. **Did loss decrease smoothly or jump around?** What does this tell you about the learning rate?

2. **Compare this LR to typical transformer LRs (2e-5)—what's the effect?** How does this impact training stability?

3. **Why use AdamW instead of Adam?** What does weight decay accomplish?

4. **How does passing labels to the model differ from computing loss separately?** What are the benefits?

### 🎯 Training Behavior Analysis
- Was the high learning rate beneficial or harmful?
- How did the loss curve compare to your baseline model?
- What would you expect with a lower learning rate?

---

**Write your reflections here:**
