# Lesson 3: Transfer Learning

**Module 4: Model Development & Optimization**  
**Estimated Time**: 1-2 hours  
**Difficulty**: Intermediate

---

## ðŸŽ¯ Learning Objectives

By the end of this lesson, you will:

âœ… Understand *why* Transfer Learning works (Feature Reuse)  
âœ… Distinguish between Feature Extraction and Fine-Tuning  
âœ… Implement Transfer Learning with PyTorch ResNet  
âœ… Answer interview questions on small-data strategies  

---

## ðŸ“š Table of Contents

1. [The Logic: Don't Reinvent the Wheel](#1-concept)
2. [Strategy 1: Feature Extraction (Frozen Body)](#2-extraction)
3. [Strategy 2: Fine-Tuning (Unfrozen Body)](#3-finetuning)
4. [Hands-On: ResNet-18 Example](#4-hands-on)
5. [Interview Preparation](#5-interview-questions)

---

## 1. The Logic: Don't Reinvent the Wheel

Training a deep network (ResNet, BERT) from scratch requires:
- Billions of examples (ImageNet, CommonCrawl).
- Weeks of GPU time.
- Massive compute cost.

**Insight**: Early layers learn generic features (edges, textures, grammar) that are useful for almost ANY task.

**Transfer Learning**: Take a model pre-trained on generic data (ImageNet) and adapt it to your specific data (Medical X-Rays).

## 2. Strategy 1: Feature Extraction (Frozen Body)

1. Take pre-trained model.
2. **Freeze** all layers (weights cannot change).
3. Replace the final **Classification Head** (FC layer) with a new one matching your classes.
4. Train **only** the new head.

**Pros**: Very fast, prevents overfitting on small data.
**Cons**: Limited adaptability if domains are very different (Cars -> X-Rays).

## 3. Strategy 2: Fine-Tuning (Unfrozen Body)

1. Start with Feature Extraction (train head).
2. **Unfreeze** some (or all) of the body layers.
3. Train with a very **low learning rate** (e.g., 1e-5).

**Pros**: Higher accuracy, model adapts to new domain.
**Cons**: Risk of **Catastrophic Forgetting** (erasing generic knowledge) and Overfitting.

## 4. Hands-On: ResNet-18 Example

Using PyTorch `torchvision`.

In [None]:
import torch
import torch.nn as nn
from torchvision import models

print("Loading pre-trained ResNet...")
model = models.resnet18(pretrained=True)

# 1. Feature Extraction: Freeze everything
for param in model.parameters():
    param.requires_grad = False

# 2. Replace the Head (fc layer)
# ResNet input features are 512. We assume 2 classes (Cat vs Dog)
num_features = model.fc.in_features
model.fc = nn.Linear(num_features, 2)

# Verify: Only the head should be trainable
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
all_params = sum(p.numel() for p in model.parameters())

print(f"Total Params: {all_params:,}")
print(f"Trainable Params: {trainable_params:,} (Should be small)")

# 3. Fine-Tuning Setup
print("\n...Assume head training is done...")
print("Unfreezing Layer 4 for Fine-Tuning...")

for param in model.layer4.parameters():
    param.requires_grad = True

trainable_params_ft = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"Trainable Params after unfreezing: {trainable_params_ft:,}")

## 5. Interview Preparation

### Common Questions

#### Q1: "When should you Fine-Tune vs Feature Extract?"
**Answer**: 
- **Feature Extract** if: Dataset is small (<1k) and similar to pre-training data.
- **Fine-Tune** if: Dataset is large (>10k) or domain is very different (e.g., Medical/Satellite).

#### Q2: "What is Gradual Unfreezing?"
**Answer**: "A technique where you unfreeze the top layer, train it, then unfreeze the next layer down, train, and so on. This prevents the gradients from the randomly initialized head from destroying the well-learned features in the lower layers (Catastrophic Forgetting)."