# üìï PYTORCH FILE 3-B: TRANSFER LEARNING & MIXED PRECISION

**Ph·∫ßn:** ADVANCED & PROFESSIONAL

**M·ª•c ti√™u:**
- ‚úÖ Hi·ªÉu v√† √°p d·ª•ng Transfer Learning
- ‚úÖ Fine-tuning strategies
- ‚úÖ S·ª≠ d·ª•ng pretrained models (ResNet, MobileNet, EfficientNet)
- ‚úÖ Mixed Precision Training v·ªõi PyTorch
- ‚úÖ Performance optimization

**Th·ªùi l∆∞·ª£ng:** 2-3 tu·∫ßn

---

## üìö M·ª•c L·ª•c

### PH·∫¶N 1: TRANSFER LEARNING
1. Transfer Learning l√† g√¨?
2. Pretrained Models trong PyTorch
3. Feature Extraction
4. Fine-tuning
5. Best Practices

### PH·∫¶N 2: PRETRAINED MODELS
1. torchvision.models
2. ResNet
3. MobileNet
4. EfficientNet
5. Model Zoo

### PH·∫¶N 3: MIXED PRECISION TRAINING
1. Mixed Precision l√† g√¨?
2. Automatic Mixed Precision (AMP)
3. GradScaler
4. Performance Comparison
5. Best Practices

---

In [None]:
# Import c√°c th∆∞ vi·ªán c·∫ßn thi·∫øt
import torch
import torch.nn as nn
import torch.optim as optim
import torchvision
import torchvision.transforms as transforms
from torchvision import models
from torch.utils.data import DataLoader, Dataset
from torch.cuda.amp import autocast, GradScaler
import numpy as np
import matplotlib.pyplot as plt
import time
from tqdm import tqdm

print(f"‚úÖ PyTorch version: {torch.__version__}")
print(f"‚úÖ Torchvision version: {torchvision.__version__}")
print(f"‚úÖ CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"   Device: {torch.cuda.get_device_name(0)}")
    print(f"   CUDA version: {torch.version.cuda}")

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"‚úÖ Using device: {device}")

---

# PH·∫¶N 1: TRANSFER LEARNING

## 1.1 Transfer Learning l√† g√¨?

### ƒê·ªãnh nghƒ©a

**Transfer Learning** = S·ª≠ d·ª•ng ki·∫øn th·ª©c t·ª´ model ƒë√£ ƒë∆∞·ª£c train tr√™n dataset l·ªõn (ImageNet) cho task m·ªõi

### T·∫°i sao d√πng Transfer Learning?

| V·∫•n ƒë·ªÅ | Gi·∫£i ph√°p |
|--------|------------|
| üìä √çt d·ªØ li·ªáu | Pre-trained weights t·ª´ ImageNet (1.4M ·∫£nh) |
| ‚è±Ô∏è T·ªën th·ªùi gian | Kh√¥ng c·∫ßn train t·ª´ ƒë·∫ßu |
| üí∞ T·ªën t√†i nguy√™n | Ch·ªâ fine-tune m·ªôt ph·∫ßn |
| üéØ Hi·ªáu qu·∫£ cao | ƒê·∫°t accuracy cao v·ªõi √≠t data |

### Hai chi·∫øn l∆∞·ª£c ch√≠nh

#### 1. Feature Extraction
```
Pretrained Model (FROZEN) ‚Üí New Classifier (TRAINABLE)
```
- Freeze to√†n b·ªô pretrained layers
- Ch·ªâ train classifier m·ªõi
- Nhanh, √≠t data

#### 2. Fine-tuning
```
Pretrained Model (PARTIALLY FROZEN) ‚Üí New Classifier (TRAINABLE)
```
- Unfreeze m·ªôt s·ªë layers cu·ªëi
- Train v·ªõi learning rate nh·ªè
- Ch·∫≠m h∆°n nh∆∞ng accuracy t·ªët h∆°n

## 1.2 Pretrained Models trong PyTorch

### torchvision.models

PyTorch cung c·∫•p nhi·ªÅu pretrained models qua `torchvision.models`:

| Model | Parameters | Top-1 Acc | Khi n√†o d√πng |
|-------|-----------|-----------|---------------|
| **ResNet-50** | 25.6M | 76.1% | C√¢n b·∫±ng accuracy/speed |
| **ResNet-101** | 44.5M | 77.4% | C·∫ßn accuracy cao |
| **MobileNet-V2** | 3.5M | 71.9% | Mobile, edge devices |
| **EfficientNet-B0** | 5.3M | 77.7% | Best accuracy/size ratio |
| **VGG-16** | 138M | 71.6% | ƒê∆°n gi·∫£n, d·ªÖ hi·ªÉu |

### Khuy·∫øn ngh·ªã

- üöÄ **Production/Mobile**: MobileNetV2, EfficientNet
- üéØ **High Accuracy**: ResNet-50/101, EfficientNet
- üìö **Learning**: ResNet-18, MobileNetV2

In [None]:
# Load pretrained models

print("üì¶ Loading pretrained models...\n")

# ResNet-18 (small, fast)
resnet18 = models.resnet18(pretrained=True)
print(f"ResNet-18:")
print(f"  Parameters: {sum(p.numel() for p in resnet18.parameters()):,}")
print()

# ResNet-50 (popular)
resnet50 = models.resnet50(pretrained=True)
print(f"ResNet-50:")
print(f"  Parameters: {sum(p.numel() for p in resnet50.parameters()):,}")
print()

# MobileNetV2 (mobile)
mobilenet = models.mobilenet_v2(pretrained=True)
print(f"MobileNetV2:")
print(f"  Parameters: {sum(p.numel() for p in mobilenet.parameters()):,}")
print()

print("‚úÖ Models loaded with ImageNet weights!")
print("\nüí° Note: pretrained=True downloads weights t·ª´ internet (first time)")

## 1.3 Feature Extraction - Strategy 1

### Workflow

```python
1. Load pretrained model
2. Freeze all layers (requires_grad=False)
3. Replace final classifier
4. Train only new classifier
```

In [None]:
# Example: Feature Extraction v·ªõi ResNet-18

# Load pretrained ResNet-18
model = models.resnet18(pretrained=True)

# Step 1: Freeze ALL layers
for param in model.parameters():
    param.requires_grad = False

print("‚ùÑÔ∏è  Frozen all layers")

# Step 2: Replace final classifier
# ResNet-18 original: fc (512 ‚Üí 1000 classes)
# Our task: Binary classification (2 classes)

num_features = model.fc.in_features
print(f"\nüìä Original classifier: {num_features} ‚Üí 1000 (ImageNet)")

# New classifier
model.fc = nn.Sequential(
    nn.Linear(num_features, 256),
    nn.ReLU(),
    nn.Dropout(0.3),
    nn.Linear(256, 2)  # 2 classes
)

print(f"üìä New classifier: {num_features} ‚Üí 256 ‚Üí 2 (our task)")

# Move to device
model = model.to(device)

# Check trainable parameters
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
total_params = sum(p.numel() for p in model.parameters())

print(f"\nüìä PARAMETERS:")
print(f"   Total: {total_params:,}")
print(f"   Trainable: {trainable_params:,} ({trainable_params/total_params*100:.2f}%)")
print(f"   Frozen: {total_params - trainable_params:,}")

print("\n‚úÖ Feature Extraction model ready!")
print("   - Ch·ªâ train classifier m·ªõi (~0.5% parameters)")
print("   - Training r·∫•t nhanh!")

In [None]:
# Training function for feature extraction

def train_feature_extraction(model, train_loader, epochs=10):
    """
    Train only the classifier (feature extraction)
    """
    # Optimizer ch·ªâ cho trainable parameters
    optimizer = optim.Adam(
        filter(lambda p: p.requires_grad, model.parameters()),
        lr=0.001
    )
    criterion = nn.CrossEntropyLoss()
    
    model.train()
    history = {'loss': [], 'accuracy': []}
    
    for epoch in range(epochs):
        running_loss = 0.0
        correct = 0
        total = 0
        
        pbar = tqdm(train_loader, desc=f'Epoch {epoch+1}/{epochs}')
        for inputs, labels in pbar:
            inputs, labels = inputs.to(device), labels.to(device)
            
            optimizer.zero_grad()
            outputs = model(inputs)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()
            
            running_loss += loss.item()
            _, predicted = outputs.max(1)
            total += labels.size(0)
            correct += predicted.eq(labels).sum().item()
            
            pbar.set_postfix({
                'loss': f'{running_loss/(pbar.n+1):.4f}',
                'acc': f'{100.*correct/total:.2f}%'
            })
        
        epoch_loss = running_loss / len(train_loader)
        epoch_acc = 100. * correct / total
        history['loss'].append(epoch_loss)
        history['accuracy'].append(epoch_acc)
    
    return history

print("‚úÖ Training function defined!")

## 1.4 Fine-tuning - Strategy 2

### Workflow

```python
1. Start with feature extraction
2. Train classifier first
3. Unfreeze some layers
4. Fine-tune with SMALL learning rate
```

### Best Practices

- ‚ö†Ô∏è **CRITICAL**: Learning rate ph·∫£i R·∫§T NH·ªé (1e-5)
- ‚úÖ Unfreeze t·ª´ t·ª´ (layer-by-layer)
- ‚úÖ Train classifier tr∆∞·ªõc
- ‚úÖ Different LR cho different layers

In [None]:
# Fine-tuning example

# Load model (assume ƒë√£ train classifier)
model_ft = models.resnet18(pretrained=True)

# Freeze all first
for param in model_ft.parameters():
    param.requires_grad = False

# Replace classifier
num_features = model_ft.fc.in_features
model_ft.fc = nn.Linear(num_features, 2)
model_ft = model_ft.to(device)

print("Step 1: Feature extraction (train classifier)")
print("   ... (assume done) ...\n")

# Step 2: Unfreeze last few layers for fine-tuning
print("Step 2: Unfreeze layers for fine-tuning")

# Unfreeze layer4 (last conv block) and fc
for param in model_ft.layer4.parameters():
    param.requires_grad = True

for param in model_ft.fc.parameters():
    param.requires_grad = True

# Count trainable params
trainable_params = sum(p.numel() for p in model_ft.parameters() if p.requires_grad)
total_params = sum(p.numel() for p in model_ft.parameters())

print(f"   Trainable: {trainable_params:,} ({trainable_params/total_params*100:.2f}%)")
print()

# Step 3: Different learning rates for different layers
print("Step 3: Setup optimizer with different LRs")

optimizer_ft = optim.Adam([
    {'params': model_ft.layer4.parameters(), 'lr': 1e-5},  # Very small LR for pretrained
    {'params': model_ft.fc.parameters(), 'lr': 1e-4}       # Larger LR for new layers
])

print("   layer4 (pretrained): LR = 1e-5 (SMALL!)")
print("   fc (new): LR = 1e-4")
print()

print("‚úÖ Fine-tuning setup complete!")
print("\n‚ö†Ô∏è  CRITICAL:")
print("   - LR cho pretrained layers PH·∫¢I R·∫§T NH·ªé (1e-5)")
print("   - N·∫øu qu√° l·ªõn ‚Üí destroy pretrained weights!")

## 1.5 Comparison: Feature Extraction vs Fine-tuning

### Feature Extraction

**∆Øu ƒëi·ªÉm:**
- ‚úÖ Nhanh (ch·ªâ train <1% parameters)
- ‚úÖ √çt data v·∫´n ok
- ‚úÖ √çt overfit
- ‚úÖ ƒê∆°n gi·∫£n

**Nh∆∞·ª£c ƒëi·ªÉm:**
- ‚ùå Accuracy c√≥ th·ªÉ kh√¥ng t·ªëi ∆∞u
- ‚ùå Kh√¥ng adapt ƒë∆∞·ª£c v·ªõi dataset kh√°c bi·ªát

### Fine-tuning

**∆Øu ƒëi·ªÉm:**
- ‚úÖ Accuracy cao h∆°n
- ‚úÖ Adapt t·ªët v·ªõi dataset m·ªõi
- ‚úÖ Flexible

**Nh∆∞·ª£c ƒëi·ªÉm:**
- ‚ùå Ch·∫≠m h∆°n
- ‚ùå C·∫ßn nhi·ªÅu data h∆°n (>5k images)
- ‚ùå D·ªÖ overfit
- ‚ùå Kh√≥ tune (learning rate critical!)

### Recommendation

```
if data < 1000:
    use Feature Extraction
elif data < 5000:
    try Feature Extraction first, then Fine-tuning if needed
else:
    use Fine-tuning
```

---

# PH·∫¶N 2: PRETRAINED MODELS DEEP DIVE

## 2.1 ResNet Architecture

### ResNet l√† g√¨?

**ResNet (Residual Network)** = Deep network v·ªõi skip connections

Key innovation:
```
out = F(x) + x  # Skip connection!
```

### Variants

- **ResNet-18**: 18 layers
- **ResNet-34**: 34 layers
- **ResNet-50**: 50 layers (popular)
- **ResNet-101**: 101 layers
- **ResNet-152**: 152 layers

In [None]:
# Inspect ResNet architecture

resnet = models.resnet50(pretrained=True)

print("üì¶ ResNet-50 Architecture:\n")
print(resnet)
print("\n" + "="*60)

# Key components
print("\nüìö Key Components:")
print(f"   conv1: Initial 7x7 conv")
print(f"   layer1-4: Residual blocks")
print(f"   avgpool: Global average pooling")
print(f"   fc: Final classifier (512 ‚Üí 1000)")

# Modify for custom task
print("\nüîß Modify for custom task (10 classes):")
num_features = resnet.fc.in_features
resnet.fc = nn.Linear(num_features, 10)
print(f"   Changed fc: {num_features} ‚Üí 10")

## 2.2 MobileNet - Efficient Architecture

### MobileNet l√† g√¨?

**MobileNet** = Lightweight architecture cho mobile/edge devices

Key innovation:
- **Depthwise Separable Convolutions**
- Much smaller than ResNet
- Faster inference

### When to use?

- ‚úÖ Mobile deployment
- ‚úÖ Real-time applications
- ‚úÖ Limited compute resources
- ‚úÖ Edge devices

In [None]:
# MobileNetV2 example

mobilenet = models.mobilenet_v2(pretrained=True)

print("üì¶ MobileNetV2 Architecture:\n")

# Check size
total_params = sum(p.numel() for p in mobilenet.parameters())
print(f"Total parameters: {total_params:,}")
print(f"Size: ~{total_params * 4 / (1024**2):.1f} MB (FP32)")

# Modify classifier
print("\nüîß Modify for custom task:")
# MobileNetV2 has different structure
mobilenet.classifier[1] = nn.Linear(mobilenet.classifier[1].in_features, 10)
print("   Changed classifier for 10 classes")

print("\n‚úÖ MobileNetV2 ready!")
print("   3.5M params vs 25.6M (ResNet-50)")
print("   ~7x smaller!")

---

# PH·∫¶N 3: MIXED PRECISION TRAINING

## 3.1 Mixed Precision l√† g√¨?

### ƒê·ªãnh nghƒ©a

**Mixed Precision** = Training v·ªõi FP16 + FP32

```
Forward pass:  FP16 (faster)
     ‚Üì
Loss:          FP16
     ‚Üì
Backward:      FP16 (t√≠nh gradient)
     ‚Üì
Update weights: FP32 (precise)
```

### L·ª£i √≠ch

| Benefit | Explanation |
|---------|-------------|
| üöÄ **2-3x faster** | FP16 operations faster on modern GPUs |
| üíæ **50% memory** | FP16 = half size of FP32 |
| üìä **Larger batch** | More memory ‚Üí larger batch size |
| ‚úÖ **Same accuracy** | With proper scaling |

### Khi n√†o d√πng?

‚úÖ **N√äN D√ôNG khi:**
- GPU h·ªó tr·ª£ Tensor Cores (V100, A100, RTX 20xx+)
- Model l·ªõn, t·ªën memory
- C·∫ßn tƒÉng t·ªëc training

‚ùå **KH√îNG C·∫¶N khi:**
- Ch·ªâ c√≥ CPU
- GPU c≈© (kh√¥ng h·ªó tr·ª£ FP16 t·ªët)
- Model nh·ªè, train nhanh r·ªìi

## 3.2 Automatic Mixed Precision (AMP) trong PyTorch

### torch.cuda.amp

PyTorch cung c·∫•p AMP qua `torch.cuda.amp`:

- **autocast**: T·ª± ƒë·ªông cast operations sang FP16
- **GradScaler**: Scale gradients ƒë·ªÉ tr√°nh underflow

### Usage

```python
from torch.cuda.amp import autocast, GradScaler

scaler = GradScaler()

for inputs, targets in loader:
    optimizer.zero_grad()
    
    # Forward with autocast
    with autocast():
        outputs = model(inputs)
        loss = criterion(outputs, targets)
    
    # Backward with scaling
    scaler.scale(loss).backward()
    scaler.step(optimizer)
    scaler.update()
```

In [None]:
# Example: Training v·ªõi Mixed Precision

def train_with_amp(model, train_loader, epochs=5, use_amp=True):
    """
    Training v·ªõi Automatic Mixed Precision
    
    Args:
        model: PyTorch model
        train_loader: DataLoader
        epochs: Number of epochs
        use_amp: Enable AMP or not
    """
    optimizer = optim.Adam(model.parameters(), lr=0.001)
    criterion = nn.CrossEntropyLoss()
    
    # GradScaler for AMP
    scaler = GradScaler() if use_amp else None
    
    model.train()
    start_time = time.time()
    
    for epoch in range(epochs):
        running_loss = 0.0
        
        for inputs, targets in train_loader:
            inputs, targets = inputs.to(device), targets.to(device)
            
            optimizer.zero_grad()
            
            if use_amp:
                # Mixed Precision
                with autocast():
                    outputs = model(inputs)
                    loss = criterion(outputs, targets)
                
                # Scaled backward
                scaler.scale(loss).backward()
                scaler.step(optimizer)
                scaler.update()
            else:
                # Normal FP32
                outputs = model(inputs)
                loss = criterion(outputs, targets)
                loss.backward()
                optimizer.step()
            
            running_loss += loss.item()
        
        epoch_loss = running_loss / len(train_loader)
        print(f"Epoch {epoch+1}/{epochs}, Loss: {epoch_loss:.4f}")
    
    elapsed_time = time.time() - start_time
    return elapsed_time

print("‚úÖ Training function with AMP defined!")
print("\nüí° Key points:")
print("   1. autocast(): T·ª± ƒë·ªông FP16 cho forward pass")
print("   2. GradScaler: Scale gradients ƒë·ªÉ tr√°nh underflow")
print("   3. scaler.step(): Update weights v·ªõi unscaling")

In [None]:
# Benchmark: FP32 vs Mixed Precision

if torch.cuda.is_available():
    print("üîÑ Benchmarking FP32 vs Mixed Precision...\n")
    
    # Create fake data
    fake_data = torch.randn(1000, 3, 224, 224)
    fake_labels = torch.randint(0, 10, (1000,))
    fake_dataset = torch.utils.data.TensorDataset(fake_data, fake_labels)
    fake_loader = DataLoader(fake_dataset, batch_size=32, shuffle=True)
    
    # Model
    model_fp32 = models.resnet18(pretrained=False, num_classes=10).to(device)
    model_amp = models.resnet18(pretrained=False, num_classes=10).to(device)
    
    # Train FP32
    print("Training with FP32...")
    time_fp32 = train_with_amp(model_fp32, fake_loader, epochs=3, use_amp=False)
    
    print("\nTraining with Mixed Precision...")
    time_amp = train_with_amp(model_amp, fake_loader, epochs=3, use_amp=True)
    
    # Results
    speedup = time_fp32 / time_amp
    
    print("\nüìä RESULTS:")
    print("=" * 50)
    print(f"FP32 time:           {time_fp32:.2f}s")
    print(f"Mixed Precision time: {time_amp:.2f}s")
    print(f"Speedup:             {speedup:.2f}x")
    print("=" * 50)
    
    if speedup > 1.3:
        print(f"\n‚úÖ Mixed Precision is {speedup:.2f}x faster!")
    else:
        print(f"\n‚ö†Ô∏è  Speedup not significant (GPU may not support FP16 well)")

else:
    print("‚ö†Ô∏è  CUDA not available. Mixed Precision requires GPU.")

## 3.3 Best Practices cho Mixed Precision

### ‚úÖ DO (N√äN L√ÄM)

#### 1. D√πng autocast context
```python
# ‚úÖ GOOD
with autocast():
    outputs = model(inputs)
    loss = criterion(outputs, targets)
```

#### 2. D√πng GradScaler
```python
# ‚úÖ GOOD
scaler = GradScaler()
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()
```

#### 3. Gradient clipping v·ªõi scaler
```python
# ‚úÖ GOOD
scaler.scale(loss).backward()
scaler.unscale_(optimizer)  # Unscale tr∆∞·ªõc khi clip
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
scaler.step(optimizer)
scaler.update()
```

### ‚ùå DON'T (KH√îNG N√äN)

#### 1. Qu√™n GradScaler
```python
# ‚ùå WRONG: autocast without scaler
with autocast():
    outputs = model(inputs)
    loss = criterion(outputs, targets)
loss.backward()  # Gradient underflow!
```

#### 2. Clip gradient kh√¥ng ƒë√∫ng c√°ch
```python
# ‚ùå WRONG: Clip tr∆∞·ªõc unscale
scaler.scale(loss).backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)  # Wrong!
scaler.step(optimizer)
```

### üéØ Checklist

- [ ] GPU h·ªó tr·ª£ FP16 (check compute capability)
- [ ] D√πng `autocast()` cho forward pass
- [ ] D√πng `GradScaler` cho backward
- [ ] Unscale tr∆∞·ªõc khi gradient clipping
- [ ] Verify accuracy kh√¥ng gi·∫£m

---

# üéì T·ªïng k·∫øt FILE 3-B

## ‚úÖ Nh·ªØng g√¨ ƒë√£ h·ªçc

### 1. Transfer Learning
- **Concepts**: Reuse pretrained knowledge
- **Feature Extraction**: Freeze all, train classifier
- **Fine-tuning**: Unfreeze some layers, small LR
- **When to use**: Feature extraction (< 5k), Fine-tuning (> 5k)

### 2. Pretrained Models
- **torchvision.models**: ResNet, MobileNet, EfficientNet
- **ResNet**: Deep with skip connections
- **MobileNet**: Lightweight for mobile
- **Modify classifier**: Easy adaptation

### 3. Mixed Precision Training
- **Concepts**: FP16 + FP32 for speed
- **AMP**: autocast + GradScaler
- **Benefits**: 2-3x faster, 50% less memory
- **Requirements**: Modern GPU with Tensor Cores

## üöÄ Key Takeaways

1. **Transfer Learning** = must-have cho Computer Vision
2. **Feature Extraction** ‚Üí nhanh, ƒë·ªß t·ªët
3. **Fine-tuning** ‚Üí ch·∫≠m h∆°n, accuracy cao h∆°n
4. **LR nh·ªè** critical cho fine-tuning (1e-5)
5. **Mixed Precision** = free 2-3x speedup
6. **MobileNet** best cho production/mobile

## üìù Next: FILE 3-C

- Clean ML Pipeline
- Reproducibility
- Model Evaluation
- Save & Load Models
- Production Best Practices

---

**Ch√∫c m·ª´ng b·∫°n ƒë√£ ho√†n th√†nh FILE 3-B! üéâ**