# Module 04: Training & Optimization Techniques

Learn advanced techniques to train better, faster CNNs.

## Topics Covered
- Optimizers (SGD, Adam, RMSprop)
- Learning rate scheduling
- Batch normalization
- Dropout for regularization
- Data augmentation
- Detecting and fixing overfitting

## Time: 45 minutes

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
from torchvision import datasets, transforms
import matplotlib.pyplot as plt

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

## Part 1: Optimizers

### Common Optimizers

**SGD (Stochastic Gradient Descent)**
- Classic, simple, reliable
- Requires careful learning rate tuning
- With momentum: smoother updates

**Adam (Adaptive Moment Estimation)**
- Most popular!
- Adapts learning rate per parameter
- Works well out-of-the-box

**RMSprop**
- Good for RNNs
- Adaptive learning rates

In [None]:
# Compare optimizers
model = nn.Linear(10, 1)

# SGD
optimizer_sgd = optim.SGD(model.parameters(), lr=0.01, momentum=0.9)

# Adam
optimizer_adam = optim.Adam(model.parameters(), lr=0.001)

# RMSprop
optimizer_rmsprop = optim.RMSprop(model.parameters(), lr=0.001)

print("Optimizers created!")
print("\nWhen to use:")
print("- Adam: Default choice, works well for most tasks")
print("- SGD with momentum: When you need better generalization")
print("- RMSprop: Recurrent networks, online learning")

## Part 2: Learning Rate Scheduling

Start with higher learning rate, gradually decrease.

**Benefits:**
- Fast initial learning
- Fine-tuning at the end
- Better final accuracy

In [None]:
optimizer = optim.Adam(model.parameters(), lr=0.001)

# Step LR: Reduce by factor every N epochs
scheduler_step = optim.lr_scheduler.StepLR(optimizer, step_size=10, gamma=0.1)

# Exponential: Smooth decay
scheduler_exp = optim.lr_scheduler.ExponentialLR(optimizer, gamma=0.95)

# Reduce on plateau: Reduce when loss stops improving
scheduler_plateau = optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode="min", patience=5)

print("Learning rate schedulers created!")

## Part 3: Batch Normalization

Normalizes layer inputs for stable, faster training.

**Benefits:**
- Faster convergence
- Higher learning rates possible
- Acts as regularization

In [None]:
class CNNWithBatchNorm(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv1 = nn.Conv2d(1, 32, 3, padding=1)
        self.bn1 = nn.BatchNorm2d(32)  # Batch norm after conv
        self.conv2 = nn.Conv2d(32, 64, 3, padding=1)
        self.bn2 = nn.BatchNorm2d(64)
        self.pool = nn.MaxPool2d(2, 2)
        self.fc1 = nn.Linear(64 * 7 * 7, 10)

    def forward(self, x):
        x = self.pool(torch.relu(self.bn1(self.conv1(x))))
        x = self.pool(torch.relu(self.bn2(self.conv2(x))))
        x = x.view(-1, 64 * 7 * 7)
        x = self.fc1(x)
        return x


model_bn = CNNWithBatchNorm()
print(model_bn)

## Part 4: Dropout for Regularization

Randomly "drops" neurons during training to prevent overfitting.

In [None]:
class CNNWithDropout(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv1 = nn.Conv2d(1, 32, 3, padding=1)
        self.conv2 = nn.Conv2d(32, 64, 3, padding=1)
        self.pool = nn.MaxPool2d(2, 2)
        self.dropout1 = nn.Dropout(0.25)  # Drop 25% of neurons
        self.fc1 = nn.Linear(64 * 7 * 7, 128)
        self.dropout2 = nn.Dropout(0.5)  # Drop 50% of neurons
        self.fc2 = nn.Linear(128, 10)

    def forward(self, x):
        x = self.pool(torch.relu(self.conv1(x)))
        x = self.pool(torch.relu(self.conv2(x)))
        x = self.dropout1(x)
        x = x.view(-1, 64 * 7 * 7)
        x = torch.relu(self.fc1(x))
        x = self.dropout2(x)
        x = self.fc2(x)
        return x


print("Dropout helps prevent overfitting!")

## Part 5: Data Augmentation

Artificially increase training data by applying transformations.

In [None]:
# Data augmentation for training
train_transform = transforms.Compose(
    [
        transforms.RandomRotation(10),  # Rotate Â±10 degrees
        transforms.RandomAffine(0, translate=(0.1, 0.1)),  # Shift
        transforms.ToTensor(),
        transforms.Normalize((0.1307,), (0.3081,)),
    ]
)

# No augmentation for testing
test_transform = transforms.Compose(
    [transforms.ToTensor(), transforms.Normalize((0.1307,), (0.3081,))]
)

print("Data augmentation increases effective dataset size!")

## Summary

### Key Techniques:
1. **Adam optimizer** - Great default choice
2. **Learning rate scheduling** - Improve final accuracy
3. **Batch normalization** - Faster, more stable training
4. **Dropout** - Prevent overfitting
5. **Data augmentation** - More training data

### Next: Module 05 - CNN Architectures
Learn famous CNN architectures used in production!