# Lab 2.2.1: CNN Architecture Study

**Module:** 2.2 - Computer Vision  
**Time:** 3 hours  
**Difficulty:** ‚≠ê‚≠ê‚≠ê

---

## üéØ Learning Objectives

By the end of this notebook, you will:
- [ ] Understand how convolution operations extract features from images
- [ ] Implement LeNet-5, AlexNet, VGG-11, and ResNet-18 from scratch
- [ ] Compare architecture performance on CIFAR-10
- [ ] Understand the evolution of CNN design principles

---

## üìö Prerequisites

- Completed: Module 6 (PyTorch Deep Learning)
- Knowledge of: Neural networks, backpropagation, PyTorch basics

---

## üåç Real-World Context

**CNNs power the visual AI all around us:**

- üì∏ **Your phone's camera** uses CNNs for face detection, portrait mode, and scene recognition
- üöó **Self-driving cars** use CNNs to identify pedestrians, signs, and lane markings
- üè• **Medical imaging** CNNs detect tumors in X-rays with superhuman accuracy
- üõí **Amazon Go** stores use CNNs to track what you pick up

The architectures you'll learn today form the foundation of all these systems!

---

## üßí ELI5: What is a Convolutional Neural Network?

> **Imagine you're playing "I Spy" with a friend...** üîç
>
> When you look for something specific (like a red ball), you don't examine every single pixel of your vision at once. Instead, your eyes scan across the scene, looking for specific patterns:
> - First, you look for anything red
> - Then, you look for round shapes
> - Finally, you check if the red round thing is the right size
>
> **A CNN works exactly the same way!**
>
> - **Convolutional layers** are like pattern detectors that slide across the image
> - **Early layers** detect simple patterns (edges, colors)
> - **Deeper layers** combine simple patterns into complex ones (eyes ‚Üí faces)
> - **Pooling layers** are like squinting - you lose detail but see the big picture
>
> **In AI terms:** A CNN applies learnable filters across an image to detect hierarchical features, building from edges to textures to object parts to whole objects.

---

## Part 1: Understanding Convolution Operations

### The Core Operation

A convolution is like a magnifying glass that looks at small patches of an image and produces a single number representing "how much does this patch match my pattern?"

```
Input Image          Filter (Kernel)       Output Feature Map
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê      ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê            ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ 1  2  3  4  ‚îÇ      ‚îÇ 1  0  ‚îÇ            ‚îÇ 14  20  ‚îÇ
‚îÇ 5  6  7  8  ‚îÇ  √ó   ‚îÇ 0  1  ‚îÇ    =       ‚îÇ 30  36  ‚îÇ
‚îÇ 9  10 11 12 ‚îÇ      ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò            ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
‚îÇ 13 14 15 16 ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
```

Let's see this in action!

In [None]:
# Setup - Run this first!
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import DataLoader
import torchvision
import torchvision.transforms as transforms
import matplotlib.pyplot as plt
import numpy as np
from typing import Tuple, List, Dict
import time
from tqdm.auto import tqdm

# DGX Spark optimizations
torch.backends.cudnn.benchmark = True
torch.set_float32_matmul_precision('high')

# Check our hardware
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"üñ•Ô∏è  Device: {device}")
if torch.cuda.is_available():
    print(f"üéÆ GPU: {torch.cuda.get_device_name(0)}")
    print(f"üíæ GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")

In [None]:
# Visualize what convolution does
def visualize_convolution():
    """
    Show how different filters detect different features.
    """
    # Load a sample image
    transform = transforms.Compose([
        transforms.ToTensor()
    ])
    dataset = torchvision.datasets.CIFAR10(root='../data', train=True, 
                                            download=True, transform=transform)
    
    # Get a nice image (let's find a car)
    for img, label in dataset:
        if label == 1:  # Car class
            break
    
    # Convert to grayscale for clearer visualization
    gray_img = img.mean(dim=0, keepdim=True).unsqueeze(0)  # [1, 1, 32, 32]
    
    # Define edge detection filters
    filters = {
        'Horizontal Edges': torch.tensor([[[[-1., -1., -1.],
                                            [ 0.,  0.,  0.],
                                            [ 1.,  1.,  1.]]]]),
        'Vertical Edges': torch.tensor([[[[-1., 0., 1.],
                                          [-1., 0., 1.],
                                          [-1., 0., 1.]]]]),
        'Corners': torch.tensor([[[[ 0., -1., 0.],
                                   [-1.,  4., -1.],
                                   [ 0., -1., 0.]]]]),
        'Blur': torch.ones(1, 1, 3, 3) / 9
    }
    
    fig, axes = plt.subplots(2, 3, figsize=(12, 8))
    
    # Original image
    axes[0, 0].imshow(img.permute(1, 2, 0).numpy())
    axes[0, 0].set_title('Original (Color)', fontsize=12)
    axes[0, 0].axis('off')
    
    axes[0, 1].imshow(gray_img.squeeze().numpy(), cmap='gray')
    axes[0, 1].set_title('Grayscale', fontsize=12)
    axes[0, 1].axis('off')
    
    # Apply each filter
    for idx, (name, kernel) in enumerate(filters.items()):
        output = F.conv2d(gray_img, kernel, padding=1)
        row, col = divmod(idx + 2, 3)
        axes[row, col].imshow(output.squeeze().detach().numpy(), cmap='gray')
        axes[row, col].set_title(f'{name}', fontsize=12)
        axes[row, col].axis('off')
    
    plt.suptitle('üîç How Convolution Filters Detect Features', fontsize=14, fontweight='bold')
    plt.tight_layout()
    plt.show()

visualize_convolution()

### üîç What Just Happened?

Each filter detected different features:
- **Horizontal edges**: Bright where there are horizontal lines (like car rooflines)
- **Vertical edges**: Bright where there are vertical lines (like car sides)
- **Corners**: Bright at corners and detailed areas
- **Blur**: Smoothed out the image (useful for noise reduction)

**Key insight**: In a CNN, the network *learns* what filters to use! It discovers the most useful patterns automatically.

---

## Part 2: The CNN Architecture Timeline

Let's trace the evolution of CNNs:

```
1998: LeNet-5         2012: AlexNet        2014: VGG           2015: ResNet
  ‚îÇ                      ‚îÇ                   ‚îÇ                    ‚îÇ
  ‚ñº                      ‚ñº                   ‚ñº                    ‚ñº
60K params           62M params          138M params         11M params
7 layers             8 layers            16-19 layers        50-152 layers!
Digit recognition    ImageNet winner     Deeper = Better?    Skip connections!
```

Each architecture solved a critical problem:
- **LeNet**: Proved CNNs work for visual recognition
- **AlexNet**: Made them work at scale with GPUs
- **VGG**: Showed that depth matters (using 3√ó3 filters)
- **ResNet**: Solved vanishing gradients with skip connections

---

## Part 3: LeNet-5 (1998) - The Pioneer üèõÔ∏è

### üßí ELI5: LeNet

> **Imagine you're sorting letters by their zip codes...**
>
> Before LeNet, computers couldn't reliably read handwritten numbers. Yann LeCun created a network that looks at digits the way you might:
> 1. First, spot the curves and straight lines
> 2. Then, combine them into recognizable patterns (loops, crosses)
> 3. Finally, decide which digit it most looks like
>
> **LeNet was the first CNN to be used commercially** - it processed millions of checks at banks!

### Architecture

```
Input (32√ó32√ó1)
    ‚îÇ
    ‚ñº
Conv1 (5√ó5, 6 filters) ‚Üí 28√ó28√ó6
    ‚îÇ
    ‚ñº
AvgPool (2√ó2) ‚Üí 14√ó14√ó6
    ‚îÇ
    ‚ñº
Conv2 (5√ó5, 16 filters) ‚Üí 10√ó10√ó16
    ‚îÇ
    ‚ñº
AvgPool (2√ó2) ‚Üí 5√ó5√ó16
    ‚îÇ
    ‚ñº
Flatten ‚Üí 400
    ‚îÇ
    ‚ñº
FC1 ‚Üí 120 ‚Üí FC2 ‚Üí 84 ‚Üí Output ‚Üí 10
```

In [None]:
class LeNet5(nn.Module):
    """
    LeNet-5 implementation (adapted for CIFAR-10's 3 channels).
    
    Original paper: "Gradient-Based Learning Applied to Document Recognition"
    by Yann LeCun et al., 1998
    
    Key innovations:
    - Convolutional layers for spatial feature extraction
    - Subsampling (pooling) for translation invariance
    - Tanh activation (original), we use ReLU here
    """
    
    def __init__(self, num_classes: int = 10):
        super(LeNet5, self).__init__()
        
        # Feature extraction layers
        self.conv1 = nn.Conv2d(3, 6, kernel_size=5)      # 32‚Üí28
        self.pool = nn.AvgPool2d(kernel_size=2, stride=2) # 28‚Üí14
        self.conv2 = nn.Conv2d(6, 16, kernel_size=5)     # 14‚Üí10
        
        # Classification layers
        self.fc1 = nn.Linear(16 * 5 * 5, 120)
        self.fc2 = nn.Linear(120, 84)
        self.fc3 = nn.Linear(84, num_classes)
    
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        # Conv block 1
        x = self.pool(F.relu(self.conv1(x)))  # [B, 6, 14, 14]
        
        # Conv block 2
        x = self.pool(F.relu(self.conv2(x)))  # [B, 16, 5, 5]
        
        # Flatten and classify
        x = x.view(x.size(0), -1)  # [B, 400]
        x = F.relu(self.fc1(x))    # [B, 120]
        x = F.relu(self.fc2(x))    # [B, 84]
        x = self.fc3(x)            # [B, 10]
        
        return x

# Test it!
model = LeNet5()
dummy_input = torch.randn(1, 3, 32, 32)
output = model(dummy_input)

print(f"üìä LeNet-5 Architecture")
print(f"   Input shape:  {dummy_input.shape}")
print(f"   Output shape: {output.shape}")
print(f"   Parameters:   {sum(p.numel() for p in model.parameters()):,}")

### ‚úã Try It Yourself #1

Modify LeNet to use **MaxPooling** instead of **AvgPooling**. Which do you think will work better for CIFAR-10 and why?

<details>
<summary>üí° Hint</summary>

- MaxPool keeps the strongest activation in each region ("Was there an edge here?")
- AvgPool smooths activations ("How much edge on average?")
- For detecting objects, strong features often matter more than average features

</details>

In [None]:
# YOUR CODE HERE: Create LeNet5_MaxPool
# Tip: Just change nn.AvgPool2d to nn.MaxPool2d



---

## Part 4: AlexNet (2012) - The GPU Revolution üöÄ

### üßí ELI5: AlexNet

> **Imagine trying to spot Waldo in a tiny phone screen vs. a huge wall mural...**
>
> Before AlexNet, neural networks looked at tiny images because bigger ones took forever to process. Alex Krizhevsky's breakthrough was using GPUs (designed for video games!) to crunch through millions of large images.
>
> Key innovations:
> - **ReLU activation**: Faster than tanh (the old standard)
> - **Dropout**: Prevents memorization
> - **Data augmentation**: More training variety
> - **GPU training**: 1000√ó speedup!
>
> **AlexNet won ImageNet 2012 by a HUGE margin** - shocking the computer vision world and starting the deep learning revolution!

### Architecture (adapted for CIFAR-10)

In [None]:
class AlexNet(nn.Module):
    """
    AlexNet implementation (adapted for 32√ó32 CIFAR-10 images).
    
    Original paper: "ImageNet Classification with Deep Convolutional Neural Networks"
    by Alex Krizhevsky et al., 2012
    
    Key innovations:
    - ReLU activation (faster training than sigmoid/tanh)
    - Dropout regularization
    - Local Response Normalization (we use BatchNorm instead)
    - Overlapping max pooling
    """
    
    def __init__(self, num_classes: int = 10):
        super(AlexNet, self).__init__()
        
        # Feature extraction (adapted kernel sizes for 32√ó32 input)
        self.features = nn.Sequential(
            # Conv1: Large receptive field to capture broad patterns
            nn.Conv2d(3, 64, kernel_size=3, stride=1, padding=1),
            nn.BatchNorm2d(64),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=2, stride=2),  # 32‚Üí16
            
            # Conv2: More filters for richer features
            nn.Conv2d(64, 192, kernel_size=3, padding=1),
            nn.BatchNorm2d(192),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=2, stride=2),  # 16‚Üí8
            
            # Conv3-5: Deeper feature extraction
            nn.Conv2d(192, 384, kernel_size=3, padding=1),
            nn.BatchNorm2d(384),
            nn.ReLU(inplace=True),
            
            nn.Conv2d(384, 256, kernel_size=3, padding=1),
            nn.BatchNorm2d(256),
            nn.ReLU(inplace=True),
            
            nn.Conv2d(256, 256, kernel_size=3, padding=1),
            nn.BatchNorm2d(256),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=2, stride=2),  # 8‚Üí4
        )
        
        # Classifier with dropout
        self.classifier = nn.Sequential(
            nn.Dropout(p=0.5),
            nn.Linear(256 * 4 * 4, 4096),
            nn.ReLU(inplace=True),
            nn.Dropout(p=0.5),
            nn.Linear(4096, 4096),
            nn.ReLU(inplace=True),
            nn.Linear(4096, num_classes),
        )
    
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        x = self.features(x)
        x = x.view(x.size(0), -1)  # Flatten
        x = self.classifier(x)
        return x

# Test it!
model = AlexNet()
output = model(dummy_input)

print(f"üìä AlexNet Architecture (CIFAR-10 adapted)")
print(f"   Input shape:  {dummy_input.shape}")
print(f"   Output shape: {output.shape}")
print(f"   Parameters:   {sum(p.numel() for p in model.parameters()):,}")

### Key Insight: Dropout as Regularization

```
Training (Dropout ON):          Inference (Dropout OFF):
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê            ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ ‚óè‚îÄ‚óã‚îÄ‚óè‚îÄ‚óè‚îÄ‚óã‚îÄ‚óè     ‚îÇ            ‚îÇ ‚óè‚îÄ‚óè‚îÄ‚óè‚îÄ‚óè‚îÄ‚óè‚îÄ‚óè     ‚îÇ
‚îÇ ‚îÇ   ‚îÇ ‚îÇ   ‚îÇ     ‚îÇ     ‚Üí      ‚îÇ ‚îÇ ‚îÇ ‚îÇ ‚îÇ ‚îÇ ‚îÇ     ‚îÇ
‚îÇ ‚óè‚îÄ‚óè‚îÄ‚óã‚îÄ‚óè‚îÄ‚óè‚îÄ‚óã     ‚îÇ            ‚îÇ ‚óè‚îÄ‚óè‚îÄ‚óè‚îÄ‚óè‚îÄ‚óè‚îÄ‚óè     ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò            ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
‚óã = dropped (random)           All neurons active
```

**Why it works**: Forces the network to not rely on any single neuron. Like training for a group project where members randomly miss meetings - everyone learns to contribute!

---

## Part 5: VGG (2014) - The Power of Depth üìè

### üßí ELI5: VGG

> **Imagine building with LEGO blocks...**
>
> Previous CNNs used different sized bricks (5√ó5, 7√ó7, 11√ó11 filters). VGG's insight was: **just use tiny 3√ó3 bricks everywhere!**
>
> Why? Two 3√ó3 layers see the same area as one 5√ó5 layer, but:
> - **More layers** = more ReLUs = more non-linearity = richer features
> - **Fewer parameters** = less memory
>
> **VGG's motto**: "Keep it simple, make it deep!"

### Why 3√ó3 Works Better

```
One 5√ó5 filter:                 Two 3√ó3 filters stacked:
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê                   ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ ‚óè ‚óè ‚óè ‚óè ‚óè ‚îÇ                   ‚îÇ ‚óè ‚óè ‚óè ‚îÇ   ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ ‚óè ‚óè ‚óè ‚óè ‚óè ‚îÇ   Same            ‚îÇ ‚óè ‚óè ‚óè ‚îÇ ‚Üí ‚îÇ ‚óè ‚óè ‚óè ‚îÇ
‚îÇ ‚óè ‚óè ‚óè ‚óè ‚óè ‚îÇ   receptive   =   ‚îÇ ‚óè ‚óè ‚óè ‚îÇ   ‚îÇ ‚óè ‚óè ‚óè ‚îÇ ‚Üí Output
‚îÇ ‚óè ‚óè ‚óè ‚óè ‚óè ‚îÇ   field!          ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò   ‚îÇ ‚óè ‚óè ‚óè ‚îÇ
‚îÇ ‚óè ‚óè ‚óè ‚óè ‚óè ‚îÇ                               ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
25 parameters                   9 + 9 = 18 parameters!
1 ReLU                          2 ReLUs (more expressive!)
```

In [None]:
class VGG11(nn.Module):
    """
    VGG-11 implementation (Configuration A from the paper).
    
    Original paper: "Very Deep Convolutional Networks for Large-Scale Image Recognition"
    by Karen Simonyan and Andrew Zisserman, 2014
    
    Key innovations:
    - Uniform 3√ó3 convolutions throughout
    - Doubling filters after each pooling
    - Very deep networks (11-19 layers)
    """
    
    def __init__(self, num_classes: int = 10):
        super(VGG11, self).__init__()
        
        # VGG block helper
        def vgg_block(in_channels: int, out_channels: int, num_convs: int):
            layers = []
            for i in range(num_convs):
                layers.append(nn.Conv2d(
                    in_channels if i == 0 else out_channels,
                    out_channels,
                    kernel_size=3, padding=1
                ))
                layers.append(nn.BatchNorm2d(out_channels))
                layers.append(nn.ReLU(inplace=True))
            layers.append(nn.MaxPool2d(kernel_size=2, stride=2))
            return nn.Sequential(*layers)
        
        # VGG-11 configuration: [64, M, 128, M, 256, 256, M, 512, 512, M, 512, 512, M]
        self.features = nn.Sequential(
            vgg_block(3, 64, 1),     # 32‚Üí16
            vgg_block(64, 128, 1),   # 16‚Üí8
            vgg_block(128, 256, 2),  # 8‚Üí4
            vgg_block(256, 512, 2),  # 4‚Üí2
            vgg_block(512, 512, 2),  # 2‚Üí1
        )
        
        self.classifier = nn.Sequential(
            nn.Linear(512, 4096),
            nn.ReLU(inplace=True),
            nn.Dropout(0.5),
            nn.Linear(4096, 4096),
            nn.ReLU(inplace=True),
            nn.Dropout(0.5),
            nn.Linear(4096, num_classes),
        )
    
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        x = self.features(x)
        x = x.view(x.size(0), -1)
        x = self.classifier(x)
        return x

# Test it!
model = VGG11()
output = model(dummy_input)

print(f"üìä VGG-11 Architecture")
print(f"   Input shape:  {dummy_input.shape}")
print(f"   Output shape: {output.shape}")
print(f"   Parameters:   {sum(p.numel() for p in model.parameters()):,}")

---

## Part 6: ResNet (2015) - Skip Connections to the Rescue ü¶∏

### üßí ELI5: ResNet

> **Imagine playing the telephone game with 150 people...**
>
> In a regular deep network, information passes through many layers. Like the telephone game, the message gets distorted with each step. By the time gradients reach the first layers during training, they've essentially vanished!
>
> **ResNet's solution: Skip connections (shortcuts!)**
>
> Instead of just passing the message along, you also whisper the original message directly to people further down the line. Now even if the telephone game distorts things, the original message still gets through!
>
> ```
> Regular Network:     ResNet:
> A ‚Üí B ‚Üí C ‚Üí D        A ‚Üí B ‚Üí C ‚Üí D
>                       ‚Üò‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚Üó
>                       (shortcut!)
> ```

### The Residual Block

```
Input (x)
    ‚îÇ
    ‚îú‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
    ‚ñº                ‚îÇ
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê           ‚îÇ
‚îÇ Conv   ‚îÇ           ‚îÇ  (identity shortcut)
‚îÇ BN     ‚îÇ           ‚îÇ
‚îÇ ReLU   ‚îÇ           ‚îÇ
‚îÇ Conv   ‚îÇ           ‚îÇ
‚îÇ BN     ‚îÇ           ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò           ‚îÇ
    ‚îÇ                ‚îÇ
    ‚ñº                ‚îÇ
   (+)‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò  ‚Üê Add the input directly!
    ‚îÇ
    ‚ñº
  ReLU
    ‚îÇ
    ‚ñº
  Output = F(x) + x      ‚Üê "Learn the residual"
```

**Why it works**: Instead of learning `H(x)`, the network learns `H(x) - x` (the residual). If the optimal transformation is close to identity, learning a near-zero residual is easier than learning identity!

In [None]:
class BasicBlock(nn.Module):
    """
    Basic residual block for ResNet-18/34.
    
    Two 3√ó3 convolutions with a skip connection.
    """
    expansion = 1
    
    def __init__(self, in_channels: int, out_channels: int, stride: int = 1):
        super(BasicBlock, self).__init__()
        
        # Main path
        self.conv1 = nn.Conv2d(in_channels, out_channels, kernel_size=3,
                               stride=stride, padding=1, bias=False)
        self.bn1 = nn.BatchNorm2d(out_channels)
        self.conv2 = nn.Conv2d(out_channels, out_channels, kernel_size=3,
                               stride=1, padding=1, bias=False)
        self.bn2 = nn.BatchNorm2d(out_channels)
        
        # Shortcut path (identity or projection)
        self.shortcut = nn.Sequential()
        if stride != 1 or in_channels != out_channels:
            # Need to match dimensions with 1√ó1 conv
            self.shortcut = nn.Sequential(
                nn.Conv2d(in_channels, out_channels, kernel_size=1,
                          stride=stride, bias=False),
                nn.BatchNorm2d(out_channels)
            )
    
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        # Main path
        out = F.relu(self.bn1(self.conv1(x)))
        out = self.bn2(self.conv2(out))
        
        # Add shortcut (THIS IS THE MAGIC!)
        out += self.shortcut(x)
        out = F.relu(out)
        
        return out


class ResNet18(nn.Module):
    """
    ResNet-18 implementation.
    
    Original paper: "Deep Residual Learning for Image Recognition"
    by Kaiming He et al., 2015
    
    Key innovations:
    - Skip connections (identity shortcuts)
    - Batch normalization throughout
    - Global average pooling instead of FC layers
    """
    
    def __init__(self, num_classes: int = 10):
        super(ResNet18, self).__init__()
        
        self.in_channels = 64
        
        # Initial convolution (adapted for 32√ó32 input)
        self.conv1 = nn.Conv2d(3, 64, kernel_size=3, stride=1, padding=1, bias=False)
        self.bn1 = nn.BatchNorm2d(64)
        
        # Residual layers
        self.layer1 = self._make_layer(64, 2, stride=1)   # 32√ó32
        self.layer2 = self._make_layer(128, 2, stride=2)  # 16√ó16
        self.layer3 = self._make_layer(256, 2, stride=2)  # 8√ó8
        self.layer4 = self._make_layer(512, 2, stride=2)  # 4√ó4
        
        # Global average pooling + classifier
        self.avgpool = nn.AdaptiveAvgPool2d((1, 1))
        self.fc = nn.Linear(512, num_classes)
    
    def _make_layer(self, out_channels: int, num_blocks: int, stride: int):
        strides = [stride] + [1] * (num_blocks - 1)
        layers = []
        for s in strides:
            layers.append(BasicBlock(self.in_channels, out_channels, s))
            self.in_channels = out_channels
        return nn.Sequential(*layers)
    
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        # Initial conv
        x = F.relu(self.bn1(self.conv1(x)))
        
        # Residual blocks
        x = self.layer1(x)
        x = self.layer2(x)
        x = self.layer3(x)
        x = self.layer4(x)
        
        # Classifier
        x = self.avgpool(x)
        x = x.view(x.size(0), -1)
        x = self.fc(x)
        
        return x

# Test it!
model = ResNet18()
output = model(dummy_input)

print(f"üìä ResNet-18 Architecture")
print(f"   Input shape:  {dummy_input.shape}")
print(f"   Output shape: {output.shape}")
print(f"   Parameters:   {sum(p.numel() for p in model.parameters()):,}")

### ‚úã Try It Yourself #2

Visualize the gradient flow through a ResNet block vs a plain block. Create a simple experiment:

1. Create a plain conv block (2 convs, no skip)
2. Create a residual block (same, but with skip)
3. Pass random input through each
4. Compute gradient of output w.r.t. input
5. Compare gradient magnitudes

<details>
<summary>üí° Hint</summary>

```python
# Compute gradients
x.requires_grad = True
output = block(x)
output.sum().backward()
grad_magnitude = x.grad.abs().mean()
```

</details>

In [None]:
# YOUR CODE HERE: Compare gradient flow



---

## Part 7: Architecture Comparison on CIFAR-10

Now let's train all architectures and compare them! We'll use a consistent training setup.

In [None]:
# ‚ö†Ô∏è DGX SPARK NOTE: When using Docker with num_workers > 0, ensure --ipc=host flag is set
# Example: docker run --gpus all --ipc=host -it nvcr.io/nvidia/pytorch:25.11-py3
# Without this flag, DataLoader may hang or crash due to shared memory issues.

def get_cifar10_loaders(batch_size: int = 128) -> Tuple[DataLoader, DataLoader]:
    """
    Create CIFAR-10 data loaders with standard augmentation.
    
    Args:
        batch_size: Batch size for training and testing.
                   DGX Spark can handle larger batches (256-512) due to 128GB memory.
    
    Note:
        When running in Docker, use --ipc=host flag for num_workers > 0.
    """
    # Training transforms (with augmentation)
    train_transform = transforms.Compose([
        transforms.RandomCrop(32, padding=4),
        transforms.RandomHorizontalFlip(),
        transforms.ToTensor(),
        transforms.Normalize((0.4914, 0.4822, 0.4465), 
                           (0.2023, 0.1994, 0.2010))
    ])
    
    # Test transforms (no augmentation)
    test_transform = transforms.Compose([
        transforms.ToTensor(),
        transforms.Normalize((0.4914, 0.4822, 0.4465), 
                           (0.2023, 0.1994, 0.2010))
    ])
    
    train_dataset = torchvision.datasets.CIFAR10(
        root='../data', train=True, download=True, transform=train_transform
    )
    test_dataset = torchvision.datasets.CIFAR10(
        root='../data', train=False, download=True, transform=test_transform
    )
    
    # num_workers=4 requires --ipc=host when running in Docker
    train_loader = DataLoader(train_dataset, batch_size=batch_size,
                              shuffle=True, num_workers=4, pin_memory=True)
    test_loader = DataLoader(test_dataset, batch_size=batch_size,
                             shuffle=False, num_workers=4, pin_memory=True)
    
    return train_loader, test_loader

train_loader, test_loader = get_cifar10_loaders()
print(f"üìä Dataset loaded:")
print(f"   Training samples: {len(train_loader.dataset):,}")
print(f"   Test samples:     {len(test_loader.dataset):,}")

In [None]:
def train_model(
    model: nn.Module,
    train_loader: DataLoader,
    test_loader: DataLoader,
    epochs: int = 10,
    lr: float = 0.01
) -> Dict[str, List[float]]:
    """
    Train a model and track metrics.
    
    Returns:
        Dictionary with train_loss, train_acc, test_loss, test_acc histories
    """
    model = model.to(device)
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.SGD(model.parameters(), lr=lr, momentum=0.9, weight_decay=5e-4)
    scheduler = optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=epochs)
    
    history = {
        'train_loss': [], 'train_acc': [],
        'test_loss': [], 'test_acc': []
    }
    
    for epoch in range(epochs):
        # Training
        model.train()
        train_loss, correct, total = 0, 0, 0
        
        pbar = tqdm(train_loader, desc=f'Epoch {epoch+1}/{epochs}')
        for inputs, targets in pbar:
            inputs, targets = inputs.to(device), targets.to(device)
            
            optimizer.zero_grad()
            outputs = model(inputs)
            loss = criterion(outputs, targets)
            loss.backward()
            optimizer.step()
            
            train_loss += loss.item()
            _, predicted = outputs.max(1)
            total += targets.size(0)
            correct += predicted.eq(targets).sum().item()
            
            pbar.set_postfix({'loss': f'{train_loss/total:.4f}', 
                            'acc': f'{100.*correct/total:.1f}%'})
        
        history['train_loss'].append(train_loss / len(train_loader))
        history['train_acc'].append(100. * correct / total)
        
        # Evaluation
        model.eval()
        test_loss, correct, total = 0, 0, 0
        
        with torch.no_grad():
            for inputs, targets in test_loader:
                inputs, targets = inputs.to(device), targets.to(device)
                outputs = model(inputs)
                loss = criterion(outputs, targets)
                
                test_loss += loss.item()
                _, predicted = outputs.max(1)
                total += targets.size(0)
                correct += predicted.eq(targets).sum().item()
        
        history['test_loss'].append(test_loss / len(test_loader))
        history['test_acc'].append(100. * correct / total)
        
        scheduler.step()
        
        print(f"   Test: Loss={history['test_loss'][-1]:.4f}, Acc={history['test_acc'][-1]:.1f}%")
    
    return history

In [None]:
# Train all architectures (this takes ~15-20 minutes)
architectures = {
    'LeNet-5': LeNet5(),
    'AlexNet': AlexNet(),
    'VGG-11': VGG11(),
    'ResNet-18': ResNet18()
}

results = {}
epochs = 10  # Quick comparison; use 50+ for best results

for name, model in architectures.items():
    print(f"\n{'='*50}")
    print(f"üèãÔ∏è Training {name}...")
    print(f"   Parameters: {sum(p.numel() for p in model.parameters()):,}")
    print(f"{'='*50}")
    
    start_time = time.time()
    history = train_model(model, train_loader, test_loader, epochs=epochs)
    train_time = time.time() - start_time
    
    results[name] = {
        'history': history,
        'params': sum(p.numel() for p in model.parameters()),
        'train_time': train_time,
        'final_acc': history['test_acc'][-1]
    }
    
    # Clear GPU memory
    del model
    torch.cuda.empty_cache()

In [None]:
# Visualize results
fig, axes = plt.subplots(1, 3, figsize=(15, 5))

# Training curves
for name, data in results.items():
    axes[0].plot(data['history']['test_acc'], label=name, linewidth=2)
axes[0].set_xlabel('Epoch')
axes[0].set_ylabel('Test Accuracy (%)')
axes[0].set_title('üìà Test Accuracy Over Training')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Parameter count vs accuracy
names = list(results.keys())
params = [results[n]['params'] / 1e6 for n in names]
accs = [results[n]['final_acc'] for n in names]
colors = ['#FF6B6B', '#4ECDC4', '#45B7D1', '#96CEB4']

axes[1].bar(names, params, color=colors)
axes[1].set_ylabel('Parameters (Millions)')
axes[1].set_title('üìä Model Size Comparison')
for i, (n, p, a) in enumerate(zip(names, params, accs)):
    axes[1].annotate(f'{a:.1f}%', (i, p), ha='center', va='bottom')

# Efficiency: Accuracy per million parameters
efficiency = [a / p for a, p in zip(accs, params)]
axes[2].bar(names, efficiency, color=colors)
axes[2].set_ylabel('Accuracy / Million Params')
axes[2].set_title('‚ö° Parameter Efficiency')

plt.tight_layout()
plt.show()

# Summary table
print("\n" + "="*70)
print("üìã FINAL COMPARISON SUMMARY")
print("="*70)
print(f"{'Model':<15} {'Parameters':>12} {'Final Acc':>12} {'Train Time':>12}")
print("-"*70)
for name, data in results.items():
    print(f"{name:<15} {data['params']:>12,} {data['final_acc']:>11.1f}% {data['train_time']:>10.1f}s")

---

## ‚ö†Ô∏è Common Mistakes

### Mistake 1: Forgetting to adjust for input size

```python
# ‚ùå Wrong: Using ImageNet-sized kernels on CIFAR-10
self.conv1 = nn.Conv2d(3, 64, kernel_size=7, stride=2)  # 32‚Üí13 (not nice!)

# ‚úÖ Right: Adapt kernel size for small images
self.conv1 = nn.Conv2d(3, 64, kernel_size=3, stride=1)  # 32‚Üí32
```
**Why:** CIFAR-10 images (32√ó32) are much smaller than ImageNet (224√ó224). Large kernels and strides destroy too much information.

### Mistake 2: Missing normalization

```python
# ‚ùå Wrong: Raw transforms
transform = transforms.ToTensor()  # Values 0-1

# ‚úÖ Right: Normalize to match training distribution
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], 
                        std=[0.229, 0.224, 0.225])
])
```
**Why:** Pre-trained models expect normalized inputs. Without it, activations explode.

### Mistake 3: Wrong residual connection dimensions

```python
# ‚ùå Wrong: Adding tensors of different sizes
out = self.conv_block(x) + x  # Fails if shapes differ!

# ‚úÖ Right: Use projection shortcut when dimensions change
if stride != 1 or in_channels != out_channels:
    self.shortcut = nn.Conv2d(in_channels, out_channels, 
                               kernel_size=1, stride=stride)
out = self.conv_block(x) + self.shortcut(x)
```
**Why:** Skip connections must have matching dimensions to add.

---

## üéâ Checkpoint

You've learned:
- ‚úÖ How convolution operations detect features in images
- ‚úÖ The evolution from LeNet (1998) to ResNet (2015)
- ‚úÖ Why 3√ó3 convolutions are preferred (VGG's insight)
- ‚úÖ How skip connections solve vanishing gradients (ResNet's breakthrough)
- ‚úÖ How to compare architectures fairly on a benchmark dataset

---

## üöÄ Challenge (Optional)

Implement **ResNet with Squeeze-and-Excitation (SE) blocks** - a 2017 improvement that adds channel attention:

```
Input ‚Üí Conv ‚Üí SE Block ‚Üí Output
              ‚Üì
         [Global Pool]
              ‚Üì
         [FC ‚Üí ReLU ‚Üí FC ‚Üí Sigmoid]
              ‚Üì
         [Scale each channel]
```

The SE block learns to weight channels by their importance. Can you beat vanilla ResNet?

<details>
<summary>üí° Starting Code</summary>

```python
class SEBlock(nn.Module):
    def __init__(self, channels, reduction=16):
        super().__init__()
        self.squeeze = nn.AdaptiveAvgPool2d(1)
        self.excite = nn.Sequential(
            nn.Linear(channels, channels // reduction),
            nn.ReLU(inplace=True),
            nn.Linear(channels // reduction, channels),
            nn.Sigmoid()
        )
    
    def forward(self, x):
        b, c, _, _ = x.size()
        y = self.squeeze(x).view(b, c)
        y = self.excite(y).view(b, c, 1, 1)
        return x * y
```

</details>

In [None]:
# YOUR CHALLENGE CODE HERE



---

## üìñ Further Reading

- [Original LeNet Paper](http://yann.lecun.com/exdb/publis/pdf/lecun-98.pdf) - Yann LeCun's classic
- [AlexNet Paper](https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf) - The GPU revolution
- [VGG Paper](https://arxiv.org/abs/1409.1556) - The depth experiments
- [ResNet Paper](https://arxiv.org/abs/1512.03385) - Skip connections explained
- [CS231n CNN Notes](http://cs231n.github.io/convolutional-networks/) - Excellent visualizations

---

## üßπ Cleanup

In [None]:
# Clear GPU memory
import gc
torch.cuda.empty_cache()
gc.collect()

print("‚úÖ Cleanup complete!")
if torch.cuda.is_available():
    print(f"üíæ GPU Memory Free: {torch.cuda.mem_get_info()[0] / 1e9:.1f} GB")