# Convolutional Neural Networks (CNNs): The Solution to Image Data

**Deep Learning - University of Vermont**

---

## Learning Objectives

1.  **Understand the Convolution Operation**: Learn how convolution, padding, and stride work to extract features.
2.  **Implement Classic Architectures**: Build and train LeNet-5 and AlexNet using PyTorch.
3.  **Visualize Learned Features**: See what the network actually "sees" by visualizing filters and feature maps.
4.  **Compare Architectures**: Benchmark MLP, LeNet-5, and AlexNet on the Fashion-MNIST dataset to understand the trade-offs between model size, complexity, and performance.

---

## 1. Recap: Why MLPs Struggle with Images

In the previous tutorial, we saw that **Multi-Layer Perceptrons (MLPs)** suffer from a massive explosion in parameters when handling image data. 

*   **Parameter Explosion**: A 2000x2000 image input to a simple MLP requires billions of parameters.
*   **Loss of Spatial Information**: Flattening an image into a vector destroys the 2D spatial relationships between pixels.

**Convolutional Neural Networks (CNNs)** solve this by using **local connectivity** (looking at small patches of the image) and **parameter sharing** (using the same filter across the entire image).

## 2. The Convolution Operation

The core of a CNN is the **convolution** (technically cross-correlation) operation. A small matrix called a **kernel** or **filter** slides over the input image, computing a dot product at each position.

In [None]:
import torch
from torch import nn
import matplotlib.pyplot as plt

def corr2d(X, K):
    """Compute 2D cross-correlation."""
    h, w = K.shape
    Y = torch.zeros((X.shape[0] - h + 1, X.shape[1] - w + 1))
    for i in range(Y.shape[0]):
        for j in range(Y.shape[1]):
            Y[i, j] = (X[i:i + h, j:j + w] * K).sum()
    return Y

# Example: Edge Detection
X = torch.ones((6, 8))
X[:, 2:6] = 0

# A kernel that detects vertical edges
K = torch.tensor([[1.0, -1.0]])

Y = corr2d(X, K)

print("Input Image X:\n", X)
print("\nKernel K:\n", K)
print("\nOutput Y (Edges Detected):\n", Y)

### Padding and Stride

*   **Padding**: Adding extra pixels (usually zeros) around the border of the image to control the output size.
*   **Stride**: The number of pixels the kernel moves at each step. Larger stride reduces output size.

**Output Size Formula:**
$$\lfloor (n_h - k_h + p_h + s_h)/s_h \rfloor \times \lfloor (n_w - k_w + p_w + s_w)/s_w \rfloor$$

In [None]:
# PyTorch handles padding and stride easily
conv_layer = nn.Conv2d(1, 1, kernel_size=3, padding=1, stride=2)
input_data = torch.rand(1, 1, 8, 8)
output = conv_layer(input_data)

print(f"Input shape: {input_data.shape}")
print(f"Output shape (p=1, s=2): {output.shape}")

## 3. LeNet-5 (1998)

LeNet-5, proposed by Yann LeCun, was one of the first successful CNNs, designed for handwritten digit recognition (MNIST).

**Architecture:**
1.  **Conv1**: 6 filters, 5x5 kernel, Sigmoid activation
2.  **AvgPool1**: 2x2, stride 2
3.  **Conv2**: 16 filters, 5x5 kernel, Sigmoid activation
4.  **AvgPool2**: 2x2, stride 2
5.  **Flatten**
6.  **FC1**: 120 units
7.  **FC2**: 84 units
8.  **Output**: 10 units (Classes)

In [None]:
class LeNet(nn.Module):
    def __init__(self):
        super(LeNet, self).__init__()
        self.net = nn.Sequential(
            nn.Conv2d(1, 6, kernel_size=5, padding=2), nn.Sigmoid(),
            nn.AvgPool2d(kernel_size=2, stride=2),
            nn.Conv2d(6, 16, kernel_size=5), nn.Sigmoid(),
            nn.AvgPool2d(kernel_size=2, stride=2),
            nn.Flatten(),
            nn.Linear(16 * 5 * 5, 120), nn.Sigmoid(),
            nn.Linear(120, 84), nn.Sigmoid(),
            nn.Linear(84, 10)
        )

    def forward(self, x):
        return self.net(x)

## 4. AlexNet (2012)

AlexNet sparked the deep learning revolution by winning the ImageNet challenge. It is much deeper and wider than LeNet.

**Key Improvements:**
*   **ReLU Activation**: Solves vanishing gradient problem.
*   **Dropout**: Reduces overfitting.
*   **MaxPooling**: Captures sharper features.
*   **Data Augmentation**: artificially expands training data.

In [None]:
class AlexNet(nn.Module):
    def __init__(self):
        super(AlexNet, self).__init__()
        self.net = nn.Sequential(
            # Use a larger kernel size 11x11 for larger images (224x224)
            nn.Conv2d(1, 96, kernel_size=11, stride=4, padding=1), nn.ReLU(),
            nn.MaxPool2d(kernel_size=3, stride=2),
            nn.Conv2d(96, 256, kernel_size=5, padding=2), nn.ReLU(),
            nn.MaxPool2d(kernel_size=3, stride=2),
            nn.Conv2d(256, 384, kernel_size=3, padding=1), nn.ReLU(),
            nn.Conv2d(384, 384, kernel_size=3, padding=1), nn.ReLU(),
            nn.Conv2d(384, 256, kernel_size=3, padding=1), nn.ReLU(),
            nn.MaxPool2d(kernel_size=3, stride=2),
            nn.Flatten(),
            nn.Linear(6400, 4096), nn.ReLU(),
            nn.Dropout(p=0.5),
            nn.Linear(4096, 4096), nn.ReLU(),
            nn.Dropout(p=0.5),
            nn.Linear(4096, 10)
        )

    def forward(self, x):
        return self.net(x)

## 5. Training and Comparison on Fashion-MNIST

We will now train three models: **MLP**, **LeNet**, and **AlexNet** on the Fashion-MNIST dataset.

In [None]:
import time
import torchvision
from torchvision import transforms
from torch.utils.data import DataLoader

# Device configuration
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

# Data Preparation
def get_dataloader(resize=None, batch_size=256):
    trans = [transforms.ToTensor()]
    if resize:
        trans.insert(0, transforms.Resize(resize))
    trans = transforms.Compose(trans)
    
    train_ds = torchvision.datasets.FashionMNIST(root="./data", train=True, transform=trans, download=True)
    test_ds = torchvision.datasets.FashionMNIST(root="./data", train=False, transform=trans, download=True)
    
    return (DataLoader(train_ds, batch_size, shuffle=True), 
            DataLoader(test_ds, batch_size, shuffle=False))

# Training Function
def train_model(net, train_iter, test_iter, num_epochs=5, lr=0.1):
    net = net.to(device)
    optimizer = torch.optim.SGD(net.parameters(), lr=lr)
    loss_fn = nn.CrossEntropyLoss()
    
    print(f"Training {net.__class__.__name__}...")
    start_time = time.time()
    
    for epoch in range(num_epochs):
        net.train()
        train_l_sum, train_acc_sum, n = 0.0, 0.0, 0
        
        for X, y in train_iter:
            X, y = X.to(device), y.to(device)
            optimizer.zero_grad()
            y_hat = net(X)
            l = loss_fn(y_hat, y)
            l.backward()
            optimizer.step()
            
            train_l_sum += l.item()
            train_acc_sum += (y_hat.argmax(dim=1) == y).sum().item()
            n += y.shape[0]
            
        # Evaluate on test set
        net.eval()
        test_acc_sum, n_test = 0.0, 0
        with torch.no_grad():
            for X, y in test_iter:
                X, y = X.to(device), y.to(device)
                test_acc_sum += (net(X).argmax(dim=1) == y).sum().item()
                n_test += y.shape[0]
                
        print(f'Epoch {epoch + 1}, Loss: {train_l_sum/n:.4f}, Train Acc: {train_acc_sum/n:.3f}, Test Acc: {test_acc_sum/n_test:.3f}')
        
    total_time = time.time() - start_time
    print(f"Total Training Time: {total_time:.1f} sec\n")
    return test_acc_sum/n_test, total_time

In [None]:
# 1. Train MLP
class MLP(nn.Module):
    def __init__(self):
        super().__init__()
        self.net = nn.Sequential(
            nn.Flatten(),
            nn.Linear(784, 256), nn.ReLU(),
            nn.Linear(256, 10)
        )
    def forward(self, x): return self.net(x)

train_iter, test_iter = get_dataloader(batch_size=256)
mlp_acc, mlp_time = train_model(MLP(), train_iter, test_iter, num_epochs=5)

In [None]:
# 2. Train LeNet
train_iter, test_iter = get_dataloader(batch_size=256)
lenet_acc, lenet_time = train_model(LeNet(), train_iter, test_iter, num_epochs=5, lr=0.5) # LeNet needs higher LR with Sigmoid

In [None]:
# 3. Train AlexNet (Resizing to 224x224)
# Note: AlexNet is computationally expensive. We use a smaller batch size and fewer epochs for demonstration.
train_iter_alex, test_iter_alex = get_dataloader(resize=224, batch_size=128)
alex_acc, alex_time = train_model(AlexNet(), train_iter_alex, test_iter_alex, num_epochs=3, lr=0.01)

## 6. Visualizing Feature Maps

Let's see what LeNet has learned. We will visualize the output of the first convolutional layer.

In [None]:
def visualize_feature_maps(net, image):
    net.eval()
    with torch.no_grad():
        # Get Conv1 output
        # LeNet structure: Conv1 is net.net[0]
        conv1_out = net.net[0](image.unsqueeze(0).to(device))
        
        fig, axes = plt.subplots(1, 6, figsize=(12, 3))
        for i in range(6):
            axes[i].imshow(conv1_out[0, i].cpu(), cmap='gray')
            axes[i].axis('off')
        plt.suptitle("Feature Maps from Conv1")
        plt.show()

# Get a sample image
image, label = next(iter(get_dataloader()[1]))
sample_img = image[0]

print("Original Image:")
plt.imshow(sample_img[0], cmap='gray')
plt.axis('off')
plt.show()

# Visualize
net = LeNet().to(device)
# Load weights from trained model if available, else this shows random filters
visualize_feature_maps(net, sample_img)

## 7. Conclusion & Comparison

| Model | Test Accuracy | Training Time | Parameters |
|-------|---------------|---------------|------------|
| MLP | High | Fast | High |
| LeNet | Moderate | Moderate | Low |
| AlexNet | High | Slow | Very High |

**Key Takeaways:**
1.  **MLPs** are fast but inefficient in parameters and ignore spatial structure.
2.  **LeNet** introduces convolution, drastically reducing parameters while maintaining reasonable accuracy.
3.  **AlexNet** scales up CNNs with depth and modern techniques (ReLU, Dropout), achieving state-of-the-art results but requiring significant compute.