# Convolutional Neural Networks (CNNs)
## What is a CNN?
A convolutional neural network (CNN) rethinks the fully connected layers of an MLP to make them smarter and more efficient for grid-like data, such as images.

You can think of the convolutional layer as an MLP that:
1. Doesn't flatten the input, preserving its spatial structure.
2. Replaces a fully connected layer with a convolutional layer, which is just a "smarter" layer that enforces local connections and reuses the same weights (the kernel) across the entire image.

## What is a Kernal?
A kernel (also called a filter) in a Convolutional Neural Network (CNN) is a small matrix of learnable weights that acts as a specialized feature detector.
The CNN's job is to slide this pattern over every part of an input image to see where the pattern matches. The result is a new image, called a feature map, that highlights where the specific feature was found.

## What is a Pooling Layer?
A pooling layer in a CNN performs downsampling, systematically reducing the spatial size of a feature map. Its main job is to summarize the features present in a region of the feature map.


# CNN Architecture
The full network can be divided into two main parts:
### Feature Extraction Base
This is the part of the network that houses the convolutional and pooling layers. Its job is to automatically learn and extract meaningful features from the input images. Between the convolutional and pooling layers, there is also an activation layer (ex: ReLU). This introduces non-linearity, allowing the model to learn more complex patterns.

### Classifier Head
This is the part of the network that takes the high level features learnt by the feature extraction base and uses them to make a final prediction. This part contains the Flatten Layer, which converts the 2D shape into a single 1D vector, Fully Connected Layers, and an Output Layer

# Implementation of a CNN
In this example, we will be using the CIFAR-10 dataset. This dataset contains 60000 32x32 colored images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 testing images. Information on the dataset can be found [here](https://www.cs.toronto.edu/~kriz/cifar.html) \
\
NOTE: In real implementations, you will want to use a validation set to check your model's performance while training.

## Data PreProcessing
Just as in the SimpleNN, we will use `torchvision` to convert these colored images into a tensor. Since we are dealing with colored images, we will also want to standardize these colors (called channels) with a mean of 0.5 and a standard deviation of 0.5. [docs](https://docs.pytorch.org/vision/main/generated/torchvision.transforms.Normalize.html)

In [6]:
import torch

# Check if GPU is available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

Using device: cuda


In [7]:
import torch
from torchvision import datasets, transforms
from torch.utils.data import DataLoader

# We'll normalize the data to a standard distribution, which helps with training
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
])

# Download and load the training data
train_data = datasets.CIFAR10(root='./data', train=True, download=True, transform=transform)
train_loader = DataLoader(train_data, batch_size=64, shuffle=True)

# Download and load the test data
test_data = datasets.CIFAR10(root='./data', train=False, download=True, transform=transform)
test_loader = DataLoader(test_data, batch_size=64, shuffle=False)

Now, we can build our CNN model. We will use `nn.Conv2d` for the convolutional layers and `nn.MaxPool2d` for the pooling layers.
1. `nn.Conv2d`: This takes three main arguments: `in_channels` (the number of color channels in our image, which is 3 for RGB), `out_channels` (the number of filters we want to apply), and `kernel_size` (the size of our filter, e.g., 5x5). [docs](https://docs.pytorch.org/docs/stable/generated/torch.nn.Conv2d.html)

2. `nn.MaxPool2d`: This is our pooling layer. The key argument is `kernel_size`, which defines the window size for pooling. A 2x2 kernel with a `stride` of 2 is common, as it halves the dimensions of the feature map. [docs](https://docs.pytorch.org/docs/stable/generated/torch.nn.MaxPool2d.html)

In [8]:
import torch.nn as nn
import torch.nn.functional as F # Can be used to apply activation functions

class SimpleCNN(nn.Module):
    def __init__(self):
        super().__init__()
        # 3 input channels (for RGB), 6 output channels, 5x5 kernel
        self.conv1 = nn.Conv2d(3, 6, 5)

        # 2x2 max pooling
        self.pool = nn.MaxPool2d(2, 2)

        # 6 input channels, 16 output channels, 5x5 kernel
        self.conv2 = nn.Conv2d(6, 16, 5)

        # Fully connected layers for classification
        self.fc1 = nn.Linear(16 * 5 * 5, 120)
        self.fc2 = nn.Linear(120, 84)
        self.fc3 = nn.Linear(84, 10)

    def forward(self, x):
        # First Convolutional and Pooling layer
        x = self.conv1(x)
        x = F.relu(x)
        x = self.pool(x)

        # Second Convolutional and Pooling layer
        x = self.conv2(x)
        x = F.relu(x)
        x = self.pool(x)

        # Flatten the feature maps for the fully connected layers
        x = torch.flatten(x, 1) # flatten all dimensions except batch
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x

## Training Loop

In [9]:
import torch.optim as optim

model = SimpleCNN()
model.to(device)
optimizer = optim.Adam(model.parameters(), lr=0.001)
criterion = nn.CrossEntropyLoss()

## Training Loop
num_epochs = 10
for epoch in range(1, num_epochs + 1):
    running_loss = 0.0
    for i, (images, labels) in enumerate(train_loader):
        # Move data to the GPU
        images = images.to(device)
        labels = labels.to(device)

        optimizer.zero_grad() # Zero the parameter gradients

        outputs = model(images) # Forward pass

        loss_value = criterion(outputs, labels) # Compute loss

        loss_value.backward() # Backward pass

        optimizer.step() # Update weights

        running_loss += loss_value.item()

    print(f'Epoch [{epoch}/{num_epochs}], Loss: {running_loss/len(train_loader):.4f}')


Epoch [1/10], Loss: 1.6373
Epoch [2/10], Loss: 1.3495
Epoch [3/10], Loss: 1.2232
Epoch [4/10], Loss: 1.1402
Epoch [5/10], Loss: 1.0829
Epoch [6/10], Loss: 1.0246
Epoch [7/10], Loss: 0.9847
Epoch [8/10], Loss: 0.9444
Epoch [9/10], Loss: 0.9085
Epoch [10/10], Loss: 0.8776


## Evaluation

In [10]:
model.eval() # Set the model to evaluation mode

## Evaluation
correct = 0
total = 0
with torch.no_grad(): # Disable gradient calculation
    for images, labels in test_loader:
        images = images.to(device)
        labels = labels.to(device)

        outputs = model(images)
        _, predicted = torch.max(outputs.data, 1)
        correct += (predicted == labels).sum().item()
        total += labels.size(0)

accuracy = 100 * correct / total
print(f'Accuracy of the network on the 10,000 test images: {accuracy:.2f}%')

Accuracy of the network on the 10,000 test images: 63.35%


This is not that good. Let's assume that the model is underfitting due to its simple nature. We can try to implement a VGG-style network. The core idea behind a VGG-style network is to use a deeper stack of small 3x3 convolutional kernels instead of larger kernels. The network gets its power from its depth rather than its width. \

### Dropout
In this new network, we will use the Dropout method between the MLP layers to try and prevent overfitting. [docs](https://docs.pytorch.org/docs/stable/generated/torch.nn.Dropout.html)

In [11]:
class VGGishCNN(nn.Module):
    def __init__(self):
        super().__init__()

        # VGG-style feature extraction layers
        self.features = nn.Sequential(
            # Block 1: 3 input channels (RGB) -> 64 output channels
            nn.Conv2d(3, 64, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            nn.Conv2d(64, 64, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=2, stride=2),

            # Block 2: 64 input channels -> 128 output channels
            nn.Conv2d(64, 128, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            nn.Conv2d(128, 128, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=2, stride=2),

            # Block 3: 128 input channels -> 256 output channels
            nn.Conv2d(128, 256, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            nn.Conv2d(256, 256, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            nn.Conv2d(256, 256, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=2, stride=2),
        )

        # Classifier head
        self.classifier = nn.Sequential(
            nn.Linear(256 * 4 * 4, 4096),
            nn.ReLU(inplace=True),
            nn.Dropout(p=0.5), # p is a hyperparameter
            nn.Linear(4096, 4096),
            nn.ReLU(inplace=True),
            nn.Dropout(p=0.5),
            nn.Linear(4096, 10),
        )

    def forward(self, x):
        x = self.features(x)
        x = torch.flatten(x, 1)
        x = self.classifier(x)
        return x

In [12]:
## Training Loop
VGG_model = VGGishCNN()
VGG_model.to(device)
optimizer = optim.Adam(VGG_model.parameters(), lr=0.001)
criterion = nn.CrossEntropyLoss()

num_epochs = 10
for epoch in range(1, num_epochs + 1):
    running_loss = 0.0
    for i, (images, labels) in enumerate(train_loader):
        # Move the data to the GPU
        images = images.to(device)
        labels = labels.to(device)

        optimizer.zero_grad() # Zero the parameter gradients

        outputs = VGG_model(images) # Forward pass

        loss_value = criterion(outputs, labels) # Compute loss

        loss_value.backward() # Backward pass

        optimizer.step() # Update weights

        running_loss += loss_value.item()

    print(f'Epoch [{epoch}/{num_epochs}], Loss: {running_loss/len(train_loader):.4f}')

Epoch [1/10], Loss: 1.7319
Epoch [2/10], Loss: 1.2856
Epoch [3/10], Loss: 1.0517
Epoch [4/10], Loss: 0.9135
Epoch [5/10], Loss: 0.8232
Epoch [6/10], Loss: 0.7474
Epoch [7/10], Loss: 0.6840
Epoch [8/10], Loss: 0.6324
Epoch [9/10], Loss: 0.6030
Epoch [10/10], Loss: 0.5668


In [13]:
## Evaluation
VGG_model.eval() # Set the model to evaluation mode

correct = 0
total = 0
with torch.no_grad(): # Disable gradient calculation
    for images, labels in test_loader:
        images = images.to(device)
        labels = labels.to(device)

        outputs = VGG_model(images)
        _, predicted = torch.max(outputs.data, 1)
        correct += (predicted == labels).sum().item()
        total += labels.size(0)

accuracy = 100 * correct / total
print(f'Accuracy of the network on the test images: {accuracy:.2f}%')

Accuracy of the network on the test images: 74.71%


In [15]:
train_correct = 0
train_total = 0
with torch.no_grad(): # Disable gradient calculation
    for images, labels in train_loader:
        images = images.to(device)
        labels = labels.to(device)

        outputs = VGG_model(images)
        _, predicted = torch.max(outputs.data, 1)
        train_correct += (predicted == labels).sum().item()
        train_total += labels.size(0)

accuracy = 100 * train_correct / train_total
print(f'Accuracy of the network on the train images: {accuracy:.2f}%')

Accuracy of the network on the train images: 86.02%


This shows us that the model is overfitting. We could use L2 regularization by adding a weight decay to the optimizer. We could also add batch normalization within our layers.