# Deep Learning

# Tutorial 14: AlexNet and VGG architectures

In this tutorial, we will cover:

- Architectures for Deep Neural Networks AlexNet 2012, VGG 2014

Prerequisites:

- Python, Tensor basics, PyTorch

My contact:

- Niklas Beuter (niklas.beuter@th-luebeck.de)

Course:

- Slides and notebooks will be available at https://lernraum.th-luebeck.de/course/view.php?id=5383

## Expected Outcomes
* Understand the basic components of neural networks: layers, neurons, weights, biases, activations, and loss functions.
* Gain hands-on experience with the computational aspects of setting up neural networks, including training and usage.
* Learn how to add layers with correct sizes to a deep neural network

# AlexNet

[AlexNet](http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf) is a pioneering convolutional neural network (CNN) that significantly influenced the field of deep learning, especially in computer vision. Developed by Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton, AlexNet was the winning entry in the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) in 2012.

## Key Features of AlexNet

- **Architecture**:
  - AlexNet contains eight layers; the first five are convolutional layers, and the last three are fully connected layers. It uses ReLU (Rectified Linear Unit) for the non-linear part, instead of a sigmoid or tanh function, which helps to alleviate the vanishing gradient problem.
  - The network uses overlapping pooling, which helps to reduce the size of the network and provides better performance than the traditional non-overlapping scheme.
  - AlexNet used an early version of Normalization (Local Response Normalization). You can find a description and comparison to BatchNorm further down.

- **Training**:
  - The network was trained on two GTX 580 GPUs for about five to six days.
  - It uses data augmentation techniques such as random crops, flips, and color alterations to expand the training dataset and improve generalization.
  - Dropout layers are applied before the first and second fully connected layers to prevent overfitting.

- **Impact**:
  - AlexNet significantly outperformed the second-place competitor in the ILSVRC-2012 competition by reducing the top-5 error from 25.8% to 16.4%.
  - Its success demonstrated the potential of deep neural networks, particularly CNNs, leading to widespread adoption in the field of image recognition and beyond.

## Local Response Normalization (LRN) vs Batch Normalization (BatchNorm)

### Local Response Normalization (LRN)

**Concept**: LRN is inspired by a biological process known as lateral inhibition, a neurobiology concept where activated neurons inhibit the activity of neighboring neurons in the same layer, creating competitive dynamics among the neurons of a layer.

**How LRN Works**:
- LRN applies a normalization over local input regions in convolutional neural networks. For a given pixel in a feature map, normalization is applied across the channels.
- The formula for LRN is:

  $$b_{x,y}^i = \frac{a_{x,y}^i}{\left( k + \alpha \sum_{j=\max(0, i-n/2)}^{\min(N-1, i+n/2)} (a_{x,y}^j)^2 \right)^\beta}$$
  
  Where:
  - $b_{x,y}^i$ is the normalized output.
  - $a_{x,y}^i$ is the activity of a neuron computed by applying kernel $i$ at position $(x, y)$ and then applying the ReLU nonlinearity.
  - $n$ is the size of the local neighborhood to normalize over (across channels).
  - $k, \alpha, \beta$ are hyperparameters.
  - $N$ is the total number of channels in the layer.

### Batch Normalization (BatchNorm)

**Concept**: Batch Normalization normalizes the activations of a neuron across the mini-batch, addressing internal covariate shift by standardizing the inputs to a layer for each mini-batch.

**How BatchNorm Works**:
- BatchNorm normalizes the input for each neuron to zero mean and unit variance, then scales and shifts it using two learnable parameters per neuron. The formula is:
  
  $$\hat{x}^{(k)} = \frac{x^{(k)} - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}}$$

  Followed by:

  $$y^{(k)} = \gamma^{(k)} \hat{x}^{(k)} + \beta^{(k)}$$

  Where:
  - $x^{(k)}$ are the inputs to the layer.
  - $\mu_B$ and $\sigma_B^2$ are the mean and variance computed over the batch.
  - $\gamma$ and $\beta$ are learnable parameters.
  - $\epsilon$ is a small constant added for numerical stability.

## Key Differences
- **Scope**: LRN works across channels enhancing contrast locally, whereas BatchNorm normalizes activations across the batch for each neuron.
- **Impact**: BatchNorm reduces the amount by which the hidden unit values shift around (covariate shift), while LRN's impact is more on local neuron competition.
- **Performance and Usage**: BatchNorm tends to outperform LRN and has largely replaced LRN in modern network architectures due to its effectiveness in accelerating training and reducing the sensitivity to network initialization.


## Challenges and Limitations

- Requires significant computational resources, notably GPUs, for training and inference.
- Deep architecture necessitates a large number of parameters, making it prone to overfitting on smaller datasets without proper regularization.

## Legacy

AlexNet has been a foundational model that spurred further research and developments in deep learning. It opened the door to more complex architectures like VGG, ResNet, and many others that have since driven progress in computer vision tasks.


## Step 1: Import necessary dependencies

In [None]:
!pip install torch torchvision matplotlib

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torch.utils.data import DataLoader
from torchvision import datasets, transforms

In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)

## Step 2: Define the network

Remember the calculation of the image size using the following formula:

$$ W_{new} = \frac{W_{old}-K+P*2}{S}+1 $$

with $K$ Kernel Size (Filter Size), $P$ Padding Size (image border), $S$ Stride (shift of Kernel over the image).


In [None]:
class AlexNet(nn.Module):
    def __init__(self, num_classes=10):
        super(AlexNet, self).__init__()
        # First convolutional layer
        
        # Second convolutional layer
        
        # Third convolutional layer
        
        # Fourth convolutional layer
        
        # Fifth convolutional layer
        
        # Max pooling layer
        
        # Fully connected layers

        # Dropout layer


    def forward(self, x):
        # First block (conv1, relu, pooling)
        
        # Second block (conv2, relu, pooling)
        
        # Third block (conv3, relu)
        
        # Fourth block (conv4, relu)
        
        # Fifth block (conv5, relu, pooling)
        
        # Flatten the output for the fully connected layers
        x = x.view(x.size(0), -1)  # Adjust the flattening size based on your input size
        # Fully connected layers fc1, fc2 with relu, then dropout

        # Output layer (fc3)
        
        return x

## Step 3: Set up data loaders

Here, we use the [Cifar10 dataset](https://www.cs.toronto.edu/~kriz/cifar.html) with 10 classes.

In [None]:
# Transformations
transform = transforms.Compose([
    transforms.Resize((227, 227)),  # Resizing the images to fit AlexNet input dimensions
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
])

# Load CIFAR-10
train_dataset = datasets.CIFAR10(root='./data', train=True, download=True, transform=transform)
test_dataset = datasets.CIFAR10(root='./data', train=False, download=True, transform=transform)

train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True, num_workers=4)
test_loader = DataLoader(test_dataset, batch_size=64, shuffle=False)

### Visualize the dataset

In [None]:
import matplotlib.pyplot as plt

def denormalize(image, means=[0.485, 0.456, 0.406], stds=[0.229, 0.224, 0.225]):
    """ Reverses normalization on an image given the mean and std """
    means = torch.tensor(means).view(3, 1, 1)
    stds = torch.tensor(stds).view(3, 1, 1)
    return image * stds + means

def visualize_one_image_per_class(dataset):
    num_classes = 10  # There are 10 classes in CIFAR-10
    class_images = {}
    labels = range(num_classes)

    # Loop over the dataset to find one image per class
    for image, label in dataset:
        if label in labels and label not in class_images:
            class_images[label] = image
        if len(class_images) == num_classes:  # Stop once we have one image per class
            break

    # Plotting the images
    fig, axes = plt.subplots(1, num_classes, figsize=(15, 15))
    for label, image in class_images.items():
        ax = axes[label]
        # Denormalize and permute image to fit matplotlib expectation
        image = denormalize(image)  # Denormalize the image
        image = image.permute(1, 2, 0)  # Permute the tensor to (H, W, C)
        image = image.clip(0, 1)  # Ensure the image values are between 0 and 1
        ax.imshow(image)
        ax.axis('off')
        ax.set_title(f'Class {label}')

    plt.show()

# Visualize one image per class from the dataset
visualize_one_image_per_class(train_dataset)

## Step 4: Initialize Network and Define Loss Function and Optimizer

In [None]:
# create the model
model = AlexNet()
model.to(device)  # Move the model to the appropriate device

# Loss and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.0001)

In [None]:
# print the created model
print('The model:')
print(model)

print('\n\nModel params:')
for param in model.parameters():
    print(param)

## Step 5: Train the Network

In [None]:
# Training loop (simplified)
def train(model, device, train_loader, criterion, optimizer, epochs):
    model.train() # set model in training mode
    for epoch in range(epochs):
        for images, labels in train_loader:
            images, labels = images.to(device), labels.to(device)
            outputs = model(images)
            loss = criterion(outputs, labels)
            
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
        
        print(f'Epoch [{epoch+1}/{epochs}], Loss: {loss.item():.4f}')

## Step 6: Evaluate the Network

Using `model.eval()` correctly is crucial for getting accurate evaluation or test results because it ensures the model behaves predictably and does not use mechanisms like dropout, which are meant to introduce randomness during training only. This helps in making fair comparisons between the model's performance during training and testing phases.

In [None]:
def evaluate_model(model, test_loader):
    model.eval()  # Set model to evaluation mode
    correct = 0
    total = 0
    with torch.no_grad():
        for images, labels in test_loader:
            images, labels = images.to(device), labels.to(device)
            outputs = model(images)
            _, predicted = torch.max(outputs.data, 1)
            total += labels.size(0)
            correct += (predicted == labels).sum().item()
    accuracy = 100 * correct / total
    print(f'Accuracy: {accuracy:.2f}%')

## Step 7: Run the Training and Evaluation

In [None]:
train(model, device, train_loader, criterion, optimizer, epochs=10)
evaluate_model(model, test_loader)

## Step 8: Predict classes

In [None]:
def predict_and_visualize_single_image(model, image, label):
    # Define the transform to resize the image and normalize it as expected by the model
    transform = transforms.Compose([
        transforms.Resize((227, 227)),  # Resize to the input size expected by the model
        transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
    ])

    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    image = image.to(device)  # Move image to the device
    image = transform(image)  # Apply the transformation
    model.eval()  # Set the model to evaluation mode

    with torch.no_grad():
        image = image.unsqueeze(0)  # Add batch dimension
        output = model(image)
        _, predicted = torch.max(output.data, 1)

    # Move the image back to CPU for visualization and undo normalization
    image = image.cpu().squeeze(0)  # Remove batch dimension
    image = image.permute(1, 2, 0)  # Change from CxHxW to HxWxC for visualization
    image = image * torch.tensor([0.229, 0.224, 0.225]) + torch.tensor([0.485, 0.456, 0.406])  # Undo normalization
    image = image.clip(0, 1)  # Ensure the image values are between 0 and 1

    # Plot the image and its predicted class
    plt.imshow(image)
    plt.title(f'Actual: {label.item()} Predicted: {predicted.item()}')
    plt.axis('off')
    plt.show()

In [None]:
# Fetch one batch from the test_loader
test_images, test_labels = next(iter(test_loader))

print("Number of images in batch: ", test_images.shape[0])

# Select one image and label from the batch
image, label = test_images[50], test_labels[50]

# Predict the class 
predict_and_visualize_single_image(model, image, label)

### Visualize the weights of the first layer as images

In order to interprete the features of the first layer, we can plot the weights as images. Then, we see where the weights are focussing on. Usually, the convolution layer learn features or feature combinations to describe the data.

In [None]:
import matplotlib.pyplot as plt
import numpy as np

def plot_weights(model):
    # Assuming the model's first convolutional layer weights are of the shape [out_channels, in_channels, height, width]
    weights = model.conv1.weight.data.cpu().numpy()

    # Determine the number of filters in the first convolutional layer
    num_filters = weights.shape[0]
    
    # Number of columns for subplot
    num_cols = min(10, num_filters)
    num_rows = (num_filters + num_cols - 1) // num_cols  # Ceiling division

    fig, axes = plt.subplots(num_rows, num_cols, figsize=(num_cols * 1.5, num_rows * 1.5))
    for i, ax in enumerate(axes.flatten()):
        if i < num_filters:
            # Assuming the input channel is RGB (3 channels), you can average over the channels, or just take one channel
            img = weights[i][0]  # Taking the first channel (for RGB images)
            # Normalize weight for better visualization
            img = (img - np.min(img)) / (np.max(img) - np.min(img))
            ax.imshow(img, cmap='gray')
            ax.set_title(f'Filter {i+1}')
            ax.axis('off')
        else:
            ax.axis('off')
    plt.tight_layout()
    plt.show()

# plot weights
plot_weights(model)

# VGG

[VGG (Visual Geometry Group)](https://arxiv.org/abs/1409.1556v6) is a convolutional neural network architecture proposed by the Visual Geometry Group at the University of Oxford.

## Key characteristics

1. Depth: It consists of very deep networks (up to 19 layers) with small (3x3) convolution filters.
2. Configurations: VGG networks are defined by their architecture configurations, denoted by VGG followed by a number (e.g., VGG16, VGG19), indicating the number of weight layers.
3. Simple Structure: VGG networks have a simple and uniform architecture, making them easy to understand and implement.
4. Transfer Learning: Pre-trained VGG models on large-scale image datasets (e.g., ImageNet) are commonly used for transfer learning tasks due to their effectiveness in feature extraction.

## Basic building blocks

1. Convolutional Layers: Consist of multiple convolutional layers with 3x3 filters, followed by ReLU activation functions.
2. Max Pooling Layers: Spatial pooling layers with 2x2 filters and stride 2, used to reduce spatial dimensions.
3. Fully Connected Layers: One or more fully connected layers with ReLU activation functions, followed by a softmax layer for classification.

The VGG network achieves strong performance on image classification tasks, and its simplicity and effectiveness have made it a popular choice in deep learning research and applications.

## Step 1: Define the model

In [None]:
class VGG16_ForCIFAR10(nn.Module):
    def __init__(self, num_classes=10):
        super(VGG16_ForCIFAR10, self).__init__()
        self.width = 
        self.height = 
        self.features = nn.Sequential(
            # Conv Block 1
            nn.Conv2d(3, 64, kernel_size=3, padding=1),  # Changed from 1 to 3 input channels
            nn.BatchNorm2d(64),
            nn.ReLU(),
            nn.Conv2d(64, 64, kernel_size=3, padding=1),
            nn.BatchNorm2d(64),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2),
            
            # Conv Block 2
            nn.Conv2d(64, 128, kernel_size=3, padding=1),
            nn.BatchNorm2d(128),
            nn.ReLU(),
            nn.Conv2d(128, 128, kernel_size=3, padding=1),
            nn.BatchNorm2d(128),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2),
            
            # Conv Block 3
            nn.Conv2d(128, 256, kernel_size=3, padding=1),
            nn.BatchNorm2d(256),
            nn.ReLU(),
            nn.Conv2d(256, 256, kernel_size=3, padding=1),
            nn.BatchNorm2d(256),
            nn.ReLU(),
            nn.Conv2d(256, 256, kernel_size=3, padding=1),
            nn.BatchNorm2d(256),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2),
            
            # Conv Block 4
            nn.Conv2d(256, 512, kernel_size=3, padding=1),
            nn.BatchNorm2d(512),
            nn.ReLU(),
            nn.Conv2d(512, 512, kernel_size=3, padding=1),
            nn.BatchNorm2d(512),
            nn.ReLU(),
            nn.Conv2d(512, 512, kernel_size=3, padding=1),
            nn.BatchNorm2d(512),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2),
            
            # Conv Block 5
            nn.Conv2d(512, 512, kernel_size=3, padding=1),
            nn.BatchNorm2d(512),
            nn.ReLU(),
            nn.Conv2d(512, 512, kernel_size=3, padding=1),
            nn.BatchNorm2d(512),
            nn.ReLU(),
            nn.Conv2d(512, 512, kernel_size=3, padding=1),
            nn.BatchNorm2d(512),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2)
        )
        
        # The output of the last pooling layer would be 512x1x1 if the input is 32x32
        self.classifier = nn.Sequential(
            nn.Linear(512 * self.height * self.width, 4096),
            nn.ReLU(),
            nn.Dropout(),
            nn.Linear(4096, 4096),
            nn.ReLU(),
            nn.Dropout(),
            nn.Linear(4096, num_classes)
        )
        
    def forward(self, x):
        x = self.features(x)
        x = x.view(x.size(0), -1)  # Flatten the feature maps into a vector
        x = self.classifier(x)
        return x


## Step 2-3:

Do them with the code above, if not already executed

## Step 4:

In [None]:
# create the model
model = VGG16_ForCIFAR10(num_classes=10)
model.to(device)  # where your_device could be 'cuda' or 'cpu'

# Loss and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.0001)

## Step 5-7

Do them with the code above, if not already executed

## Step 8: Train the network

Training time is quite intensive. It might took around 40 minutes.

In [None]:
train(model, device, train_loader, criterion, optimizer, epochs=10)
evaluate_model(model, test_loader)

The model tends to overfit on the data, but a test accuracy between Accuracy: 80-90% should be achieved. I got 83.49% with a first run without fine-tuning.