# I) Summary

**DISCLAIMER**: We will use the weights of a pretained model.

AlexNet architecture:

- **5 Convolutional layers**.
- **3 Fully connected layers**.
- **3 Overlapping Max pooling layers**.
- **ReLU** as activation function for hidden layer.

    - Avoid vanishing gradients for positive values.
    - More computationally efficient to compute than sigmoid and tanh.
    - Better convergence performance than sigmoid and tanh.

- **Softmax** as activation function for output layer.
- **60,000,000 trainable parameters**.
- **Cross-entropy** as cost function
- **Mini-batch gradient descent with Momentum optimizer**.
    - Batch size : 128.
    - Momentum = 0.9.
    - Weight decay = 0.0005.
    - Learning rate: 0.01. Equal learning rate for all layers and diving by 10 when validation error stopped  improving.
- **Local Response Normalization** 
    - it helps with generalization.

--- 

AlexNet details:

- Trained with ILSVRC-2010 dataset (1.2 million training images, 50,000 validation images, and 150,000 testing images.).
- Trained on **90 epochs**.
- **Weight initialization**: zero-mean Gaussian distribution and a standard deviation of 0.01.
- **Bias initialization**: 1 for 2nd/4th/5th conv layers and all fully-connected layers and 0 for remaining layers.

--- 

AlexNet inputs:

- **RGB image of size 256 x 256**. If not, training/test set images need to be resized.
   - Example: image_size = 1024 x 500 => Smaller dimension is resized to 256 and resulting image is cropped  to obtain a 256 x 256 image.
 
- the RGB image of size 256 x 256 will then be **cropped into 227 x 227** (cf Data Augmentation part). The paper mistakenly says 224 x 224.

--- 

AlexNet is proned to overfit, thus to prevent that:

- **Dropout**.
    - 50% dropout rate.
- **Data Augmentation**.
    - **Translations and horizontal reflections (mirroring)**: Extract random 227 x 227 crops from 256 x 256 images.
        - Translation on 1 image: (256−227)∗(256−227) = 841 possible images.
        - Mirroring : x2 the training set size.
        - New training set size = 1.2 millions * 2 * 841 = 1.2 millions * 1682 images.
    - **Altering the intensities of RGB channels**: performing PCA on the set of RGB pixel values throughout the ImageNet training set. Doing this approximately captures an important property of natural images: object identity is invariant to changes in the intensity and color of the illumination.

**Remark**:

- According to the paper, they only trained on 1.2 millions training data without using data augmentation. The reason is the following:
    - Suppose they could get 0.001s per forward/backward pass. It will take (0.001 * 1,200,000 * 1682 * 90) / (60 * 60 * 24 * 365) ~= **5.7 years** to train the model.


![legend](../../img/legend.png)

![AlexNet model](../../img/alexnet-model.png)

# II) Implementation

In [4]:
import torch
import torch.nn as nn
from torch.utils.data import DataLoader, random_split
from torchvision import transforms, datasets
import torch.optim as optim
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

## c) Architecture build

In [5]:
class AlexNet(nn.Module):
    
    def __init__(self):
        super(AlexNet, self).__init__()
        
        self.convolution = nn.Sequential(
            nn.Conv2d(in_channels=3, out_channels=96, kernel_size=11, stride=4),
            nn.ReLU(inplace=True),
            nn.LocalResponseNorm(size=5, alpha=0.0001, beta=0.75),
            nn.MaxPool2d(kernel_size=3, stride=2),
            nn.Conv2d(in_channels=96, out_channels=256, kernel_size=5, padding=2),
            nn.ReLU(inplace=True),
            nn.LocalResponseNorm(size=5, alpha=0.0001, beta=0.75),
            nn.MaxPool2d(kernel_size=3, stride=2),
            nn.Conv2d(in_channels=256, out_channels=384, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            nn.Conv2d(in_channels=384, out_channels=384, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            nn.Conv2d(in_channels=384, out_channels=256, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=3, stride=2),
        )
        self.fully_connected = nn.Sequential(
            nn.Linear(256 * 6 * 6, 4096),
            nn.ReLU(inplace=True),
            nn.Dropout(p=0.5),
            nn.Linear(4096, 4096),
            nn.ReLU(inplace=True),
            nn.Dropout(p=0.5),
            nn.Linear(4096, 1000),
        )

    def forward(self, x):
        x = self.convolution(x)
        x = x.view(x.shape[0], 256 * 6 * 6)
        x = self.fully_connected(x)
        return x