This is an implementation of AlexNet introduced by Alex Krizevsky in the paper **"ImageNet Classification with Deep Convolutional Neural Networks"**. This notebook is meant for me to take notes on dansuh17 implementation of the network found [here](https://github.com/dansuh17/alexnet-pytorch/blob/d0c1b1c52296ffcbecfbf5b17e1d1685b4ca6744/README.md).

## Preparing the Project

Let's start by downloading the necessary libraries from `requirements.txt`.

In [None]:
%pip install numpy
%pip install Pillow
%pip install protobuf
%pip install six
%pip install torch
%pip install torchvision

Great now let's import the required libraries into our notebook!

In [None]:
import os
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torch.utils import data

# Torchvision will sometimes give you errors. Just reinstall it
import torchvision.datasets as datasets
import torchvision.transforms as transforms
# from tensorboardX import SummaryWriter

Pytorch allows us to execute their code on either the CPU or GPU so let's set that up.

In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

Now we needs to define some parameters for our model.

In [None]:
NUM_EPOCHS = 90
BATCH_SIZE = 128
MOMENTUM = 0.9
LR_DECAY = 0.0005
LR_INIT = 0.01
IMAGE_DIM = 227
NUM_CLASSES = 1000
DEVICE_IDS = [0, 1, 2, 3]
INPUT_ROOT_DIR = 'alexnet_data_in'
TRAIN_IMG_DIR = 'alexnet_data_in/imagenet'
OUTPUT_DIR = 'alexnext_data_out'
CHECKPOINT_DIR = OUTPUT_DIR + '/models'

Let's review what each parameter defines.

- `NUM_EPOCHS` : number of times the dataset should run through the model (1 epoch = 1 time data passes through the model)
- `BATCH_SIZE` : number of samples used in one forward and backward pass through the network
- `MOMENTUM` : additional parameter that accelerates the gradient descent in a particular direction
- `LR_DECAY` : how much the learning rate will decrease over time in training
- `LR_INIT` : learning rate
- `IMAGE_DIM` : dimensions of the input image
- `NUM_CLASSES` : number of classes for predictable by the model
- `DEVICE_IDS` : identifiers for each processing unit
- `INPUT_ROOT_DIR` : root directory where input data is stored
- `TRAIN_IMG_DIR` : directory where the training images are stored
- `OUTPUT_DIR` : where the model outputs will be saved
- `CHECKPOINT_DIR` : place to store model checkpoints (includes model parameters ate various stages)


In [None]:
# Initialize the folder that will containe the checkpoint data
os.makedirs(CHECKPOINT_DIR, exist_ok=True)

## Building AlexNet



Here's a quick overview of Alexnet.

The overall **MAIN** architecture of AlexNet contains eight layers with weights.

- 5 convolutional
- 3 fully connected (used for classifying input images into a label)

Neurons in each fully connected layer are only connected to all neurons in its previous layer.

- The final fully connected layer is fed to a 1000-way softmax that produces a distribution over 1000 class labels

Each convolutional layer has their own kernels (nxn matrix of number used to filter information from the image). 

- **2nd**, **4th**, and **5th** layers have their kernels connected to the kernel maps in the previous layer that are on the same GPU
- **3rd** layer has all its kernels connected to all kernel maps in the 2nd layer

Throughout the main architecture, mixes of Response-normalization, Max-pooling, and ReLU non-linearity layers are placed in between.

- Response-normalization layer : will normalize the activaitons of its previous layers
    - activations are scaled and shifted to have a standard normal distribution ( mean = 0 , variance = 1)
    - increases the sensory perception at points of interest
    - useful to be placed right after ReLU layers since they have unbounded activations
- Max-pooling layer : calculates the maximum value for patches of the feature map
- ReLU non-linearity layer : linear activation function that basically does f(x) = max(0, x)
    - solves the vanishing gradient problem since gradient will always be 0 or positive
    - has unbounded activations (solved with normalization)

With that in mind let's code it out

In [None]:
class AlexNet(nn.Module):
    """
    Neural Network model with layers proposed by AlexNet paper
    """

    def __init__(self, num_classes=1000):
        super().__init__()

        # Input image should be (b x 3 x 277 x 277)
        self.net = nn.Sequential(
            nn.Conv2d(in_channels=3, out_channels=96, kernel_size=11, stride=4),    # (b x 96 x 55 x 55)
            nn.ReLU(),
            nn.LocalResponseNorm(size=5, alpha=0.0001, beta=0.75, k=2),
            nn.MaxPool2d(kernel_size=3, stride=2),  # (b x 96 x 27 x 27)
            nn.Conv2d(96, 256, 5, padding=2),   # (b x 256 x 27 x 27)
            nn.ReLU(),
            nn.LocalResponseNorm(size=5, alpha=0.0001, beta=0.75, k=2), 
            nn.MaxPool2d(kernel_size=3, stride=2),  # (b x 384 x 13 x 13)
            nn.Conv2d(256, 384, 3, padding=1),
            nn.ReLU(),
            nn.Conv2d(384, 384, 3, padding=1),  # (b x 384 x 13 x 13)
            nn.ReLU(),
            nn.Conv2d(384, 256, 3, padding=1),  # (b x 256 x 13 x 13)
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=3, stride=2)   # (b x 256 x 6 x 6)
        )

        self.classifier = nn.Sequential(
            nn.Dropout(p=0.5, inplace=True),
            nn.Linear(in_features=(256 * 6 * 6), out_features=4096),
            nn.ReLU(),
            nn.Dropout(p=0.5, inplace=True),
            nn.Linear(in_features=4096, out_features=4096),
            nn.ReLU(),
            nn.Linear(in_features=4096, out_features=num_classes)
        )

    def init_bias(self):
        for layer in self.net:
            if isinstance(layer, nn.Conv2d):
                nn.init.normal_(layer.weight, mean=0, std=0.01)
                nn.init.constant_(layer.bias, 0)
        nn.init.constant_(self.net[4].bias, 1)
        nn.init.constant_(self.net[10].bias, 1)
        nn.init.constant_(self.net[12].bias, 1)
    
    def forward(self, x):
        x = self.net(x)
        x = x.view(-1, 256 * 6 * 6)
        return self.classifier(x)


## Preparing to Train



With our AlexNet, let's see how we can train it and have it classify images.

First we make some initial setup.

In [None]:
seed = torch.initial_seed()
print(f"Current seed : {seed}.")

# Create the mode
alexnet = AlexNet(num_clases=NUM_CLASSES).to(device)
# Train the model on multiple GPUs
alexnet = torch.nn.parallel.DataParallel(alexnet, device_ids=DEVICE_IDS)
print(f"AlexNet created : {alexnet}.")

dataset = datasets.ImageFolder(TRAIN_IMG_DIR, transforms.Compose([
    transforms.CenterCrop(IMAGE_DIM),
    transforms.ToTensor(),
    transforms.Normalize(
        mean=[0.485, 0.456, 0.406], 
        std=[0.229, 0.224, 0.225]
    )
]))
print("Dataset created.")

dataloader = data.DataLoader(
    dataset,
    shuffle=True,
    pin_memory=True,
    num_workers=8,
    drop_last=True,
    batch_size=BATCH_SIZE
)
print("Dataloader created.")

optimizer = optim.Adam(params=alexnet.parameters(), lr=0.0001)
print("Optimizer created.")

lr_scheduler = optim.lr_scheduler.StepLR(optimizer, step_size=30, gamma=0.1)
print("LR Scheduler created.")

There are 4 points of interest in the code above, let's take a look at each one.

```python
dataset = datasets.ImageFolder(TRAIN_IMG_DIR, transforms.Compose([
    transforms.CenterCrop(IMAGE_DIM),
    transforms.ToTensor(),
    transforms.Normalize(
        mean=[0.485, 0.456, 0.406], 
        std=[0.229, 0.224, 0.225]
    )
]))
```

`datasets.ImageFolder` is a `torchvision.datasets` module designed the handle images in a directory organized by their class labels (each class has their own directory). This method contains the path of the data and a "list" of transforms operatios that should be applied to each image in the dataset.

- `CenterCrop` : crops image to size `IMAGE_DIM` around the center
- `ToTensor` : converts the data to a PyTorch tensor (makes it so PyTorch can work with the data)
- `Normalizer` : normalizes the tensor image with mean and standard deviation values to make sure data has zero mean and unit variance

```python
dataloader = data.DataLoader(
    dataset,
    shuffle=True,
    pin_memory=True,
    num_workers=8,
    drop_last=True,
    batch_size=BATCH_SIZE

)
```

`data.Dataloader` sets up a PyTorch object that will efficiently load and iterate over batches of data during the training process

- `dataset` : the dataset
- `shuffle` : data will be shuffled on every epoch
- `pin_memory` : whether to use pinned memory for faster data transfwer to the GPU
    - **pinned memory** : memory that is allocated and locked only for GPU usage
- `drop_last` : will drop the last incomplete batch if the dataset size is not divisible by the batch size
- `batch_size` : specifies the number of samples per batch

```python
optimizer = optim.Adam(params=alexnet.parameters(), lr=0.0001)
```

AlexNet utilizes the **Adam Optimizer**, a very popular option for tuning parameters during training. This optimizer combines 2 versions of stochastic gradient descent:

- Adaptive Gradient Algorithm (AdaGrad) : maintains a per-parameter learning rate
- Root Mean Square Propagation (RMSProp) : maintains per-parameter learning rates that are adapted based on the average of recent magnitudes

In out AlexNext implementation, we are having it our Adam optimizer adjust parameters in `alexnet.parameters` at a learning rate of 0.0001

```python
lr_scheduler = optim.lr_scheduler.StepLR(optimizer, step_size=30, gamma=0.1)
```

This schedular will give as a dynamic learning rate. With the parameters given, our Adam optimizer will now adjust the learning rate by 0.1 for every 30 epochs.

## Training the Model

In [None]:
print("Training start...")

total_steps = 1
for epoch in range(NUM_EPOCHS):
    lr_scheduler.step()
    for imgs, classes in dataloader:
        imgs, classes = imgs.to(device), classes.to(device)

        # Calculate the loss
        output = alexnet(imgs)
        loss = F.cross_entropy(output, classes)

        # Update the parameters
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        # Print out gradient values and parameter average values
        if total_steps % 100 == 0:
            with torch.no_grad():
                print("*" * 10)
                for name, parameter in alexnet.named_parameters():
                    if parameter.grad is not None:
                        avg_grad = torch.mean(parameter.grad)
                        print(f"\t{name} - grad_avg: {avg_grad}")
                    if parameter.data is not None:
                        avg_weight = torch.mean(parameter.data)
                        print(f"\t{name} - param_avg: {avg_weight}")
        total_steps += 1

    # Save checkpoints
    checkpoint_path = os.path.join(CHECKPOINT_DIR, f"alexnet_states_e{epoch + 1}.pkl")
    state = {
        "epoch": epoch,
        "total_steps": total_steps,
        "optimizer" : optimizer,
        "model": alexnet.state_dict(),
        "seed": seed
    }
    torch.save(state, checkpoint_path)

Let's breakdown the training process.

1. Check if we need to update the parameter with `lr_scheduler.step()`. This method will only update if the epoch matches the set scheduler.
2. `for` each img in our dataset (and their corresponding class) 
    - Run them on the GPU if available
    - Predict with the training set
    - Get the loss by using the predictions from the training set and the desired outputs
    - Zero out the gradients on the optimizer (makes sure we are not retaining previous information)
    - Compute the gradients of the loss
    - Update parameters based on back propagation calculations
3. Save the progress for the epoch.