# AlexNet Architecture

### Introduction

AlexNet is a groundbreaking convolutional neural network (CNN) architecture designed by Geoffrey Hinton and his student Alex Krizhevsky. It gained fame as the winner of the 2012 ImageNet Large Scale Visual Recognition Challenge (ILSVRC), where it achieved remarkable results that significantly outperformed traditional machine learning approaches.

<br>

After AlexNet's success, the field of deep learning saw a surge in the development of deeper and more complex neural network architectures, including influential models like VGGNet and GoogLeNet.

<br>

In the ILSVRC 2012 competition, AlexNet achieved an impressive top-5 error rate of 15.3%, translating to an accuracy of 57.1%. This performance marked a significant milestone in image classification, demonstrating the effectiveness of deep learning techniques in visual recognition tasks. By comparison, traditional machine learning classification algorithms struggled to achieve similar levels of accuracy.



![title](https://raw.githubusercontent.com/blurred-machine/Data-Science/master/Deep%20Learning%20SOTA/img/alexnet2.png)

#### The following table explains the network structure of AlexNet:

<table>
<thead>
	<tr>
		<th>Size / Operation</th>
		<th>Filter</th>
		<th>Depth</th>
		<th>Stride</th>
		<th>Padding</th>
		<th>Number of Parameters</th>
		<th>Forward Computation</th>
	</tr>
</thead>
<tbody>
	<tr>
		<td>Input Image<br>(3 * 227 * 227)</td>
		<td></td>
		<td></td>
		<td></td>
		<td></td>
		<td></td>
		<td></td>
	</tr>
	<tr>
		<td>Conv1 + Relu<br>(96 * 55 * 55)</td>
		<td>11 * 11</td>
		<td>96</td>
		<td>4</td>
		<td></td>
		<td>(11 * 11 * 3 + 1) * 96 = 34,944</td>
		<td>(11 * 11 * 3 + 1) * 96 * 55 * 55 = 105,705,600</td>
	</tr>
	<tr>
		<td>Max Pooling<br>(96 * 27 * 27)</td>
		<td>3 * 3</td>
		<td></td>
		<td>2</td>
		<td></td>
		<td></td>
		<td></td>
	</tr>
	<tr>
		<td>Norm</td>
		<td></td>
		<td></td>
		<td></td>
		<td></td>
		<td></td>
		<td></td>
	</tr>
	<tr>
		<td>Conv2 + Relu<br>(256 * 27 * 27)</td>
		<td>5 * 5</td>
		<td>256</td>
		<td>1</td>
		<td>2</td>
		<td>(5 * 5 * 96 + 1) * 256 = 614,656</td>
		<td>(5 * 5 * 96 + 1) * 256 * 27 * 27 = 448,084,224</td>
	</tr>
	<tr>
		<td>Max Pooling<br>(256 * 13 * 13)</td>
		<td>3 * 3</td>
		<td></td>
		<td>2</td>
		<td></td>
		<td></td>
		<td></td>
	</tr>
	<tr>
		<td>Norm</td>
		<td></td>
		<td></td>
		<td></td>
		<td></td>
		<td></td>
		<td></td>
	</tr>
	<tr>
		<td>Conv3 + Relu<br>(384 * 13 * 13)</td>
		<td>3 * 3</td>
		<td>384</td>
		<td>1</td>
		<td>1</td>
		<td>(3 * 3 * 256 + 1) * 384 = 885,120</td>
		<td>(3 * 3 * 256 + 1) * 384 * 13 * 13 = 149,585,280</td>
	</tr>
	<tr>
		<td>Conv4 + Relu<br>(384 * 13 * 13)</td>
		<td>3 * 3</td>
		<td>384</td>
		<td>1</td>
		<td>1</td>
		<td>(3 * 3 * 384 + 1) * 384 = 1,327,488</td>
		<td>(3 * 3 * 384 + 1) * 384 * 13 * 13 = 224,345,472</td>
	</tr>
	<tr>
		<td>Conv5 + Relu<br>(256 * 13 * 13)</td>
		<td>3 * 3</td>
		<td>256</td>
		<td>1</td>
		<td>1</td>
		<td>(3 * 3 * 384 + 1) * 256 = 884,992</td>
		<td>(3 * 3 * 384 + 1) * 256 * 13 * 13 = 149,563,648</td>
	</tr>
	<tr>
		<td>Max Pooling<br>(256 * 6 * 6)</td>
		<td>3 * 3</td>
		<td></td>
		<td>2</td>
		<td></td>
		<td></td>
		<td></td>
	</tr>
	<tr>
		<td>Dropout (rate 0.5)</td>
		<td></td>
		<td></td>
		<td></td>
		<td></td>
		<td></td>
		<td></td>
	</tr>
	<tr>
		<td>FC6 + Relu<br>(4096)</td>
		<td></td>
		<td></td>
		<td></td>
		<td></td>
		<td>256 * 6 * 6 * 4096 = 37,748,736</td>
		<td>256 * 6 * 6 * 4096 = 37,748,736</td>
	</tr>
	<tr>
		<td>Dropout (rate 0.5)</td>
		<td></td>
		<td></td>
		<td></td>
		<td></td>
		<td></td>
		<td></td>
	</tr>
	<tr>
		<td>FC7 + Relu<br>(4096)</td>
		<td></td>
		<td></td>
		<td></td>
		<td></td>
		<td>4096 * 4096 = 16,777,216</td>
		<td>4096 * 4096 = 16,777,216</td>
	</tr>
	<tr>
		<td>FC8 + Relu<br>(1000 classes)</td>
		<td></td>
		<td></td>
		<td></td>
		<td></td>
		<td>4096 * 1000 = 4,096,000</td>
		<td>4096 * 1000 = 4,096,000</td>
	</tr>
	<tr>
		<td>Overall</td>
		<td></td>
		<td></td>
		<td></td>
		<td></td>
		<td>62,369,152 = 62.3 million</td>
		<td>1,135,906,176 = 1.1 billion</td>
	</tr>
	<tr>
		<td>Conv VS FC</td>
		<td></td>
		<td></td>
		<td></td>
		<td></td>
		<td>Conv: 3.7 million (6%), FC: 58.6 million (94%)</td>
		<td>Conv: 1.08 billion (95%), FC: 58.6 million (5%)</td>
	</tr>
</tbody>
</table>


## Why Does AlexNet Achieve Better Results?



In this section, we will briefly highlight four key techniques that contribute to the improved performance of AlexNet:


### **ReLU Activation Function**

- **ReLU Function**: The Rectified Linear Unit (ReLU) activation function is defined as \( f(x) = \max(0, x) \). This simple yet effective function allows for faster training of deep convolutional networks compared to traditional activation functions like sigmoid and hyperbolic tangent (tanh).

- **Training Speed**: ReLU-based networks can be trained several times faster than those using tanh or sigmoid. This increased speed is due to the reduced likelihood of the vanishing gradient problem, where gradients become too small for effective learning. The figure below illustrates the number of iterations required for a four-layer convolutional network on the CIFAR-10 dataset to reach 25% training error using both tanh and ReLU.






![alex1](https://raw.githubusercontent.com/blurred-machine/Data-Science/master/Deep%20Learning%20SOTA/img/alex512.png)




### **Local Response Normalization (LRN)**

- **Normalization After ReLU**: After applying the ReLU function, the output values are unbounded, unlike those from tanh or sigmoid functions. To address this, normalization techniques like Local Response Normalization (LRN) can be applied. This method helps to maintain a stable learning environment by enhancing the generalization capabilities of the network.



- **Biological Inspiration**: LRN is inspired by a phenomenon in neuroscience known as "lateral inhibition," where active neurons suppress the activity of neighboring neurons. This concept helps to emphasize the most salient features detected by the network.


![alex2](https://iq.opengenus.org/content/images/2022/05/1-QspNGlKrJ5dAW9VCrP4T4Q.png)



### **Dropout**

- **Preventing Overfitting**: Dropout is a regularization technique that effectively mitigates overfitting in neural networks. Unlike traditional linear models that use explicit regularization methods, dropout modifies the neural network architecture itself.

- **Implementation**: During training, dropout randomly "drops out" a subset of neurons (with a defined probability) from a layer, while keeping the input and output layer neurons unchanged. This prevents the network from becoming overly reliant on specific neurons and encourages redundancy in feature representation. In subsequent iterations, different neurons are randomly selected for dropout, ensuring varied learning pathways until training concludes.




### **Data Augmentation**

- **Enhancing Data Size**: In deep learning, especially when the training dataset is small, data augmentation is crucial. It artificially expands the training set by generating "new" data from existing samples through several techniques:



  - **Translation**: This technique involves shifting the image along the x or y axis. By moving the image left, right, up, or down, the model learns to recognize objects in different positions within the frame. This helps improve its robustness to variations in object location.



  - **Rotation**: Rotation involves turning the image by a certain angle (e.g., 90°, 180°, or any degree). This technique helps the model become invariant to the orientation of the objects, ensuring that it can recognize them regardless of how they are presented in the image.



  - **Flipping**: Flipping refers to mirroring the image horizontally or vertically. This technique is particularly useful for tasks like facial recognition, where the orientation may vary, and helps ensure that the model learns to recognize features regardless of their left or right orientation.



  - **Adding Noise**: Adding noise involves introducing random variations to the images, such as Gaussian noise, which can simulate real-world conditions like lighting changes or sensor imperfections. This technique helps the model become more robust to noisy inputs and improves its generalization to new, unseen data.



- **Other Techniques to Combat Overfitting**:

  - **Regularization**: In cases of limited data, regularization techniques can be applied to prevent overfitting. This involves adding a regularization term to the loss function, which helps to balance the training and test errors. However, this approach requires manual tuning of hyperparameters.

  

  - **Unsupervised Pre-training**: Another method is unsupervised pre-training, where autoencoders or restricted Boltzmann machines (RBMs) are used to initialize the network layers. This layer-by-layer approach prepares the network for supervised fine-tuning with a classification layer added at the end.


#### **Augmentation visualisation**

In [None]:
import requests

from io import BytesIO

import matplotlib.pyplot as plt

import torchvision.transforms as transforms

from PIL import Image

import numpy as np



IMAGE_DIM = 277



augmentations = [

    ("Original", transforms.Compose([

        transforms.CenterCrop(IMAGE_DIM),

        transforms.ToTensor()

    ])),

    ("RandomResizedCrop", transforms.Compose([

        transforms.RandomResizedCrop(IMAGE_DIM, scale=(0.9, 1.0)),

        transforms.ToTensor()

    ])),

    ("HorizontalFlip", transforms.Compose([

        transforms.CenterCrop(IMAGE_DIM),

        transforms.RandomHorizontalFlip(1),

        transforms.ToTensor()

    ])),

    ("ColorJitter", transforms.Compose([

        transforms.CenterCrop(IMAGE_DIM),

        transforms.ColorJitter(brightness=0.2, contrast=0.2, saturation=0.2, hue=0.1),

        transforms.ToTensor()

    ])),

    ("RandomRotation", transforms.Compose([

        transforms.CenterCrop(IMAGE_DIM),

        transforms.RandomRotation(degrees=15),

        transforms.ToTensor()

    ])),

    ("All Combined", transforms.Compose([

        transforms.RandomResizedCrop(IMAGE_DIM, scale=(0.9, 1.0)),

        transforms.RandomHorizontalFlip(),

        transforms.ColorJitter(brightness=0.2, contrast=0.2, saturation=0.2, hue=0.1),

        transforms.RandomRotation(degrees=15),

        transforms.ToTensor()

    ]))

]



image_url = 'https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcQFuAFw9qqY0WdxJ8OWNkrVKpbCZi0rXaslxg&s'

response = requests.get(image_url)

image = Image.open(BytesIO(response.content))



fig, axs = plt.subplots(2, 3, figsize=(15, 10))

axs = axs.ravel()



for i, (title, aug) in enumerate(augmentations):

    augmented_img = aug(image)  # Apply the augmentation

    augmented_img_np = augmented_img.permute(1, 2, 0).numpy()

    axs[i].imshow(np.clip(augmented_img_np, 0, 1))

    axs[i].axis('off')

    axs[i].set_title(title)



for j in range(i+1, len(axs)):

    axs[j].axis('off')



plt.tight_layout()

plt.show()


# AlexNet Implementation

## Library Imports & Constant Definitions

In this section, we will import the necessary libraries and define the constants required for implementing the AlexNet model.

In [None]:
import os

import torch

import torch.nn as nn

import torch.optim as optim

import torch.nn.functional as F

import torchvision.datasets as datasets

from torch.utils.data import DataLoader, Subset, ConcatDataset

from sklearn.metrics import confusion_matrix, f1_score, recall_score

from tqdm import tqdm


device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

In [None]:
# define model parameters

IMAGE_DIM = 28  # pixels of Fashion-MNIST

NUM_CLASSES = 10  # 10 classes for Fashion-MNIST

DEVICE_IDS = [0]  # GPUs to use

## Architecture Overview

In this section, we will define the architecture of the AlexNet model.

In [None]:
class AlexNet(nn.Module):

    def __init__(self, num_classes):

        super().__init__()

        self.net = nn.Sequential(

            nn.Conv2d(in_channels=3, out_channels=96, kernel_size=11, stride=4),  # (b x 96 x 55 x 55)

            nn.ReLU(),

            nn.LocalResponseNorm(size=5, alpha=0.0001, beta=0.75, k=2), # Local response normalization

            nn.MaxPool2d(kernel_size=3, stride=2),  # (b x 96 x 27 x 27)

            nn.Conv2d(96, 256, 5, padding=2),  # (b x 256 x 27 x 27)

            nn.ReLU(),

            nn.LocalResponseNorm(size=5, alpha=0.0001, beta=0.75, k=2), # Local response normalization

            nn.MaxPool2d(kernel_size=3, stride=2),  # (b x 256 x 13 x 13)

            nn.Conv2d(256, 384, 3, padding=1),  # (b x 384 x 13 x 13)

            nn.ReLU(),

            nn.Conv2d(384, 384, 3, padding=1),  # (b x 384 x 13 x 13)

            nn.ReLU(),

            nn.Conv2d(384, 256, 3, padding=1),  # (b x 256 x 13 x 13)

            nn.ReLU(),

            nn.MaxPool2d(kernel_size=3, stride=2),  # (b x 256 x 6 x 6)

        )

        self.classifier = nn.Sequential(

            nn.Dropout(p=0.5, inplace=False), # Dropout layer for regularization

            nn.Linear(in_features=(256 * 6 * 6), out_features=4096),

            nn.ReLU(inplace=False),

            nn.Dropout(p=0.5, inplace=False), # Dropout layer for regularization

            nn.Linear(in_features=4096, out_features=4096),

            nn.ReLU(inplace=False),

            nn.Linear(in_features=4096, out_features=num_classes), # Output layer for classification

        )

        self.init_bias() # Initializes the biases of the model layers



    def init_bias(self):

        for layer in self.net:

            if isinstance(layer, nn.Conv2d):

                nn.init.normal_(layer.weight, mean=0, std=0.01)

                nn.init.constant_(layer.bias, 0)

        nn.init.constant_(self.net[4].bias, 1)

        nn.init.constant_(self.net[10].bias, 1)

        nn.init.constant_(self.net[12].bias, 1)



    def forward(self, x):

        x = self.net(x)

        x = x.view(-1, 256 * 6 * 6)

        return self.classifier(x)


The architecture shown above is the original one described in the paper. However, to adapt this model for use with 28x28 images, several modifications are required. Below is the modified architecture reflecting these changes.

In [None]:
class AlexNetFashionMNIST(nn.Module):

    def __init__(self, num_classes=10):  

        super(AlexNetFashionMNIST, self).__init__()

        self.features = nn.Sequential(

            nn.Conv2d(1, 64, kernel_size=3, stride=1, padding=1),  # 28x28 -> 28x28

            nn.ReLU(inplace=True),

            nn.MaxPool2d(kernel_size=2, stride=2),  # 28x28 -> 14x14

            nn.Conv2d(64, 192, kernel_size=3, padding=1),  # 14x14 -> 14x14

            nn.ReLU(inplace=True),

            nn.MaxPool2d(kernel_size=2, stride=2),  # 14x14 -> 7x7

            nn.Conv2d(192, 384, kernel_size=3, padding=1),  # 7x7 -> 7x7

            nn.ReLU(inplace=True),

            nn.Conv2d(384, 256, kernel_size=3, padding=1),  # 7x7 -> 7x7

            nn.ReLU(inplace=True),

            nn.Conv2d(256, 256, kernel_size=3, padding=1),  # 7x7 -> 7x7

            nn.ReLU(inplace=True),

            # No further pooling to maintain feature map size at 7x7

        )

        self.classifier = nn.Sequential(

            nn.Dropout(p=0.5),  # Dropout for regularization

            nn.Linear(256 * 7 * 7, 4096),  # Adjusted for 7x7 input

            nn.ReLU(inplace=True),

            nn.Dropout(p=0.5),  # Dropout for regularization

            nn.Linear(4096, 1024),  # Reduced for computational efficiency

            nn.ReLU(inplace=True),

            nn.Linear(1024, num_classes),  # Output for 10 classes in Fashion-MNIST

        )



    def forward(self, x):

        x = self.features(x)

        x = torch.flatten(x, 1)

        return self.classifier(x)


### Helper Functions

In this section, we will define several helper functions to streamline the training process of the AlexNet model. These functions will handle tasks such as loading the dataset, sampling, and applying data augmentation when necessary, while also integrating the required classes for model training.

In [None]:
# create model

def create_model():

    return AlexNetFashionMNIST(num_classes=NUM_CLASSES).to(device)

In [None]:
# create optimizer and criterion

def create_optimizer(model):

    return optim.Adam(model.parameters(), lr=0.001)



def create_criterion():

    return nn.CrossEntropyLoss()

In [None]:
def get_smaller_dataset(dataset, fraction=0.1):

    dataset_size = len(dataset)

    indices = np.random.choice(dataset_size, int(dataset_size * fraction), replace=False)

    return Subset(dataset, indices)

In [None]:
# Function to augment the dataset by adding horizontally and vertically flipped images

def augment_dataset(dataset):

    """Expand the dataset by adding horizontally and vertically flipped versions of the images."""

    horizontal_flip_transform = transforms.Compose([transforms.RandomHorizontalFlip(p=1.0), transforms.ToTensor()])

    vertical_flip_transform = transforms.Compose([transforms.RandomVerticalFlip(p=1.0), transforms.ToTensor()])



    # Augment the dataset by flipping

    horizontal_flipped_dataset = datasets.FashionMNIST(root='./data', train=True, download=True, transform=horizontal_flip_transform)

    vertical_flipped_dataset = datasets.FashionMNIST(root='./data', train=True, download=True, transform=vertical_flip_transform)



    # Combine the original, horizontally flipped, and vertically flipped datasets

    augmented_dataset = ConcatDataset([dataset, horizontal_flipped_dataset, vertical_flipped_dataset])



    return augmented_dataset

In [None]:
# Download and augment Fashion-MNIST dataset

def load_data(augment=False, batch_size=64, fraction=0.1):

    transform_train = transforms.Compose([

        transforms.ToTensor(),

        transforms.Normalize((0.1307,), (0.3081,)),

    ])



    transform_test = transforms.Compose([

        transforms.ToTensor(),

        transforms.Normalize((0.1307,), (0.3081,)),

    ])



    # Load full Fashion-MNIST dataset and subsample them based on the given fraction

    trainset = datasets.FashionMNIST(root='./data', train=True, download=True, transform=transform_train)

    testset = datasets.FashionMNIST(root='./data', train=False, download=True, transform=transform_test)

    small_trainset = get_smaller_dataset(trainset, fraction)



    # If augment=True, augment the dataset with horizontal and vertical flips

    if augment:

        small_trainset = augment_dataset(small_trainset)



    trainloader = DataLoader(small_trainset, batch_size=batch_size, shuffle=True, num_workers=2)

    testloader = DataLoader(testset, batch_size=batch_size, shuffle=False, num_workers=2)



    return trainloader, testloader

## Train & Test Functions

In this section, we will define the training & testing functions.

In [None]:
# train function 

def train_model(model, trainloader, criterion, optimizer, device, epochs=10):

    model.train()

    for epoch in range(epochs):

        running_loss = 0.0

        progress_bar = tqdm(enumerate(trainloader), total=len(trainloader), desc=f"Epoch {epoch+1}/{epochs}")

        for i, (inputs, labels) in progress_bar:

            inputs, labels = inputs.to(device), labels.to(device)

            

            # calculate the loss

            optimizer.zero_grad()

            outputs = model(inputs)

            loss = criterion(outputs, labels)



            # update the parameters

            loss.backward()

            optimizer.step()



            running_loss += loss.item()

            progress_bar.set_postfix(loss=running_loss / (i + 1))

        print(f"Finished Epoch {epoch+1}")

In [None]:
# Test function

def evaluate_model(model, testloader, criterion, device):

    model.eval()  # Set model to evaluation mode

    

    correct = 0

    total = 0

    test_loss = 0.0

    all_labels = []

    all_preds = []

    

    progress_bar = tqdm(enumerate(testloader), total=len(testloader), desc="Evaluating")

    with torch.no_grad():  # Disable gradient computation for testing

        for i, (inputs, labels) in progress_bar:

            inputs, labels = inputs.to(device), labels.to(device)



            # Forward pass

            outputs = model(inputs)

            loss = criterion(outputs, labels)

            test_loss += loss.item()



            # Predictions

            _, predicted = torch.max(outputs, 1)

            total += labels.size(0)

            correct += (predicted == labels).sum().item()

            

            all_labels.extend(labels.cpu().numpy())

            all_preds.extend(predicted.cpu().numpy())

            progress_bar.set_postfix(loss=test_loss / (i + 1), accuracy=100 * correct / total)

            

    # Calculate average test loss, accuracy, F1 score, and recall 

    accuracy = 100 * correct / total
    f1 = f1_score(all_labels, all_preds, average='weighted')
    recall = recall_score(all_labels, all_preds, average='weighted')
    
    print(f"Test Loss: {test_loss / len(testloader):.4f}, Accuracy: {accuracy:.2f}%, F1 Score: {f1:.4f}, Recall Score: {recall:.4f}")

    # Compute confusion matrix

    conf_matrix = confusion_matrix(all_labels, all_preds)
    print(f"Confusion Matrix:\n{conf_matrix}")

### Observation

In this section, we will create datasets both with and without data augmentation. We will train and test our models on these datasets to evaluate the impact of augmentation on the performance of the AlexNet model. The model will be trained for 5 epochs using only 10% of the dataset, allowing us to observe whether this reduced data is sufficient to achieve efficient performance.

In [None]:


def run_experiment(augment=False, epochs=10, fraction=0.1):

    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")



    # Initialize model, criterion, and optimizer

    model = create_model()

    criterion = create_criterion()

    optimizer = create_optimizer(model)



    # Load data with the specified fraction

    trainloader, testloader = load_data(augment=augment, fraction=fraction)



    # Train model

    print(f"Training with {'augmentation' if augment else 'no augmentation'} on a {fraction * 100}% subset of Fashion-MNIST")

    train_model(model, trainloader, criterion, optimizer, device, epochs=epochs)



    # Evaluate model

    print("Evaluating model...")

    evaluate_model(model, testloader, criterion, device)



# without augmentation

print("Running training on Fashion-MNIST without augmentation:")

run_experiment(augment=False, epochs=4, fraction=0.1)  # Using 10% of the dataset



# with augmentation

print("\n\nRunning training on Fashion-MNIST with augmentation:")

run_experiment(augment=True, epochs=4, fraction=0.1)  # Using 10% of the dataset
