## Notes

In [None]:
"""
Notes from the paper:

The Alexnet paper used Convolutional Neural Networks to win the ImageNet competition in 2012.

Goal:
Image Classification

Dataset Used:
Imagenet-1000
Imagenet is a 15 million labelled high-resolution (Relatively speeaking, compared to NIST which was 28 x28, this is 256 x 256) images in 22,000 categories.
The 1000 category subset was used for this paper. which is over 1.2 Million images lol. That's ALOT. This might be the biggest dataset I have ever worked with.

Method Used:
Convolution layers, occasionally followed by max-pooling layers. The final layers are fully connected layers, with Dropout layers in between.
Ends with a 1000-way softmax layer.

Convolution dimension calculation:
https://madebyollin.github.io/convnet-calculator/

Architecture:
The input is originally 256 x 256, but is cropped to 224 x 224, and then fed into the network (data augmentation).
Input (3 x 224, 224)
DONE
-------------------------------------------------------------------------------------------------------------------------------------------
- Convolutional Layer 1
    GPU1 - (96 filters, 11 x 11, stride 4, padding 0) -> (output dim: (96, 55, 55))
    GPU2 - (96 filters, 11 x 11, stride 4, padding 0) -> (output dim: (96, 55, 55))
- Max Pooling Layer 1
    GPU1 - (3 x 3, stride 2) -> (output dim: (96, 27, 27))
    GPU2 - (3 x 3, stride 2) -> (output dim: (96, 27, 27))
- Convolutional Layer 2
    GPU1 - (256 filters, 5 x 5, stride 1, padding 2) -> (output dim: (256, 27, 27))
    GPU2 - (256 filters, 5 x 5, stride 1, padding 2) -> (output dim: (256, 27, 27))
- Max Pooling Layer 2
    GPU1 - (3 x 3, stride 2) -> (output dim: (256, 13, 13))
    GPU2 - (3 x 3, stride 2) -> (output dim: (256, 13, 13))
- Convolutional Layer 3
    GPU1 - (384 filters, 3 x 3, stride 1, padding 1) -> (output dim: (384, 13, 13))
    GPU2 - (384 filters, 3 x 3, stride 1, padding 1) -> (output dim: (384, 13, 13))
- Convolutional Layer 4
    GPU1 - (384 filters, 3 x 3, stride 1, padding 1) -> (output dim: (384, 13, 13))
    GPU2 - (384 filters, 3 x 3, stride 1, padding 1) -> (output dim: (384, 13, 13))
- Convolutional Layer 5
    GPU1 - (256 filters, 3 x 3, stride 1, padding 1) -> (output dim: (256, 13, 13))
    GPU2 - (256 filters, 3 x 3, stride 1, padding 1) -> (output dim: (256, 13, 13))
- Max Pooling Layer 3
    GPU1 - (3 x 3, stride 2) -> (output dim: (256, 6, 6))
    GPU2 - (3 x 3, stride 2) -> (output dim: (256, 6, 6)) # Flattened this becomes 9216 neurons
- Dropout Layer 1
    Dropout 0.5
- Fully Connected Layer 1
    GPU1 - (9216 neurons) -> (output dim: (4096))
    GPU2 - (9216 neurons) -> (output dim: (4096))
- Fully Connected Layer 2
    GPU1 - (4096 neurons) -> (output dim: (4096))
    GPU2 - (4096 neurons) -> (output dim: (4096))
- Fully Connected Layer 3
    4096 Neurons -> (output dim: (1000))
- Softmax Layer
    1000 Neurons -> (output dim: (1000))

Keep in mind this:

```
Now we are ready to describe the overall architecture of our CNN. As depicted in Figure 2, the net
contains eight layers with weights; the first five are convolutional and the remaining three are fullyconnected. The output of the last fully-connected layer is fed to a 1000-way softmax which produces
a distribution over the 1000 class labels. Our network maximizes the multinomial logistic regression
objective, which is equivalent to maximizing the average across training cases of the log-probability
of the correct label under the prediction distribution.
The kernels of the second, fourth, and fifth convolutional layers are connected only to those kernel
maps in the previous layer which reside on the same GPU (see Figure 2). The kernels of the third
convolutional layer are connected to all kernel maps in the second layer. The neurons in the fullyconnected layers are connected to all neurons in the previous layer. Response-normalization layers
follow the first and second convolutional layers. Max-pooling layers, of the kind described in Section
3.4, follow both response-normalization layers as well as the fifth convolutional layer. The ReLU
non-linearity is applied to the output of every convolutional and fully-connected layer
```

Training Parameters / Hyperparamters:
- Data Augmentation: Randomly cropped 224x224 patches from the 256x256 images, and horizontally mirroring them.
    - This means that on test time, the image is resized to 256x256, and then 5 224x224 patches are cropped from it, and mirrored, and the network is run on all of them. The final prediction is the average of the 10 predictions.
- They wrote a Cuda ConvNet from scratch to train the network. BASED
- SGD with momentum 0.9 and weight decay 0.0005
- Batch Size: 128

Metrics Defined:
Error Rate
- Number of misclassified test samples / Total number of test samples

Top 1 vs top 5 error rate
- Top 1 error rate is the number of test samples for which the correct label is not among the top 1 predicted labels
- Top 5 error rate is the number of test samples for which the correct label is not among the top 5 predicted labels

Results:
- Top-1 error rate: 37.5%
- Top-5 error rate: 17.0%
"""

In [None]:
!wget "http://aisdatasets.informatik.uni-freiburg.de/freiburg_groceries_dataset/freiburg_groceries_dataset.tar.gz"

In [None]:
!tar -xvf "freiburg_groceries_dataset.tar.gz"

In [None]:
# Import image paths in a way where we have a dataframe with the structure --> Path|Folder_name (aka label)
import glob
import pandas as pd

In [None]:
glob.glob("images/*/*.png")

In [None]:
images = dict()
# Dictionary where structure is {Folder name: [Image1, Image2,...]}

folder_names = glob.glob("images/*")

In [None]:
image_paths = glob.glob("images/*/*")
labels = pd.Series(image_paths)

In [None]:
labels = labels.str.split(pat="/", expand=True)

In [None]:
labels

In [None]:
image_paths = pd.DataFrame(image_paths)
image_paths

In [None]:
images_dict = dict()
images_dict["image_paths"] = image_paths[0].values
images_dict["labels"] = labels[1].values
images_dict

In [None]:
dataset = pd.DataFrame(images_dict)

In [None]:
dataset

In [None]:
len(dataset)

In [56]:
# Torch imports
import torch
import torchvision
from torchvision.transforms import ToTensor, Lambda, Compose, v2
from torch.nn.functional import one_hot
from torch.utils.data import DataLoader, Dataset

In [None]:
class groceries_dataset_class(Dataset):
  def __init__(self, dataframe, transform = None):
    self.dataframe = dataframe
    self.transform = transform

  def __len__(self):
    return len(self.dataframe)

  def __getitem__(self, idx):
    image = torchvision.io.read_image(self.dataframe["image_paths"][idx])
    label = self.dataframe["labels"][idx]
    if self.transform:
      image = self.transform(image)
    return image, label


In [60]:
batch_size = 256
transforms = Compose([
    v2.RandomResizedCrop(size=(224, 224), antialias=True),
    v2.RandomHorizontalFlip(p=0.5),
    v2.ToDtype(torch.float32, scale=True),
    v2.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
])
groceries_dataset = groceries_dataset_class(dataset, transform=transforms)
train_loader = DataLoader(groceries_dataset, batch_size=batch_size, shuffle=True)
test_loader = DataLoader(groceries_dataset, batch_size=batch_size, shuffle=False)

In [61]:
# Test dataset
image, label = groceries_dataset[3]

In [62]:
image.shape

torch.Size([3, 224, 224])

## NN Model, Hyperparameters

In [63]:
import torch.nn as nn
import torch.nn.functional as F

In [None]:
class AlexNet(nn.Module):
  def __init__(self):
    super(AlexNet).__init__()
    self.conv1 = nn.Conv2d(3, 96, kernel_size=11, stride=4, padding=0) # 96, 55, 55
    self.conv2 = nn.Conv2d(96, 256, kernel_size=5, stride=1, padding=2) # 256, 27, 27
    self.conv3 = nn.Conv2d(256, 384, kernel_size=3, stride=1, padding=1)
    self.conv4 = nn.Conv2d(384, 384, kernel_size=3, stride=1, padding=1)
    self.conv5 = nn.Conv2d(384, 256, kernel_size=3, stride=1, padding=1)
    self.drop = nn.Dropout(p=0.5)
    self.fc6 = nn.Linear(6*6*256, 4096)
    self.fc7 = nn.Linear(4096, 4096)
    self.fc8 = nn.Linear(4096, 1000)
    self.sm = nn.Softmax(dim=1)

  def forward(self, x):
    x = F.relu(self.conv1(x))
    x = F.max_pool2d(x, kernel_size=3, stride=2)
    x = F.relu(self.conv2(x))
    x = F.max_pool2d(x, kernel_size=3, stride=2)
    x = F.relu(self.conv3(x))
    x = F.relu(self.conv4(x))
    x = F.relu(self.conv5(x))
    x = F.max_pool2d(x, kernel_size=3, stride=2)
    x = x.view(-1, 256*6*6)
    x = F.relu(self.fc6(x))
    x = self.drop(x)
    x = F.relu(self.fc7(x))
    x = self.drop(x)
    x = F.relu(self.fc8(x))
    x = self.sm(x)
    return x

In [None]:
# initialize model and test it
model = AlexNet().cuda()
image = torch.randn(1, 3, 224, 224)
output = model(image.cuda()).cpu()
output.size()