This is an implementation of GoogLeNet that won the **"ImageNet Classification with Deep Convolutional Neural Networkds"** in 2014. 

### Preparing the Project

Let's start by downloading the necessary libraries from `requirements.txt`.

In [None]:
% pip install -r requirements.txt

Great, now let's import the requried libraries into our notebook!

In [None]:
from collections import namedtuple
from typing import Optional, Tuple, Any

import torch
from torch import Tensor
from torch import nn

Since training will be on in a different file and use the modules from this file, let's state which module should be exported if a were to utilize this one. By setting an `__all__`, we can let Python know that any other modules (functions, classes, etc) that exists in the file but outside the list should not be accessible by files that will utilize this one.

In [None]:
__all__ = ["GoogleNetOutputs","GoogLeNet", "BasicConv2d", "Inception", "InceptionAux", "googlenet"]

To help with organization with the model's outputs, we explicitly define what we want our model output to be.

In [None]:
GoogLeNetOutputs = namedtuple("GoogLeNetOutputs", ["logits", "aux_logits2", "aux_logits1"])
GoogLeNetOutputs.__annotations__ = {
    "logits": Tensor,
    "aux_logits2": Optional[Tensor],
    "aux_logits1": Optional[Tensor]
}

With the code above, we state that we want our outputs to be stored in the variable `GoogLeNetOutputs` where the data contained should be a tuple of 3 `Tensor`s where 2 of them can either be a `Tenor` or `None`.

With that, we can begin building GoogLeNet model!

### Building the Model

First there are a few components we need to build before getting to the network. The following classes:

- `BasicConv2d1`
- `Inception`
- `InceptionAux`

Let's explore each one in detail.

In [None]:
class BasicConv2d(nn.Module):
    def __init__(self, in_channels: int, out_channels: int, **kwargs: Any) -> None:
        super(BasicConv2d, self).__init__()
        self.conv = nn.Conv2d(in_channels, out_channels, bias=False, **kwargs)
        self.bn = nn.BatchNorm2d(out_channels, eps=0.001)
        self.relu = nn.ReLU(True)

    def forward(self, x: Tensor) -> Tensor:
        out = self.conv(x)
        out = self.bn(out)
        out = self.relu(out)

        return out

For the convolutional layers in GoogLeNet, we will be using this "custom" layer instead of the normal `nn.Conv2d` to help improve the stability and capabilities of our model. Now let's break down the code.

Our new convolutional layer takes in 3 parameters:

- `in_channels` : the dpeth of the input feature map
- `out_channels` : the depth of the output feature map
- `**kwargs` : any additional parameters to be added to `nn.Con2d`

Let's explore what happens when data is passes through the `forward` function.

1. Assume a input `x` representing a batch of images or feature maps of size (N, C, H, W) where:
    - N : batch size (number of images)
    - C : depth of the input (input channels)
    - H : height of each image / feature map
    - W : width of each image / feature map
2. The input tensor `x` is passed to a `nn.Conv2d` layer where kernel operations are performed on it, resulting in a new tensor of shape (N, C, H, W) where:
    - N : batch size
    - C : depth of input (determined by number of kernels)
    - H/W : new dimensions of the image / feature map
3. `nn.BatchNorm2d` applies normalization to the tensor, stanardizing all feature maps to have a mean of 0 and a standard deviation of 1
4. Non-linearity is introduced through ReLU activation

_Note: when a convolutional layer returns a tensor of (N, C, H, W), we can think of it like : **Each batch contains N feature map/images where EACH SAMPLE in the batch has C features learned represented in a spatial structure of H x W**_

In [None]:
class Inception(nn.Module):
    def __init__(
        self,
        in_channels: int,
        ch1x1: int,
        ch3x3red: int,
        ch3x3: int,
        ch5x5red: int,
        ch5x5: int,
        pool_proj: int
    ) -> None:
        super(Inception, self).__init__()
        self.branch1 = BasicConv2d(in_channels, ch1x1, kernel_size=(1,1), stride=(1,1), padding=(0,0))
        self.branch2 = nn.Sequential(
            BasicConv2d(in_channels, ch3x3red, kernel_size=(1,1), stride=(1,1), padding=(0,0)),
            BasicConv2d(ch3x3red, ch3x3, kernel_size=(3,3), stride=(1,1), padding=(1,1))
        )
        self.branch3 = nn.Sequential(
            BasicConv2d(in_channels, ch5x5red, kernel_size=(1,1), stride=(1,1), padding=(0,0)),
            BasicConv2d(ch5x5red, ch5x5, kernel_size=(3,3), stride=(1,1), padding=(1,1))
        )
        self.branch4 = nn.Sequential(
            nn.MaxPool2d(kernel_size=(3,3), stride=(1,1), padding=(1,1), ceil_mode=True),
            BasicConv2d(in_channels, pool_proj, kernel_size=(1,1), stride=(1,1), padding=(0,0))
        )

    def forward(self, x: Tensor) -> Tensor:
        branch1 = self.branch1(x)
        branch2 = self.branch2(x)
        branch3 = self.branch3(x)
        branch4 = self.branch4(x)
        out = [branch1, branch2, branch3, branch4]
        out = torch.cat(out, 1)
        return out

The Inception module is the key to GoogLeNet. Let's see how it works.

The modules contains 4 branches, each of which processes the input in a different way.

1. Branch 1 : 1x1 convolutional layer used to capture features at a small spatial scale
2. Branch 2 : 1x1 convolutional layer that outputs directly to a 3x3 convolutional layer
    - 1x1 reduces the number of input channels
    - 3x3 captures the spatial features
3. Branch 3 : similar to branch 2 but from a 1x1 to a 5x5
4. Branch 4 : max-pooling to capture larger spatial feature and gets dimensionally reduced by a 1x1 convolution

When making the forward pass, the each branch does their own computation and all the branches are then concatenated along the channel dimension using `torch.cat`

The last part may be a little confusing so let me explain.

With 4 branches, we will have 4 tensors of size (N, C, H, W) where:
- N : batch size
- C : number of input channels
- H/W : dimensions of the feature maps

Each tensor has a different C based on the branch it was produced. By concatenating, the resulting tensor will be (N, C1 + ... + Cn, H, W), which can be interpreted as having **N images where each image has Ctotal features detected where each feature is represented in HxW**

In [None]:
class InceptionAux(nn.Module):
    def __init__(
            self,
            in_channels: int,
            num_classes: int,
            dropout: float = 0.7
    ) -> None:
        super().__init__()
        self.avgpool = nn.AdaptiveAvgPool2d((4,4))
        self.conv = BasicConv2d(in_channels, 128, kernel_size=(1,1), stride=(1,1), padding=(0,0))
        self.relu = nn.ReLU(True)
        self.fc1 = nn.Linear(2048, 1024)
        self.fc2 = nn.Linear(1024, num_classes)
        self.dropout = nn.Dropout(dropout, True)

    def forward(self, x: Tensor) -> Tensor:
        out = self.avgpool(x)
        out = self.conv(out)
        out = torch.flatten(out, 1)
        out = self.fc1(out)
        out = self.relu(out)
        out = self.dropout(out)
        out = self.fc2(out)

        return out

This is the final component of our GoogLeNet. First let's figure out what is the difference between `Inception` and `InceptionAux`.

**Auxiliary Classifiers** are classifiers that are used to improve the convergence of very deep netowrks by pushing useful gradients to the lower layers, **combatting the vanishig gradient problem**. We add auxiliary classifiers usually towards the end of the network, creating another waay for the network to backpropagate.

Another way to think about it is that they introduce additional feedback signals that can be used to adjust weights, giving lower layers more fine-tuning.

Let's breakdown what the `InceptionAux` class does.

Like the previous components, this class is constructed with multiple layers so let's see what happens when we pass an input tensor into this "network".

1. The adaptive average pooling layer will reduce the spatial dimension to 4x4
    - Think of this like a max-pooling layer where we can define the output dimensions
    - The stride + kernel size to make it happen is automatically selected
2. The result is passed through a 1x1 convolutional layer
3. The convolutional layer result is then flattened and ReLU
    - It is necessary to flatten convolutional results before passing it onto fully-connected layers
4. Dropout is applied after the first fully-connected layer
5. The final fully-connected layer produces the final class scores

The results of this class should be an ouput that corresponds with the number of classes

Great! With all our components ready, it is time to build up GoogLeNet!

In [None]:
class GoogLeNet(nn.Module):
    __constants__ = ["aux_logits", "transform_input"]

    def __init__(
        self,
        num_classes: int = 1000,
        aux_logits: bool = True,
        transform_input: bool = False,
        dropout: float = 0.2,
        dropout_aux: float = 0.7
    ) -> None:
        super(GoogLeNet, self).__init__()
        self.aux_logits = aux_logits
        self.transform_input = transform_input

        self.conv1 = BasicConv2d(2, 64, kernel_size=(7,7), stride=(2,2), padding=(3,3))
        self.maxpool1 = nn.MaxPool2d((3,3), (2,2), ceil_mode=True)
        self.conv2 = BasicConv2d(64, 64, kernel_size=(1,1), stride=(1,1), padding=(0,0))
        self.conv3 = BasicConv2d(64, 192, kernel_size=(3,3), stride=(1,1), padding=(1,1))
        self.maxpool2 = nn.MaxPool2d((3,3), (2,2), ceil_mode=True)

        self.inception3a = Inception(192, 64, 98, 128, 16, 32, 32)
        self.inception3b = Inception(256, 128, 128, 192, 32, 96, 64)
        self.maxpool3 = nn.MaxPool2d((3,3), (2,2), ceil_mode=True)

        self.inception4a = Inception(480, 192, 96, 208, 16, 48, 48)
        self.inception4b = Inception(512, 160, 112, 224, 24, 64, 64)
        self.inception4c = Inception(512, 128, 128, 256, 24, 64, 64)
        self.inception4d = Inception(512, 112, 144, 288, 32, 64, 64)
        self.inception4e = Inception(528, 256, 160, 320, 32, 128, 128)
        self.maxpool4 = nn.MaxPool2d((2,2), (2,2), ceil_mode=True)

        self.inception5a = Inception(832, 256, 160, 320, 32, 128, 128)
        self.inception5b = Inception(832, 384, 192, 384, 48, 128, 128)

        if aux_logits:
            self.aux1 = InceptionAux(512, num_classes, dropout_aux)
            self.aux2 = InceptionAux(528, num_classes, dropout_aux)
        else:
            self.aux1 = None
            self.aux2 = None

        self.avgpool = nn.AdaptiveAvgPool2d((1,1))
        self.dropout = nn.Dropout(dropout, True)
        self.fc = nn.Linear(1024, num_classes)

        self._initialize_weights()

    @torch.jit.unused
    def eager_outputs(self, x: Tensor, aux2: Tensor, aux1: Optional[Tensor]) -> GoogLeNetOutputs | Tensor:
        if self.training and self.aux_logits:
            return GoogLeNetOutputs(x, aux2, aux1)
        else:
            return x
        
    def forward(self, x: Tensor) -> Tuple[Tensor, Optional[Tensor], Optional[Tensor]]:
        out = self._forward_impl(x)
        return out
    
    def _transform_input(self, x: Tensor) -> Tensor:
        if self.transform_input:
            x_ch0 = torch.unsqueeze(x[:, 0], 1) * (0.229 / 0.5) + (0.485 - 0.5) / 0.5
            x_ch1 = torch.unsqueeze(x[:, 1], 1) * (0.224 / 0.5) + (0.456 - 0.5) / 0.5
            x_ch2 = torch.unsqueeze(x[:, 2], 1) * (0.225 / 0.5) + (0.406 - 0.5) / 0.5
            x = torch.cat((x_ch0, x_ch1, x_ch2), 1)
        return x
    
    def _forward_imp1(self, x: Tensor) -> GoogLeNetOutputs:
        x = self._transform_input(x)

        out = self.conv1(x)
        out = self.maxpool1(out)
        out = self.conv2(out)
        out = self.conv3(out)
        out = self.maxpool2(out)

        out = self.inception3a(out) 
        out = self.inception3b(out)
        out = self.maxpool3(out)
        out = self.inception4a(out)
        aux1: Optional[Tensor] = self.aux1(out) if self.aux1 is not None and self.training else None

        out = self.inception4b(out)
        out = self.inception4c(out)
        out = self.inception4d(out)
        aux2: Optional[Tensor] = self.aux2(out) if self.aux1 is not None and self.training else None

        out = self.inception4e(out) 
        out = self.maxpool4(out)
        out = self.inception5a(out)
        out = self.inception5b(out)

        out = self.avgpool(out)
        out = torch.flatten(out, 1)
        out = self.dropout(out)
        aux3 = self.fc(out)

        if torch.jit.is_scripting():
            return GoogLeNetOutputs(aux3, aux2, aux1)
        else:
            return self.eager_outputs(aux3, aux2, aux1)
        
    def _initialize_weights(self) -> None:
        for module in self.modules():
            if isinstance(module, nn.Conv2d) or isinstance(module, nn.Linear):
                torch.nn.init.trunc_normal_(module.weight, mean=0.0, std=0.1, a=-1, b=2)
            elif isinstance(module, nn.BatchNorm2d):
                nn.init.constant_(module.weight, 1)
                nn.init.constant_(module.bias, 0)


This implementation follows the diagram shown in the GoogLeNet papers, but here are some key things to note.

**`eager_outputs`**

Determines the format of the output during the forward pass and will only be used when model is in training mode.

**`_transform_input`**

Normalizes the input tensor `x`, by normalizing each channel (R,G,B) individually and then concatinating the results.

The constant used in the calculation are typical mean and standared deviation values derived from ImageNet.

- Mean : [0.485, 0.456, 0.406]
- Standard Deviation : [0.229, 0.224, 0.225]

### Training the Model

With the model built, let's see how we can train it.

Let's start by importing the libraries we will need.

In [None]:
import os
import time

import torch
from torch import nn
from torch import optim
from torch.cuda import amp
from torch.optim import lr_scheduler
from torch.optim.swa_utils import AveragedModel
from torch.utils.data import DataLoader
from torch.utils.tensorboard import SummaryWriter

import config
from dataset import CUDAPrefetcher, ImageDataset
from utils import accuracy, load_state_dict, make_directory, save_checkpoint, Summary, AverageMeter, ProgressMeter
import model

Great, now let's see the code that we will utilize to build our training process.

In [None]:
def load_dataset() -> [CUDAPrefetcher, CUDAPrefetcher]:
    
    # Get the training and validation datasets
    train_dataset = ImageDataset(config.train_image_dir, config.image_size, "Train")
    valid_dataset = ImageDataset(config.valid_image_dir, config.image_size, "Valid")

    # Create dataloaders to efficiently load and batch the data
    train_dataloader = DataLoader(train_dataset,
                                  batch_size=config.batch_size,
                                  shuffle=True,
                                  num_workers=config.num_workers,
                                  pin_memory=True,
                                  drop_last=True,
                                  persistent_workers=True)
    valid_dataloader = DataLoader(valid_dataset,
                                  batch_size=config.batch_size,
                                  shuffle=False,
                                  num_workers=config.num_workers,
                                  pin_memory=True,
                                  drop_last=False,
                                  persistent_workers=True)

    # Optimize the loading pipeline by prefetching data batches on the GPU
    train_prefetcher = CUDAPrefetcher(train_dataloader, config.device)
    valid_prefetcher = CUDAPrefetcher(valid_dataloader, config.device)

    return train_prefetcher, valid_prefetcher


def build_model() -> [nn.Module, nn.Module]:

    # Initialize the GoogLeNet model
    googlenet_model = model.__dict__[config.model_arch_name](num_classes=config.model_num_classes, aux_logits=False, transform_input=True)

    # Move the model to the appropriate device + format
    googlenet_model = googlenet_model.to(device=config.device, memory_format=torch.channels_last)

    # The exponential moving average (EMA) function
    ema_avg = lambda averaged_model_parameter, model_parameter, num_averaged: (1 - config.model_ema_decay) * averaged_model_parameter + config.model_ema_decay * model_parameter

    # Initialize the EMA model
    ema_googlenet_model = AveragedModel(googlenet_model, avg_fn=ema_avg)

    return googlenet_model, ema_googlenet_model


def define_loss() -> nn.CrossEntropyLoss:

    # Initialize a cross entropy loss model
    criterion = nn.CrossEntropyLoss(label_smoothing=config.loss_label_smoothing)

    # Move the model to the appropriate device + format
    criterion = criterion.to(device=config.device, memory_format=torch.channels_last)

    return criterion


def define_optimizer(model) -> optim.SGD:

    # Initialize the optimizer
    optimizer = optim.SGD(model.parameters(), lr=config.model_lr, momentum=config.model_momentum, weight_decay=config.model_weight_decay)

    return optimizer


def define_scheduler(optimizer: optim.SGD) -> lr_scheduler.CosineAnnealingWarmRestarts:

    # Initialize a Learning Rate scheduler
    scheduler = lr_scheduler.CosineAnnealingWarmRestarts(optimizer,
                                                         config.lr_scheduler_T_0,
                                                         config.lr_scheduler_T_mult,
                                                         config.lr_scheduler_eta_min)

    return scheduler

That's a lot of functions so let's go through what each one does.

`load_dataset()` : downloads and process the ImageNet dataset in the most optimal way (by pipelining the process on the GPU).

`build_model()` : sets up the model by retreiving the `googlenet` class with the configured setting (arguments in `()`)

In addition, the model also sets up a Exponential Moving Average (EMA) mode. 