# I) Summary

- The paper [Going Deeper with Convolutions][paper] introduces the first version of Inception model called GoogLeNet.


- During ILSVLC-2014, they achieved 1st place at the classification task (top-5 test error = 6.67%)


- It has around 6.7977 million parameters which is 9x fewer than AlexNet (ILSVRC-2012 winner) and 20x fewer than its competitor VGG-16.


- In most of the standard network architectures, the intuition is not clear why and when to perform the max-pooling operation, when to use the convolutional operation. For example, in AlextNet we have the convolutional operation and max-pooling operation following each other whereas in VGGNet, we have 3 convolutional operations in a row and then 1 max-pooling layer.


- Thus, **the idea behind GoogLeNet is to use all the operations at the same time**. It computes multiple kernels of different size over the same input map in parallel, concatenating their results into a single output. This is called an **Inception module**.

<div style="text-align: center">
    <img src="https://cdn.discordapp.com/attachments/676833120053493770/689776141221101629/unknown.png"
         height="100%" width="100%">
</div>

- Consider the following:

<div style="text-align: center">
    <img src="https://media.discordapp.net/attachments/676833120053493770/690136206092402715/unknown.png"
         height="50%" width="70%">
</div>
 

<div style="text-align: center">
    <img src="https://cdn.discordapp.com/attachments/676833120053493770/690138280058028146/unknown.png"
         height="50%" width="90%">
</div>


- The Naive approach is computationally expensive:
    - Computation cost = ((28 x 28 x 5 x 5) x 192) x 32 $\simeq$ **120 Mil**
        - We perform (28 x 28 x 5 x 5) operations along 192 channels for each of the 32 filters.


- The dimension reduction approach is **less** computationally expensive:
    - 1st layer computation cost = ((28 x 28 x 1 x 1) x 192) x 16 $\simeq$ 2.4 Mil
    - 2nd layer computation cost = ((28 x 28 x 5 x 5) x 16) x 32 $\simeq$ 10 Mil  
    - Total computation cost $\simeq$ **12.4 Mil**

---

Here its architecture:

<div style="text-align: center">
    <img src="https://cdn.discordapp.com/attachments/676833120053493770/690150147392667651/unknown.png"
         height="100%" width="100%">
</div>

- There are:
    - 9 Inception modules (red box)
    - Global Average pooling were used instead of a Fully-connected layer.
        - It enables adapting and fine-tuning on the network easily.
    - 2 auxilaries softmax layer (green box)
        - Their role is to push the network toward its goal and helps to ensure that the intermediate features are good enough for the network to learn.
        - It turns out that softmax0 and sofmax1 gives regularization effect.
        - They are discarded during inference.
        - Structure:
            - Average pooling layer with 5×5 filter size and stride 3 resulting in an output size:
                - For 1st green box: 4x4x512.
                - For 2nd green box: 4x4x528.
            - 128 1x1 convolutions + ReLU.
            - Fully-connected layer with 1024 units + ReLU.
            - Dropout = 70%.
            - Linear layer (1000 classes) + Softmax.

<br>

<div style="text-align: center">
    <img src="https://cdn.discordapp.com/attachments/676833120053493770/690147584534511659/unknown.png"
         height="100%" width="80%">
</div>

[paper]: https://arxiv.org/pdf/1409.4842.pdf

# II) Implementation

- Auxilaries softmax will note be implemented since pretrained weights will be loaded.
- Local Response Normalization will be replaced by Batch Normalization.

In [2]:
from utils import *
from classes import class_names
from collections import OrderedDict
from matplotlib import pyplot as plt
import numpy as np
import os
import cv2
import urllib.request
import torch.nn as nn
import torchvision.transforms as transforms
from torchsummary import summary
from torch.utils.data import Dataset, DataLoader

## a) Architecture build

In [5]:
class ConvB(nn.Module):
    
    def __init__(self, in_channels, out_channels, kernel_size, stride, padding):
        super(ConvB, self).__init__()
        
        self.conv = nn.Conv2D(in_channels, out_channels, kernel_size, stride, padding)
        self.bn = nn.BatchNorm2d(out_channels)
        
    def forward(self, x):
        x = self.conv(x)
        x = self.bn(x)
        return F.relu(x, inplace=True)

In [6]:
class InceptionB(nn.Module):
    
    def __init__(self):
        super(InceptionB, self).__init__()
        
        
        
    def forward(self, x):
        pass

In [None]:
class GoogLeNet(nn.Module):
    
    def __init__(self):
        super(GoogLeNet, self).__init__()
        
       
    
    
    def forward(self, x):
        pass