## LeNet-5

![image.png](attachment:f438813f-0865-4780-a96e-57f63548542d.png)

- Most widely known CNN architecture. The MNIST images are 28x28 pixels but they are zero padded to 32x32 pixels and normalized before being fed to the network. the rest of the network doesn't use any padding, thats why the size of the image keeps shrinking as it progresses through the network.
- The average pooling layers are complex than usual. Each neuron computes the mean of its inputs, then multiplies it by a learnable coefficient (one per map) and add it to a learnable bias (also one per map), then finally applies the activation function.
- Most neurons in C3 maps are connected to only three or four of the neurons in S2 maps(instead of all 6)
- In the ouput layer instead of computing the matrix multiplication of the input and weight vector, it outputs the Euclidian distance between the input and the weight vector. Each output measure how much the image belongs to a particular digit class. thus crossentropy cost func is much preferred as it penalizes the wrong predictions much more, thus creating larger gradients and converging faster.

## AlexNet

- Similar to LeNet but much larger and deeper. It was the first to stack conv. layer on top of one another instead of stacking pooling layer on top of conv. layer.

![image.png](attachment:8af138d0-4bfe-484f-bae5-bfd03b3757de.png)

- To reduce overfitting the authors used two regularization techniques. First they applied dropout with a 50% dropout rate during training to the output layers of F9 and F10. Second, they performed data augmentation by randomly shifting the training images by various offsets, flipping them horizontally, and changing the lighting conditions.

## Data augmentation

- Data augmentation artificially increases the size of the training set by generating many realistic variants of the existing instances. This reduces overfitting making it a regularization technique.
- The generated instances should be as realistic to the extent that a human should not be able to tell whether it was augmented or not. Simply adding white noise will not help, the modifications should be learnable(white noise is not).

- For example we can slightly shift, rotate and resize every picture in the training set by various amounts and add the resulting pictures to the training set. This forces the model to be more tolerant to the variations in the positions, orientations and size of the objects in the picture.
- For a model that more tolerant towards lighting conditions, we can generate many images with various contrasts. In general, we can also flip the images horizontally, except for text or images that are asymmetric. Combining all of these we can increase the no. of instances in the training set.

![image.png](attachment:ab8194ad-1b81-4174-a487-c49bd59be663.png)

- AlexNet also uses a competitive normalization step immediately after the ReLU step of layers C1 and C3, called local response normalization(LRN), the highly activated neurons inhibit the neurons present in the neighbouring maps at the same position. Such competitive activation has been observed in biological neurons. This encourages different feature maps to specialize, pushing them apart and forcing them to explore a wider range of features, ultimately improving generalization.

![image.png](attachment:e7ab07a5-04a8-4397-8200-9db8c1824c76.png)

- bi is the normalized output of the neuron located in feature map i, at some row u and some column v.
- ai is the activation of that neuron after the relu stpe but before normalization.
- k, alpha, beta and r are the hyperparameters. k is called the bias and r is called the depth radius.
- fn is the no. of feature maps.
- For example if r = 2 and a neuron is strongly activated, then it will inhibit the activations of the neurons present in feature maps immediately above and below it own.

- In AlexNet the hyperparameters are set as follows, r = 2, alpha = 0.00002, beta = 0.75, k = 1. This step can be implemented using `tf.nn.local_response_normalization()`. We can wrap it in a Lambda layer in order to use it in a Keras model.
- ZFNet is a variant of AlexNet. It is essentially AlexNet with a few tweaked hyperparameters.

## GoogleNet

- Performs much better that the previous architectures due to the network being much deeper. This was made possible by subnetworks called inception modules, which allow GoogleNet to use params much more efficiently than previous architectures. GoogleNet actually has 10 times lesser params than AlexNet, 6 million instead of 60 million.

![image.png](attachment:25c0b813-7a43-4618-8a94-4447aa6b3783.png)

- Below image is the architecture of an inception module. The notation '3 x 3 + 1(S)' means that the feature uses 3 X 3 kernel with 1 stride and 'same' padding.
- The input signal is first copied and fed to 4 different layers. All conv. layers use the relu activation func. The second set of conv layers uses different kernel sizes, 1 X 1, 3 X 3, 5 X 5, so as to capture patterns in a different scale. Also every single layer uses a stride of 1 and 'same' padding including the max pool layer, so the ouputs all have their same heights and width as their inputs.
- This makes it possible to concatenate all the layers in their depth dimensionin the final depth concatenation layer, which stack the feature maps from all 4 top conv. layers. This concatenations layer can be implemented in TF using tf.concat() with axis = 3.
- The conv. layers in inception module having 1x1 kernels serve 3 purposes. Although they cannot capture spatial patterns, they can capture patterns along the depth dimension.
- They are configured to output fewer feature maps than their inputs, so they are bottleneck layers which reduce dimensionality. This cuts the computational cost and the no. of parameters, thus increasing the training speed and generalization.
- Each pair of conv. layers, 1x1 and 3x3, 1x1 and 5x5 together act as a single powerful conv. layer capable of capturing more complex patterns. Instead of sweeping a linear classifier across the image, like what a simple conv. layer does, it sweeps a two-layer NN across the input image.

![image.png](attachment:f9156397-a0b1-4c6a-a242-f5b2cc0f98f3.png)

- The no. of conv. kernels for each conv. layer is a hyperparameter. That means we add an extra 6 hyperparameters for each inception module that is added to the network.

- GoogleNet architecture includes 9 inception modules. All conv. layers use the relu activation func. The first 2 layers divide the image's height and width by 4 (so its area is divided by 16), to reduce the computational load. The first layer uses the large kernel size so that much of its info is preserved.
- Then the local response normalization layer ensures that the previous layers learn a wide variety of features.
- Two conv. layers follow where the first acts like a bottleneck layer. Consider these two as a pair of smarter conv layer.
- Again a local response normalization layer ensures that the previous layers capture a wide variety of patterns.
- Next the max pool layer reduces the image's height and width by a factor of two to speed up computations.
- Then comes the stack of 9 inception modules interleaved with a couple of max pooling layers to reduce dimensionality and speed up the net.
- Then the global average pooling layer outputs the mean of each feature map. This drops any remaining spatial information which is fine as there is not much spatial information left.
- Typically GoogleNet input images are expected to be 224x224 in size, so after 5 maxpooling layers of kernel size 2, the dimensions woudl be 7x7. More it is a classification task, not localization, so it does not matter where the object is. Due to the dimensionality reduction brought by this layer, there is no need to have several fully connected layers at the top of the CNN, and this considerably reduces the no. of parameters and limits the risk of overfitting.
- Then a dropout layer with 50% dropout rate and a Dense layer with 1000 units for 1000 classes.

- This is a simplified version, the original GoogleNet architecture also included two auxiliary classifiers plugged on top of the third and sixth inception modules. These clasifiers contained an average pooling layer, one conv. layer, two fully connected layers and a softmax activation func. During training their loss, scaled down by 70% was added to the overall loss.
- The idea was to fight vanishing gradients problem and achieve regularization. However it was later shown that their effect was relatively minor.

## VGGNet

- It has a very simple and classical architecture, with 2 or 3 conv. layers and a pooling layer, again 2 or 3 conv layers and a pooling layer and so on, reaching a total of 16 to 19 conv. layers depending on the type of VGG variant, plus a final dense network with 2 hidden layersand the ouput layer. It used only 3x3 filters but many filters. 

## ResNet

- Uses extremely deep CNN composed of 152 layers, other variants use 34, 50 and 101 layers. It confirmed the general trend of models getting deeper and deeper with fewer parameters.
- The key to be able to train such a deep network is to use skip connections also called shortcut connections. The signal feeding into a layer is also added to the ouput of the layer located a bit higher up the stack.
- When training a NN, the goal is to make it model a target func h(x). If we pass the input x to the output of teh network i.e., if we add a skip connection then the network will be forced to the model `f(x) = h(x) - x` rather than h(x). This is called residual learning.

![image.png](attachment:4407507c-a84a-4947-9f30-5ca350c49def.png)

- When we initialize a regular NN, its weights are close to zero, so it will output values that are close to zero. If we add a skip connection, then the network just outputs a copy of its inputs i.e., it will model an identity func. If the target func is fairly close to the identity func, which is the case mostly, then it will speed up training considerably.
- Moreover if we add many skip connections, the network can start making progress even if several layers have not started learning yet. Due to skip connections, the signal can easily make its way across the whole network.
- The deep residual network can bee seen as a stack of residual units (RUs), where each residual unit is a small NN with a skip connection.

![image.png](attachment:3ef1f814-4c38-445e-959d-cc74da46ddbb.png)

- Resnet's architecture is surprisingly simple. It starts and ends exactly like GoogLeNet without the Dropout Layer in the output and in between is just a very simple stack of residual units. Each residual unit is composed of 2 conv. layers and no pooling layers with Batch normalization and Relu activation, using 3x3 kernels and spatial dimensions.

![image.png](attachment:e4c9927e-219b-4a54-9f70-03a462d82fc1.png)

- The no. of feature maps are doubled once in few layers, at the same time the height and weight dimensions are halfed by using a stride of 2 in the conv. layer. In this case the inputs cant be added to the ouputs of the residual units as their shapes dont match. To solve this problem, the input is passed through a 1x1 conv. kernel with stride 2 and the right no. of feature maps.

![image.png](attachment:7919ef01-67d6-4ed7-9a3c-44794d350318.png)

- ResNet-34 is the ResNet with 34 layers, only counting the conv. layers and the fully connected layers, containing 3 RUs with 64 feature maps, 4 RUs with 128 feature maps, 6 RUs with 256 feature maps and 3 RUs with 512 feature maps.
- ResNets deeper than that such as ResNet-152 use slightly different residual units. Instead of two conv. layers with 3x3 kernels, with 256 feature maps, they use 3 conv. layers, first a 1x1 conv. layer with 64 feature maps which acts as a bottleneck layer, then a 3x3 layer with 64 feature maps, and finally another 1x1 layer with 256 feature maps that restores the original depth.
- ResNet-152 contains 3 such RUs that output 256 maps, then 8 RUs with 512 maps, 36 RUs with 1024 maps, and finally 3 RUs with 2048 maps.

- Google's Inception-v4 combined the ideas of Googlenet and ResNet and achieved a top 5 error rateof close to 3% on ImageNet classification.