### AlexNet

https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf

#### Data Augmentation

Since AlexNet has 60M Parameters trained on 1.2M samples we should suspect that overfitting is a problem. One way to combat overfitting is to artificially enlarge the training set by data augmentation.

Alexnet has two forms of data augmentation: 1. Image translation and horizontal reflection. 2. Alter intensity of the RGB channels.

![](alexnet.png)


#### Same Convolution

A Same Convolution is a type of convolution where the output matrix is of the same dimension as the input matrix.

For a nxn input matrix A and a fxf filter matrix F: the output of the convolution A*F is of dimension: 
$$\left(\frac{n*2p-f}{s}\right)+1 \text{ x } \left(\frac{n*2p-f}{s}\right)+1$$
s = stride   
p = padding

For a same convolution:
- s = 1,  
- p = $\frac{f - 1}{2}$, and   
- f is an odd number

### VGG-16

Karen Simonyan and Andrew Zisserman (2014). Visual Geometry Group Lab of Oxford University.

https://arxiv.org/abs/1409.1556

Goal of the research was to analyze how to increase the depth of Convolutional Networks.

All convolutions are same convolutions.

![](VGG.png)



* ~138 Million parameters
* 3x3 filters
* Stride = 1
* Number of filters 64->128->256->512

Notice that the number of channels increases (increasing the number of parameters) and the size decreases (to reduce the number of parameters).

## ResNet

He,Zhang,Ren and Sun (2015) [Deep Residual Learning for Image Recognition. ](https://arxiv.org/abs/1512.03385)

Why doesn't adding more layers improve Training and Test Error? 

Weight decay, small random initialization, L2 regularization biased the learning toward zero. As you add layers the model tends to learn the zero function, F(x) = 0. 

![](MoreLayers.png)

$\text{Train and Test Error for 20 and 56 layer networks}$

### Residual Blocks

![](ResNetBlk.png)

$\text{Residual Building Block}$



Solution: Make the identity function rather than the zero function as the default function.

Learn F(x) + x, F(x) is the change to x made by the layer(s).

![](ResBlock.png)

<div style="font-size: 115%;">
$$a^{l+2} = ReLU((W^{l+2}\cdot{a^{l+1}}+b^{l+2}) + a^l)$$
</div>

If the weights and bias = 0 (because of weight decay, small random initialization, L2 regularization) then

<div style="font-size: 115%;">
$$ a^l = ReLU(a^l) = a^l$$
</div>

The Residual block learns the indentity function. It is called a "skip" or "short cut" connection.

#### ResNet Architecture

![](ResNet.png)

VGG-19 has 19.6 Billion FLOPs whereas the 34 layer ResNet has 3.6 Billion FLOPs. Thus it is less complex eventhough it is deeper. Each layer only learns a little so don't need as many parameters.

### 1x1 Convolutions

https://medium.com/analytics-vidhya/talented-mr-1x1-comprehensive-look-at-1x1-convolution-in-deep-learning-f6b355825578

Dotted line in the ResNet architecture are where the number of channels increases.

A 1x1 convolution projects to a higher number of channels but doesn't change the input size.

![](Bottleneck.png)

$\text{Bottleneck Building Block}$

### Batch Normalization

Ioffe and Szegedy (2015) [Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift](https://arxiv.org/abs/1502.03167)

In each training iteration, BN normalizes the output of each hidden layer node 
(on each layer where it is applied) by subtracting its mean and dividing by its standard deviation, estimating both based on the current minibatch.

For each Hidden Layer on which Batch Normalization is applie:

![](BatchNorm.png)

For convolutional layers, batch normalization occurs after the convolution computation and before the application of the activation function.

### References

Aston Zhang, Zachary C. Lipton, Mu Li, and Alexander J. Smola, DiveIntoDeepLearning

Andrew Ng, DeepLearning.AI

Jason Brownlee, A Gentle Introduction to 1×1 Convolutions to Manage Model Complexity

https://www.youtube.com/watch?v=GWt6Fu05voI&ab_channel=YannicKilcher