# ImageNet Classification with Deep Convolutional Neural Networks

## 1. Introduction


### Context

* ML machine methods of that time work well for relatvely small datasets. ( NORB, Caltech, Cifar )
* Until very recently larger datasets emerge like ImageNet, LabelMe
* Existing methods do not work well on these large datasets
* To learn about thousands of objects of millions of images a model with large learning capacity is needed.

### CNN's before this:
* CNN's have much fewer parameters so they are easier to train
* CNN's good performance for the number of parameters they have.
* Expensive to apply to large scale images, because there is no effective implementation.
* Highly optimized GPU implementation provided shown in the paper.

### Specific contributions:

* Achieved at that time best results on subsets of ImageNet used in the ILSVRC-2010 and ILSVRC-2012
* One of the largest conventional neural networks at that time.
* Highly optimized GPU implementation for CNNs.
* Number of new and unusual features of network which improve the result.
* Techniques to avoid overfitting.
* Final network consists of five convolutional and three fully connected layers.
* Fully utilized 2x Nvidia GTX 580 3GB GPUs

----

## 2. Dataset

Large dataset with many labeeeeeels

## 3. The architecture

<img src="pinecone-image.png" alt="Architecture diagram" width="800">

Here are novel or unusual features of this netork described by the paper:

### ReLU Nonlinearity

<div style="overflow: auto;">
<img src="relu.png" alt="ReLU diagram" width="300" style="float: right; margin-top: 20px; margin-left: 20px; margin-bottom: 10px;">

* AlexNet popularized the use of ReLus for Deep CNNs
* Trains faster, which was demonstraded on CIFAR Dataset.
* Contrast with prior work:
    * Previous works used instead: $ f (x) = tanh(x)$ or $f (x) = (1 + e^{−x})^{−1}$, or $(x)=|tanh(x)|$
    * Focused on regularization instead of fitting large datasets

</div>

### Training on 2 GPUs

* GPUs had ~ 3gb memory which limits the size of networks
* The network was split into two gpus
* GPUs communicate only in certain layers, half parameters on each GPU.
* Faster to train than on one GPU
* Made use of NVIDIA CUDA framework

### Local Response Normalization

$$
b^{i}_{x,y} =
\frac{a^{i}_{x,y}}
{\left(
k + \alpha
\sum_{j=\max(0,\, i - \frac{n}{2})}^{\min(N - 1,\, i + \frac{n}{2})}
\left(a^{j}_{x,y}\right)^2
\right)^{\beta}}
$$

What each term means:
* $b^{i}_{x,y}$ activation after normalization
* $a^{i}_{x,y}$ feature map after convolution and ReLu
* $i$ feature map index
* $j$ neigbouring feature map index
* $x,y$ fixed pixel in the feature map
* $k$ Baseline numerical stabilizer
* $\sum_{j=\max(0,\, i - \frac{n}{2})}^{\min(N - 1,\, i + \frac{n}{2})}
\left(a^{j}_{x,y}\right)^2$ - Measures how strong nearby filters are at x,y
* $\alpha\beta $ Normalization strenth - scales the influence of neighbours, Beta controls the how much non linear the suppresion is.
----
* Batch Normalization is more effective
* LRN adds computational cost.
* Benefits do not scale to even deeper networks.

### Overlapping pooling

<img src="maxpooling.png" alt="ReLU diagram" width="300">

* summaraizes local neighborhoods within a feature map
* AlexNet used overlapping pooling window size 3*3, stride = 2
* Slightly harder to overfit
* Reduced error 0.4%

----

## 4. Reducing overfittting

* Even though dataset has 1000 classes, 60 million parameters of model make it easy to overfit.

### Data augmentation

* Easiest way reduce overfitting as described in paper.
* Augmentations:
    * Image translations and horizontal reflections. ( Random patches 224x224 from 256x256)
    * Color PCA for each pixel $I_{xy} = [R, G, B]^T$, add:
    
    $$ [p_1, p_2, p_3] \cdot [\alpha_1 \lambda_1, \alpha_2 \lambda_2, \alpha_3 \lambda_3]^T $$
    
    Where:
    * $p_i$ = $i$-th eigenvector of RGB covariance
    * $\lambda_i$ = $i$-th eigenvalue
    * $\alpha_i$ = Gaussian random variable, re-drawn per image
    
    **Why this works:**
    * Captures principal axes of variation in natural lighting/colors
    * Introduces realistic color shifts without changing object identity
    * Helps generalization for large-scale datasets like ImageNet

### Dropout

* Sets each hidden neuron to zero with probability 0.5 during training
* Reduces co-adaptations - forces learning of robust features
* Used in first two fully-connected layers
* Prevents substantial overfitting

----

## 5. Details of learning

* **Optimizer:** Stochastic gradient descent (batch size: 128, momentum: 0.9, weight decay: 0.0005)
    
    Update rule:
    $$v_{i+1} := 0.9 \cdot v_i - 0.0005 \cdot \epsilon \cdot w_i - \epsilon \cdot \left\langle \frac{\partial L}{\partial w}\Big|_{w_i} \right\rangle_{D_i}$$
    $$w_{i+1} := w_i + v_{i+1}$$
    
    Where:
    * $i$ = iteration index
    * $v_i$ = momentum variable (velocity)
    * $w_i$ = weights at iteration $i$
    * $\epsilon$ = learning rate
    * $0.9$ = momentum coefficient
    * $0.0005$ = weight decay coefficient
    * $\left\langle \frac{\partial L}{\partial w}\Big|_{w_i} \right\rangle_{D_i}$ = average gradient over batch $D_i$

* **Weight initialization:** Zero-mean Gaussian, std = 0.01
* **Bias initialization:**
    * Constant 1: Conv layers 2, 4, 5 and FC layers (helps ReLUs)
    * Constant 0: All other layers
* **Learning rate:** Started at 0.01, reduced by 10x when validation error plateaued
* **Training:** 90 epochs, 5-6 days on 2x NVIDIA GTX 580 GPUs



*