# AlexNet  
- paper review

## Abstract  
- We trained a large, deep CNN to classify the 1.2 million images in ImageNet.  
- Our model achieved top-1 and top-5 error rates of 37.5% and 17.0%.  
- We get better result than the previous SOTA.  
- The neural network has 60 million parameters and 650,000 neurons.  
- To make trainig faster, we used non-saturating neurons and a very efficient GPU implementation.  
- To reduce overfitting we empolyed "dropout".  

## Introduction  
- Tom improve performance, we can collect larger datasets, learn more powerful models, and use better techniques for preventing overfitting.  
- For example, the current best error rate on the MNIST digit-recognition task approaches human performance.  
- But objects in realistic settings exhibit considerable variability, so to learn to recognize them it is necessary to use much larger training sets.  
- The new larger datasets consists of hundreds of thousand of images.  
- To learn about thousands of images, we need a model with a large learning capacity.  
- However, the immense complexity of the object recognition means that this problem cannot be specified even by a dataset as large as ImageNet.  
- CNN model can be controlled by varying their depth and breadth, and they also make strong and mostly correct.  
- Thus, compared to standard feedforward network with similarly-sized layers, CNNs have much fewer connections and parameters and so they are easier to train, while thier theoretically-best performance is likely to be only slightly worse.  
- Despite the attractive qualities of CNNs, they have still been prohibitively expensive to apply in large scale to high-resolution images.  
- The specific contributions of this paper are as follows:
    - We trained one of the largest CNN to date on the ImageNet competitions and achieved by far the best results.  
    - We wrote a highly-optimized GPU implementation of 2D convolution.  
    - The size of our network made overfitting a significant problem, even with 1.2 million labeld training examples, so we used several effective techniques for preventing overfitting.  
    - Our final network contains five convolutional and three fc layers.  
    - Our network takes between five and six days to train on two GTX 580 3GB GPUs.  
    - All of our experiments suggest that our results can be improved simply by waiting for faster GPUs and bigger datasets to become available.

## The Dataset  
- ILSVRC dataset contains 1.2 million tranining images, 50,000 validation images, and 150,000 testing images/  
- ImageNet consists of variable-resolution images, while our system requires a constant input dimensionality.  
- Therefore, we down-sampled the images to a fixed resolution $256\times 256$.  
- We did not pre-process the images in any other way, except for subtracting the mean activity over the training set from each pixel.  

## The Architecture  
- The architecture of our network is summarized in Figure 2.  
- It contains eight learned layers - five convolutional and three full-connected.  

### ReLU Nonlinearlity  

<img src = "https://github.com/Sangh0/Classification/blob/main/AlexNet/figure/figure1.png?raw=true" width=500>

- The standard way to model a neuron's output as a function is hyperbolic tangent or sigmoid function.  
- But these functions are much slower than ReLU function.  
- Deep CNN with ReLUs train several times faster than their equivalents with tanh units.  

### Training on Multiple GPUs  
- A single GTX 580 GPU has only 3GB of memory, which limits the maximum size.  
- It turns out that 1.2 million training examples are enough to train networks which are too big to fit on one GPU.  
- Therefore we spread the net across two GPUs.  
- The two-GPU net takes slightly less time to train than one-GPU net.  

### Local Response Normalization  
- ReLUs have the desirable property that they do not require input normalization to prevent them from saturating.  
- If at least some training examples produce a positive input to a ReLU, learning will happen in that neuron.  
- However, we still find that the following local normalization scheme aids generalization.  

$$b_{x,y}^i=a_{x,y}^i/\left(k+\alpha\sum_{j=max\left(0, i-n/2\right)}^{min\left(N-1, i+n/2\right)}\left(a_{x,y}^i\right)^2\right)^{\beta}$$  

- where the sum runs over n "adjacent" kernel maps at the same spatial position  
- and N is the total number of kernels in the layer.  
- The constants $k$, $n$, $\alpha$ and $\beta$ are hyperparameters.  
- We used $k=2$, $n=5$, $\alpha=10^{-4}$ and $\beta=0.75$  
- We applied this normalization after applying the ReLU.  
- Response normalization reduces our top-1 and top-5 error rates by 1.4% and 1.2%, respectively.  

### Overlapping Pooling  
- Pooling layers in CNNs summarize the outputs of neighboring groups of neurons.  
- If we set $s=z$, we obtain traditional pooling in CNNs.  
- where s is stride and z is kernel size.  
- If we set $s<z$, we obtain overlapping pooling and we use $s=2$ and $z=3$.  
- This scheme reduces the top-1 and top-5 error reates by 0.4% and 0.3%, respectively, as compared with the non-overlapping scheme $s=2$, $z=2$.  
- We generally observe during training that models with overlapping pooling find it slightly more difficult to overfit.  

### Overall Architecture  

<img src = "https://github.com/Sangh0/Classification/blob/main/AlexNet/figure/figure2.png?raw=true" width=700>

- Our architecture is shown in figure 2.  
- The networks contains five convolutional and three fully-connected layers.  
- The node number of output of the last fc layer is 1000 for classifing 1000 class.  
- The kernels of the second, fourth, and fifth convolutional layers are connected only to those kernel maps in the previous layer which reside on the same GPU.  
- The kernels of the third convolutional layer are connected to all kernel maps in the second layer.  
- Response-normalization layers follow the first and second convolutional layers.  
- Max-pooling layers foolow both response-normalization layers as well as the fifth convolutional layer.  
- The ReLU is applied to the output of every convolutional and fc layer.  
- The first conv layer filters the $224\times 224\times 3$ input image with 96 kernel of size $11\times 11\times 3$ with a stride 4.  
- The filter size in second conv layer is $5\times 5\times 48$.  
- The third, fourth, and fifth conv layers are connected to one another without pooling or normalization layer.  
- The third conv layer has 384 kernels of size $3\times 3\times 256$.  
- The fourth conv layer has 384 kernels of size $3\times 3\times 192$.  
- and fifth conv layer has 256 kernels of size $3\times 3\times 192$.  
- The fc layer have 4096 neurons each.  

## Reducing Overfitting  
- We describe the two primary ways in which we combat overfitting.  

### Data Augmentation  
- The easiest and most common method to reduce overfitting is to enlarge the dataset.  
- In out implementation, the transformed images are generated on the CPU while the GPU is training on the previous batch of images.  
- We applied horizontal reflections.  
- We do this by extracting random $224\times 224$ patches from the $256\times 256$ images.  
- The second form of augmentation consists of altering the intensities of the RGB channels in training images.  
- We add multiples of the found principal components, with magnitudes proportional to the corresponding eigenvalues times a random variable drawn from a Gaussian with mean zero and std 0.1.  
- Therefore to each RGB image pixel $I_{xy}=\left[I_{xy}^R, I_{xy}^G, I_{xy}^B\right]^T$ we add following quantity:  

$$\left[\textbf{p}_1, \textbf{p}_2, \textbf{p}_3\right]\left[\alpha_1\lambda_1, \alpha_2\lambda_2, \alpha_3\lambda_3\right]^T$$  
$$Covariance Matrix : C = \frac{1}{n-1}\sum_{i=1}^n\left(X_i-\bar{X}\right)\left(X_i-\bar{X}\right)^T$$  

- where $\textbf{p}_i$ and $\lambda_i$ are $i$th eigenvector and eigenvalue of the $3\times 3$ covariance matrix of RGB pixel values, respectively, and $\alpha_i$ is random variable.  
- This scheme is invariant to changes in the intensity and color of the illumination.  
- This scheme reduces the top-1 error rate by over 1%.  

### Dropout  
- Combining the predictions of many different models is a very successful way to reduce test errors, but it appears to be too expensive for big neural network.  
- We introduce a technique named "dropout".  
- We set 0.5 of the rate of dropout.  
- The neurons which are "dropped out" in this way do not contribute to the forward pass and backpropagation.  
- This technique reduces complex co-adaptions of neurons.  
- We use dropout in the first two fc layers of out architecture.  

## Details of learning  
- optimizer: SGD momentum 0.9  
- batch size: 128  
- weight decay: 0.0005  
- initial learning rate: 0.01  
- epochs: 90
- The update rule for weight $w$ was  
$$v_{i+1} = 0.9\cdot v_i -0.0005\cdot \epsilon\cdot w_i - \epsilon \cdot \left<\frac{\partial{L}}{\partial{w}}\vert_{w_i}\right>$$  
$$w_{i+1} = w_i + v_{i+1}$$  
- where $i$ is the iteration step, $v$ is the momentum variable, $\epsilon$ is the learning rate.  
- We initialized the weights in each layer from a zero-mean Gaussian distribution with std 0.01.  
- We initialized the neuron biases in the second, fourth and fifth conv layers, as well as in the fc hidden layers, with the constant 1.  
- We initialized the neuron biases in the remaining layers with constant 0.  
- We followed was to deviced the learning rate by 10 when the validation error rate stopped improving with the current learning rate.  

## Results  
- Our network achieves top-1 and top-5 test set error rates of 37.5% and 17.0%.  
