### 1. What is the COVARIATE SHIFT Issue, and how does it affect you?


Covariate shift occurs when the distribution of variables in the training data is different to real-world or testing data. This means that the model may make the wrong predictions once it is deployed, and its accuracy will be significantly lower. This makes it ineffective with new data with a different distribution.

![image.png](attachment:image.png)

### 2. What is the process of BATCH NORMALIZATION?


Batch normalization is a technique for training very deep neural networks that standardizes the inputs to a layer for each mini-batch. This has the effect of stabilizing the learning process and dramatically reducing the number of training epochs required to train deep networks

### 3. Using our own terms and diagrams, explain LENET ARCHITECTURE.



![image.png](attachment:image.png)

The LeNet architecture is an excellent “first architecture” for Convolutional Neural Networks (especially when trained on the MNIST dataset, an image dataset for handwritten digit recognition).

LeNet is small and easy to understand — yet large enough to provide interesting results. Furthermore, the combination of LeNet + MNIST is able to run on the CPU, making it easy for beginners to take their first step in Deep Learning and Convolutional Neural Networks.

The LeNet architecture consists of the following layers:

Instead of explaining the number of convolution filters per layer, the size of the filters themselves, and the number of fully-connected nodes right now, I’m going to save this discussion until our “Implementing LeNet with Python and Keras” section of the blog post where the source code will serve as an aid in the explantation.

In the meantime, let’s took at our project structure — a structure that we are going to reuse many times in future PyImageSearch blog posts.

To keep our code organized, we’ll define a package named pyimagesearch . And within the pyimagesearch module, we’ll create a cnn sub-module — this is where we’ll store our Convolutional Neural Network implementations, along with any helper utilities related to CNNs.

Taking a look inside cnn , you’ll see the networks sub-module: this is where the network implementations themselves will be stored. As the name suggests, the lenet.py file will define a class named LeNet , which is our actual LeNet implementation in Python + Keras.

The lenet_mnist.py script will be our driver program used to instantiate the LeNet network architecture, train the model (or load the model, if our network is pre-trained), and then evaluate the network performance on the MNIST dataset.

Finally, the output directory will store our LeNet model after it has been trained, allowing us to classify digits in subsequent calls to lenet_mnist.py without having to re-train the network.

### 4. Using our own terms and diagrams, explain ALEXNET ARCHITECTURE.



![image.png](attachment:image.png)
One thing to note here, since Alexnet is a deep architecture, the authors introduced padding to prevent the size of the feature maps from reducing drastically. The input to this model is the images of size 227X227X3.

AlexNet. The architecture consists of eight layers: five convolutional layers and three fully-connected layers. But this isn’t what makes AlexNet special; these are some of the features used that are new approaches to convolutional neural networks:
* ReLU Nonlinearity. AlexNet uses Rectified Linear Units (ReLU) instead of the tanh function, which was standard at the time. ReLU’s advantage is in training time; a CNN using ReLU was able to reach a 25% error on the CIFAR-10 dataset six times faster than a CNN using tanh.
* Multiple GPUs. Back in the day, GPUs were still rolling around with 3 gigabytes of memory (nowadays those kinds of memory would be rookie numbers). This was especially bad because the training set had 1.2 million images. AlexNet allows for multi-GPU training by putting half of the model’s neurons on one GPU and the other half on another GPU. Not only does this mean that a bigger model can be trained, but it also cuts down on the training time.
* Overlapping Pooling. CNNs traditionally “pool” outputs of neighboring groups of neurons with no overlapping. However, when the authors introduced overlap, they saw a reduction in error by about 0.5% and found that models with overlapping pooling generally find it harder to overfit.


### 5. Describe the vanishing gradient problem.

### 6. What is NORMALIZATION OF LOCAL RESPONSE?

Local Response Normalization is a normalization layer that implements the idea of lateral inhibition. Lateral inhibition is a concept in neurobiology that refers to the phenomenon of an excited neuron inhibiting its neighbours: this leads to a peak in the form of a local maximum, creating contrast in that area and increasing sensory perception. In practice, we can either normalize within the same channel or normalize across channels when we apply LRN to convolutional neural networks.
![image.png](attachment:image.png)
Where the size is the number of neighbouring channels used for normalization,  is multiplicative factor,  an exponent and  an additive factor

### 7. In AlexNet, what WEIGHT REGULARIZATION was used?

The regularization used in AlexNet is L2 with a weight decay of 5e-4. It was trained on GTX580 GPU which contains 3GB of memory. It has an error rate of 16.4 in the ImageNet Large Scale Visual Recognition Challenge(ILSVRC).

### 8. Using our own terms and diagrams, explain VGGNET ARCHITECTURE.



![image.png](attachment:image.png)
This architecture is from VGG group, Oxford. It makes the improvement over AlexNet by replacing large kernel-sized filters(11 and 5 in the first and second convolutional layer, respectively) with multiple 3X3 kernel-sized filters one after another. With a given receptive field(the effective area size of input image on which output depends), multiple stacked smaller size kernel is better than the one with a larger size kernel because multiple non-linear layers increases the depth of the network which enables it to learn more complex features, and that too at a lower cost. 

For example, three 3X3 filters on top of each other with stride 1 ha a receptive size of 7, but the number of parameters involved is 3*(9C^2) in comparison to 49C^2 parameters of kernels with a size of 7. Here, it is assumed that the number of input and output channel of layers is C.Also, 3X3 kernels help in retaining finer level properties of the image. The network architecture is given in the table.
You can see that in VGG-D, there are blocks with same filter size applied multiple times to extract more complex and representative features. This concept of blocks/modules became a common theme in the networks after VGG.

The VGG convolutional layers are followed by 3 fully connected layers. The width of the network starts at a small value of 64 and increases by a factor of 2 after every sub-sampling/pooling layer. It achieves the top-5 accuracy of 92.3 % on ImageNet.

### 9. Describe VGGNET CONFIGURATIONS.



The table 1 shown below (displayed in the previous section too, as table 1) displays all network configurations of the VGG architecture. These networks follow the same design principles, but differ in depth.
![image.png](attachment:image.png)
Table 1. ConvNet configurations (displayed in columns). It can be noted that the depth of the configurations increase from the left (A) to right (E), as more layers are added (displayed in bold). The convolutional layer parameters are denoted as “conv (receptive field size) — (number of channels)”. The ReLU activation function is not shown here.
![image-2.png](attachment:image-2.png)

Table 2. Number of parameters (in millions)
* The above table presents a lot of information and can be seen in almost all the attempts to describe VGG.
* Point 1 : This is a comparison chart of 6 networks, in which, from A to E, the network can be seen as getting deeper. Several layers have been added to verify the effect.
* Point 2 : The columns can be interpreted easily, where it explains the structure of each network in detail.
* Point 3: This approach helps in using the simplest solution for the problem in hand, and then gradually optimising it for the problems in the future.
* Network A: First mention a shallow network, this network can easily converge on ImageNet. And then?
* Network A-LRN: Add something that someone else (AlexNet) has experimented to say is effective (LRN), but it seems useless. And then?
* Network B: Then try adding 2 layers. It may seem to be effective. And then?
* Network C: Add two more layers of 1 1 convolution, and it will definitely converge. The effect seems to be better. And finally?
* Network D: Change the 1 1 convolution kernel to 3 * 3. Try it. The effect has improved again. This seems to be the best (2014).

### 10. What regularization methods are used in VGGNET to prevent overfitting?


Moreover, using convolutional layers may be considered regularization itself (weight sharing). Also, in the rest of section 3 they discuss impact of weight initialization on the performance, and describe data augmentation by scaling.