# Inception V4
**In this exercise you’ll need to explain <a href="https://arxiv.org/abs/1602.07261">Inception V4 (GoogLeNet V4)</a> architecture.**

**sourse link:**<br>https://dl.acm.org/doi/10.5555/3298023.3298188<br>https://arxiv.org/pdf/1602.07261.pdf<br>

Inception module was firstly introduced in `Inception-v1 / GoogLeNet`. 

When **Microsoft** researchers published their `ResNet` findings, **Google** followed suit with `Inception v4`


*Advantages*
- Uniform simplified architecture
- More Inception modules
- DistBelief replaced by TensorFlow

Very deep convolutional networks have been central to the largest advances in image recognition performance in recent years. One example is the Inception architecture that has been shown to achieve very good performance at relatively low computational cost. 

**from the paper:**
*These variations improve the single-frame recognition performance on the ILSVRC 2012 classification task significantly. We further demonstrate how proper activation scaling stabilizes the training of very wide residual Inception networks. With an ensemble of three residual and one Inception-v4, we achieve 3.08 percent top-5 error on the test set of the ImageNet classification (CLS) challenge*


## 1. Explain all the different parts of the architecture. 

**sourse link:**<br>https://medium.com/@ilias_mansouri/part-4-image-classification-9a8bc9310891<br>

The `Inception` architecture is
highly tunable, meaning that there are a lot of possible
changes to the number of filters in the various layers that
do not affect the quality of the fully trained network. In
order to optimize the training speed, you used to tune the
layer sizes carefully in order to balance the computation between the various model sub-networks.
The input goes through 1×1, 3×3 and 5×5 conv, as well as max pooling simultaneously and concatenated together as output. Thus, we don’t need to think of which filter size should be used at each layer.

![Screenshot%20from%202020-06-12%2015-20-55.png](attachment:Screenshot%20from%202020-06-12%2015-20-55.png)

The step time of Inception-v4 proved to be significantly slower in practice, probably due to the larger number
of layers.
As we can see above, the team made lots of efforts to acquire a more general and simplified architecture. Unfortunately, as we will see the Inception is anything but cute in terms of its internal working more so, due to the lack of explanations from the authors concerning the architectural choices.




## What are the different layers of the network? 

**sourse link:**<br>https://www.youtube.com/watch?v=JNKnlNGPpS4<br>https://www.youtube.com/watch?v=C86ZXvgpejM<br>https://towardsdatascience.com/a-simple-guide-to-the-versions-of-the-inception-network-7fc52b863202

 The input image first passes through the `Stem`.
 ![Screenshot%20from%202020-06-12%2018-43-57.png](attachment:Screenshot%20from%202020-06-12%2018-43-57.png)

The input is then fed to the `Inception A` block. For the sake of clarity, below you can compare the different Inception blocks:
![Screenshot%20from%202020-06-12%2018-46-35.png](attachment:Screenshot%20from%202020-06-12%2018-46-35.png)

We can see that all blocks make use of the aforementioned dimensionality reduction paradigm by using those 1x1 Convs. The reason being is that even some "5x5 convolutions on top of a convolutional layer with a large number of filters gets quickly suppah expensive." 1x1 convolutions are used to compute reductions before more expensive convolutions.


Inception v4 introduced specialized “Reduction Blocks” which are used to change the width and height of the grid.

![Screenshot%20from%202020-06-12%2022-08-56.png](attachment:Screenshot%20from%202020-06-12%2022-08-56.png)


## What is the final Accuracy score? 

![Screenshot%20from%202020-06-12%2023-13-46.png](attachment:Screenshot%20from%202020-06-12%2023-13-46.png)




Inception-v4: a pure Inception variant without residual connections with roughly
the same recognition performance as Inception-ResNet-v2.

To prevent the middle part of the network from “dying out”, the authors introduced two auxiliary classifiers . There essentially applied softmax to the outputs of two of the inception modules, and computed an auxiliary loss over the same labels. `The total loss function` is a weighted sum of the auxiliary loss and the real loss. Weight value used in the paper was 0.3 for each auxiliary loss. `Auxiliary loss` is purely used for training purposes, and is ignored during inference.

The term `top-5 error rate` refers method of benchmarking machine learning models in the ImageNet Large Scale Visual Recognition Competition.
The model is considered to have classified a given image correctly if the target label is one of the model’s top 5 predictions.

First, you make a prediction using the model and obtain the predicted class multinomial distribution ($∑pclass=1$).
Now, in the case of `top-1 score`, you check if the top class (the one having the highest probability) is the same as the target label.
In the case of `top-5 score`, you check if the target label is one of your top 5 predictions (the 5 ones with the highest probabilities).

In both cases, the `top score` is computed as the times a predicted label matched the target label, divided by the number of data-points evaluated.

![Screenshot%20from%202020-06-12%2022-27-20.png](attachment:Screenshot%20from%202020-06-12%2022-27-20.png)

## How many parameters are present in the network?


**sourse link:**<br>https://pdfs.semanticscholar.org/73ac/009051bba99eaea799172b28d69168b6aa02.pdf<br>https://medium.com/@sh.tsang/review-inception-v3-1st-runner-up-image-classification-in-ilsvrc-2015-17915421f77c<br>https://towardsdatascience.com/a-simple-guide-to-the-versions-of-the-inception-network-7fc52b863202

GoogLeNet has 9 inception modules stacked linearly. It is 22 layers deep (27, including the pooling layers). It uses global average pooling at the end of the last inception module.


    AlexNet [2]: 60 million parameters
    VGGNet [3]: 3× more parameters than AlexNet
    GoogLeNet / Inception-v1 [4]: 7 million parameters

## 2. Explain what is the difference between GoogLeNet V2, V3 and V4?


**sourse link:**<br>https://datascience.stackexchange.com/questions/15328/what-is-the-difference-between-inception-v2-and-inception-v3<br>

There are 4 versions. The first GoogLeNet must be the Inception-v1, but there are numerous typos in Inception-v3  which lead to wrong descriptions about Inception versions. These maybe due to the intense ILSVRC competition at that moment. Consequently, there are many reviews in the internet mixing up between v2 and v3. Some of the reviews even think that v2 and v3 are the same with only some minor different settings.

Nevertheless, in Inception-v4, Google has a much more clear description about the version issue:
   
    “The Inception deep convolutional architecture was introduced as GoogLeNet in (Szegedy et al. 2015a), here named Inception-v1. Later the Inception architecture was refined in various ways, first by the introduction of batch normalization (Ioffe and Szegedy 2015) (Inception-v2). Later by additional factorization ideas in the third iteration (Szegedy et al. 2015b) which will be referred to as Inception-v3 in this report.”

Main difference:
*  The earlier versions didn’t explicitly have reduction blocks, but the functionality was implemented
*  Inception v2 utilized separable convolution as first layer of depth 64.  It was dropped in v3 and v4 and  inception resnet, but re-introduced and heavily used in mobilenet later.
*  Batch normalization (BN) and factorization was introduced in Inception-v2 
*  in Inception-v2  5×5 conv was replaced by two 3×3 convs for dimension reduction.
*  Inception-v4, evolved from GoogLeNet / Inception-v1, has a more uniform simplified architecture and more inception modules than Inception-v3.
*  Inception-v4 performance is similar to the latest generation Inception-v3 network, but it adds residual connections in conjunction with a more traditional architecture.
*  The techniques from Inception-v1 to Inception-v3 are used in Inception-v4

## 3. What is Batch Normalization? 

**sourse link:**<br>https://towardsdatascience.com/review-inception-v4-evolved-from-googlenet-merged-with-resnet-idea-image-classification-5e8c339d18bc<br>https://medium.com/@ilias_mansouri/part-4-image-classification-9a8bc9310891

In February 2015, the same team of researchers at Google introduced the concept of `Batch Normalization` (BN) to the `Inception architecture`.



`Batch Normalization` is the solution to the `Internal Covariate Shift phenomenon` which happens due to the fact that "the distribution of each layer’s inputs changes during training, as the parameters of the previous layers change". <br>This has a negative impact on training speed due to the need of a lower learning rate, careful parameter initialization and saturating nonlinearities. `BN` forces the inputs of each layer to have resembling distributions at each training step by normalizing the mean and variance of each of the features at every level of representation during training. `BN` was the next coolest kid on the block and is now a widely adopted technique.<br>
This did not stop the `Google` team to present this paper in December 2015 which contains another set of upgrades to the initial Inception architecture. The first concrete improvement was the factorization of convolution with large filter sizes.

![Screenshot%20from%202020-06-12%2021-46-50.png](attachment:Screenshot%20from%202020-06-12%2021-46-50.png)

## Why is it important here?

ReLU is used as activation function to address the saturation problem and the resulting vanishing gradients. But it also makes the output more irregular. It is advantageous for the distribution of X to remain fixed over time because a small change will be amplified when network goes deeper. Higher learning rate can be used.

And an efficient grid size reduction module was also introduced which is less expensive and still efficient network.

Factorization was introduced in convolution layer  to further reduce the dimensionality, so as to reduce the overfitting problem. For example:
By using 3×3 filter, number of parameters = 3×3=9
By using 3×1 and 1×3 filters, number of parameters = 3×1+1×3=6
Number of parameters is reduced by 33%
