# 📐 Architecture 


Convolution NN is crazy and amazing. Here we will break the NN into 2 parts or 2 stages. First is something **that we are interested in** and the second one is what we already know about.


<img src="../images/cnn-arc.png" height=400 width=500>

**Part 1:** 
- Is a **feature transformer** as it usually is in the regular ANN.
- This time, it is made ***specially*** for the images.
- It is filled with **convolution** and **pooling** layers.
- Here each convolution layer **extracts** some feature from the image and forwards that information to the next layers.

**Part 2:** 
- Is made of a **bunch of regular dense** layers.
- They perform **non-linear** transformations which they generally do.
- These layers are called "**fully connected layers**" because they connect all convolutions and pooling output in the single layer.

## 🏊 Pooling
*Nah, the emoji isn't perfect.*

> Pooling, on the high-level performs the ***downsampling*** operation. It **shrinks** the output of the previous conv later.

<img src="../images/pooling.png" height=400 width=500>

#### `2` kinds of pooling:
1. Max *(more common)*
2. Average

<img src="../images/pooling-types.png" height=400 width=500>

The given example is of **max** pooling. There is the similar operation done in the average case. What can that be, is an super easy guess.

#### But why?
- **Practical**: Less data to process, speedy
- **Translational Invariance**: With pooling the feature can be highlighted. 
    - Meaning, in the simple image a nose can be anywhere right?
    - Pooling says: *"I don't care **where** the feature is, I care that **it does** exist."*
    - Thus, it helps the model to be **more generalize**.

<img src="../images/transational0invariance.png" height=400 width=500>

In the image above ↑ we have the letter "A". But while training, the NN will get the shape "A" in different places, so it will start looking for "A" like:

> *hey, if the point is on 5th row and 55th col and 556th row and 6678th col, then it is A*.

This simply isn't a good generalized model. If "A" happens to be somewhere else, then this will cause a problem. This is **where pooling helps**. It reduces the ***translational invariance***.
___

Here is a nice [question](https://stats.stackexchange.com/questions/208936/what-is-translation-invariance-in-computer-vision-and-convolutional-neural-netwo) thread that discusses the same topic and the image below shows the same problem:

<img src="https://i.stack.imgur.com/iY5n5.png" height=700 width=500>

###  Unlike convolution...

1. Pooling layer ***can be*** of uneven sizes. Meaning, it can be of $3\times2$ or $5\times2$ etc.
    - Well convolution also ***can be***, but that's not much conventional.
2. Pooling layer has possibility to **overlap**.
    - Here we can set the hyperparameter "stride" to control the overlap.

<img src="../images/pooling-types.png" height=400 width=500>

This image has:
- **Pooling**: 2x2 | **Stride**:2
    - That's why the boxes **didn't** overlap

If:
- **Pooling**: 2x2 | **Stride**:1
    - Then the box **would** overlap
- **Pooling**: 3x3 | **Stride**:2
    - Then the box **would** overlap
- **Pooling**: 3x3 | **Stride**:3
     - Then the box **wouldn't** overlap

I think, this covers a pretty goo understanding of how the stride works.

## 🌴 CNN learns hierarchically
<img src="../images/c-pool-c-pool.png" height=400 width=500>

This answers the question: *"why conv layer - pooling - conv - pooling...?"*

The reason is: CNN learns hierarchically. Research has shown that the NN learns the **basic** features like basic strokes first, then looks for **higher order** features and so on. So, each feature *(here image)* has to be *pooled*.

#### After pooling...

<img src="../images/after-pooling.png" height=400 width=500>

The image size shrinks. **But** the filter size stays the same.

<img src="../images/increased-relation.png" height=400 width=500>

See? How first in the ***bigger*** image the filter was able to learn the small stuff like strokes and then as it is progressing, it has started learning the bigger shapes like **whole face**. 

> ***This is what leads the CNN to learn the features hierarchically.***

## 🗺️ Losing information?
On each iteration in the the deep layers, we are *shrinking* the image. Thus, there is *some* loss.
- Since, we **care** about the feature's existence, we get some features which **indecates** that there is the feature but **don't know** where it is.
- Since the first layer, we start finding the feature and if we get that, we perform other convolutions followed by pooling carrying that feature in the upcoming layers.
- The ***spatial*** information decreases at each layer, **but** the number of **features** increases.
- So, we don't care **where** the feature was found, but we care that it **was** found.
- Hence, the number of feature maps increases. There can be many features at once.
    > **Feature maps** are the output of one layer propogated to another layer. If you recall, we are appending all featuremaps into a single image making the image a 3-dimension. See the image below to get the idea... <br> <br>  <br><img src="../images/featuremap-increases.png" height=400 width=500>

## 👍 Rule of Thumb
We studied ANN, right? There weren't many hyperparameters. There were, indeed more than machine learning, but on our side, it wasn't a very high number. **Here**, we have too much.

- Choose filter size
- Number of feature maps
- Number of layers
- Pool size
- Pool mode

...

### Guideline
There is a general pattern that many people follow in this space. So, we can use that as our starting point.
- Small filter size relative to the image: $3 \times 3, 5 \times 5, 7\times 7$
- Repeat: Conv → Pool → Conv → Pool
- Increase # of feature maps: 32, 64, 128, 128...
- *Read lots of papers!*


*(The given guidelines are from the lecture itself, I have not altered anything)*

## 😮‍💨  Convolution can have "stride"
What's the case. Let's understand.

- **Striding** means, we ***skip*** the pixels and continue learning.
- We do it because generally in an imaged, the **neighboring pixels are almost always highly correlated**.
- So, learning the same information again and again will take more computation, so instead we just skip those pixels
- **Doing so** will ***reduce*** the size of an image!

Have a look at this animation:

<img src="../images/convolution-strided.gif" height=300 width=300>

So, **with stride** we already are shrinking an image. The researchers have found out that "using stride in the convolution and *not* using pooling layer, works more efficient or just as well".

<img src="../images/stride-vs-pool.png" height=400 width=500>

## 📔📕 What about the 2nd part?
The dense layer?

That fully connected layer will have to take the **image** as a **flattened** version (N x T x D)

## ◼️▪️◻️▫️ Different shaped images?
For different sized shape... say:
- An image of 32 x 32 is passed and after 4 convolutions it became size 4 x 4
    - Now, with 10 feature maps we will have 4 x 4 x 10 = 160 shape vector (fully connected)
- If another feature with 64 x 64 is passed and is after 4 convolutions it became size 8 x 8 *(since it was bigger)*
    - Now, with 10 feature maps we will have 8 x 8 x 10 = 640 shape vector (fully connected)
    
We can see that **it results** in the different shape of the output. Which is not what NN can support! There can by any number of output then. So, we have something called **global max / average pooling**.

The global pooling takes the max / average across all channels in the image and results in the predefined vector format. <br>
<img src="../images/global-pooling.png" height=400 width=500>

The example given above is of global pooling which takes `1` max out of all channels.

## 📑 Summary

<img src="../images/cnn-summary.png" height=400 width=500>

There, you can see:
- We start with $28 \times 28 \times 1$ *(grey)* image.
- That goes into the $5 \times 5$ convolution with 32 feature maps *(meaning apply 32 convolutions)*.
- That results the original $28 \times 28$ image into $24 \times 24 \times 32$ *(where 32 is the number of feature maps)*.
- The result $24 \times 24 \times 32$ is then applied to the pooling of $2 \times 2$ which results in the **shrinked** image of $12 \times 12 \times 32$.
- Again, we are applying the convolution of $5 \times 5$ but here with 64 feature maps *(meaning applying 64 convolutions)*.
    > Thing to note here, <br> <br> Since in the previous layer we had $12 \times 12 \times 32$ image, in the next layer *(after 5x5 convolution)* it should be $8 \times 8 \times 32 \times 64$, right? But nah! Each convolution (out of 64, again each.) will sum the channels into a single. Thus, $8 \times 8 \times 32$ for each convolution will become $8 \times 8$ and thus for 64 fmaps, we will have $8 \times 8 \times 32$.
- Finally we apply the flatten layer and the things are straight forward!

<img src="../images/stepwise-cnn.png" height=500 width=700>

Do read the steps above and see the illustration to understand what's going on.

# 

# Amazing 
Let's catch up in the next book where we will write our first CNN code.