### `Convolutional Neural Network`


- When computer tries to identify an image it looks at the image as a grid of numbers(rgb numbers ranges between `0` to `255`).

<img src="images\cnn\1.png" width=600>

- The isssue with it is that it is too much hard coded. So if the image will change it's position the computer will not be able to recognize it.

<img src="images\cnn\2.png" width=600>


- Also as the digits are hand written so there can be variations, as a result it will change the 2 dimensional representation of the number as a result it will not match with the original grid.

<img src="images\cnn\3.png" width=600>
<img src="images\cnn\4.png" width=600>

- To handle these varieties in digits we can use the simple `ANN`.
- This technique works well with a simple image like a hand written number. But when the image size is big, e.g. `(1920 X 1080 X 3)` where `3` is the `rgb` channel. In this case:
  - First layers neurons = **`1920 X 1080 X 3 ~ 6 million`**
  - Hidden layer neurons = **`Imagine we have 4 million`**
  - Weights between **Input** and **Hidden** layer = **`6 million X 4 million = 24 million`**
- So here we can see that it needs `24 million` weights to be calculated. Also remember deep nural network may have multiple hidden layers, so as a result the calculation may go to a huge number. This is too much computation for the `ANN`.  
- The disadvantages of the `ANN` for image classification are:
  - It cannot handle too much computation.
  - It treats the local pixels same as pixels far apart.
  - This is also sensitive to location of an object in an image. That is in this technique the image recognization task is cenetered around the locality. So if the pixels moved around it would be hard for `ANN` to detect the object in the image.
- To overcome this situtaion we use `CNN`.


<hr style="border:2px solid black">

**How do humans recognize an image?**

- Let's say here when we try to recognize the image of a quala we first look at the different features like the eyes, nose, ears, and detect these features one by one.
- In human brain different set of neurons working on these different features and recognize that feature. 

<img src="images\cnn\5.png" width=600>

- Then these neurons connected to another set of neurons which will aggregate the results and say that if in the image we are seeing quala's eyes, nose and ears, it means there is a quala's face in the image.

<img src="images\cnn\6.png" width=600>

- Also there will be seperate set of neurons that will identify the quala's hand and legs, which eventually connected to another set of neurons that will decide that there is a quala's body in the image.

<img src="images\cnn\7.png" width=600>


- Then at the final stage another set of neurons will decide if there is quala's head and body in the image then the image is of a quala.

<img src="images\cnn\8.png" width=800>

- Applying the same logic we can identify the handwritten numbers where each part of the number is identified separately and at the end the digit gets identified as a whole.

<img src="images\cnn\9.png" width=800>


<hr style="border:2px solid black">

**Now applying the same logic for computer to recognize each feature**

- To do this we use the concept of **Filter**. As here in case of handwritten digit `9` we use `3` **Filters** to identify `3` features of the image.

<img src="images\cnn\10.png" width=800>

- Here we take the original image and apply an ***Convolutional*** or a **Filter** operation. As here we identify the head of the digit `9` through a **Filter**.

<img src="images\cnn\11.png" width=800>

- The way the ***Convolutional*** operation works is that it takes `3 X 3` grids from the original image and multiply individual numbers with these filters. Then it gets a result and then it will calculate the average and whatever number it gets it put in a new grid called **Feature Map**. So by doing these ***Convolutional*** operations we create a **Feature Map**.
- The size of the **Filter** can be anything. So we need to do it for all the grids seperately. As a result at the end we will get a **Feature Map**.

<img src="images\cnn\12.png" width=800>
<img src="images\cnn\13.png" width=800>
<img src="images\cnn\14.png" width=800>


- The benefit is in the **Feature Map** wherever there is number `1` or a number close to `1`, it means there is a ***loopy circle*** pattern. So by applying this we can identify the loopy patterns in an image.
- If there are multiple areas in the image where there are loopy features then the loopy pattern detectior activation function will get activated in those parts.

<img src="images\cnn\15.png" width=800>
<img src="images\cnn\16.png" width=800>


<u>**Convolutional Padding**</u>

- When we apply the filters we find the following:

<img src="images\cnn\33.png" width=800>

- Here it reduces the input size from `5 X 7` to `3 X 5` which is the **Feature Map** when we apply a filter of `3 X 3` on the input.
- It is called a **Valid Convolution** or **Valid Padding**. 
- The problem with this approach is that the pixels of the corner don't get to play an important role in `feature detection`.
- To solve this problem we can put padding around the original input image (here we are using `1 X 1` padding). We can also put some value in this padding as here we put `-1` which represents blank background. Now if we apply the `3 X 3` filter we can start from the very left corner and go to the right corner. As a result the corner pixels of the original image now gets a role to play multiple times when detecting features.

<img src="images\cnn\34.png" width=800>

- Here the original image was `5 X 7`, but due to padding of `1` now it becomes `7 X 9`. Then after applying a filter of `3 X 3` the output i.e. the **Feature Map** we get is of `5 X 7`. So here we get the original image back as the **Feature Map**. Also here the corner pixels get a better role in `feature detection`. This is called **Same Convolution**.

<img src="images\cnn\35.png" width=800>

- So **Valid Convolution** means *no padding* and **Same Convolution** means *pad such that the output is same as input*.
- In `Tensorflow` api there is an argument `padding`, where we can supply the value as `same` or `valid`.By default `padding` is `valid`.
> `layers.Conv2D(16, 3, padding="same", activation="relu")`
- In summary when we apply this filter or a **Convolutional** operation, we are generating a **Feature Map** that has that particular feature detected.
- **So *`Filters`* are nothing but *`feature detectors`***.
- In case of the quala's image we can detect eyes using this **Filter**. So even if the eyes are at different locations it will still detect the eyes in the image because here we are moving the filter throughout the image.

<img src="images\cnn\17.png" width=800>

- It is **location invarient** means it can detect the eyes in any location of the image, and activate those particular regions.
- As here we have `6` eyes from `3` different qualas and the filter can detect all the `6` eyes.

<img src="images\cnn\18.png" width=800>

<hr style="border:2px solid black">

- So in case of number `9` example we need to identify three features for that we created `3` filters and as a result we will get `3` **Feature Maps**.

<img src="images\cnn\19.png" width=800>

- Then by combining those `3` **Feature Maps** we can get the actual image as a `stack` of a `3d` volume.

<img src="images\cnn\20.png" width=800>

- Same logic can be applied for the case of quala also. Here we can detect the head of the quala using the filter.

<img src="images\cnn\21.png" width=800>

- Using same logic we can detect both quala's head and body filters to detect their position in the image. Then we flatten the numbers of the **Feature Maps**. And then join the two arrays together. After the joining we can create a ***Fully Connected Dense Neural Network*** for the classification.

<img src="images\cnn\22.png" width=800>

- The need for the ***Fully Connected Dense Neural Network*** is that even if we get a different image of quala where the eyes and ears are at different locations we can still detect it as a quala. In case of the secon image due to eyes and ears are at different location we will get a different type of flattened array.

<img src="images\cnn\23.png" width=800>

- **Basics of Neural Network:**
> **Neural Networks are used to handle the variety in the inputs in such a way that it can `classify` those variety of inputs in a generic way.**
- Here in the first part where we use **Convolution Opearation** is called **Feature Extraction** part as it detects all the features (ears,nose,eyes,head,body), and the second part where we use **Dense Neural Network** is called **Classification** part as it is responsible for the classification.

<img src="images\cnn\24.png" width=800>

<hr style="border:2px solid black">

- But the above is not the complete **Convolutional Neural Network**, there are two more components:
  - **ReLU Activation Function:**
    - It is use to bring **Non Linearity** in the model.
    - So it will take the **Feature Map** and whatever negative values are there it will replace them with `0` and if it is a positive number then it will keep it as it is.
    <img src="images\cnn\25.png" width=800>
  - **Pooling layer:**
    - **Pooling** layer is used to reduce the size of the image.
    - This technique is use to reduce the computation.
    - The first **Pooling** operation is **Max Pooling**:
      - Here we pick a window of `2 X 2` and select the maximum number out of that window and put it in the `2 X 2` filter.
      - So here we take the **Feature Map** apply the **Pooling** and create a new **Feature Map**.
      - Here the new **Feature Map** is less than the size of the old **Feature Map**, so it saves computation.
      <img src="images\cnn\26.png" width=800>
      - Here `stride = 2` means once we are done with one window we will move two points(two pixels) further.
      - In case of our handwritten number `9` we can take `stride = 1` and the following we will get.
      <img src="images\cnn\27.png" width=800>
      - In `Tensorflow` api we can also declare the `strides`. By default it is `1 X 1` means 1 stride at a time.
      > `tf.keras.layers.Conv2D(filters, kernel_size, strides=(1,1), padding="valid", data_format=None)`
      - If the number is shifted we will get a new **Max Pooling** map. But still it detects the loopy pattern at the top.
      <img src="images\cnn\28.png" width=800>
      <img src="images\cnn\29.png" width=800>
      - So **Max Pooling** along with **Convolution** helps in **Position Invariate Feature Detection**. It means that there is no problem where the feature is in the image, it will still detect that feature.
    - There is another **Pooling** called **Average Pooling**:
      - In this we need to calculate average to create the new **Feature Map**.
      <img src="images\cnn\31.png" width=800>
      - But **Max Pooling** is mostly used.
  - **Advantages of `Pooling`**:
    - It reduces dimensions and computation.
    - It reduces **Overfitting** as there are less parameters.
    - The model it creates is tolerant towards variations and distortions. Because if there is a distortion and we are picking the maximum number then we are capturing the main feature, and filtering all the noise.


- So the complete **Convolutional Neural Network** will look like the following:

<img src="images\cnn\30.png" width=800>

- Here we will have a **`Convolution + ReLU`** layer, then we will have **`Pooling`**, then another **`Convolution + ReLU`** layer and **`Pooling`**. 
- There can be **n** number of layers for **Convolutional** and **Pooling**, and at the end we will have **Fully Connected Dense Neural Network**.
- In these **Filters** the network will learn on it's own.
- **Benefits of `Convolution` Operations**:
  - Connections sparsity reduces **Overfitting**. Connections sparsity means not every node is connected with every other node like in `ANN` where we call it a `Dense Network`.
  - Here we have a filter that we move around the image and at a time we only talk about a **local region**, so they are not affecting the whole image.
  - `Convolution` and `Pooling` gives a location invariant feature detection. It means that the location of the feature doesnot matter.
  - Parameter sharing. That is if we lawn the parameter for filter `a` we can apply them in the entire image.
- **Benefits of `ReLU`**:
  - It introduces **Non Linearity** as when we are solving `DL` problems they are **Non_linear** by nature.
  - It speeds up the training as a result it becomes faster to compute.
  

<hr style="border:2px solid black">

#### Rotation and Thickness:

- **CNN** by itself doesn't take care of **rotation** and **scale**.
- We need to have rotated, scaled sample in training dataset.
- If we don't have such samples then we need to use **Data Augmentation** methods to generate new rotated/scaled samples from existing training samples.
- **Data Augmentation:** Pick a few samples from the training set and then rotate them manually, or make them larger or smaller, or thicker or thiner and make new samples by doing so.

<img src="images\cnn\32.png" width=800>


<hr style="border:2px solid black">