# Convolutional Neural Network Tutorial

<br><br>
<img src="https://i.imgur.com/2GXmcvN.png" width="700" hight = "600" > <br><br>


**Convolutional Neural Network** is one of the main categories to do image classification and image recognition in neural networks. Scene labeling, objects detections, and face recognition, etc., are some of the areas where convolutional neural networks are widely used.

CNN takes an image as input, which is classified and process under a certain category such as dog, cat, lion, tiger, etc. The computer sees an image as an array of pixels and depends on the resolution of the image. Based on image resolution, it will see as <b>h * w * d</b>, where h= height w= width and d= dimension. For example, An RGB image is <b>6 * 6 * 3</b> array of the matrix, and the grayscale image is <b>4 * 4 * 1</b> array of the matrix.

In CNN, each input image will pass through a sequence of convolution layers along with pooling, fully connected layers, filters (Also known as kernels). After that, we will apply the Soft-max function to classify an object with probabilistic values 0 and 1.

<br>
<img src="https://i.imgur.com/FLtPNbD.png" width="700" hight = "500" > <br>


## Convolution Layer

Convolution layer is the first layer to extract features from an input image. By learning image features using a small square of input data, the convolutional layer preserves the relationship between pixels. It is a mathematical operation which takes two inputs such as image matrix and a kernel or filter.

<br>
<img src="https://i.imgur.com/cH8PwEE.png" width="1200" hight = "800" > <br>

## Strides

Stride is the number of pixels which are shift over the input matrix. When the stride is equal to 1, then we move the filters to 1 pixel at a time and similarly, if the stride is equal to 2, then we move the filters to 2 pixels at a time. The following figure shows that the convolution would work with a stride of 2.

<br>
<img src="https://i.imgur.com/21Pg9Dl.png" width="700" hight = "600" > <br>

## Padding

Padding plays a crucial role in building the convolutional neural network. If the image will get shrink and if we will take a neural network with 100's of layers on it, it will give us a small image after filtered in the end.

If we take a three by three filter on top of a grayscale image and do the convolving then what will happen?

<br>
<img src="https://i.imgur.com/pkGIrTl.png" width="800" hight = "700" > <br>

It is clear from the above picture that the pixel in the corner will only get covers one time, but the middle pixel will get covered more than once. It means that we have more information on that middle pixel, so there are two downsides:

- Shrinking outputs
- Losing information on the corner of the image.

To overcome this, we have introduced padding to an image. 

**"Padding is simply a process of adding layers to our input images so as to avoid the problems mentioned above."**

<br>
<img src="https://i.imgur.com/s8Ub8TZ.png" width="450" hight = "400" > <br>

- This prevents shrinking as, 
    if **p =** number of layers of zeros added to the border of the image, then our **(n x n)** image becomes **(n + 2p) x (n + 2p)** after padding. So applying convolution-operation **(with (f x f) filter)**, outputs will be **(n + 2p – f + 1) x (n + 2p – f + 1)**. 
    **For example,** adding one layer of **padding** to an **(8 x 8)** image and using a **(3 x 3)** filter we would get an **(8 x 8)** output after performing convolution operation.


- This increases the contribution of the pixels at the border of the original image by bringing them into the middle of the padded image. Thus, information on the borders is preserved as well as the information in the middle of the image.

<br>

## ReLU Layer

ReLU stands for the **rectified linear unit**. Once the feature maps are extracted, the next step is to move them to a ReLU layer. 

ReLU performs an element-wise operation and sets all the negative pixels to 0. It introduces non-linearity to the network, and the generated output is a rectified feature map. Below is the graph of a ReLU function:

<br>
<img src="https://i.imgur.com/oRxY9rf.png" width="600" hight = "500" > <br>


## Pooling Layer

Pooling layer plays an important role in pre-processing of an image. Pooling layer reduces the number of parameters when the images are too large. Pooling is **"downscaling"** of the image obtained from the previous layers. It can be compared to shrinking an image to reduce its pixel density. Spatial pooling is also called downsampling or subsampling, which reduces the dimensionality of each map but retains the important information. There are the following types of spatial pooling:

## Max Pooling

Max pooling is a **sample-based discretization process.** Its main objective is to downscale an input representation, reducing its dimensionality and allowing for the assumption to be made about features contained in the sub-region binned.

Max pooling is done by applying a max filter to non-overlapping sub-regions of the initial representation.

<br>
<img src="https://i.imgur.com/rYu8XFm.png" width="800" hight = "800" > 

## Average Pooling

Down-scaling will perform through average pooling by dividing the input into rectangular pooling regions and computing the average values of each region.

**Syntax**

layer = averagePooling2dLayer(poolSize)
layer = averagePooling2dLayer(poolSize,Name,Value)


## Sum Pooling

The sub-region for **sum pooling** or **mean pooling** are set exactly the same as for **max-pooling** but instead of using the max function we use sum or mean.

<br>

## Fully Connected Layer

The fully connected layer is a layer in which the input from the other layers will be flattened into a vector and sent. It will transform the output into the desired number of classes by the network.

<br>
<img src="https://i.imgur.com/OoPI95y.png" width="800" hight = "600" > <br>

In the above diagram, the feature map matrix will be converted into the vector such as **x1, x2, x3... xn** with the help of fully connected layers. We will combine features to create a model and apply the activation function such as **softmax** or **sigmoid** to classify the outputs as a car, dog, truck, etc.

<br>
<img src="https://i.imgur.com/iTmqRJK.png" width="800" hight = "600" > <br>


#### Here’s how the structure of the convolution neural network looks so far:

<br>
<img src="https://i.imgur.com/jUWQ3cg.png" width="800" hight = "600" > <br>

The next step in the process is called flattening. Flattening is used to convert all the resultant 2-Dimensional arrays from pooled feature maps into a single long continuous linear vector.


<br>
<img src="https://i.imgur.com/UVKMDlt.png" width="700" hight = "600" > <br>


The flattened matrix is fed as input to the fully connected layer to classify the image.

<br>
<img src="https://i.imgur.com/7rQNVED.png" width="970" hight = "800" > <br>

<br>
<img src="https://i.imgur.com/kfNEtRM.png" width="970" hight = "800" > <br>

<br>
<img src="https://i.imgur.com/cjjlns7.png" width="970" hight = "800" > <br>


**Here’s how exactly CNN recognizes a bird:**

- The pixels from the image are fed to the convolutional layer that performs the convolution operation 
- It results in a convolved map 
- The convolved map is applied to a ReLU function to generate a rectified feature map 
- The image is processed with multiple convolutions and ReLU layers for locating the features 
- Different pooling layers with various filters are used to identify specific parts of the image 
- The pooled feature map is flattened and fed to a fully connected layer to get the final output

<br>
<img src="https://i.imgur.com/HPIgTYf.png" width="970" hight = "800" > <br>
