# <p style='text-align: center;'> Convolutional Neural Network (CNN) </p>

## 1. Introduction to Convolutional Neural Network (CNN):
Have you ever wondered how facial recognition works on social media, or how object detection helps in building self-driving cars, or how disease detection is done using visual imagery in healthcare? It’s all possible thanks to convolutional neural networks (CNN). 


<b> Here’s an example of convolutional neural networks that illustrates how they work:


Imagine there’s an image of a bird, and you want to identify whether it’s really a bird or some other object. The first thing you do is feed the pixels of the image in the form of arrays to the input layer of the neural network (multi-layer networks used to classify things). The hidden layers carry out feature extraction by performing different calculations and manipulations. There are multiple hidden layers like the convolution layer, the ReLU layer, and pooling layer, that perform feature extraction from the image. Finally, there’s a fully connected layer that identifies the object in the image.


![image.png](attachment:image.png)



## 2. Convolutional Neural Network (CNN):
**Convolutional Neural Network** is one of the main categories to do image classification and image recognition in neural networks. Scene labeling, objects detections, and face recognition, etc., are some of the areas where convolutional neural networks are widely used.


A **convolutional neural network** is a feed-forward neural network that is generally used to analyze visual images by processing data with grid-like topology. It’s also known as a **ConvNet**.



CNN takes an image as input, which is classified and process under a certain category such as dog, cat, lion, tiger, etc. The computer sees an image as an array of pixels and depends on the resolution of the image. Based on image resolution, it will see as h * w * d, where h= height w= width and d= dimension. For example, An RGB image is 6 * 6 * 3 array of the matrix, and the grayscale image is 4 * 4 * 1 array of the matrix. An RGB image is nothing but a matrix of pixel values having three planes whereas a grayscale image is the same but it has a single plane. 


In CNN, each input image will pass through a sequence of convolution layers along with pooling, fully connected layers, filters (Also known as kernels). After that, we will apply the Soft-max function/Sigmoid to classify an object with probabilistic values 0 and 1.




## 3. Why should we use CNN ?
<b> Problem with Feedforward Neural Network
    
Suppose you are working with MNIST dataset, you know each image in MNIST is 28 x 28 x 1(black & white image contains only 1 channel). Total number of neurons in input layer will 28 x 28 = 784, this can be manageable. What if the size of image is 1000 x 1000 which means you need 10⁶ neurons in input layer. Oh! This seems a huge number of neurons are required for operation. It is computationally ineffective right. So here comes Convolutional Neural Network or CNN. In simple word what CNN does is, it extract the feature of image and convert it into lower dimension without loosing its characteristics. In the following example you can see that initial the size of the image is 224 x 224 x 3. If you proceed without convolution then you need 224 x 224 x 3 = 100, 352 numbers of neurons in input layer but after applying convolution you input tensor dimension is reduced to 1 x 1 x 1000. It means you only need 1000 neurons in first layer of feedforward neural network.
    
![image.png](attachment:image.png)
    
    

## 4. Few Definitions
<b> There are few definitions you should know before understanding CNN

### 4.1 Image Representation:
Thinking about images, its easy to understand that it has a height and width, so it would make sense to represent the information contained in it with a two dimensional structure (a matrix) until you remember that images have colors, and to add information about the colors, we need another dimension, and that is when Tensors become particularly helpful.

    
Images are encoded into color channels, the image data is represented into each color intensity in a color channel at a given point, the most common one being RGB, which means Red, Blue and Green. The information contained into an image is the intensity of each channel color into the width and height of the image, just like this
    
![image.png](attachment:image.png)
    
    
So the intensity of the red channel at each point with width and height can be represented into a matrix, the same goes for the blue and green channels, so we end up having three matrices, and when these are combined they form a tensor.    
    
    
### 4.2 Edge Detection:
Every image has vertical and horizontal edges which actually combining to form a image. Convolution operation is used with some filters for detecting edges. Suppose you have gray scale image with dimension 6 x 6 and filter of dimension 3 x 3(say). When 6 x 6 grey scale image convolve with 3 x 3 filter, we get 4 x 4 image. First of all 3 x 3 filter matrix get multiplied with first 3 x 3 size of our grey scale image, then we shift one column right up to end , after that we shift one row and so on.
    
![image-2.png](attachment:image-2.png)
    
    
If we have N x N image size and F x F filter size then after convolution result will be
    
   (N x N) * (F x F) = (N-F+1)x(N-F+1)(Apply this for above case)
    
   OR
    
   N - F + 1
    
   
### 4.3 Stride and Padding:
**Stride** denotes how many steps we are moving in each steps in convolution. By default it is one. We can also mention the stride side based on the requirements. In the below "Working of Convolutional Neural Network (CNN)" Example where i am going to apply the 'stride'.
    
    
We can know that the size of output is smaller than input after applying filter. To maintain the dimension of output as in input, we use **padding**. **Padding** is a process of adding zeros to the input matrix symmetrically. 
    
**Padding** is a term relevant to convolutional neural networks as it refers to the amount of pixels added to an image when it is being processed by the kernel of a CNN. For example, if the padding in a CNN is set to zero, then every pixel value that is added will be of value zero. If, however, the zero padding is set to one, there will be a one pixel border added to the image with a pixel value of zero.
    
![image-3.png](attachment:image-3.png)    
    
    
Let say ‘p’ is the padding

Initially(without padding)
    
   (N x N) * (F x F) = (N-F+1)x(N-F+1)---(1)
    
    
After applying padding
    
![image-4.png](attachment:image-4.png)
    
    
If we apply filter F x F in (N+2p) x (N+2p) input matrix with padding, then we will get output matrix dimension (N+2p-F+1) x (N+2p-F+1). As we know that after applying padding we will get the same dimension as original input dimension (N x N). Hence we have,
    
    
   (N+2p-F+1)x(N+2p-F+1) equivalent to NxN
    
   N+2p-F+1 = N ---(2)
    
   p = (F-1)/2 ---(3)
    
    
The equation (3) clearly shows that Padding depends on the dimension of filter.
    

### 4.4 Depth:
**Depth** corresponds to the number of filters we use for the convolution operation. In the network shown in Figure below, we are performing the convolution of the original boat image using three distinct filters, thus producing three different feature maps as shown. You can think of these three feature maps as stacked 2d matrices, so, the ‘depth’ of the feature map would be three. The more numbers of filters the more accurate result.

![image.png](attachment:image.png)



## 5. Layers in CNN/How Convolutional Neural Network Works?

<b> Generally, A Convolutional neural network has four layers, they are:

- Convolutional
- ReLU Layer
- Pooling
- Fully Connected Layer
    
    
<b> Approach:

- Build a small convolutional neural network as defined in the architecture below.


- Select images to train the convolutional neural network.


- Extraction of feature filters/feature maps.


- Implementation of the convolutional layer.


- Apply the ReLu Activation function on the convolutional layer to convert all negative values to zero.


- Then apply max pooling on convolutional layers.


- Make a fully connected layer


- Then input an image into CNN to predict the image content


- Backpropagation to calculate the error rate
    
    

## 6. Working of Convolutional Neural Network (CNN):

Generally, A Convolutional neural network has four layers. And we understand each layer one by one with the help of an example of the classifier. With it can classify an image of an X and O. So, with the case, we will understand all four layers.

- Convolutional Layer (Operation)
- ReLU Layer
- Pooling
- Fully Connected Layer


<b> Let's have a look an image of an X and O. So, with the case, we will understand all four layers:
    
![image-2.png](attachment:image-2.png)
 
    
A computer understands an image using numbers at each pixel.

    
In our example, we have considered that a yellow pixel will have value 1, and a white pixel will have -1 value. This is as the way we've implemented to differentiate the pixels in a primary binary classification.    
    
![image-3.png](attachment:image-3.png)    
    
    
<b> Train the Convolutional Neural Network For Image X
    
<b> Feature Filters extraction from image X  
    
In convolutional networks, you look at an image through a smaller window and move that window to the right and down. That way you can find features in that window, for example, a horizontal line or a vertical line or a curve, etc… What exactly a convolutional neural network considers an important feature is defined while learning.
    

Wherever you find those features, you report that in the feature maps. A certain combination of features in a certain area can signal a larger, more complex feature exists there.

    
For example, your first feature map could look for curves. The next feature map could look at a combination of curves that build circles.
    
![image-4.png](attachment:image-4.png)
    
![image-5.png](attachment:image-5.png)
    
![image-6.png](attachment:image-6.png)    
    

### 6.1 Convolutional Layers (Operation):
Convolution layer is the first layer to extract features from an input image. By learning image features using a small square of input data, the convolutional layer preserves the relationship between pixels. It is a mathematical operation which takes two inputs such as image matrix and a kernel or filter.

<b> 6.1.1 Convolutional Layer 1 (Image X with filter 1)
    
In CNN convolutional layer, the 3×3 matrix called the ‘feature filter’ or ‘kernel’ or ‘feature detector’ sliding over the image and the matrix formed will be the convolutional layer. It is important to note that filters act as feature detectors from the original input image. Image X matching with filter # 1 with a stride of 1.    

![image.png](attachment:image.png)
    
    
The pixel values of the highlighted matrix will be Multiplying the Corresponding Pixel Values of the filter and then Adding and Dividing by total number of pixels, it is also called as **"dot product"**.   
    
![image-2.png](attachment:image-2.png)
    
    
<b> Here you will see how the filter shifts on pixels with a stride of 1. 
    
**Strides:** Stride is the number of pixels which are shift over the input matrix. When the stride is equaled to 1, then we move the filters to 1 pixel at a time and similarly, if the stride is equaled to 2, then we move the filters to 2 pixels at a time. The following figure shows that the convolution would work with a stride of 1.    
    
![image-3.png](attachment:image-3.png)
    
![image-4.png](attachment:image-4.png)    
    
![image-5.png](attachment:image-5.png)
    
![image-6.png](attachment:image-6.png)
    
![image-7.png](attachment:image-7.png)
    
![image-8.png](attachment:image-8.png)
    
    
similarly so on    
    
    
![image-10.png](attachment:image-10.png)   
    
    
<b> Convolution Layer Output
    
We will transfer the features to every other position of the image and will see how the features match that area. Finally, we will get an output as;    
    
    
![image-11.png](attachment:image-11.png)   
    
    
<b> Similarly, we perform the same convolution with every other filter. 
    
<b> Hence Convolutional Layer 1, 2 and 3 (Image X with filter 1, 2 and 3)
    
![image-12.png](attachment:image-12.png)   
    
    
    

### 6.2 RELU Layer:
In this layer we remove every negative values from the filtered images and replaces it with zero's.


This is dine to avoid the values from summing upto zero.


**Rectified Linear Unit (RELU)** transform function only activates a node if the input is about a certain quantity, while the input is below zero, the output is zero, but when the input rises above a certain threshold, it has a linear relathionship with the dependent variable.


<b> Apply ReLu Activation Function on Convolutional layers: Convert all negative values to zero

<b> 6.2.1 Relu layer For Convolutional Layer 1, 2 and 3:
    
Apply ReLu activation Function on Convolutional Layer 1, 2 and 3 to convert all the negative values to zero.
    
![image.png](attachment:image.png)
    
![image-2.png](attachment:image-2.png)
    
![image-3.png](attachment:image-3.png)    
    

### 6.3 Pooling:
Pooling layer plays an important role in pre-processing of an image. Pooling layer reduces the number of parameters when the images are too large. Pooling is "downscaling" of the image obtained from the previous layers. It can be compared to shrinking an image to reduce its pixel density. Spatial pooling is also called downsampling or subsampling, which reduces the dimensionality of each map but retains the important information. There are the following types of spatial pooling:

<b> Max Pooling: 
    
- Max pooling is a sample-based discretization process. Its main objective is to downscale an input representation, reducing its dimensionality and allowing for the assumption to be made about features contained in the sub-region binned.

    
- Max pooling is done by applying a max filter to non-overlapping sub-regions of the initial representation.


<b> Average Pooling:
    
- Down-scaling will perform through average pooling by dividing the input into rectangular pooling regions and computing the average values of each region.
    
    
    
In the layer, we shrink the image stack into a smaller size. Pooling is done after passing by the activation layer. We do by implementing the following 4 steps:

- Pick a **window size** (often 2 or 3)
    
- Pick a **stride** (usually 2)
    
- **Walk** your Window **across** your **filtered** images
    
- From each **Window**, take the **maximum** value
    
    
Let us understand this with an example. Consider performing pooling with the window size of 2 and stride is 2 as well.
    
<b> After applying the Convolutional & Relu layer respectively Now we apply the Max pooling for convolutional layers 1, 2 & 3 and extract maximum feature from the image. 

<b> 6.3.1 Max pooling For Convolutional Layer 1
    
![image.png](attachment:image.png)
    
![image-2.png](attachment:image-2.png)
    
    
similarly so on
    
![image-3.png](attachment:image-3.png)
    
    
<b> Similarly, we perform the same Max pooling For Convolutional Layer 2 and 3.

![image-4.png](attachment:image-4.png)
    
![image-5.png](attachment:image-5.png)    
    

**6.3.2 Further Max Pooling:** Further Max Pooling for use in fully connected layer

<b> Further Max pooling for convolutional layer 1
    
![image.png](attachment:image.png)    

<b> Similarly, we perform the same Max pooling For Convolutional Layer 2 and 3.
    
    
![image.png](attachment:image.png)    
    
    

### 6.4 Flattening:
In this step, we converting all the resultant 2-dimensional arrays into a single long continuous linear vector.

![image.png](attachment:image.png)


Now, these single long continuous linear vectors are input nodes of our full connection layer.

### 6.5 Fully Connected Layer ( for X ):
The last layer in the network is **fully connected**, meaning that neurons of preceding layers are connected to every neuron in subsequent layers.

This **mimics high-level reasoning** where all possible pathways from the input to output are considered.

Then, take the shrunk image and put into the single list, so we have got after passing through two layers of convolution relo and pooling and then converting it into a single file or a vector.

We take the first Value 1, and then we retake 0.55 we take 0.55 then we retake 1. Then we take 1 then we take 0.55, and then we take 1 then 0.55 and 0.55 then again retake 0.55 take 0.55, 1, 1, and 0.55. So, this is nothing but a vector. The fully connected layer is the last layer, where the classification happens. Here we took our filtered and shrunk images and put them into one single list as shown below.

![image.png](attachment:image.png)


<b> Output:
    
When we feed in, 'X' and '0'. Then there will be some element in the vector that will be high. Consider the image below, as we can see for 'X' there are different top elements, and similarly, for 'O' we have various high elements.

There are specific values in my list, which were high, and if we repeat the entire process which we have discussed for the different individual costs. Which will be higher, so for an X we have 1st, 4th, 5th, 10th, and the 11th element of vector values are higher. And for O we have 2nd, 3rd, 9th and 12th element vector which are higher. We know now if we have an input image which has a 1st, 4th, 5th, 10th, and 11th element vector values high. We can classify it as X similarly if our input image has a list which has the 2nd 3rd 9th and 12th element vector values are high so that we can organize it.
    
![image-3.png](attachment:image-3.png)
    
    
Then the 1st, 4th, 5th, 10th, and 11th values are high, and we can classify the image as 'x.' The concept is similar for other alphabets as well - when certain values are arranged the way they are, they can be mapped to an actual letter or a number which we require.
    

<b> Comparing the Input Vector with X
    
After the training is done the entire process for both 'X' and 'O.' Then, we got this 12 element vector it has 0.9, 0.65 all these values then now how do we classify it whether it is X or O. We will compare it with the list of X and O so we have got the file in the previous slide if we notice we have got two different lists for X and O. We are comparing this new input image list that we have arrived with the X and O. First let us compare that with X now as well for X there are certain values which will be higher and nothing but 1st 4th 5th 10th and 11th value. So, we are going to sum them, and we have got 5= 1+ 1+ 1+ 1+1 times 1 we got 5, and we are going to sum the corresponding values of our image vector. So the 1st value is 0.9 then the 4th value is 0.87 5th value is 0.96, and 10th value is 0.89, and 11th value is 0.94 so after doing the sum of these values have got 4.56 and divide this by 5 we got 0.9.
    
![image-5.png](attachment:image-5.png)   
    
    
<b> We are comparing the input vector with 0.

And for X then we are doing the same process for O we have notice 2nd 3rd 9th, and 12th element vector values are high. So when we sum these values, we get 4 and when we do the sum of the corresponding values of our input image. We have got 2.07 and when we divide that by 4 we got 0.51.
    
![image-7.png](attachment:image-7.png)
    
    
<b> Result:

Now, we notice that 0.91 is the higher value compared to 0.5 so we have compared our input image with the values of X we got a higher value then the value which we have got after comparing the input image with the values of 4. So the input image is classified as X.    
    
