## CNN:
In deep learning, a convolutional neural network (CNN) is a class of artificial neural network most commonly applied to analyze visual imagery.Now when we think of a neural network we think about matrix multiplications but that is not the case with ConvNet. It uses a special technique called Convolution. It is specifically designed to process pixel data and are used in image recognition and processing.

![convnet](../images/covnet.png)

1. Question: ***Why do we need CNN?***

Regular Neural Nets don’t scale well to full images. In CIFAR-10, images are only of size 32x32x3 (32 wide, 32 high, 3 color channels), so a single fully-connected neuron in a first hidden layer of a regular Neural Network would have $32*32*3 = 3072$ weights. This amount still seems manageable, but clearly this fully-connected structure does not scale to larger images. For example, an image of more respectable size, e.g. 200x200x3, would lead to neurons that have 200*200*3 = 120,000 weights. Moreover, we would almost certainly want to have several such neurons, so the parameters would add up quickly! Clearly, this full connectivity is wasteful and the huge number of parameters would quickly lead to overfitting.

Convolutional Neural Networks take advantage of the fact that the input consists of images and they constrain the architecture in a more sensible way. In particular, unlike a regular Neural Network, the layers of a ConvNet have neurons arranged in `3 dimensions: width, height, depth(channel)`.

### 1. Layers used to build ConvNets:
A simple ConvNet is a sequence of layers, and every layer of a ConvNet transforms one volume of activations to another through a differentiable function. We use three main types of layers to build ConvNet architectures: Convolutional Layer, Pooling Layer, and Fully-Connected Layer.

1. ***INPUT $[32*32*3]$*** will hold the raw pixel values of the image, in this case an image of width 32, height 32, and with three color channels R,G,B.

2. ***CONV layer*** will compute the output of neurons that are connected to local regions in the input, each computing a dot product between their weights and a small region they are connected to in the input volume. This may result in volume such as $[32*32*12]$ if we decided to use 12 filters.

3. ***RELU layer*** will apply an elementwise activation function, such as the $max(0,x)$
 thresholding at `zero`. This leaves the size of the volume unchanged $([32*32*12])$.

4. ***POOL layer*** will perform a downsampling operation along the spatial dimensions (width, height), resulting in volume such as $[16*16*12]$.

5. ***FC (i.e. fully-connected)*** layer will compute the class scores, resulting in volume of size $[1*1*10]$, where each of the 10 numbers correspond to a class score, such as among the 10 categories of CIFAR-10. As with ordinary Neural Networks and as the name implies, each neuron in this layer will be connected to all the numbers in the previous volume.

### 2. Convolutional Layer:
The Conv layer is the core building block of a Convolutional Network that does most of the computational heavy lifting.


***Overview:***

The CONV layer’s parameters consist of a set of learnable filters. Every filter is small spatially (along width and height), but extends through the full depth of the input volume. It is mainly used to extract the features from the image. 

For example, a typical filter on a first layer of a ConvNet might have size ***5x5x3***.
- compute dot products between the entries of the filter and the input at any position.
- will produce a 2-dimensional activation map that gives the responses of that filter at every spatial position.
However, in NN every entry in the 3D output volume can also be interpreted as an output of a neuron that looks at only a small region in the input and shares parameters with all neurons to the left and right spatially (since these numbers all result from applying the same filter)

In Convolution layer each neuron will connect to only a local region of the input volume. The spatial extent of this connectivity is a hyperparameter called the receptive field of the neuron (equivalently this is the filter size).
1. Example:  if the input volume has size **[32 x 32 x 3]**, and ***receptive field (or the filter size) is [5 x 5]***, then each neuron in the Conv Layer will have weights to a ***[5 x 5 x3 ]*** region in the input volume, for a total of ***$5*5*3 = 75$*** weights (and +1 bias parameter) and 75 connections.

2. Example:  if the input volume has size ***[16 x 16 x 20]***, and ***receptive field (or the filter size) is [3 x 3]***, then each neuron in the Conv Layer will have weights to a ***[3 x 3 x 20]*** region in the input volume, for a total of $3*3*20 = 180$ weights (and +1 bias parameter) and 180 connections.

![convolution](../images/convolutional.png)

#### Questions according to Convolutional Layer:
1. What is Local connectivity?
2. What is filter? and How many filters be in a convolutional layer?
3. What is spatial?
4.  

####  how many neurons there are in the output volume or how they are arranged in ConvNet?
Three hyperparameters control the size of the output volume: the ***depth, stride and zero-padding***.

1. ***Depth(hypoparameter):*** The depth corresponts to the number of filters we want to use, each learning to look for something differeent in input. For example, if the first conv layer takes as input the raw image, then different neurons along the depth dimention may active in presence of various oriented edges or blobs of color. We will refer to a set of neurons that are all looking at the same region of the input as depth column.

image-1|img-2|
---|:---:|
![img](../images/filter-2.png)|![img](../images/filter-1.png)

2. ***Stride(hyperparameter):*** Stride is a hypoparmeter of the filter that slides over the image or video. When the stride is 1 then we move the filters one pixel at a time.

3. ***Zero-padding(hyperparameter):*** Padding is a term relevant to convolutional neural networks as it refers to the number of pixels added to an image when it is being processed by the kernel of a CNN. The nice feature of zero padding is that it will allow us to control the spatil size of the output volume.


##### Calculating the Stride Example:
1. Example: 7x7 input (spatially) assume 3x3 filter; what is the output size?

Soln: Here, $N=7; F = 3; S=1; P=0;$<br>
***output size= $(7-3 +2*0)/1 + 1 =  4+1 = 5 -> 5*5$***

2. Example: 7x7 input (spatially) assume 3x3 filter applied with stride 2; what is the output size?

Soln: Here, $N=7; F = 3; S=2; P=0;$<br>
***output size= $(7-3 +2*0)/2 + 1 =  2+1 = 3 -> 3*3$***

3. Example: 7x7 input (spatially) assume 3x3 filter applied with stride 3; what is the output size?

Soln: Here, $N=7; F = 3; S=3; P=0;$<br>
***output size= $(7-3 +2*0)/3 + 1 =  2.33+1 = 3.33 -> 3.33*3.33$*** does't not fit.

4. Example: input 7x7 and 3x3 filter, applied with stride 1 pad with 1 pixel border => what is the output size?

Soln: Here, $N=7; F = 3; S=1; P=1;$<br>
***output size= $(7-3 +2*1)/1 + 1 =  6+1 = 7 -> 7*7$***

5. Example: Input volume(32x32x3) and 10 (5x5) filters with stride 1, pad 2 then Output volume size: ?

Soln: Here, $N=32; F = 5; S=1; P=2;$<br>
***output size= $(32-5 +2*2)/1 + 1 =  31+1 = 32 -> 32*32 -> 32*32*10$***

6. Example: Input volume(32x32x3) and 10 (5x5) filters with stride 1, pad 2 then Number of parameters in this layer?

Soln: parmas = $5*5*3+1= 76$ params (+1 for bias) => $76*10 = 760$ 


1. How can we set zero-padding efficiently?<br>
In general, common to see CONV layers with stride 1, filters of size F x F, and zero-padding with (F-1)/2. (will preserve size spatially).
For exapmle input 7x7 stride 1  and we want to keep preserve size spatially.
    - $F= 3$ => zero-padding = $3-1/2=1$ and output-size=$(7-3+2*1)/1 +1 = 7->7*7$
    - $F= 5$ => zero-padding = $5-1/2=2$ and output-size=$(7-5+2*2)/1 +1 = 7->7*7$
    - $F= 7$ => zero-padding = $7-1/2=3$ and output-size=$(7-7+2*3)/1 +1 = 7->7*7$

#### Summazrization of Conv Layer:
1. Accepts a volume of size $W_1 \times H_1 \times D_1$
2. Requires four hyperparameters:
    - Number of filters $K$
    - their spatial extent $F$
    - the stride $S$
    - the amount of zero padding $P$.

3. Produces a volume of size $W_2 \times H_2 \times D_2$ where:
    - $W_2 = (W_1 - F + 2P)/S + 1$
    - $H_2 = (H_1 - F + 2P)/S + 1$(i.e. width and height are computed equally by symmetry)
    - $D2=K$
4. With parameter sharing, it introduces $F \cdot F \cdot D_1$ weights per filter for a total of $(F \cdot F \cdot D_1) \cdot K$ weights and $K$ biases.
5. In the output volume, the $d^{th}$ depth slice (of size $W_2 \times H_2$) is the result of performing a valid convolution of the $d^{th}$ filter over the input volume with a stride of $S$, and then offset by $d^{th}$ bias.

***A common setting of the hyperparameters is F=3,S=1,P=1.***


#### Implementation as Matrix Multiplication:
Note that the convolution operation essentially performs dot products between the filters and local regions of the input.

Suppose, input = [227x227x3] and Filter(11x11x3) and straide S = 4.
1. ***im2col:*** The local regions in the input image are stretched out into columns in an operation commonly called im2col.
    - then we will take $11*11*3 = 363$ column vector or 363 rows.
    - Iterating this process over $((227-11)/4)+1 = 55$ locations and leading to an output matrix `X_col` = [363 x 3025] and there are (55*55) = 3025 colums.
    - ***Note that since the receptive fields overlap, every number in the input volume may be duplicated in multiple distinct columns.***
2. ***W_row:*** The weights of the CONV layer are similarly stretched out into rows. For example, if there are `96` filters of size [11x11x3] this would give a matrix `W_row` of size [96 x 363].

3. The result of a convolution is now equivalent to performing one large matrix multiply np.dot(W_row, X_col), which evaluates the dot product between every filter and every receptive field location. In our example, the output of this operation would be [96 x 3025], giving the output of the dot product of each filter at each location.

4. The result must finally be reshaped back to its proper output dimension [55x55x96]

#### How backpropagation works in convNet?
The backword pass for a convolution operation( for both the data and weights) is also a convolution.

#### Dilated convolutions:
 It is a type of convolutional operation that introduces gaps or holes between the kernel elements. This is achieved by inserting zeros in the kernel, effectively dilating the kernel and changing its receptive field.

 This can be very useful in some settings to use in conjunction with 0-dilated filters because it allows you to merge spatial information across the inputs much more agressively with fewer layers.

 For example, if you stack two 3x3 CONV layers on top of each other then you can convince yourself that the neurons on the 2nd layer are a function of a 5x5 patch of the input (we would say that the effective receptive field of these neurons is 5x5). 

In [30]:
sum = ((227-11)/4)+1
print(sum, 11*11*3)


55.0 363


#### CONV layer in Torch:

In [None]:
import torch
import torchinfo
import torch.nn.functional as F

class CNNNet(torch.nn.Module):
    def __init__(self, in_channels = 3, num_classes=10):
        super(CNNNet, self).__init__()
        self.conv1 = torch.nn.Conv2d(in_channels, out_channels= 6, kernel_size=(5,5)) # output = (32-5+2*0)/1+1=28 -> [1, 6,28,28] parameter: 5^2*3*6+6 = 456 and colROw: 456*28*28
        self.pool = torch.nn.MaxPool2d(kernel_size=(2,2), stride=(2,2)) # output= 28-2/2+1=13+1= 14 => [1,6,14,14] params =null col_Row=null
        self.conv2 = torch.nn.Conv2d(in_channels = 6, out_channels= 16, kernel_size=(5,5)) # output = 14-5/1 + 1=10=>[1,16,10,10] parameter = 5^2*6*16+16 = 2416 and colRow = 2416*10*10
        self.conv3 = torch.nn.Conv2d(16,120,5)
        self.flat = torch.nn.Flatten()
        self.fc1 = torch.nn.Linear(120,64)
        self.fc2 = torch.nn.Linear(64,num_classes)
    
    def forward(self,x):
        x = torch.nn.functional.relu(self.conv1(x)) # output = (32-5+2*0)/1+1=28 -> [1, 6,28,28] parameter: 5^2*3*6+6 = 456 and colROw: 456*28*28
        x = self.pool(x)  # output= 28-2/2+1=13+1= 14 => [1,6,14,14] params =null col_Row=null
        x = torch.nn.functional.relu(self.conv2(x)) # output = 14-5/1 + 1=10=>[1,16,10,10] parameter = 5^2*6*16+16 = 2416 and colRow = 2416*10*10
        x = self.pool(x) # output= 10-2/2+1= 4+1= 5 => [1,16,5,5] params =null col_Row=null
        x = torch.nn.functional.relu(self.conv3(x)) # output = 5-5/1 + 1= 1 =>[1,120,1,1]; parameter = 5^2*16*120+120 = 48120 and colRow = 48120*1*1
        # x = x.reshape(x.shape[0], -1)
        x = self.flat(x) # output= [1, 120*1*1]=>[1, 120]; params = null;  colRow =null
        x = torch.nn.functional.relu(self.fc1(x)) # output:[1, 64]; params: 120*64+64= 7744; colRow= 7744*1=7744
        x = self.fc2(x) # output:[1, 10]; params: 64*10+10= 650; colRow =650
        return x
model = CNNNet()
torchinfo.summary(model, input_size=(1, 3, 32, 32), col_names= ("input_size", "output_size", "num_params", "mult_adds"), verbose=2,)   return x

#### CONV layer in TensorFlow:

### Pooling Layer:
Its function is to progressively reduce the spatial size of the representation to reduce the amount of parameters and computation in the network, and hence to also control overfitting. The Pooling Layer operates independently on every depth slice of the input and resizes it spatially, using the MAX operation.

The most common form is a pooling layer with filters of size 2x2 applied with a stride of 2 downsamples every depth slice in the input by 2 along both width and height, discarding 75% of the activations.

Every MAX operation would in this case be taking a max over 4 numbers (little 2x2 region in some depth slice).

1. ***Average Pooling*** takes a sliding window (for example, 2x2 pixels) and computes an average of values within the window
2. ***Max Pooling*** replaces the window with the maximum value. The idea behind max pooling is to detect a presence of a certain pattern within the sliding window.

More generally, the pooling layer:

1. Accepts a volume of size $W_1 \times H_1 \times D_1$
2. Requires two hyperparameters:
    - their spatial extent $F$,
    - the stride $S$,
3. Produces a volume of size W2×H2×D2 where:
    - $W_2 = (W_1 - F)/S + 1$
    - $H_2 = (H_1 - F)/S + 1$
    - $D_2 = D_1$
4. Introduces zero parameters since it computes a fixed function of the input.
5. For Pooling layers, it is not common to pad the input using zero-padding.

A pooling layer with $F=3,S=2$ (also called overlapping pooling), and more commonly $F=2,S=2$. Pooling sizes with larger receptive fields are too destructive.

![pooling-layer](../images/pooling-layer.png)


#### Back Propagation:
During the forward pass of a pooling layer it is common to keep track of the index of the max activation (sometimes also called the switches) so that gradient routing is efficient during backpropagation.

### Fully-connected layer
Neurons in a fully connected layer have full connections to all activations in the previous layer, as seen in regular Neural Networks. Their activations can hence be computed with a matrix multiplication followed by a bias offset.
![FC-layer](../images/fc-layer.png)

### FC layers vs CONV layers:
The only differenece between FC and CONV layers is that the neuron in the CONV layer are connected only to a local region in the input and many of the neurons in a CONV volume share parameters.

However the neurons in both layers still compute dot products, so their functional form is identical.

#### Converting FC and CONV:



In [27]:
import numpy as np

w = np.array([[1,1,1,0,0],
              [0,1,1,1,0],
              [0,0,1,1,1],
              [0,0,1,1,0],
              [0,1,1,0,0],
              ])
f= np.array([[1,0,1],
             [0,1,0],
             [1,0,1]
             ])
w.shape, f.shape
w1 = np.array([[1,1,1],
              [0,1,1],
              [0,0,1]
              ])
w1*(f)

array([[1, 0, 1],
       [0, 1, 0],
       [0, 0, 1]])

- Input:      Color images of size 227x227x3;
- Conv-1:     96 kernels of size 11×11, stride: 4, padding:0 out: 55x55x96
- Max-pool-1: pool:3x3 and stride:2 out: 27x27x96
- Conv-2:     256 kernal size 5x5 stride:1 padding:2 out: 27x27x256
- Max-pool-2: pool:3x3 and stride:2 out: 13x13x256
- Conv-3:     384 kernal size 3x3 stride:1 padding:1 out: 13x13x384
- Conv-4:     384 kernal size 3x3 stride:1 padding:1 out: 13x13x384
- Conv-5:     256 kernal size 3x3 stride:1 padding:1 out: 13x13x256
- Max-pool-3: pool:3x3 and stride:2 out: 6x6x256
- FC-1:       4096 neuron out: 4096×1
- FC-2:       4096 neuron out: 4096×1
- FC-3:       1000 neuron out: 1000×1
- output:     1000×1

![link](../images/AlexNet-1.png)

In [None]:
## Alexnet:
conv-1 -> input=[3, 96, 227, 227] output:[1, 96, 55, 55] params:(11*11*3*96)+96= 34944 col_row=(34944*55*55)
Max-pool-1-> input=[1, 96, 55, 55] output:[1, 96, 27, 27] params: null
conv-2-> input = [1, 96, 27, 27] = [1,256, 27, 27]=(5*5*96*256)+256=614656 col_row: 614656*27*27
conv-3-> 