# Back propagation in CNN

CNN consist of two main layers: **Convolution layer and Pooling Layer.** 
```{figure} https://www.researchgate.net/publication/327571320/figure/fig4/AS:700669442932736@1544064021996/The-network-architecture-composed-of-five-convolutional-layers-three-pooling-layers-and.jpg
:align: center
:width: 500px
```

Let's consider them separately:

## Back propogation in Convolution layer

Backpropagation in CNN (Convolutional Neural Network) is a process where the network adjusts its internal parameters based on the difference between predicted and actual outcomes, helping it learn and improve its ability to recognize features in images.
```{figure} back_prop/111.png
:align: center
:width: 500px
```
Using the input layer and filter, let's perform the convolution operation.
```{figure} https://miro.medium.com/v2/resize:fit:488/format:webp/1*4h_J0Zpx93_sFHKxWUoHAw.gif
:align: center
```
The kernel first moves horizontally, then shift down and again moves horizontally inside the input layer. 
The sum of the dot product of the image pixel value and kernel pixel value gives the output matrix.

**Stride** denotes how many steps we are moving in each steps in convolution.\
**Padding** is a process of adding zeros to the input matrix symmetrically.\
Padding and strides provide control over the spatial dimensions of the output, help preserve information at the borders, and contribute to the efficiency and effectiveness of the convolutional operation in neural networks.
```{figure} https://miro.medium.com/v2/resize:fit:1400/format:webp/1*17TNPi4m0pBqOCGrXzU27w.gif
:align: center
:width: 400px
```
In the above illustration,the extra grey blocks denote the padding. It is used to make the dimension of output same as input.

To explain backpropagation in a convolutional neural network (CNN). We use a CNN with zero padding (padding = 0) and a stride of two (stride = 2).

#### Forward Propagation
Using the input layer and filter, we can perform forward propagation to find the output layer z.
```{figure} back_prop/222.gif
:align: center
:width: 800px
```
After the convolution we end up with an output matrix which we can call layer 1.

$$
\begin{align*}
z_1 &= w_1 \times a_1 + w_2 \times a_2 + w_3 \times a_3 + w_4 \times a_6 ... w_9 \times a_{13} \\
z_2 &= w_1 \times a_3 + w_2 \times a_4 + w_3 \times a_5 + w_4 \times a_8 ... w_9 \times a_{15} \\
z_3 &= w_1 \times a_{11} + w_2 \times a_{12} + w_3 \times a_{13} + w_4 \times a_{16} ... w_9 \times a_{23} \\
z_4 &= w_1 \times a_{13} + w_2 \times a_{14} + w_3 \times a_{15} + w_4 \times a_{18} ... w_9 \times a_{25}
\end{align*}
$$

Then, we flatten out layer 1 and output a prediction which we will denote as $\hat{y}$. $\hat{y}$ can then be used to calculate the loss.
```{figure} back_prop/333.png
:align: center
:width: 700px
```

#### Back Propagation
In order to update the weight, you can use the following formula:

$$
\begin{align*}
new\_{w_i} &= w_i - \alpha \times \frac{\partial L}{\partial w_i}
\end{align*}
$$

Here,\
$\quad$ $\alpha$ is the learning rate\
$\quad$ **$i$** is the value in the given range 1 to 9\
$\quad$ $w_i$ is the kernel\
$\quad$ The unknown in this formula is the partial derivative of the loss with respect to the weights.\
The partial derivatives of loss with respect to each weight can be represented as a matrix.
```{figure} back_prop/444.png
:align: center
:width: 200px
```

<span style="display:none" id="q_learing_rate">W3sicXVlc3Rpb24iOiAiV2hhdCBpcyB0aGUgcHVycG9zZSBvZiB0aGUgbGVhcm5pbmcgcmF0ZSBpbiBhIG5ldXJhbCBuZXR3b3JrPyIsICJ0eXBlIjogIm1hbnlfY2hvaWNlIiwgImFuc3dlcnMiOiBbeyJhbnN3ZXIiOiAiVG8gY29udHJvbCB0aGUgc3BlZWQgb2Ygd2VpZ2h0IHVwZGF0ZXMiLCAiY29ycmVjdCI6IHRydWUsICJmZWVkYmFjayI6ICJUaGUgbGVhcm5pbmcgcmF0ZSBjb250cm9scyB0aGUgc3BlZWQgYXQgd2hpY2ggdGhlIHdlaWdodHMgYXJlIHVwZGF0ZWQgZHVyaW5nIHRyYWluaW5nLCBhZmZlY3RpbmcgdGhlIGNvbnZlcmdlbmNlIG9mIHRoZSBuZXVyYWwgbmV0d29yay4ifSwgeyJhbnN3ZXIiOiAiVG8gZGV0ZXJtaW5lIHRoZSBudW1iZXIgb2YgbGF5ZXJzIiwgImNvcnJlY3QiOiBmYWxzZSwgImZlZWRiYWNrIjogIkluY29ycmVjdCBwbGVhc2UgdHJ5IGFnYWluIn0sIHsiYW5zd2VyIjogIlRvIHNldCB0aGUgYWN0aXZhdGlvbiBmdW5jdGlvbiIsICJjb3JyZWN0IjogZmFsc2UsICJmZWVkYmFjayI6ICJJbmNvcnJlY3QgcGxlYXNlIHRyeSBhZ2FpbiJ9LCB7ImFuc3dlciI6ICJUbyBpbml0aWFsaXplIHRoZSB3ZWlnaHRzIiwgImNvcnJlY3QiOiBmYWxzZSwgImZlZWRiYWNrIjogIkluY29ycmVjdCBwbGVhc2UgdHJ5IGFnYWluIn1dfV0=</span>

In [1]:
from jupyterquiz import display_quiz
display_quiz('#q_learing_rate')

<IPython.core.display.Javascript object>

$ $

Let's calculate partial derivative of the loss with respect to the weights.\
First start off with $w_1$. The change in $w_1$ will cause a change in all the $z$ values because $w_1$ appears in all of the equations of $z$. The change in the $z$ values will cause $\hat{y}$ to change which in turn will cause the loss to change.
```{figure} back_prop/555.png
:align: center
:width: 700px
```

$$
\frac{\partial L}{\partial w_1} = \frac{\partial z_1}{\partial w_1} \frac{\partial \hat{y}}{\partial z_1} \frac{\partial L}{\partial \hat{y}} + \frac{\partial z_2}{\partial w_1} \frac{\partial \hat{y}}{\partial z_2} \frac{\partial L}{\partial \hat{y}} + \frac{\partial z_3}{\partial w_1} \frac{\partial \hat{y}}{\partial z_3} \frac{\partial L}{\partial \hat{y}} + \frac{\partial z_4}{\partial w_1} \frac{\partial \hat{y}}{\partial z_4} \frac{\partial L}{\partial \hat{y}}
$$

Simplify the equation by transforming $\frac{\partial \hat{y}}{\partial z_1} \frac{\partial L}{\partial \hat{y}}$ to $\frac{\partial L}{\partial z_1}$, and apply the same logic to others.

$$
\frac{\partial L}{\partial w_1} = \frac{\partial z_1}{\partial w_1} \frac{\partial L}{\partial z_1} + \frac{\partial z_2}{\partial w_1} \frac{\partial L}{\partial z_2} + \frac{\partial z_3}{\partial w_1} \frac{\partial L}{\partial z_3} + \frac{\partial z_4}{\partial w_1} \frac{\partial L}{\partial z_4}
$$

Looking at the terms, where we take the partial derivative with respect to $w_1$, we notice that we can simplify this equation using the equations from earlier. Use equations that were generated during forward propagation to determine the filter gradients. 

$$
\begin{align*}
z_1 &= w_1 \times a_1 + w_2 \times a_2 + w_3 \times a_3 + w_4 \times a_6 ... w_9 \times a_{13} \\
z_2 &= w_1 \times a_3 + w_2 \times a_4 + w_3 \times a_5 + w_4 \times a_8 ... w_9 \times a_{15} \\
z_3 &= w_1 \times a_{11} + w_2 \times a_{12} + w_3 \times a_{13} + w_4 \times a_{16} ... w_9 \times a_{23} \\
z_4 &= w_1 \times a_{13} + w_2 \times a_{14} + w_3 \times a_{15} + w_4 \times a_{18} ... w_9 \times a_{25}
\end{align*}
$$

Simplify the equation by transforming $\frac{\partial z_1}{\partial w_1}$ to $a_1$, and apply the same logic to others.

$$
\frac{\partial L}{\partial w_1} = a_1 \frac{\partial L}{\partial z_1} + a_3 \frac{\partial L}{\partial z_2} + a_11 \frac{\partial L}{\partial z_3} + {a_13} \frac{\partial L}{\partial z_4}
$$

$$
\frac{\partial L}{\partial w_2} = a_2 \frac{\partial L}{\partial z_1} + a_4 \frac{\partial L}{\partial z_2} + a_12 \frac{\partial L}{\partial z_3} + {a_14} \frac{\partial L}{\partial z_4}
$$ 

```{figure} back_prop/p_2.png
:align: center
:width: 500px
```
To discern the pattern within the filter gradient values, it is necessary to adjust the output gradient tensor. Upon making this modification to the output gradient tensor, we observe that the calculated filter gradient values align cohesively in a structured pattern. Now, these terms turn out to be the partial derivatives of the loss with respect to the terms in layer 1. 

```{figure} back_prop/p_3.gif
:align: center
:width: 500px
```

Next, copy the values from the input layer and multiply them with the partial derivative of the loss with respect to $z_1$. Then do exact the same steps until the fourth term. Multiplying and adding matrices together gives the Matrix containing the partial derivative of the loss with respect to the weights.
```{figure} back_prop/p_4.png
:align: center
:width: 500px
```
Multiplying this Matrix with the learning rate Alpha and subtracting it from the kernel gives the updated weights, exactly as the formula that we looked earlier. 
```{figure} back_prop/p_5.png
:align: center
:width: 500px
```
$$
\begin{align*}
new\_{w_i} &= w_i - \alpha \times \frac{\partial L}{\partial w_i}
\end{align*}
$$
The backpropagation operation is equivalent to performing a convolution operation on the input tensor with a dilated version of the output gradient tensor, where the stride is set to 1.


The following animation created by [Tamas Szilagyi](http://tamaszilagyi.com/blog/2017/2017-11-11-animated_net/) shows a neural network model learning. The animation shows a feedforward neural network rather than a convolutional neural network, but the learning principle is the same. In this animation each line represents a weight. The number shown next to the line is the weight value.

```{figure} https://glassboxmedicine.files.wordpress.com/2020/08/learningweights-1.gif
:align: center
:width: 500px
```


## Back propagation in Max Pooling


In order to improve the performance of CNN, application of pooling layer after the convolution layer helps the network to generalize better and reduce overfitting. This is because, given a certain grid (pooling height x pooling width) we sample only one value, ignoring particular elements and suppressing noise. 

### Forward propogation

Let's say we have 4x4 grid, and following parameters for max pooling:

$$
\begin{align}
pooling\ height = 2\\
pooling\ width = 2\\
stride\ = 2\\
\end{align}
$$

Pooling is similar to convolution, but here we simply select the maximum element from that region. The following visualization will clarify:

```{figure} https://miro.medium.com/v2/resize:fit:932/format:webp/1*9kMkohwhU2SvtbMtc4vr_Q.gif
:align: center
```

The **output shape** after pooling operation is obtained using the following formula:

$$
\begin{align}
Height\_out\ = \frac{\text{Height} - \text{Pool_Height}}{\text{Stride}} + 1\\
Width\_out\ = \frac{\text{Width} - \text{Pool_Width}}{\text{Stride}} + 1\\
\end{align}
$$

For the example above, it is calculated as:



### Back Propogation
Comparing to the convolution layer, we don't have to compute weights and bias derivatives as there are no parameters in a pooling operation. Thus, the only derivative we need to compute is with respect to the input: \begin{align}\frac{\partial Y}{\partial X}\end{align}


As we know, derivative with respect to the inputs will have the same shape as the input. Let’s look at the first element of $
\begin{align}
\frac{\partial Y}{\partial X} - \frac{\partial Y}{\partial x_{11}}
\end{align}
$

```{figure} https://miro.medium.com/v2/resize:fit:932/format:webp/1*wzrnpOCydh7QcMIF_ZaO3w.png
:align: center
```

The derivative of $ \begin{align}\frac{\partial Y}{\partial x_{11}}\end{align} $ is non-zero only if $ \begin{align} x_{11}\end{align} $ is the maximum element in the first pooling operation for the first region. Assuming $ \begin{align} x_{12} \end{align} $ is the max element, $ \begin{align} \frac{\partial y_{11}}{\partial x_{12}} = \frac{\partial x_{11}}{\partial x_{12}} = 1 \end{align} $, and other derivatives in the region are zero. With incoming derivative $ \begin{align} dy_{11} \end{align} $, the gradients are zero except for $ \begin{align} \frac{\partial x_{11}}{\partial x_{12}} = dy_{11} \end{align} $. Let's apply the same approach to whole grid:

```{figure} https://miro.medium.com/v2/resize:fit:866/format:webp/1*kvJtYeTgNDrO85jvUdYHKw.gif
:align: center
```

<span style="display:none" id="q_maxpool">WyAgewogICAgICAgICJxdWVzdGlvbiI6ICJXaGF0IGlzIHRoZSBwYXJhbWV0ZXIgb2YgcG9vbGluZyBvcGVyYXRpb24/IiwKICAgICAgICAidHlwZSI6ICJtYW55X2Nob2ljZSIsCiAgICAgICAgImFuc3dlcnMiOiBbCiAgICAgICAgICAgIHsKICAgICAgICAgICAgICAgICJhbnN3ZXIiOiAiV2VpZ2h0IiwKICAgICAgICAgICAgICAgICJjb3JyZWN0IjogZmFsc2UsCiAgICAgICAgICAgICAgICAiZmVlZGJhY2siOiAiVGhhdCBpcyB0aGUgcGFyYW1ldGVyIG9mIHRoZSBjb252b2x1dGlvbiBsYXllciIKICAgICAgICAgICAgfSwKICAgICAgICAgICAgewogICAgICAgICAgICAgICAgImFuc3dlciI6ICJJdCBkb2Vzbid0IGhhdmUgbGVhcm5hYmxlIHBhcmFtZXRlcnMiLAogICAgICAgICAgICAgICAgImNvcnJlY3QiOiB0cnVlLAogICAgICAgICAgICAgICAgImZlZWRiYWNrIjogIkNvcnJlY3QuIgogICAgICAgICAgICB9LAogICAgICAgICAgICB7CiAgICAgICAgICAgICAgICAiYW5zd2VyIjogIk51bWJlciBvZiBsYXllcnMiLAogICAgICAgICAgICAgICAgImNvcnJlY3QiOiBmYWxzZSwKICAgICAgICAgICAgICAgICJmZWVkYmFjayI6ICJJdCBpcyBub3QgYSBwYXJhbWV0ZXIgb2YgcG9vbGluZyBsYXllcnMiCiAgICAgICAgICAgIH0sCiAgICAgICAgICAgIHsKICAgICAgICAgICAgICAgICJhbnN3ZXIiOiAiQmlhcyIsCiAgICAgICAgICAgICAgICAiY29ycmVjdCI6IGZhbHNlLAogICAgICAgICAgICAgICAgImZlZWRiYWNrIjogIlRoYXQgaXMgdGhlIHBhcmFtZXRlciBvZiB0aGUgY29udm9sdXRpb24gbGF5ZXIiCiAgICAgICAgICAgIH0KICAgICAgICBdCiAgICB9Cl0=</span>

In [2]:
from jupyterquiz import display_quiz
display_quiz('#q_maxpool')

<IPython.core.display.Javascript object>