# Convolutional Networks

Convolutional networks (LeCun, 1989), also known as convolutional neural networks or CNNs, are a specialized kind of neural network for processing data that has a known, grid-like topology. Examples include time-series data, which can be thought of as a 1D grid taking samples at regular time intervals, and image data, which can be thought of as a 2D grid of pixels. Convolutional networks have been tremendously successful in practical applications, and were able to achieve superhuman performance on some complex visual tasks.

## Convolutional Layers

The most important building block of a CNN is the convolutional layer. Neurons in the first convolutional layer are not connected to every single pixel in the input image (like they were in dense layers), but only to pixels in their receptive fields. That region in the input image is called the *local receptive field* for the hidden neuron. It's a little window on the input pixels. Each connection learns a weight. And the hidden neuron learns an overall bias as well.

<figure>
  <img src='images/local receptive field.png' style="background-color: white;">
  <figcaption>Local Receptive field</figcaption>
</figure>

In turn, each neuron in the second convolutional layer is connected only to neurons located within a small rectangle in the first layer. This architecture allows the network to concentrate on small low-level features in the first hidden layer, then assemble them into larger higher-level features in the next hidden layer, and so on. This hierarchical structure is common in real-world images,which is one of the reasons why CNNs work so well for image recognition.


### How conv layers work
 
The way convolution layers work is the following, we start at the top left corner of the input layer, then apply the convolution operation between the input layer and the filter (also called kernel), the result will be the value that reaches the neuron at the top left corner. Next, we will slide the window (local receptive field) and do the same for the new field to calculate the input to the 2nd neuron. When the row ends we slide the windo downwards and so on...

**1. Step one: start from top left**

<img src="images/conv p1.png" height="300px">

**2. Step two: slide to the right**

<img src="images/conv p2.png" height="300px">

**3. Continue:**

<img src="images/convolutions example.png">

Here is an example of the full process:

<img src="images/convolution.gif">

<video controls src="videos\conv_kiank.mp4" height='500px'></video>

The filters are the trainable parameters of the convolution layers, and they are shared by all the neurons. This is one of the most important features of the convlution networks, so let's look at what distinguishes them.

### Edge Detection

Consider the following picture (big matrix) and the filter (small matrix). High values in the image matrix mean brighter colors, and low values darker colors. By multiplying with the filter weights and adding the products up we get high values in a single convolution step if the values are big in the left partof the covered area and small on the right side. This filter is therefore able to detect vertical edges where the pixels on the left are bright and the pixels on the left and dark pixels on the right.

<img src="images/edge_detection.png">

### Why are convolutional networks effective

**1. Parameter sharing:** 

CNNs utilize parameter sharing through convolutional filters, significantly reducing the number of parameters compared to fully connected networks. This makes CNNs more computationally efficient and reduces the risk of overfitting, especially when dealing with large input data like images. We are able to do that because a feature detector that is useful in one part of the image is probably also useful in another part of the image. In simple words, we're going to use the same weights and bias for each of the hidden neuron.

<img src="images/weights.png" height="200px">

<img src="images/shared_weights.png" height="200px">


**2. Locality:**

One of the most common types of structure in data is "locality" -- the most relevant information for understanding or predicting a pixel is a small number of pixels around it.

Locality is a fundamental feature of the physical world, so it shows up in data drawn from physical observations, like photographs and audio recordings.

Locality means most meaningful linear transformations of our input only have large weights in a small number of entries that are close to one another, rather than having equally large weights in all entries.

**3. Translation Equivariance:**

Another type of structure commonly observed is "translation equivariance" -- the top-left pixel position is not, in itself, meaningfully different from the bottom-right position or a position in the middle of the image. Relative relationships matter more than absolute relationships.

Translation equivariance arises in images because there is generally no privileged vantage point for taking the image. We could just as easily have taken the image while standing a few feet to the left or right, and all of its contents would shift along with our change in perspective.

Translation equivariance means that a linear transformation that is meaningful at one position in our input is likely to be meaningful at all other points. We can learn something about a linear transformation from a datapoint where it is useful in the bottom-left and then apply it to another datapoint where it's useful in the top-right.

### Strides

Stride refers to the number of pixels by which the filter/kernel is shifted over the input image. When performing convolution, the filter slides over the input image with a certain step size determined by the stride. A stride of 1 means the filter moves one pixel at a time, while a stride of 2 means the filter moves two pixels at a time, and so on. Larger strides lead to smaller output feature maps because the filter covers fewer positions.

<figure>
  <img src="images/strides.png" height="300px">
  <figcaption>Convolution layer with strides=2</figcaption>
</figure>

### Padding

Padding is the process of adding extra border pixels around the input image. This is often done before applying convolution to prevent the spatial dimensions of the feature maps from shrinking too much.

There are two types of padding in TensorFlow:

- Valid (No padding): In this case, no padding is added to the input image. As a result, the spatial dimensions of the feature maps decrease after convolution.

- Same (Zero padding): Here, padding is added in such a way that the output feature maps have the same spatial dimensions as the input image. Typically, zero padding (padding with zeros) is used. This helps in preserving spatial information and enables better performance, especially when multiple layers are stacked.

<figure>
<img src="images/zero_padding.png" height="500px">
</figure>
<figure>
  <img src="images/padding.png" height="500px">
  <figcaption>Convolution layer "same" padding</figcaption>
</figure>


### Size of layers
The size of the next layer would be:

$$ n_{\text{H or W}}^{[l]} = \lfloor \frac{n_{\text{H or W}}^{[l-1]} + 2p^{[l]} - f^{[l]}}{s^{[l]}} + 1\rfloor$$

$$A^{[l]} \implies m \times n_{\text{H}}^{[l]} \times n_{\text{W}}^{[l]} \times n_c^{[l]}$$

Where:
- $m$ is number of samples
- $f^{[l]}$ is filter size
- $p^{[l]}$ is padding
- $s^{[l]}$ is strides
- $n_c^{[l]}$ is number of filters

Each filter is $f^{[l]} \times f^{[l]} \times n_c^{[l-1]}$

The wights have size $f^{[l]} \times f^{[l]} \times n_c^{[l-1]} \times n_c^{[l]}$


## Pooling

Their goal is to subsample (i.e., shrink) the input image in order to reduce the computational load, the memory usage, and the number of parameters (thereby limiting the risk of overfitting). The layer will either take the maximum number in the window, or average the values. The layer does not have any weights to learn.

- **Max-Pooling (POOL):** The value in the output matrix corresponds to the maximum value in the covered area
- **Average-Pooling (AVG):** The value in the output matrix corresponds to the average value in the covered area

The size of the pooling layer is:

$$ n_{\text{H or W}}^{[l]} = \lfloor \frac{n_{\text{H or W}}^{[l-1]} - f^{[l]}}{s^{[l]}} + 1\rfloor$$
$$A^{[l]} \implies m \times n_{\text{H}}^{[l]} \times n_{\text{W}}^{[l]} \times n_c^{[l]}$$

Padding is rarely used with pooling. Here is an example of a max pooling layer:

<figure>
  <img src="images/max pooling.png" height="300px">
  <figcaption>Max pooling layer (2x2 pooling kernel, stride 2, no padding)</figcaption>
</figure>
<figure>
<img src="images/max_avg_pooling.png">
</figure>


### What do CNNs learn?

The following is a visualization of what a CNNs might learn

<img src="images/net_full_layer_0.png" height="300px" />

<img src="images/visualizing_what_convnets_learn_17_1.png">

## Classical CNN Architectures

Typical CNN architectures stack a few convolutional layers (each one generally followed by a ReLU layer), then a pooling layer, then another few convolutional layers (+ReLU), then another pooling layer, and so on. The image gets smaller and smaller as it progresses through the network, but it also typically gets deeper and deeper (i.e., with more feature maps), thanks to the convolutional layers. At the top of the stack, a regular feedforward neural network is added, composed of a few fully connected layers (+ReLUs), and the final layer outputs the prediction (e.g., a softmax layer that outputs estimated class probabilities).

<img src="images/typical cnn.png">

### LeNet-5

Developed by Yann LeCun in 1998. LeNet-5 was trained on the MNIST dataset, a collection of hand-written digits. This CNN is quite old and comparably small to current CNN (approximately 60k trainable parameters). It uses sigmoid or tanh as activation functions in the hidden layers. However, it does not use softmax as classifier in the last layer whereas today we probably would. We can further notice that it only uses valid convolutions (i.e. no padding) which results in the matrices becoming smaller and smaller.

<img src="images/lenet-5.png">

### AlexNet

AlexNet was trained on color images because it uses several channels in the input layer. Its architecture is similar to LeNet-5, but it is much bigger (approximately 60M trainable Parameters). It also uses ReLU as activation functions in the hidden layers and softmax as the classifier in the last layer. Its performance was far better than LeNet-5 which was an inspiration for scientists to use DL for computer vision. It was proposed by Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton in 2012. It Won the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) in 2012 by a significant margin.

<img src="images/alexnet.png">

### VGG-16

Developed by the Visual Geometry Group (VGG) at the University of Oxford in 2014. VGG-16 was a 16-Layer CNN with approximately 138M trainable parameters. The convolution layer all used SAME-convolution. The following image shows a simplified representation of the 16 layers:

<img src="images/vgg-16.png">

### Residual Networks

Recent Networks have become very deep. The CNN Microsoft used to win the ImageNet competition in 2015 was as deep as 152 layers! We have discussed earlier that networks usually suffer from vanishing (or in rare cases exploding) gradients and thus the gradient for the earlier layers decreases to zero very rapidly as training proceeds:

<img src="images/vanishing_grad.png" height="200px">

In order to train such very deep CNN that don’t suffer from exploding/vanishing gradient, we use special building blocks known as residual blocks. Those residual blocks consist of two conventional layer together with a shortcut (called skip-connection):

<img src="images/residual-block.png" height="200px">

This residual block uses the activation of the previous layer $a^{[l]}$ for the calculation of its activation in the second layer. This calculation can be seen in the next equation:

$$ a^{[l+2]}=g(W^{[l+2]} \cdot g(W^{[l+1]}a^{[l]}+b^{[l+1]})+b^{[l+2]}+a^{[l]}) $$

If the weights inside the residual blocks become very small because of vanishing gradients, the activation of the previous layer dominates over the cell state during the calculation of the activation of the second layer. Therefore skip connections allow the forward propagation in a ResNet to learn some kind of identity function if the weights become too small. This makes learning the optimal parameters much simpler. The parameters will be set to zero and what remains is:

$$ a^{[l+2]} = g(a^{[l]})$$

Since, we are using **ReLU** the output will be:

$$ a^{[l+2]} = a^{[l]} $$

Kaiming He et al. won the ILSVRC 2015 challenge using **ResNet**. It achieved state-of-the-art performance on various image recognition tasks, including the ILSVRC and COCO challenges, it delivered an astounding top-five error rate under 3.6%. The architecure of ResNet Looks like the following:

<img src="images/resnet simple.png" height="300px">

<img src="images/resnet.png" height="300px">

### Inception model

<table style="background-color: white; color: black;">
<tr>
  <th>Simple iception module</th>
  <th>Complex inception module</th>
</tr>
<tr>
  <td>
   <img src="images/inception-module-simple.png" />
  </td>
  <td>
   <img src="images/inception-module-complex.png"/>
  </td>
</tr>
</table>

<img src="images/inception-model.png" style="background-color: white;">
