## Convolution in Practice

Find out why convolutional and pooling layers are the building blocks of Convolutional Neural Networks.

We will be covering:

- Spatial dimensions

Pooling layers

When it comes to real-life applications, most images are in fact a 3D tensor with width, height, and 3 channels (R,G,B) as dimensions.

In that case, the kernel should also be a 3D tensor (k $\times$ k $\times$ $channels$). Each kernel will produce a 2D feature map. Remember the sliding happens only across width and height. We just take the dot product of all the input channels on the computation. Each kernel will produce 1 output channel.

In practice, we tend to use more than 1 kernel in order to capture different kinds of features at the same time.

![pic](https://raw.githubusercontent.com/CUTe-EmbeddedAI/images/main/images/fig17.png)

As you may have guessed, our learnable weights are now the values of our filters and can be trained with backpropagation, as usual. We can add a bias into each filter as well.

Convolutional layers can be stacked on top of others. Since convolutions are linear operators, we include non-linear activation functions in between just as we did in fully connected layers.

To recap, you have to think in terms of input channels, output channels, and kernel size. And that is exactly how we are going to define it in Pytorch.

To define a convolutional network in Pytorch, we have:

In [1]:
import torch.nn as nn

conv_layer = nn.Conv2d(in_channels=3, out_channels=5, kernel_size=5)
print(conv_layer)

Conv2d(3, 5, kernel_size=(5, 5), stride=(1, 1))


The above layer will receive an image of 3 channels (i.e., R,G,B) and will output 5 feature maps (channels) using a kernel size of 5x5x3. For simplicity, we just say 5x5 kernel.

### Spatial dimensions

Because figuring out the dimensions of each tensor can quickly become complicated, let’s examine a simple example of a convolutional layer. Assume that we have a 7x7 image and a 3x3 filter. The feature map will be of size 5x5.

![pic](https://raw.githubusercontent.com/CUTe-EmbeddedAI/images/main/images/fig18.PNG)

This is the most common approach but it is not the only one.

Sometimes, we may want to slide/move the kernel every 2 pixels instead of every 1, thus introducing an extra hyperparameter called **stride**.

In some cases, we can also pad the image around the edges with zeros in order to control the output dimensions. The amount of zero-padding on the edges introduces another parameter in the mix called **zero-padding** or simply **padding**.

If we introduce a stride of 2 and zero-padding of 1, we will receive an image of 4x4:

![pic](https://raw.githubusercontent.com/CUTe-EmbeddedAI/images/main/images/fig19.PNG)

To summarize, given an input of size $W_1 \times H_1 \times D_1$, a number of output channels $K$ with kernel size $F \times F$, stride $S$ and padding $P$, we acquire an output of size $W_2 \times H_2 \times D_2$ where:

![pic](https://raw.githubusercontent.com/CUTe-EmbeddedAI/images/main/images/fig20.PNG)

The above example can be validated in 3 lines of Pytorch code.

In the input tensor, 1 refers to the batch size. You can ignore it for now.

In [2]:
import torch
import torch.nn as nn

input_img = torch.rand(1,3,7,7)
layer = nn.Conv2d(in_channels=3, out_channels=6, kernel_size=3, stride=2, padding=1)
out = layer(input_img)
print(out.shape) 

torch.Size([1, 6, 4, 4])


### Pooling layer

Many popular CNN architectures utilize another type of layer besides the convolutional layer. This new layer is known as the pooling layer. Pooling layers can be thought of as a way to **downsample** the features without having any learnable parameters. In other words, pooling layers do not contribute to the training of a neural network. These layers function in a similar form as convolutional layers in terms that we apply a function on a chunk of the input and produce a single scalar number. The most common way is known as max-pooling and works as shown below:

![pic](https://raw.githubusercontent.com/CUTe-EmbeddedAI/images/main/images/fig21.PNG)

From each rectangular 2x2 chunk of our image, we keep only the bigger element in our feature map, resulting in a tensor with half the size of our initial input. In other words, we collapse each non-overlapping chunk into a single value.

One reason that we may introduce pooling is that it adds invariance to minor spatial changes. For example, two tensors with slightly different translations will result in the same pooling map.

Another reason is that we want to gradually reduce the resolution of the input as we perform the forward pass. That’s because the deeper layers should have a higher receptive field, meaning that they should be more and more sensitive to the entire image.

After all, our ultimate goal is to classify if an image contains a cat or a dog and not detect the corners.

Finally, pooling makes the learned features more abstract.

In [3]:
import torch
import torch.nn as nn

input_img = torch.rand(1,3,8,8)
layer = nn.MaxPool2d(kernel_size=2, stride=2)
out = layer(input_img)
print(out.shape) 

torch.Size([1, 3, 4, 4])
