# CNN architecture

## Building blocks of CNN architectures

The basic components of a convolutional neural networks are convolutional layers and downsampling operations. In this notebook we will downsample through max pooling.

### Convolutions

The convolutional operation works similarly to what you have previously seen for hand engineered localised feature descriptors (week 8) such as the Gray level co-occurrence matrix and Sobel, Laplacian and Gaussian filters.

These work by translating a hand engineered filter kernel across an image, at each position, multiplying each element of the filter with the element of the image that they overlap with. This operation is shown below for the vertical sobel edge filter, for the first filter location:

<img src="https://drive.google.com/uc?id=1gb2QaE8lW6GgNFwCEk5mJqP0Uo91xxn8" alt="Drawing" style="width: 500px;"/>

The second (translated one step to the right)

<img src="https://drive.google.com/uc?id=1gcsv7870kjTCwnUV043k2u4iM2uydzBn" alt="Drawing" style="width: 500px;"/>

This continues until the filter has been fit at all possible locations in the image. Until the final output of the convolution is another image:

<img src="https://drive.google.com/uc?id=1geRXevPNs_sSMeKWyEnEEKFszz3ptqcV" alt="Drawing" style="width: 500px;"/>

This will be slightly smaller than the input image by default, since it corresponds to all locations in the input image on which the filter can be centered. This excludes the outer rows and columns of the image.

By default however, Pytorch will return you an image of the same size by using padding:

<img src="https://drive.google.com/uc?id=1ggS0vucaH_mM9X1aBTRX4lSEDnZlmyZg" alt="Drawing" style="width: 500px;"/>

**The key difference between CNNs and traditional filters is that CNNs learn the best filters for a specific feature recognition problem, whereas traditional feature detectors are hand engineered**

### Max pooling

The next most important component of a CNN is downsampling. Downsampling allows a CNN to increase its _receptive field_ :

<img src="https://drive.google.com/uc?id=1gV1ZqsV5JGkWys5H0bDcnw9pgR-EYtvf" alt="Drawing" style="width: 500px;"/>

<img src="https://drive.google.com/uc?id=1gYKvI9ULXFhdyHSGp4XoUmyMFmCijbsI" alt="Drawing" style="width: 500px;"/>

As you saw in the previous section the filters themselves are small. The only way the network can 'see' the full context of the image is by aggregating aceoss layers, and by downsampling at regular intervals. In this way, as you go through the network the filters will learn more and more complex textures, at larger scales, until they can recognise whole objects.


## Coding CNNs in Pytorch

You already saw last week how to set up a basic fullly connected network, choose a cost function and write a training loop. All this stays the same for CNNs. All you need in addition is convolutional and max pool layers.

Where in PyTorch, A 2D convolution class is defined within the `torch.nn` module as follows:

<img src="https://drive.google.com/uc?id=12hvQSk-kCsPWTnEE16KKPkA0zo1R3Wzc" alt="Drawing" style="width: 800px;"/>

And the maxpool function in pytorch is [```nn.MaxPool2d```](https://pytorch.org/docs/stable/nn.html?highlight=maxpool#torch.nn.MaxPool2d).

Inputs of 2D convolutional layers must
have a shape $N\times C\times H\times W$, where $N$ is the number of images
in a batch, $C$ is the number of channels, $H$ is the image height and
$W$ is the image width. The code below creates a random image and passes it through a convolutional layer. Run the code.

In [None]:
import torch

# create a random input image
input_image = torch.randint(0, 255, (1, 1, 64, 64)).float()
print('Input size: ',input_image.shape)

# create a convolutional layer
conv = torch.nn.Conv2d(1,8,5,padding=2)

# pass the random image through the convolutional layer
output = conv(input_image)
print('Output size: ',output.shape)

__Activity 1:__ Answer the following questions:
* What is the batch size and number of channels of input image
* How many channels does the convolutional layer output?
* What happens if you change the `padding` parameter?

__Answer:__


__Activity 2:__ Implement a convolutional layer as follows:
* Create a random image with spatial dimensions $100\times 100$ and 3 channels;
* Implement a convolutional layer that outputs 5 channels and has a
kernel size of $3\times 3$. Pass the image through it. Print out the dimensions
of the results
* Change the convolutional layer so that its output has spatial dimensions $100\times 100$


In [None]:
# create a random input image
input_image2 = None
print('Input size: ',input_image2.shape)

# create a convolutional layer
conv2 = None

# pass the random image through the convolutional layer
output2 = None
print('Output size: ',output2.shape)

__Activity 3:__ Change the stride of the convolutional layer, so that the output has spatial dimensions $20\times 20$

In [None]:
# create a convolutional layer
conv3 = None

# pass the random image through the convolutional layer
output3 = None
print('Output size: ',output3.shape)

__Activity 4:__ Instead of changing the stride, implement a max-pooling operation to reduce the dimension of the output of the convolutional layer with stride 1 to $20\times 20$.

In [None]:
# create a convolutional layer
conv4 = None

# pass the random image through the convolutional layer
output4 = None
print('Output size: ',output4.shape)

# max pooling
maxpool = None
downsampled = None
print(downsampled.shape)

## Exercise

You are given code that implements this CNN architecture: 
<img src="images/CNN.png" alt="Drawing" style="width: 500px;"/>
The `CNNModel` has four blocks, two convolutional blocks followed by two linear blocks. Each block is implemented using `nn.Sequential`. The convolutional blocks consist of convolutional layer, ReLU activation and Pooling layer. The linear blocks consist of linear layer and a ReLU activation.

Run the code and study the size of the input and the outputs of each block. Note that the shape of the output of each block needs to match input of the following block. After second convolutional block, we also need to reshape the output into a vector using `view` to be able to feed it to a linear layer.

In [None]:
import torch.nn as nn

# CNN architecture
class CNNModel(nn.Module):
    def __init__(self):
        super(CNNModel, self).__init__()

        self.conv_block1 = nn.Sequential(
            nn.Conv2d(1,8,5,padding=2),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2,stride=2))
        self.conv_block2 = nn.Sequential(
            nn.Conv2d(8,16,5,padding=2),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2,stride=2))
        self.fc_block1 = nn.Sequential(
            nn.Linear(16*16*16, 128),
            nn.ReLU())
        self.fc_block2 = nn.Sequential(
            nn.Linear(128,10),
            nn.ReLU())

    def forward(self, x):
        x = self.conv_block1(x)
        print('Output 1: ', x.shape)
        x = self.conv_block2(x)
        print('Output 2: ', x.shape)
        x = x.view(-1, 16*16*16)
        x = self.fc_block1(x)
        print('Output 3: ', x.shape)
        x = self.fc_block2(x)
        print('Output 4: ', x.shape)

        return x
    
# input image
input_image = torch.randint(0, 255, (1, 1, 64, 64)).float()
print('Input: ', input_image.shape)
# create CNN model
net = CNNModel()
# predict output for input_image
o = net(input_image)
# shape is as expected
print('Final output: ', o.shape)

Modify the `CNNmodel` by changing each of the following:
* the name of the model to `CNNmodel2`;
* the number of output channels of the first and second convolutional
block to 4 and 6 respectively;
* the number of outputs of the first and second fully connected block
to 32 and 2, respectively

Create an instance `net2` of the new model. Perform forward pass with `input_image`. Check that you have 2 outputs at the end.

In [None]:
import torch.nn as nn

# CNN architecture
class CNNModel2(nn.Module):
    def __init__(self):
        super(CNNModel2, self).__init__()

        self.conv_block1 = nn.Sequential(
            nn.Conv2d(1,None,5,padding=2),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2,stride=2))
        self.conv_block2 = nn.Sequential(
            nn.Conv2d(None,None,5,padding=2),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2,stride=2))
        self.fc_block1 = nn.Sequential(
            nn.Linear(None, None),
            nn.ReLU())
        self.fc_block2 = nn.Sequential(
            nn.Linear(None,None),
            nn.ReLU())

    def forward(self, x):
        x = self.conv_block1(x)
        print('Output 1: ', x.shape)
        x = self.conv_block2(x)
        print('Output 2: ', x.shape)
        x = x.view(-1, None)
        x = self.fc_block1(x)
        print('Output 3: ', x.shape)
        x = self.fc_block2(x)
        print('Output 4: ', x.shape)

        return x
    
# input image
input_image = torch.randint(0, 255, (1, 1, 64, 64)).float()
print('Input: ', input_image.shape)
# create CNN model
net = CNNModel2()
# predict output for input_image
o = net(input_image)
# shape is as expected
print('Final output: ', o.shape)