#**Applied AI for Health Research**

# Practical 3: Convolutional Neural Networks

Tutorial by Emma Robinson. Edited by Mariana da Silva.

## **Why Convolutions?**

In this notebook we will motivate the development and design of convolutional neural networks (CNNs).

The design of CNNs arose from a need to overcome the computational constraints met when upscaling deep learning to the processing of high dimensional images, but took significant inspiration from biological vision networks. Representations are learnt over many convolutional layers, where early layers can be seen to act as edge detectors and higher layers detect more complex textures or whole objects.

<figure>
<img src="https://github.com/IS-pillar-3/miscellaneous/blob/main/convnetwork.png?raw=true" alt="Drawing"  width="800px;"/>
</figure>

This has strong advantages for image recognition, object localisation and segmentation tasks as it allows images to be compared without any requirement for spatial normalisation or image registration. In other words, there is no assumption that corresponding image pixels, at the same relative locations in the image, represent the same content.



### **The convolutional operation**

So how do convolutional operations support comparisons between images? Lets first look at an example of the convolutional operation applied for a hand engineered filter kernel known as a Sobel filter (designed to detect edges). 

Here we apply it to a small part of a 2D slice from q brain scan, at a point where we know there is a sharp change in image intensity. The numbers in the central grid reflect the intensity on a scale from 0 to 255.

The convolutional operation then results from translating the convolutional kernel across the image and performing elementwise multiplication and sum at each location. The output of the operation is assigned to the pixel location at the centre of the filter:

<figure>
<img src="https://github.com/IS-pillar-3/miscellaneous/blob/main/sobelconv2.png?raw=true" alt="Drawing"  width="800px;"/>
</figure>

Notice that the result of the full convolutional operation fills a grid of size 2 rows and 2 columns less than the original – corresponding to the number of full times the kernel can be fit into the space this can be corrected for using padding:

<figure>
<img src="https://github.com/IS-pillar-3/miscellaneous/blob/main/padding.png?raw=true" alt="Drawing"  width="800px;"/>
</figure>


A related concept is strides, which allow the network convolutional operation to skip over locations in the image, with the result that the output shape is downsampled (e.g. for an input dimension h =7, kernel size f =3 and a stride s=2):

<figure>
<img src="https://github.com/IS-pillar-3/miscellaneous/blob/main/strides.png?raw=true" alt="Drawing"  width="800px;"/>
</figure>


Where the dimensions of output following (strided and/or padded convolutions) can be determined from the following formula:

$O = \lfloor (h+2p -f)/s \rfloor +1$.




## **Convolutional Neural Networks**

Images may be compared using hand engineered feature detectors by comparing the responses of images to these features. The advantage of convolutional neural networks over traditional approaches is that rather than hand designing techniques to detect features from images, the CNNs instead learn to detect these through design of bespoke filter kernels. The CNN filters correspond to the weights of the network; these are optimised through minimisation of a loss with respect to a image classification, regression or segmentation task, for example.


### **The building blocks of Convolutional Neural Networks (CNNs)**

The essential components of a CNN are the convolutional layers, downsampling operations (performed through pooling or striding) and the activations (which support learning of non-linear interactions).



### **Convolutional layers**

We just looked at what is meant by the convolutional operation, but how is this implemented within deep networks and how does it relate to parameter (weights) learning during optimisaton?

In contrast to fully connected layers, CNNs do not employ full connectivity between each incoming feature and each neuron in the layer. Rather each neuron in a CNN has a very restricted field of view, constrained to the dimensions of some local filter kernel fit at each location. Let's give this filter dimensions $f \times f \times d_0$, where $f$ represents the height and width of the kernel and $d_0$ represents the channel depth.

<figure>
<img src="https://github.com/IS-pillar-3/miscellaneous/blob/main/convolutionalfiltersfigure.png?raw=true" alt="Drawing"  width="800px;"/>
</figure>


The receptive field (or scale $f$ of filters) varies (although these are typically odd numbers e.g. $3 \times 3$, $5 \times 5$, $7 \times 7$); however, the depth must always equal the depth of the incoming data array e.g for the input layers this might be 3 channels for a RGB image, or 1 for greyscale. 

All points in the image are constrained to learn the same filter weights as their neighbours (otherwise known as parameter sharing). This operation therefore reduces to learning a set of convolutional filters, which operate on the image to return activation maps. Training CNNs in this way both significantly reduces the amount of parameters which need to be learnt (relative to a comparable fully connected network). It also allows CNNs to take advantage of the hierarchical, multi-scale properties of images, in a similar way to biological networks.


### **Downsampling**

An important feature of convolutional networks is that they use downsampling to increase the receptive field of filter kernels so as to learn to recognise objects over a hierarchy of scales. There are two different mechanisms that CNNs use for downsampling: pooling and striding. 

**Pooling** works by fitting a pooling kernel to typically non-overlapping patches of an image and then applying simple min, max or averaging operations to aggregate/filter those values.

**Strides** on the work hand work by applying a convolutional operation whilst skipping out certain locations in the image, for example every alternate kernel centre. These offers the advantage of effectively learning the downsampling operation, but at the cost of learning more parameters.


### **Activations**

Convolutional networks require activation layers in order to learn non-linear mappings of the data. It is typical to implement activations (within the body of the CNN) as ReLU functions. 





### **Optional elements**

In addition to the key components of convolutional networks, several other operations have been introduced to regularise, speed up training and/or improve efficiency or generalisability of network:

**$1\times 1$ convolutions**

The motivation behind $1\times 1$ convolutions is to support compression or upsampling of the channel dimension of an activation block. This can be useful when the goal is to perform some parameter heavy operation, after which point the data can be upsampled back to its previous resolution. As always the depth of the kernel must be equal to the depth of the incoming avctivation tensor.

**Batch Normalisation (batchnorm)**

As we will see in the session on optimisers, deep networks are typically trained with variants of stochastic gradient descent. This samples batches from the data sets and estimates average loss for each batch rather than estimating it across all examples. This can lead to noisy gradient updates, since the composition of each batch is subject to change. [Batch-norm](https://arxiv.org/abs/1502.03167) seeks to address this by normalising and rescaling the activations of each batch, at every layer throughout the network. In doing so it can considerably speed up training.

**Dropout regularisation**

Finally, dropout is a technique which can be used to performed network regularisation. It works by randomly dropping activations during training, by applying a randomised masking operation on the output of each layer. In doing so, the approach stops individual network components form memorising the inputs – something which would lead to overfitting. There is a trade off however, as dropping out too many weights will prevent the network from learning well enough and will cause underfitting.


In what order should these be placed? It has been common convention to place the batchnorm before the ReLu with the pool (or downsample) at the end.

<figure>
<img src="https://github.com/IS-pillar-3/miscellaneous/blob/main/convorder1.png?raw=true" alt="Drawing"  width="600px;"/>
</figure>


The order of the ReLu and pool is actually unimportant since, as elementwise operations, they commute.

More frequently nowadays pooling operations ar ereplaced with strided convolutions (for example as seen in all variants of ther ResNet)

<figure>
<img src="https://github.com/IS-pillar-3/miscellaneous/blob/main/convorder2.png?raw=true" alt="Drawing"  width="600px;"/>
</figure>

And in certain recent forum posts it has been suggested that greater performance is achieved through sitching the position of the batchnorm and relu:

<figure>
<img src="https://github.com/IS-pillar-3/miscellaneous/blob/main/convorder3.png?raw=true" alt="Drawing"  width="600px;"/>
</figure>


Dropout for regularisation has largely fallen out of favour as certain [sources](https://arxiv.org/abs/1801.05134) suggest its operations clash with batch normalisation. 


---


## Exercise 1 - Implementing Convolutional Operations in PyTorch

### Ex 1.1 - Implementing a 2D convolution

A 2D convolution class is defined within the `torch.nn` module as follows:

<img src="https://drive.google.com/uc?id=12hvQSk-kCsPWTnEE16KKPkA0zo1R3Wzc" alt="Drawing" width="700px;"/>

Let's look at implementing a 2D Convolution with stride = 1, kernel size = 3x3 and 2 output channels, applied to a random 'image' of shape $3 \times 100 \times 100$ (noting here channels are specified first, following the required PyTorch convention). 

In [None]:
import torch
import torch.nn as nn # importing torch.nn 

# the first dimension has size N where N is the number of images. 
# here it is simply 1 
input_image = torch.randint(0, 255, (1, 3,100,100)) # our random image. 

# Ex 1.1. Implement 2d convolution with in_channels = 3,out_channels = 2, kernel_size = 3
operation = None

print(operation) # we can see our convolution operation by printing it

First try the following operation - observe the ```RuntimeError```

In [None]:
result = operation(input_image)

The operation fails as it cannot work on integer tensors. Let us convert it into a float tensor first

In [None]:
input_image = input_image.to(torch.float)
result = operation(input_image)
print(result.shape)


Observe the shape of the ```result``` with respect to the shape of the original image. We see that the output number of features reduce from 3 to 2 as required, but we lose a unit around the edge of the 2D image. 

In [None]:
print(result.shape,input_image.shape)

We can correct this using padding, as:

In [None]:
operation = nn.Conv2d(in_channels = 3,out_channels = 2, kernel_size = 3, padding=1)
input_image = input_image.to(torch.float)
result = operation(input_image)
print(result.shape)

The result shows the result of a 2D Convolution between our image and some randomly generated kernel. What if we wanted to inspect that kernel? We can use: 

In [None]:
for name, param in operation.named_parameters(): # for each named parameter
    print(name, param.data.shape)

Now we can see that our convolution weight tensor is of shape [2,3,3,3] (2  3×3×3  convolutional filters) and has a bias of shape [2].





### Ex 1.2 - Max pooling

The maxpool function in pytorch is [```nn.MaxPool2d```](https://pytorch.org/docs/stable/nn.html?highlight=maxpool#torch.nn.MaxPool2d). As we have seen in our lectures the max pool operation downsamples an image by selecting the maximum intensity of an image patch to represent the whole patch.

Generate a random integer array to represent 5 images which have 3 channels are of size (100 x 100). Perform a 2D maxpool on the images using PyTorch. Your max pooling operations should have:

**Task 1.2.1.** filter size 3x3, stride = 1 x 1

**Task 1.2.2.** filter size 4 x 2, stride = 2 x 2

**Hint** check the docs (linked above)


In [None]:
#STUDENT CODE HERE (replace Nones)

random_ims = torch.randint(0, 255, (5,3,100,100)).to(torch.float)

#1.2.1 implement max pool with filter size 3x3 and stride 1x1
maxpoolop = None
print(maxpoolop)
r = maxpoolop(random_ims)
print(r.shape)

#1.2.1 implement max pool with filter size 4x2 and stride 2x2
maxpoolop = None
print(maxpoolop)
r = maxpoolop(random_ims)
print(r.shape)


## Exercise 2: MNIST classification using a simple convolutional network

We will next implement a convolutional implementation for MNIST classification and compare it against the MLP that we created in the last practical.

First we must download the MNIST dataset from torchvision and generate the DataLoaders:

In [None]:
import torchvision
import numpy as np
from torchvision import datasets, models, transforms

# load datasets from torchvision - set test/train - convert to tensors
mnist_train_dataset = datasets.MNIST(root = 'mnist_data/train', download= True, train = True, transform = transforms.ToTensor())
mnist_test_dataset = datasets.MNIST(root = 'mnist_data/test', download= True, train = False, transform = transforms.ToTensor())

# pass these to the DataLoader class to create instances for each of test and train
# batch size is now smaller (8)
train_loader = torch.utils.data.DataLoader(
       mnist_train_dataset, batch_size= 8, shuffle = True)

test_loader = torch.utils.data.DataLoader(
       mnist_test_dataset, batch_size = 8, shuffle = True)

# class labels for plotting function
classes = ('0', '1', '2', '3',
          '4', '5', '6', '7', '8', '9')

In [None]:
if torch.cuda.is_available():
    device = torch.device('cuda')
else:
    device = torch.device('cpu')
    
print(device) 

### Ex 2.1 - Create a CNN network class 

We wish to create a CNN with 2 convolutional layers, using max pooling to downsample and implementing relu non-linearities. 

Following the two convolutional layers we implement followed by two fully connected layers in order to compress the features to 10 output neurons (representing each class). Here, the first layer has 50 neurons, each of which will be connected to all activations from the final convolutional layer. The second fully connected layer connects the 50 neurons of the penultimate layer to 10 output neurons followed by a softmax layer to return probabilities of the raining example belonging to each class.



---



**Task 2.1.1. Edit the number of input and output channels of the convolutional layers**

1. MNIST is grayscale; thus how many input channels do you think this will be?
2. The first convolutional layer has a kernel size of $5 \times 5 $ and learns 10 output channels (in other words the neurons learn 10 $5 \times 5$ image filters).
3. The second layer has a kernel size of $5 \times 5 $ and learns 20 output channels. 



---



**Task 2.1.2. Define MaxPool2d**
Using the syntax shown above define a maxpool2D operation with filter size $x \times 2$ and stride of $2$



---



**Task 2.1.3. The first fully connected layer has 50 neurons and the second has 10 neurons; set the number of input and output features for these layers.**

The dimensions of first linear layer  are trickier to work out. Each neuron in the output must connect to each activation in the second convolutional layer. We see that the network applies $5\times 5$ kernels with no padding and has downsamples through pooling twice. This results in the following reductions in spatial dimensions
- conv1 downsamples from $28 \times 28$ to $24 \times 24$
- maxpool1 downsamples from $24 \times 24$ to $12 \times 12$
- conv2 downsamples from $12 \times 12$ to $8 \times 8$
- maxpool1 downsamples from $8 \times 8$ to $4 \times 4$
- Thus the activations of the final convolution layer have spatial dimensions $4 \times 4$; how many output features is this in total?

Note, if you are having problems working this out you can always print the shapes of your tensors while debugging!



---



**Task 2.1.4.** Implement the forward function to contain:

**a) the first convolutionl block** Here the first layer is implemented for you. See it references the layer instantiated in the constructor by name `self.conv1,` the argument is the input data `x`, and the output is also called `x`. Now implement the max pool and relu (using nn.functional form). Don't forget that, each time, the output of each operation of the forward layer (here `x`) becomes the input of the next layer (also `x`). If you get stuck go back to look at the MLP from the last lecture which applies the same basic structure, just with a different combination of layers

**b) the second convolutionl block** Here repeat the process but for the second convolutional layer

**c) the linear layers** we implement the flattening for you. Please implement the 2 linear layers with *one* relu activation only (between them). 

In [None]:
import torch.nn as nn
import torch.nn.functional as F


class MNIST_Model(nn.Module):
    def __init__(self):
        super(MNIST_Model, self).__init__()
        # replace Nones with correct code #
        # Task 2.1.1. a) edit the number of input and output channels of the convolutional layers
        # MNIST is grayscale; thus how many input channels do you think this will be? We want to learn 10 filters, kernel size 5x5
        self.conv1=nn.Conv2d(None, None, kernel_size=5)
        
        # Task 2.1.1. b) edit the number of input and output channels of the convolutional layers
        # the previous layer learnt 10 kernels this one shall lern 20, kernel size 5x5
        self.conv2=nn.Conv2d(None, None, kernel_size=5)
        
        # Task 2.1.3. create the linear laayers with the correct numbers of input and output features of the linear layers
        self.fc1=None
        self.fc2=None

        # Task 2.1.2. definition of maxpool
        self.maxpool=None
        
    
    def forward(self, x):
        # Task 2.1.4. construct forward function
        # a) implement first convolutional block: conv1 -> maxpool -> relu
        # We implement the first conv layer for you
        x = self.conv1(x)
        x = None
        x = None
        # b) implement second convolutional block: conv2 -> maxpool -> relu
        x = None
        x = None
        x = None
        # c) implement linear layers: linear ->  relu  -> linear
        x = x.view(x.size(0),-1)
        x = None
        x = None
        x = None

        return x

    
net = MNIST_Model() 
print(net)
net = net.to(device)

### Loss function and optimizer

We again need to define our loss and optimizers. In this case, since we're doing classification, we use **CrossEntropy Loss**, a commonly used loss function for classification.


In [None]:
import torch.optim as optim

loss_fun = nn.CrossEntropyLoss()
loss_fun = loss_fun.to(device)
optimizer = optim.SGD(net.parameters(), lr=0.001, momentum=0.9)

### Ex 2.2 - Training 

We will now train our classifer model. In the training code, the outer loop allows for iterating over epochs (with each epoch defining a pass through all the data), while the inner loop iterates over batches. In the first instance we will leave the number of epochs as 1 but if you are training on GPU you might choose to increase it.

Implement the following training steps:

**2.2.1.** Set the network mode to train

**2.2.2.** Load data and labels to device

**2.2.3.** Clear the gradient

**2.2.4.** Feed the input and acquire the output from network

**2.2.5.** Calculate the loss from predicted class and real label

**2.2.6.** Compute the gradient and backpropagate

**2.2.7.** Update the parameters (calling the update step on the optimizer)

**Hint** - You can follow the example from the MLP tasks from the Session 2 Practical.

In [None]:
epochs = 1
for epoch in range(epochs): 
    # 2.2.1. set mode to train
    None
    # enumerate can be used to output iteration index i, as well as the data 
    for i, (data, labels) in enumerate(train_loader, 0):
        # replace Nones with correct code #

        # 2.2.2 load data and labels to device
        data = None
        labels = None
        
        # 2.2.3: clear the gradient
        None

        # 2.2.4 feed the input and acquire the output from network
        outputs = None

        # 2.2.5 calculating the loss
        loss = None

        # 2.2.6 compute the gradient on the loss tensor
        None

        # 2.2.7 update the parameters (calling the update on the optimiser object)
        None
        

        # print statistics
        ce_loss = loss.item()
        if i % 100 == 0:
            print('[%d, %5d] loss: %.3f' %
                 (epoch + 1, i + 1, ce_loss))


### Ex 2.3 - Testing

We will now use our trained network to make a prediction on our test set. Run the below code to obtain your test accuracy:

In [None]:
#make an iterator from test_loader
test_iterator = iter(test_loader)
#Get a batch of testing images
images, labels = test_iterator.next()
images = images.to(device)
labels = labels.to(device)

In [None]:
import matplotlib.pyplot as plt

# set as test
net.eval()
#forward pass
y_score = net(images)
# get predicted class from the class probabilities
_, y_pred = torch.max(y_score, 1)

print('Predicted: ', ' '.join('%5s' % classes[y_pred[j]] for j in range(8)))
rows = 2
columns = 4
# plot y_score - true label (t) vs predicted label (p)
fig2 = plt.figure()
for i in range(8):
    fig2.add_subplot(rows, columns, i+1)
    plt.title('t: ' + classes[labels[i].cpu()] + ' p: ' + classes[y_pred[i].cpu()])
    img = images[i] / 2 + 0.5     # this is to unnormalize the image
    img = torchvision.transforms.ToPILImage()(img.cpu())
    plt.axis('off')
    plt.imshow(img)
plt.show()


**Computing classification scores**

We will now use the predictions to compute the accuracy, f1 score, precision and recall. These are scores commonly used to evaluate classification, in particular the f1 score is a good measure for datasets with imbalanced classes.

You can use sklearn classification metrics to calculate the scores - you will 
need to input the true labels, and predicted classes.

See https://scikit-learn.org/stable/modules/model_evaluation.html#classification-metrics for more details.

In [None]:
# first convert tensors to numpy
y_true = labels.data.cpu().numpy()
y_pred = y_pred.data.cpu().numpy()

In [None]:
from sklearn.metrics import accuracy_score, f1_score, recall_score, precision_score

accuracy = accuracy_score(y_true, y_pred)
f1 = f1_score(y_true, y_pred, average='macro')
precision = precision_score(y_true, y_pred, average='macro')
recall = recall_score(y_true, y_pred, average='macro')
print('accuracy:', accuracy, ', f1 score:', f1, ', precision:', precision, ', recall:', recall)

## Ex 3 - Extras (Optional) 

### Sequential layers

It should be clear that as networks become more and more complicated, the forward function can quickly become long and cluttered. For these reasons PyTorch provides a functionality to combine steps by stacking Modules in blocks using `nn.sequential`.

In [None]:
class Model(nn.Module):
 def __init__(self):
        super(Model, self).__init__()
    
        self.conv_block1 = nn.Sequential(
            nn.Conv2d(in_channels=3, 
                      out_channels=64, kernel_size=5),

            nn.BatchNorm2d(64),
            
            nn.ReLU() 
        )
        
        self.conv_block2 = nn.Sequential(
            nn.Conv2d(in_channels=64,
                      out_channels=128, kernel_size=5),
            nn.BatchNorm2d(128),
            nn.ReLU()
        )
        
        self.fc1 = nn.Linear(320, 10)

def forward(self, x):
       x = self.conv_block1(x) 
       x = self.conv_block2(x)
       x = x.view(-1, 320)
       x = F.relu(self.fc1(x))
       return x

net = Model()


Sequential blocks are advantageous as they run faster. The one limitation, however, is that it is then not possible to observe the outputs of the intermediate steps stacked inside. If this is required, an alternative approach  can be to use a `ModudeList` or `ModuleDict`. For more functionality on `nn.sequential` `nn.ModuleList` and `nn.ModuleDict,` please read https://github.com/FrancescoSaverioZuppichini/Pytorch-how-and-when-to-use-Module-Sequential-ModuleList-and-ModuleDict 



---





**Ex 3.1.** Try reducing the lines of code in the convolutional neural network class by grouping lines of code into `nn.Sequential` blocks. 
   - An example of your this can be done for linear layers is:
```
self.lin_blocks = nn.Sequential(
            nn.Linear(320, 50),
            nn.ReLU(),
            nn.Linear(50, 10),
            
        )
```
  - try swapping out the linear layers of the network for this sequential block
  - then create your own sequential convolutional blocks
  
**Ex 3.2.** Try adding dropout and batchnorm to your convolutional blocks

**Ex 3.3.** Try removing/changing/adding layers to see how it impacts performance



In [None]:
# STUDENTS CODE HERE
# Copy and paste the network from above but this time swap out the conv->maxpool->relu operations (and linear layers) for sequential blocks

class MNIST_Sequential_Model(nn.Module):
    def __init__(self):
        super(MNIST_Sequential_Model, self).__init__()
        # insert first sequential conv block
        
        # insert second sequential conv block
        
        
        # linear sequential block
        self.lin_blocks = nn.Sequential(
         nn.Linear(320, 50),
         nn.ReLU(),
         nn.Linear(50, 10),
 
       )

      # (optional) dropout
        self.dropout=nn.Dropout2d() # could also go in sequential blocks of course
    
    def forward(self, x):
        # implement  with sequential blocks 
      
        return F.log_softmax(x,dim=1)

    
net_seq = MNIST_Sequential_Model() 
print(net_seq)
net_seq = net_seq.to(device)

In [None]:
# STUDENTS CODE HERE
# regenerate optimiser for this new network

optimizer = None

# Now copy and pste training loop and run for the new network 
# Don't forget to change the call to the correct network object!

epochs = 1
for epoch in range(epochs): 

    # enumerate can be used to output iteration index i, as well as the data 
    for i, (data, labels) in enumerate(train_loader, 0):
        # STUDENTS CODE - insert code for training loop #

        
        # print statistics
        ce_loss = loss.item()
        if i % 100 == 0:
            print('[%d, %5d] loss: %.3f' %
                 (epoch + 1, i + 1, ce_loss))


In [None]:
# Test performance 

# keeping test batch constant for comparison i.e. using image and labels from above

# STUDENT CODE - get prediction by implementing forwarada pass through network for test data
y_score = None
# get predicted class from the class probabilities
_, y_pred = torch.max(y_score, 1)

print('Predicted: ', ' '.join('%5s' % classes[y_pred[j]] for j in range(8)))
rows = 2
columns = 4
# plot y_score - true label (t) vs predicted label (p)
fig2 = plt.figure()
for i in range(8):
    fig2.add_subplot(rows, columns, i+1)
    plt.title('t: ' + classes[labels[i].cpu()] + ' p: ' + classes[y_pred[i].cpu()])
    img = images[i] / 2 + 0.5     # this is to unnormalize the image
    img = torchvision.transforms.ToPILImage()(img.cpu())
    plt.axis('off')
    plt.imshow(img)
plt.show()
