<font size=25, color='#ED1F24'>Laboratory 5

<font size=25, color='#ED1F24'>Convolutional Neural Networks (CNNs)

**Summary:**


*   learn about convolutions 
*   learn to build Convolution Neural Networks (CNNs)
*   train CNNs
*   use pretrained models





**Motivation**

Fully Connected Layer 

<div>
<img src=https://drive.google.com/uc?id=1xaImbI9cZfaMPjc1UfG8OXymqYAl3aCN width="600"/>
</div>

---
Images
<div>
<img src=https://drive.google.com/uc?id=1MZ64Ay5PIozenvyIzN1TveNEj-ATqnLd width="600"/>
<div>

---

In order to use a fully connected layer for processing an image, we need to flatten the image ($⇒$ get a vector). The input size of our layer will be equal to the number of pixels contained in the image, multiplied by the number of channels. So, if we are working with RGB images, at a resolution of ($width$, $height$), the input size would be $3 * width * height$.

After image flattening, the input would look something like this:
<div>
<img src=https://drive.google.com/uc?id=1I5nv1Ly6EUAAZbdntYxId2o7CKD78SCa width="1000"/>
<div>

This is an undesirable approach and during this laboratory, we will explore an alternative approach.

---



**Question 1** 
If we stack multiple fully connected layers of the form $y=Wx$, can we model complex functions?

**Question 2** Do we have any limitation regarding the functions that we could model using a fully connected network (composed of multiple fully connected layers, with non-linear activations) ? 

Main drawbacks of Fully Connected Networks:

*   many parameters $⇒$ require a lot of training examples
*   each pixel is treated differently $⇒$ it has to learn redundant features
*   small image translations would induce different outputs

Insights for processing images (useful biases):

*   close neighbors of one pixel are more relevant than distant ones
*   an observed pattern should have the same meaning regardless its position in the image







Convolutional Neural Networks (CNNs) exploit the above mentioned biases and tackle the main drawbacks of Fully Connected Networks. 

<div>
<img src=https://drive.google.com/uc?id=1PtO6mNFtm71ftrn2ea0huyvEk9PpmXSd width="750"/>
<div>

*Image Source: [FloydHub](https://blog.floydhub.com/building-your-first-convnet/)*


CNNs can be employed for multiple modalities:

*   computer vision 
*   speech recognition and speech synthesis 
*   natural language processing 
*   protein/DNA binding problem 
*   any problem with a spatial (or sequential) structure



# <font color='ED1F24'>Part I: Convolutions 




Convolution is the mathematical way of combining two signals to form a third signal. 

In the current laboratory, we will discuss about convolutions in the context of image processing. 

Convolution in image processing = the process of transforming an image by applying a kernel over each pixel and its local neighbors across the entire image.

The most common type of convolutions are:

*   1D convolutions (e.g. temporal data)
*   2D convolutions (e.g. images with height and width)
*   3D convolutions (e.g. videos with height, width and time)

Informal: How to think about xD convolutions? We are working over data with x main dimensions along which we expect a neighbourhood to have a specific meaning. Each xD point can additionally have multiple features (channels).

E.g.:
  - 1D convolutions: assume we follow chocolate sales, collecting data at regular time intervals, regarding price, marketing spend and if is weekend or not; we expect a certain temporal consistency between neighbor data samples 
  - 2D convolutions: images are 3D tensors, where each pixel has 3 color channels (RGB features); we expect a certain spatial consistency between neighbor pixels 


In [None]:
import torch 
import torch.nn as nn
from torchvision import datasets, transforms
from torch.utils.data import Dataset, DataLoader
from PIL import Image
from tqdm import tqdm
from typing import Iterator, List, Callable, Tuple
import matplotlib.pyplot as plt
from google.colab import drive
drive.mount('/content/gdrive')
torch.manual_seed(115)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

## 2D convolutions over single-channel data 




We will start working over single-channel images (grayscale images):
<div>
<img src=https://drive.google.com/uc?id=19lH10oZetZ-VFOAJUso4FR4l9Q7GPzPJ width="600"/>
<div>

**Convolution** 
Over a $4\times 4$ image, with a $3\times 3$ kernel

<div>
<img src=https://drive.google.com/uc?id=1uiSMfwIsmR72m6j7PGsknzemp9S1GGFc width="200"/>
<div>

*Image Source: [GitHub](https://github.com/vdumoulin/conv_arithmetic)*

<div>
<img src=https://drive.google.com/uc?id=1w2V0Ew6V6_WExaQop02G_YAQ0g4ZiifW width="500"/>
<div>


In [None]:
# initialize a single channel image of size 4 x 4
image = torch.arange(0,16).reshape((4,4))
# initialize a 3 x 3 kernel
kernel = torch.tensor([[1,0,-1],[0,1,0],[-1, 0, 1]]) 

print('The image:')
print(image)
print('The kernel:')
print(kernel)

In [None]:
# TODO - complete the function implementing the above convolution
def convolution(image, kernel):
  in_height, in_width = image.shape
  kernel_height, kernel_width = kernel.shape
  out_height = ...
  out_width = ...
  result = torch.zeros((out_height, out_width))
  for i in range(out_height):
    for j in range(out_width):
      image_crop = ...
      result[i,j] = ...
  return result

res = convolution(image, kernel)
print('Result:')
print(res) 

We can see that the implemented convolution reduces the image size. If we wish to maintain the original size, we can add a border around the initial image. 

**Padding** - prevents shrinking by adding an image border before convolution
<div>
<img src=https://drive.google.com/uc?id=19OqGj6PpoBT50HQ8q0mKhhESaz8QGk-- width="200"/>
<div>

*Image Source: [GitHub](https://github.com/vdumoulin/conv_arithmetic)*





In [None]:
# TODO - complete the function implementing the above convolution with padding 
# the image will be zero padded
def convolution_with_padding(image, kernel, padding):
  in_height, in_width = image.shape
  kernel_height, kernel_width = kernel.shape
  out_height = ... 
  out_width = ...
  result = torch.zeros((out_height, out_width))
  # add border 
  image = ...
  for i in range(out_height):
    for j in range(out_width):
      image_crop = ...
      result[i,j] = ...
  return result

padding = 1
res = convolution_with_padding(image, kernel, padding)
print('Result:')
print(res) 

### Edge detection



Detect sharp changes in image brightness, which are likely to correspond to discontinuities in depth, discontinuities in surface orientation, illumination variations or texture changes. ([more about edges](https://en.wikipedia.org/wiki/Edge_detection)). 

We will implement a simple edge detector, using the [Sobel operator](https://en.wikipedia.org/wiki/Sobel_operator).

The operator uses two $3\times3$ kernels which are convolved with the original image. The result approximates the image derivatives $⇒$ brightness changes along the horizontal and vertical axis. Further, using this maps we can obtain the magnitude of the gradient and estimate the edge intensity in each pixel.

**Estimate vertical edges**

In order to estimate the vertical edges, we need to look for brightness changes along the horizontal dimension. To achieve this, we can use the following Sobel filter:

\begin{bmatrix}
  -1 & 0 & 1\\ 
  -2 & 0 & 2 \\
  -1 & 0 & 1
\end{bmatrix}

The filter is convolved over the image, resulting in a map that contains in each pixel the approximation of the image derivative along the horizontal dimension. 



In [None]:
# TODO - define a function that computes the image derivatives along the horizontal axis
def sobel_vertical_edges(image):
  # initialize the kernel/filter
  kernel_x = torch.tensor(...)
  # apply the kernel over the image 
  edges_x = ...
  return edges_x

# TODO - replace with path to the chessboard.png image 
synthetic_image_path = '/content/gdrive/MyDrive/course_fmi_2022/Lab 5/chessboard.png'
transf = transforms.ToTensor()
image = transf(Image.open(synthetic_image_path).convert(mode='L'))[0]
edges_x = sobel_vertical_edges(image)

print('Original Image')
plt.imshow(image, cmap='gray')
plt.show()
print('Vertical Edges')
plt.imshow(abs(edges_x), cmap='gray')

**Estimate horizontal edges**
In order to estimate the horizontal edges, we need to define a second filter to be convolved over the image such that the convolution result approximates the image derivatives along the vertical dimension. 


In [None]:
# TODO - define a function that computes the image derivatives along the vertical axis
def sobel_horizontal_edges(image):
  ...
  return edges_y

edges_y = sobel_horizontal_edges(image)

print('Original Image')
plt.imshow(image, cmap='gray')
plt.show()
print('Horizontal Edges')
plt.imshow(abs(edges_y), cmap='gray')

**Edges Map**

Using the horizontal and vertical edges, we can compute the final edges map. 
Let $G_x$ be the vertical edges and  $G_y$ the horizontal ones. 

The gradient magnitude can be computed as follows:
$G = \sqrt{G_x^2+G_y^2}$

In [None]:
# TODO - define a function that given an input image, estimates the edges map
# the function will return the edges map along with the maps corresponding to vertical and horizontal edges
def sobel_edges(image):
  ...
  return edges, edges_x, edges_y

edges, _, _ = sobel_edges(image)

print('Original Image')
plt.imshow(image, cmap='gray')
plt.show()
print('Edges')
plt.imshow(edges, cmap='gray')
plt.show()

In [None]:
# now, let's apply the edge detector over a real image 
# TODO - replace with path to the city.png image 
real_image_path = '/content/gdrive/MyDrive/course_fmi_2022/lab5_images/city.png'
transf = transforms.ToTensor()
image = transf(Image.open(real_image_path).convert(mode='L'))[0]

edges, edges_x, edges_y = sobel_edges(image)
print('Original Image')
plt.imshow(image, cmap='gray')
plt.show()
print('Vertical Edges')
plt.imshow(edges_x, cmap='gray')
plt.show()
print('Horizontal Edges')
plt.imshow(edges_y, cmap='gray')
plt.show()
print('Edges')
plt.imshow(edges, cmap='gray')
plt.show()

## 2D convolutions over multi-channel data








We will illustrate convolutions over three-channel images (RGB images):
<div>
<img src=https://drive.google.com/uc?id=1MZ64Ay5PIozenvyIzN1TveNEj-ATqnLd width="600"/>
<div>

The operations can be extrapolated to higher numbers of channels.

<div>
<img src=https://drive.google.com/uc?id=1Vu2CCnsmH7EGR5_nTDfh7ygOXnIP83CG width="600"/>
<div>

*Image Source: [link](https://towardsdatascience.com/a-comprehensive-guide-to-convolutional-neural-networks-the-eli5-way-3bd2b1164a53)*

Note: we will make use of the already implemented functions, and in the next section we will learn how to efficiently use the PyTorch implementation for implementing convolutions. 

### Compute Grayscale Images



Let $R$, $G$ and $B$ be the red, green and blue channels of an image. 

We will consider the weighted method for computing the grayscale representation of the image, as follows:

$grayscale = 0.299 R + 0.587 G + 0.114 B$

In [None]:
# TODO - we will implement the color conversion using convolution operations 
# Note: this is a highly ineficient approach and is used solely for exemplification
def rgb_2_grayscale(image):
  # define the kernel
  kernel = ...
  # compute the grayscale image 
  grayscale = convolution(...)+convolution(...)+convolution(...)
  return grayscale

# TODO - replace with path to the city.png image 
image_path = '/content/gdrive/MyDrive/course_fmi_2022/Lab 5/city.png'
transf = transforms.ToTensor()
image = transf(Image.open(image_path))

grayscale = rgb_2_grayscale(image)

print('Original Image')
plt.imshow(image.permute(1,2,0))
plt.show()
print('Grayscale Image')
plt.imshow(grayscale, cmap='gray')
plt.show()

### Compute Blurred Grayscale Images


What if we also want to compute the blurred version of the image. So, we receive an RGB image and want to apply a convolution that returns us the blurred grayscale version of the image. 

In order to blur an image, we can simply perform an average of pixels in a local neighborhood. 

E.g. the following kernel can perform a blur over a $3\times 3$ neighborhood

$\begin{bmatrix}\frac{1}{9} & \frac{1}{9} & \frac{1}{9} \\ \frac{1}{9} & \frac{1}{9} & \frac{1}{9} \\ \frac{1}{9} & \frac{1}{9} & \frac{1}{9} \end{bmatrix}$

In [None]:
# TODO - implement the blur & grayscale conversion using convolution operations 
# Note: this is a highly ineficient approach and is used solely for exemplification
def rgb_2_blurred_grayscale(image):
  # define the kernel that will perform the conversion
  kernel_r = torch.ones((5,5))* (1/9) * 0.299
  kernel_g = ... 
  kernel_b = ... 
  kernel = torch.cat(...) 
  # compute the result 
  blurred_grayscale = convolution(...)+convolution(...)+convolution(...)
  return blurred_grayscale

# TODO - replace with path to the city.png image 
image_path = '/content/gdrive/MyDrive/course_fmi_2022/Lab 5/city.png'
transf = transforms.ToTensor()
image = transf(Image.open(image_path))

blurred_grayscale = rgb_2_blurred_grayscale(image)

print('Original Image')
plt.imshow(image.permute(1,2,0))
plt.show()
print('Blurred Grayscale Image')
plt.imshow(blurred_grayscale, cmap='gray')
plt.show()
print('Grayscale Image')
plt.imshow(grayscale, cmap='gray')
plt.show()

# <font color='ED1F24'>Part II: CNN Layers  



<div>
<img src=https://drive.google.com/uc?id=1PtO6mNFtm71ftrn2ea0huyvEk9PpmXSd width="750"/>
<div>

*Image Source: [FloydHub](https://blog.floydhub.com/building-your-first-convnet/)*

##Convolution Layer



[torch.nn.Conv2d](https://pytorch.org/docs/stable/generated/torch.nn.Conv2d.html) - applies a 2D convolution over a single / multi-channel input signal


<div>
<img src=https://drive.google.com/uc?id=1BvNJZQqLIBHZC7d5Ojix8orIOYQ9fbSl width="1000"/>
<div>

Input size: $(N, C_{in}, H, W)$

Output size: $(N, C_{out}, H, W)$

$N$ - batch size / number of examples

$(H, W)$ - image size

$C_{in}$ - number of input channels (e.g. 3 for RGB images)

$C_{out}$ - number of output channels ( the number of filters)

The output of the layer:

$out(N_i, {C_{out}}_{j})=bias({C_{out}}_{j}) + \sum_{k=0}^{C_{in}-1}weight({C_{out}}_{j}, k)\ast input(N_i, k)$

**Stride** - defines the kernel shift 
<div>
<img src=https://drive.google.com/uc?id=18yxT_b-cJvZAwxi1FrssToqmOJyzfvPC width="200"/>
<div>

*Image Source: [GitHub](https://github.com/vdumoulin/conv_arithmetic)*

**Dilation**

<div>
<img src=https://drive.google.com/uc?id=1abJdd9TECdpoImtWW8MSxzioa-50znHB width="200"/>
<div>

*Image Source: [GitHub](https://github.com/vdumoulin/conv_arithmetic)*

**Groups**

<div>
<img src=https://drive.google.com/uc?id=1CuK7nOzuFgOdhBDf723PKQcrz1Hkr-gV width="1000"/>
<div>

*Image Source: [link](https://towardsdatascience.com/a-comprehensive-introduction-to-different-types-of-convolutions-in-deep-learning-669281e58215)*

Now, let's play with a few convolutions

In [None]:
conv_layer = torch.nn.Conv2d(in_channels=1, out_channels=1, kernel_size=3)
print(conv_layer.weight.shape)
print(conv_layer.weight) 
print(conv_layer.bias.shape)
print(conv_layer.bias)

In [None]:
conv_layer = torch.nn.Conv2d(in_channels=1, out_channels=1, kernel_size=(3,5))
print(conv_layer.weight.shape)
print(conv_layer.weight) 
print(conv_layer.bias.shape)
print(conv_layer.bias)

In [None]:
conv_layer = torch.nn.Conv2d(in_channels=3, out_channels=1, kernel_size=(3,3))
print(conv_layer.weight.shape)
print(conv_layer.weight) 
print(conv_layer.bias.shape)
print(conv_layer.bias)

In [None]:
# Try running the following configuration 
# Is everything working fine? Can you explain why?
conv_layer = torch.nn.Conv2d(in_channels=3, out_channels=1, kernel_size=(3,5), groups=3)
print(conv_layer.weight.shape)
print(conv_layer.weight) 
print(conv_layer.bias.shape)
print(conv_layer.bias)

In [None]:
# We can drop the bias term
conv_layer = torch.nn.Conv2d(in_channels=3, out_channels=3, kernel_size=(3,5), groups=3, bias=False)
print(conv_layer.weight.shape)
print(conv_layer.weight) 
print(conv_layer.bias.shape)
print(conv_layer.bias)

### Edge detection


Use torch.nn.Conv2d to implement the edge detector

In [None]:
def sobel_edges_with_conv2d(image):
  # define the horizontal edges kernel
  sobel_horizontal_kernel = torch.nn.Conv2d(in_channels=1, out_channels=1, kernel_size=3, padding=1, bias=False)
  # define the vertical edges kernel
  sobel_vertical_kernel = torch.nn.Conv2d(in_channels=1, out_channels=1, kernel_size=3, padding=1, bias=False)

  kernel_x = torch.tensor([[-1, 0, 1], [-2, 0, 2], [-1, 0, 1]])
  kernel_y = torch.tensor([[1, 2, 1], [0, 0, 0], [-1, -2, -1]])
  with torch.no_grad():
    sobel_horizontal_kernel.weight.data.copy_(kernel_x)
    sobel_vertical_kernel.weight.data.copy_(kernel_y)
  
  # note that the conv2d layer receives a batch of N images 
  edges_x = sobel_horizontal_kernel(image[None,None,:,:]).squeeze().detach()
  edges_y = sobel_vertical_kernel(image[None,None,:,:]).squeeze().detach()
  edges = torch.sqrt(torch.pow(edges_x,2)+torch.pow(edges_y,2))
  return edges, edges_x, edges_y
  
real_image_path = '/content/gdrive/MyDrive/course_fmi_2022/lab5_images/img_square_resize.jpg'
transf = transforms.ToTensor()
image = transf(Image.open(real_image_path).convert(mode='L'))[0]

edges, edges_x, edges_y = sobel_edges_with_conv2d(image)

print('Original Image')
plt.imshow(image, cmap='gray')
plt.show()
print('Horizontal Edges')
plt.imshow(edges_x, cmap='gray')
plt.show()
print('Vertical Edges')
plt.imshow(edges_y, cmap='gray')
plt.show()
print('Edges')
plt.imshow(edges, cmap='gray')
plt.show()


## Pooling Layer



<div>
<img src=https://drive.google.com/uc?id=1a8mTuEPXVXpcTnV08V3PjOIh7AYTlTme width="400"/>
<div>

*Image Source: [link](https://towardsdatascience.com/a-comprehensive-guide-to-convolutional-neural-networks-the-eli5-way-3bd2b1164a53)*

### Max Pooling





[torch.nn.MaxPool2d](https://pytorch.org/docs/stable/generated/torch.nn.MaxPool2d.html#torch.nn.MaxPool2d)

<div>
<img src=https://drive.google.com/uc?id=1uInwM4lPwuda5i_yxggSMYL9aLFhMJGi width="1000"/>
<div>

In [None]:
x = torch.rand((4,4))
print(x)
maxpool_layer = torch.nn.MaxPool2d(kernel_size=2)
x = x.unsqueeze(0).unsqueeze(0)
x = maxpool_layer(x)
print(x)

### Average Pooling



[torch.nn.AvgPool2d](https://pytorch.org/docs/stable/generated/torch.nn.AvgPool2d.html#torch.nn.AvgPool2d)

<div>
<img src=https://drive.google.com/uc?id=1fs2PhzwTIKZe9WHtP1nJ-tTLR5sD5bHC width="1000"/>
<div>


In [None]:
x = torch.rand((4,4))
print(x)
avgpool_layer = torch.nn.AvgPool2d(kernel_size=2)
x = x.unsqueeze(0).unsqueeze(0)
x = avgpool_layer(x)
print(x)

# <font color='ED1F24'> Part III: CNNs for Image Classification

## MNIST Dataset








For this section we will use [The MNIST Database](http://yann.lecun.com/exdb/mnist/).


*   handwritten digits
*   size-normalized and centered in a fixed-size image ($28 \times 28$)

*   training set - 60 000 examples 
*   test set - 10 000 examples 

In [None]:
# download the MNIST dataset
!wget www.di.ens.fr/~lelarge/MNIST.tar.gz
!tar -zxvf MNIST.tar.gz

In [None]:
# we create a loader to iterate through the dataset
# https://pytorch.org/vision/stable/datasets.html
batch_size = 64
train_dataloader = torch.utils.data.DataLoader(
    datasets.MNIST('./', train=True, download=True,
                   transform=transforms.Compose([
                       transforms.ToTensor(),
                   ])),
    batch_size=batch_size, shuffle=True,drop_last=True)

test_dataloader = torch.utils.data.DataLoader(
    datasets.MNIST('./', train=False, transform=transforms.Compose([
                       transforms.ToTensor(),
                   ])),
    batch_size=batch_size, shuffle=False,drop_last=True)

first_train_batch_imgs, first_train_batch_labels = next(iter(train_dataloader))
print(first_train_batch_imgs.shape)
print(first_train_batch_labels.shape)

f, axarr = plt.subplots(1,5)
for i in range(5):
  axarr[i].imshow(first_train_batch_imgs[i,0], cmap='gray')
print(f'Labels of the shown images: {first_train_batch_labels[:5]}')

## Image Classification with a Fully Connected Network

We will try to solve the problem with a fully connected network first.





In [None]:
class FCNNClassifier(nn.Module):
    def __init__(self):
        super().__init__()
        self.layer_1 = nn.Linear(in_features=1*28*28, out_features=500)
        self.output_layer = nn.Linear(in_features=500, out_features = 10)
        self.activation_fn = nn.ReLU()

    def forward(self, x):
        # flatten the images
        x = x.view(x.shape[0], -1)
        x = self.activation_fn(self.layer_1(x))
        x = self.output_layer(x)
        return x

In [None]:
torch.manual_seed(115)
fcnn_model = FCNNClassifier()

num_params = 0
print("Model's parameters: ")
for n, p in fcnn_model.named_parameters():
    print('\t', n, ': ', p.size())
    num_params += p.numel()
print("Number of model parameters: ", num_params)

We will define the training and validation loops.

In [None]:
def train_epoch(model, train_dataloader, loss_crt, optimizer, device):
    """
    model: Model object 
    train_dataloader: DataLoader over the training dataset
    loss_crt: loss function object
    optimizer: Optimizer object
    device: torch.device('cpu) or torch.device('cuda')

    The function returns: 
     - the epoch training loss, which is an average over the individual batch
       losses
    """
    model.train()
    epoch_loss = 0.0
    epoch_accuracy = 0.0
    num_batches = len(train_dataloader)
    
    for batch_idx, batch in tqdm(enumerate(train_dataloader)):
        # shape: batch_size x 1 x 28 x 28, batch_size x 1
        batch_img, batch_labels = batch
        
        # move data to GPU
        batch_img = batch_img.to(device)
        batch_labels = batch_labels.to(device)
        
        # initialize as zeros all the gradients of the model
        model.zero_grad()

        # get predictions from the FORWARD pass 
        # shape: batch_size x 10
        output = model(batch_img)

        loss = loss_crt(output, batch_labels.squeeze())       
        loss_scalar = loss.item()

        # BACKPROPAGATE the gradients
        loss.backward()
        # use the gradients to OPTIMISE the model
        optimizer.step()
        
        epoch_loss += loss_scalar

        pred = output.argmax(dim=1, keepdim=True)
        epoch_accuracy += pred.eq(batch_labels.view_as(pred)).float().mean().item()
        
    epoch_loss = epoch_loss/num_batches
    epoch_accuracy = 100. * epoch_accuracy/num_batches
    return epoch_loss, epoch_accuracy

def eval_epoch(model, val_dataloader, loss_crt, device):
    """
    model: Model object 
    val_dataloader: DataLoader over the validation dataset
    loss_crt: loss function object
    device: torch.device('cpu) or torch.device('cuda')

    The function returns: 
     - the epoch validation loss, which is an average over the individual batch
       losses
    """
    model.eval()
    epoch_loss = 0.0
    epoch_accuracy = 0.0
    num_batches = len(val_dataloader)
    with torch.no_grad():
        for batch_idx, batch in tqdm(enumerate(val_dataloader)):
            # shape: batch_size x 3 x 28 x 28, batch_size x 1
            batch_img, batch_labels = batch
            current_batch_size = batch_img.size(0)

            # move data to GPU
            batch_img = batch_img.to(device)
            batch_labels = batch_labels.to(device)
 
            # batch_size x 10
            output = model(batch_img)

            loss = loss_crt(output, batch_labels.squeeze())
            loss_scalar = loss.item()

            epoch_loss += loss_scalar

            pred = output.argmax(dim=1, keepdim=True)
            epoch_accuracy += pred.eq(batch_labels.view_as(pred)).float().mean().item()

    epoch_loss = epoch_loss/num_batches
    epoch_accuracy = 100. * epoch_accuracy/num_batches
    return epoch_loss, epoch_accuracy

In [None]:
# move the model to GPU (when available)
fcnn_model.to(device)

# create a SGD optimizer
optimizer = torch.optim.SGD(fcnn_model.parameters(), lr=0.01, momentum=0.9)

# set up loss function
loss_criterion = nn.CrossEntropyLoss()

num_epochs = 10
train_losses = []
train_accuracies = []
val_losses = []
val_accuracies = []
for epoch in range(1, num_epochs+1):
  train_loss, train_accuracy = train_epoch(fcnn_model, train_dataloader, loss_criterion, optimizer, device)
  val_loss, val_accuracy = eval_epoch(fcnn_model, test_dataloader, loss_criterion, device)
  train_losses.append(train_loss)
  val_losses.append(val_loss)
  train_accuracies.append(train_accuracy)
  val_accuracies.append(val_accuracy)
  print('\nEpoch %d'%(epoch))
  print('train loss: %10.8f, accuracy: %10.8f'%(train_loss, train_accuracy))
  print('val loss: %10.8f, accuracy: %10.8f'%(train_loss, train_accuracy))


In [None]:
# plot loss & accuracy
plt.figure()
plt.plot(train_losses, label='train_loss', color='red')
plt.plot(val_losses, label='val_loss', color='blue')
plt.legend()
plt.show()
plt.plot(train_accuracies, label='train_accuracy', color='red')
plt.plot(val_accuracies, label='val_accuracy', color='blue')
plt.legend()
plt.show()


## Image Classification with a Convolutional Neural Network





Now, we will solve the same classification problem using a CNN model. 

The architecture:

*   Conv Layer: 20 filters, kernel size: 5x5, stride: 1
*   ReLU
*   Max Pool Layer: kernel size: 2x2, stride: 2
*   Conv Layer: 50 filters, kernel size: 5x5, stride:1
*   ReLU 
*   Max Pool Layer: kernel size: 2x2, stride: 2
*   Fully Connected Layer: 500 neurons
*   ReLU
*   Fully Connected Layer: 10 neurons 

In [None]:
# implement the above architecture 
class CNNClassifier(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv1 = nn.Conv2d(...)
        self.conv2 = nn.Conv2d(...)
        # note that the maps are flattened before this layer
        # you need to infer the input size of the fully connected layer
        self.fc1 = nn.Linear(...)
        self.fc2 = nn.Linear(...)
        self.activation_fn = ...
        self.pool = nn.MaxPool2d(...)

    def forward(self, x):
        x = ...
        return x

In [None]:
torch.manual_seed(115)
cnn_model = CNNClassifier()

num_params = 0
print("Model's parameters: ")
for n, p in cnn_model.named_parameters():
    print('\t', n, ': ', p.size())
    num_params += p.numel()
print("Number of model parameters: ", num_params)

In [None]:
# move the model to GPU (when available)
cnn_model.to(device)

# create a SGD optimizer
optimizer = torch.optim.SGD(cnn_model.parameters(), lr=0.01, momentum=0.9)

# set up loss function
loss_criterion = nn.CrossEntropyLoss()

num_epochs = 10
train_losses = []
train_accuracies = []
val_losses = []
val_accuracies = []
for epoch in range(1, num_epochs+1):
  train_loss, train_accuracy = train_epoch(cnn_model, train_dataloader, loss_criterion, optimizer, device)
  val_loss, val_accuracy = eval_epoch(cnn_model, test_dataloader, loss_criterion, device)
  train_losses.append(train_loss)
  val_losses.append(val_loss)
  train_accuracies.append(train_accuracy)
  val_accuracies.append(val_accuracy)
  print('\nEpoch %d'%(epoch))
  print('train loss: %10.8f, accuracy: %10.8f'%(train_loss, train_accuracy))
  print('val loss: %10.8f, accuracy: %10.8f'%(val_loss, val_accuracy))

In [None]:
# plot loss & accuracy
plt.figure()
plt.plot(train_losses, label='train_loss', color='red')
plt.plot(val_losses, label='val_loss', color='blue')
plt.legend()
plt.show()
plt.plot(train_accuracies, label='train_accuracy', color='red')
plt.plot(val_accuracies, label='val_accuracy', color='blue')
plt.legend()
plt.show()

# <font color='ED1F24'> Part IV: Transfer Learning 


<div>
<img src=https://drive.google.com/uc?id=1Lx_VHuOyjariEPPrB6iahV0_30mrXLB7 width="700"/>
<div>

**How**:


*   finetuning - start from the pretrained weights, but the whole model is retrained.
*   feature extraction - use the pretrained model solely for feature extraction and we train a classifier on top of those features.

In this laboratory we will implement the feature extraction approach.



## ResNet Architecture



<div>
<img src=https://drive.google.com/uc?id=1QX_AYUEYYzQcBLVGaRHfXj24ttmHFdHN width="1000"/>
<div>

*Image Source: He et al. Deep Residual Learning for Image Recognition [paper](https://openaccess.thecvf.com/content_cvpr_2016/papers/He_Deep_Residual_Learning_CVPR_2016_paper.pdf)*


Main advantages
*   train deeper models
*   easily optimise by allowing direct paths between lower and upper levels

More detailes regarding the ResNet architecture will be discussed in next laboratory and during the lectures. For the moment, we will treat it as a black box.

## Pretrained ResNet

Multiple model definitions along with their pretrained weights are available in the [torchvision.models](https://pytorch.org/vision/0.8/models.html). The models have been pretrained on the 1000-class ImageNet dataset.

The model inputs are mini-batches of 3-channel images of shape $(3\times height \times width)$. $height$ and $width$ are expected to be at least $224$

The input images should be in range $[0,1]$, normalized using:

$mean=[0.485, 0.456, 0.406]$

$std=[0.229, 0.224, 0.225]$



In [None]:
import torchvision.models as models 
# load the ResNet-18 model, with randomly initialized weights 
resnet18_random = models.resnet18(pretrained=False)
# load the ResNet-18 model, with weights pretrained on ImageNet 
resnet18_pretrained = models.resnet18(pretrained=True)

num_params = 0
print("Model's parameters: ")
for n, p in resnet18_pretrained.named_parameters():
    print('\t', n, ': ', p.size())
    num_params += p.numel()
print("Number of model parameters: ", num_params)

In [None]:
# TODO - apply the ResNet model over a randomly generated image (224 x 224) & check the output dimension 
...



In [None]:
# TODO - apply the ResNet model over a real image, after resizing it to 224 x 224
# do not forget to apply the required transformations over the image
# try resizing the image to 28 x 28 & see what happens 
# TODO - replace with path to the city.png image 
image_path = '/content/gdrive/MyDrive/course_fmi_2022/Lab 5/city.png'
# define transformations to be applied over the image
transf = ...
image = ...
# apply ResNet
res = resnet18_pretrained(...)
print(res.shape)


## Cats vs. Dogs Dataset





We will work on the Cats vs. Dogs Dataset introduced in Laboratory 3

*   classify images based on their content
*   images containing cats or dogs $⇒$ 2 classes
*   training set - 2000 examples 
*   test set - 1000 examples

In [None]:
import zipfile
import os

# Download the samples
!wget --no-check-certificate \
https://storage.googleapis.com/mledu-datasets/cats_and_dogs_filtered.zip \
-O /tmp/cats_and_dogs_filtered.zip

# Extract the data
local_zip = '/tmp/cats_and_dogs_filtered.zip'
zip_ref = zipfile.ZipFile(local_zip, 'r')
zip_ref.extractall('/tmp')
zip_ref.close()

# set up train and validation dirs
base_dir = '/tmp/cats_and_dogs_filtered'
train_dir = os.path.join(base_dir, 'train')
validation_dir = os.path.join(base_dir, 'validation')

In [None]:
class CatsAndDogsDataset(Dataset):
  def __init__(self, data_dir: str, transform=None):
    # dir structure:
    # /
    #  /cats/
    #        cat.0.jpg
    #        cat.1.jpg
    #        ...
    #  /dogs/
    #        dog.0.jpg  
    #        dog.1.jpg
    #        ...           
    #  
    self.data_dir = data_dir

    # directory containing cat pictures
    self.cats_dir = os.path.join(self.data_dir, 'cats')

    # directory containing dog pictures
    self.dogs_dir = os.path.join(self.data_dir, 'dogs')

    self.cat_fnames = [os.path.join(self.cats_dir, fname) \
                        for fname in os.listdir(self.cats_dir)]
    self.dog_fnames = [os.path.join(self.dogs_dir, fname) \
                        for fname in os.listdir(self.dogs_dir)]

    self.fnames = self.cat_fnames + self.dog_fnames

    self.labels = len(self.cat_fnames) * [0] + len(self.dog_fnames) * [1]

    # TODO - implement the transformations 
    #      - resize images to 224 x 224 
    #      - normalize images 
  
    self.image_transforms = ...

  def __getitem__(self, index):
    fname = self.fnames[index]
    img_obj = Image.open(fname)
    img_tensor = self.image_transforms(img_obj)
    
    # retrieve the image's label and store it into a Tensor
    label = self.labels[index]
    label_tensor = torch.tensor([label])

    return img_tensor, label_tensor
  
  def __len__(self):
    return len(self.fnames)

In [None]:
# get train & validation datasets 
train_dataset = CatsAndDogsDataset(data_dir=train_dir)
validation_dataset = CatsAndDogsDataset(data_dir=validation_dir)
# instantiate the dataloaders
batch_size = 64
train_dataloader = DataLoader(
    dataset=train_dataset, 
    batch_size=batch_size,
    shuffle=True
)
validation_dataloader = DataLoader(
    dataset=validation_dataset, 
    batch_size=batch_size
)

# visualize few examples
train_batch_imgs, train_batch_labels = next(iter(train_dataloader))
print(train_batch_imgs.shape)
print(train_batch_labels.shape)

f, axarr = plt.subplots(1,5)
for i in range(5):
  axarr[i].imshow(train_batch_imgs[i].permute(1,2,0))
print(f'Labels of the shown images: {train_batch_labels[:5]}')

## Feature extraction

We use the pretrained model solely for feature extraction and we train a classifier on top of those features.

In [None]:
# function counting the number of parameters and the number of trainable parameters of a model 
# optionally, it will also display the layers
def check_model_parameters(model, display_layers=False):
  num_params = 0
  num_trainable_params = 0
  if display_layers==True:
    print("Model's parameters: ")
  for n, p in resnet18_pretrained.named_parameters():
      if display_layers == True:
        print('\t', n, ': ', p.size())
      num_params += p.numel()
      if p.requires_grad:
        num_trainable_params += p.numel()
  print("Number of model parameters: ", num_params)
  print("Number of trainable parameters: ", num_trainable_params)

In [None]:
# freeze the model parameters 
import torchvision.models as models 
# load the ResNet-18 model, with weights pretrained on ImageNet 
resnet18_pretrained = models.resnet18(pretrained=True)

# check the number of parameters and the number of trainable parameters
check_model_parameters(resnet18_pretrained, display_layers=False)

# freeze all the layers
for param in resnet18_pretrained.parameters():
  param.requires_grad = False 

# check the number of parameters and the number of trainable parameters
check_model_parameters(resnet18_pretrained, display_layers=False)


In [None]:
torch.manual_seed(115)
# TODO - change the last layer of the model to adapt it for our task
# Hint - look at the layers of the model, you need to change the last one s.t. the number of output classes is 2, instead of 1000
#      - this last layer is a simple linear layer
...

check_model_parameters(resnet18_pretrained, display_layers=True)

In [None]:
def train_epoch(model, train_dataloader, loss_crt, optimizer, device):
    """
    model: Model object 
    train_dataloader: DataLoader over the training dataset
    loss_crt: loss function object
    optimizer: Optimizer object
    device: torch.device('cpu) or torch.device('cuda')

    The function returns: 
     - the epoch training loss, which is an average over the individual batch
       losses
    """
    model.train()
    epoch_loss = 0.0
    epoch_accuracy = 0.0
    num_batches = len(train_dataloader)
    
    for batch_idx, batch in tqdm(enumerate(train_dataloader)):
        # shape: batch_size x 1 x 28 x 28, batch_size x 1
        batch_img, batch_labels = batch
        
        # move data to GPU
        batch_img = batch_img.to(device)
        batch_labels = batch_labels.to(device)
        
        # initialize as zeros all the gradients of the model
        model.zero_grad()

        # get predictions from the FORWARD pass 
        # shape: batch_size x 10
        output = model(batch_img)

        loss = loss_crt(output, batch_labels.squeeze())       
        loss_scalar = loss.item()

        # BACKPROPAGATE the gradients
        loss.backward()
        # use the gradients to OPTIMISE the model
        optimizer.step()
        
        epoch_loss += loss_scalar

        pred = output.argmax(dim=1, keepdim=True)
        epoch_accuracy += pred.eq(batch_labels.view_as(pred)).float().mean().item()
        
    epoch_loss = epoch_loss/num_batches
    epoch_accuracy = 100. * epoch_accuracy/num_batches
    return epoch_loss, epoch_accuracy

def eval_epoch(model, val_dataloader, loss_crt, device):
    """
    model: Model object 
    val_dataloader: DataLoader over the validation dataset
    loss_crt: loss function object
    device: torch.device('cpu) or torch.device('cuda')

    The function returns: 
     - the epoch validation loss, which is an average over the individual batch
       losses
    """
    model.eval()
    epoch_loss = 0.0
    epoch_accuracy = 0.0
    num_batches = len(val_dataloader)
    with torch.no_grad():
        for batch_idx, batch in tqdm(enumerate(val_dataloader)):
            # shape: batch_size x 3 x 28 x 28, batch_size x 1
            batch_img, batch_labels = batch
            current_batch_size = batch_img.size(0)

            # move data to GPU
            batch_img = batch_img.to(device)
            batch_labels = batch_labels.to(device)
 
            # batch_size x 10
            output = model(batch_img)

            loss = loss_crt(output, batch_labels.squeeze())
            loss_scalar = loss.item()

            epoch_loss += loss_scalar

            pred = output.argmax(dim=1, keepdim=True)
            epoch_accuracy += pred.eq(batch_labels.view_as(pred)).float().mean().item()

    epoch_loss = epoch_loss/num_batches
    epoch_accuracy = 100. * epoch_accuracy/num_batches
    return epoch_loss, epoch_accuracy

In [None]:
resnet18_pretrained.to(device)

# create a SGD optimizer
optimizer = torch.optim.SGD(resnet18_pretrained.parameters(), lr=0.01, momentum=0.9)

# set up loss function
loss_criterion = nn.CrossEntropyLoss()

# evaluate the initial model 
val_loss, val_accuracy = eval_epoch(resnet18_pretrained, validation_dataloader, loss_criterion, device)
print('Validation performance before finetuning -- loss: %10.8f, accuracy: %10.8f'%(val_loss, val_accuracy))

# finetune the model 
num_epochs = 2
train_losses = []
train_accuracies = []
val_losses = []
val_accuracies = []
for epoch in range(1, num_epochs+1):
  train_loss, train_accuracy = train_epoch(resnet18_pretrained, train_dataloader, loss_criterion, optimizer, device)
  val_loss, val_accuracy = eval_epoch(resnet18_pretrained, validation_dataloader, loss_criterion, device)
  train_losses.append(train_loss)
  val_losses.append(val_loss)
  train_accuracies.append(train_accuracy)
  val_accuracies.append(val_accuracy)
  print('\nEpoch %d'%(epoch))
  print('train loss: %10.8f, accuracy: %10.8f'%(train_loss, train_accuracy))
  print('val loss: %10.8f, accuracy: %10.8f'%(val_loss, val_accuracy))

In [None]:
# TODO - train the resnet architecture from scratch and compare the obtained results 
#      - we will employ the same training strategy 
torch.manual_seed(115)
...S