# Convolutional Neural Networks Exercise 2.

In this exercise you will fine-tune/re-train pre-trained models to help you with images classification (still using the CIFAR10 dataset).

The code below is almost ready to run, and making it run requires very little of you. So, the exercise here is to try out different pre-trained models (I have suggested some below) and maybe experiment with only fine-tuning some (later) layers instead of the whole network.

Neither me nor the TAs are gonna police anybody, but I would recommend that you at least try out 2 different models beyond the one already implemented below. 

Given the right model and settings, you should be able to get and accuracy above 80 and perhaps even above 90.

Have fun..

In [None]:
import os
import numpy as np
import matplotlib.pyplot as plt

import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
import torchvision
from torchvision import datasets
import torchvision.transforms as transforms
from torchsummary import summary                    

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(device)

# Load models

We will start by testing a pre-trained vgg16 model. You can read a bit about the pytorch implementation here:

- https://pytorch.org/vision/master/models/generated/torchvision.models.vgg16.html#torchvision.models.vgg16


And here is a schematic of the vgg16 architecture:
- https://miro.medium.com/max/850/1*_Lg1i7wv1pLpzp2F4MLrvw.png

Hopefully the architecture looks somewhat familiar and you can recognize the "encoder" form. The reason it is denoted *16* is that this vgg implementation have 16 trainable layers (the blue and green blocks in the architecture schematics above). 

The point for this exercise is to try different pre-trained networks out. That also means that most of the code below needs no tweaking. When you have changed what little needs to be changed for the vgg16 model to run, and you have noted its performance, I suggest you try to implement these models:

- resnet18
- mobilenet_v3_small
- shufflenet_v2_x0_5
- squeezenet1_0

You can read about these - and meny more under *classification* at:
- https://pytorch.org/vision/master/models.html#classification

Note that some of these networks are quite a bit more complicated - and larger - than e.g. the vgg16. This should not discourage you. For now, you do not have to understand the more complicated architectures. You just need to be able to re-train the networks. Of course, you would always want to read the article(s) presenting the networks if you where to use them for actual research, but since we are just playing around the pytorch documentation will do.

In [None]:
# Load the vgg16 model-builder

from torchvision.models import vgg16_bn # bn is simply the version with batch normalization

In [None]:
# lest also define our classes here. We will need this in a moment.
classes = ('plane', 'car', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck')
num_classes = len(classes)

In [None]:
#instanciate a pretrained vgg16_bn model (and send it to the gpu)

model = vgg16_bn(pretrained = True).to(device)

The fact that you have instantiated the model as "pre-trained" means that all the weights in all the filters/kernels are already trained.


In [None]:
# print the model architectures
print(model)

You can also use 
`summary(model, (3,224,244))`
or
`for k in model._modules.keys():
  print(k)`
To survey the model if you want to.

As you might see this model is somewhat similar to what you created last week. The most central part of this print out (right now) is the last layer. As you can see it right now, the output of the model is 1000 different classes. If we were using a dataset with the same 1000 classes we could just go ahead. But We only have 10 in the our dataset CIFAR10. So we need to change this last layer.

If you look at the very bottom of the printout above you should see something like:

`(classifier): Sequential(`   
`...`  
`(6): Linear(in_features=4096 , out_features=1000, bias=True)`  
`)`

So, we want to change the 6th object in the classifier *block*. Specifically, you want to insert a new linear layer here. It should take the same number of in_features but the number of out-features should correspond the number of classes in our task. 

Note that this specific for Alexnet (and a couple different models). Other models have different architecture and the last layer that needs changing might be called something else, be placed differently, or might not even be Linear (remember that you can do convolutions all the way down). So you will need to check each model. If you get stuck here, this page might help you out a bit. Go to **Initialize and Reshape the Networks**
:

https://pytorch.org/tutorials/beginner/finetuning_torchvision_models_tutorial.html


Also note: if you do not want to hard-code or remember the number of input features use:

`model.classifier[6].in_features`

In [None]:
model.classifier[...] = nn.Linear(..., ...).to(device)

If you now print the last layer, you should see that the appropriate amount of out_features: 

In [None]:
print(model.classifier[...])


Note that this layer is now completely new. There is nothing pre-trained in it. Only "randomly" initialized weights. These needs to be trained. But! All the other layers is still pre-trained.

Now, just as we changed the output of our model, we must also change our input. I.e. the images (not the model this time). All pre-trained models expect a input of a specific size (and dimension). For Alexnet you can see it here (bottom page):

https://pytorch.org/vision/master/models/generated/torchvision.models.alexnet.html#torchvision.models.alexnet

What you should focus on (rigth now) is that the network expects a image of 224x224 pixels. I also expects the image to be normalized between [-1, 1]. A couple of things are worth noting here:

- Since these ar CNN they can take images of many varying sizes, but the result will suffer (a lot sometimes) is you do no resize you input images accordingly.
- Whether you resize to 256 and then center crop to 224 (as stated in the manual) is less important. You can also resize to 256 and random crop for better data augmentation (doing training only of course).
- According to the manual you should first re-scale the image to [0,1] then normalize using mean=[0.485, 0.456, 0.406] and std=[0.229, 0.224, 0.225]. In practice normalizing the CIFAR10 images with mean=[0.5, 0.5, 0.5] and std=[0.5, 0.5, 0.5] achieves something very similar: normalization between [-1, 1].

I have put a (working) suggestion below, but feel free to experiment with different transformations. 

In [None]:
transform_train = transforms.Compose([transforms.ToTensor(), transforms.Resize(256), transforms.RandomCrop(224), transforms.RandomHorizontalFlip(p=0.5), transforms.RandomRotation(10), transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])
transform_test = transforms.Compose([transforms.ToTensor(), transforms.Resize(256), transforms.CenterCrop(224), transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])

And we can now create our data loaders:

In [None]:
os.makedirs('data', exist_ok = True)
os.makedirs('models', exist_ok = True)

trainset = torchvision.datasets.CIFAR10(root='data', train=True, download=True, transform=transform_train)
testset = torchvision.datasets.CIFAR10(root='data', train=False, download=True, transform=transform_test)

batch_size = 4

trainloader = torch.utils.data.DataLoader(trainset, batch_size=batch_size, shuffle=True, num_workers=2)
testloader = torch.utils.data.DataLoader(testset, batch_size=batch_size, shuffle=False, num_workers=2)


# get some random training images
dataiter=iter(trainloader)
images,labels=next(dataiter)

print(f"number of samples{images.shape} (batch size, color channel, width, hieght)")
print(f"number of labels {labels.shape}")

Note from the print out above that the images are now 3x224x224 and not 3x32x32. (the first dim = 4 is the batch dimension)

Before we go further, I just show a couple of different ways you can examine 

your network architecture - as supplements to just "print(model)"

In [None]:
summary(model, (3,224,244))

In [None]:
for k in model._modules.keys():
  print(k)

We can also check which parameters are trainable:

In [None]:
print("Params to learn:")

for name,param in model.named_parameters():
  if param.requires_grad == True:
    print("\t",name)

also we do not need to re-train all layers. If we want to, we could choose to only retrain the last layer (or layers). This is called feature extraction. That is, we are not just using a pre-trained network. We are using the specific features learned in that in network without changing them (or most of them).

In the code below I freeze all parameters apart from the bias and weight in the last layer. Again note that this code is not universal. It works for vgg16 and a couple others, but networks with different architectures might need different handling.

In [None]:
for param in list(model.parameters())[:-2]: # -2 save the last two parameters in Alexnet.
  param.requires_grad = False

If we now print the trainable parameters we see that only the last layer (the weights and the biases here) are trainable:

In [None]:
print("Params to learn:")

for name,param in model.named_parameters():
  if param.requires_grad == True:
    print("\t",name)

And you could also choose to keep more bare parameters learnable, like all three layers in the dense classifier for vgg16 (-6 instead of -2). For now, however, we want to retrain all parameters (feel free to go back and experiment later).

In [None]:
# train all parameters
for param in list(model.parameters()):
  param.requires_grad = True


And, we can check that all the parameters are now trainable again

In [None]:
print("Params to learn:")

for name,param in model.named_parameters():
  if param.requires_grad == True:
    print("\t",name)


And, we just go fine-tune/re-train our pre-trained model no bigger here. Feel free the experiment with different hyper parameters and optimizer.

# NOTE THAT THESE ARE LARGE MODELS. THEY WILL TAKE TIME TO TRAIN EVEN ON A GPU.

So go read the curriculum or some such in the mean time...

In [None]:
lr = 0.001
momentum = 0.9

criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=lr)

In [None]:
# set the model to train mode
model.train()

n_epochs = 4

for epoch in range(n_epochs):  # loop over the dataset multiple times

    history_loss = []
    running_loss = 0.0
    for i, data in enumerate(trainloader, 0):
        # get the inputs; data is a list of [inputs, labels]
        inputs, labels = data[0].to(device), data[1].to(device)

        # zero the parameter gradients
        optimizer.zero_grad()

        # forward
        outputs = model(inputs)

        # loss + backward
        loss = criterion(outputs, labels)
        loss.backward()
        
        # optimize
        optimizer.step()

        # print statistics
        running_loss += loss.item()
        history_loss.append(running_loss / len(trainset))
        if i % 2000 == 1999:    # print every 2000 mini-batches
            print(f'[{epoch + 1}, {i + 1:5d}] loss: {running_loss / 2000:.3f}')
            running_loss = 0.0

print('Finished Training')

Lets plot some test images


In [None]:
dataiter_test = iter(testloader)
images_test, labels_test = dataiter_test.next()
images_test = images_test.to(device)
labels_test = labels_test.to(device)

outputs = model(images_test)
_, predicted = torch.max(outputs, 1)

def show_predictions(images = images_test, labels = labels_test, classes = classes, predicted = predicted, n = 4):

    plt.figure(figsize=(10,8))
    
    for i in range(n):
      plt.subplot(1,n,i+1)
    
      img = images[i] / 2 + 0.5 # one image from batch and unnormalize
      npimg = img.cpu().numpy() # from tensor to numpy
      plt.imshow(np.transpose(npimg, (1, 2, 0))) # from shape (3,32,32) -> (32,32,3) bc imshow...
      plt.title(f'true class: {classes[labels[i]]}\npredicted class: {classes[predicted[i]]}')
    plt.show()

# show images
show_predictions(images = images_test, labels = labels_test)

Find the overall accuracy

In [None]:
correct = 0
total = 0
# since we're not training, we don't need to calculate the gradients for our outputs
with torch.no_grad():
    for data in testloader:
        images, labels = data

        # if you run on gpu
        images = images.to(device)
        labels = labels.to(device)
        
        # calculate outputs by running images through the network
        outputs = model(images)
        # the class with the highest energy is what we choose as prediction
        _, predicted = torch.max(outputs.data, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()

print(f'Accuracy of the network on the 10000 test images: {100 * correct // total} %')

And the class specific accuracy

In [None]:
# prepare to count predictions for each class
correct_pred = {classname: 0 for classname in classes}
total_pred = {classname: 0 for classname in classes}

# again no gradients needed

model.eval() # should not be neccesary ince you fd owith no grad

with torch.no_grad():
    for data in testloader:
        images, labels = data
          # if you run on gpu
        images = images.to(device)
        labels = labels.to(device)

        outputs = model(images)
        _, predictions = torch.max(outputs, 1)
        # collect the correct predictions for each class
        for label, prediction in zip(labels, predictions):
            if label == prediction:
                correct_pred[classes[label]] += 1
            total_pred[classes[label]] += 1


# print accuracy for each class
for classname, correct_count in correct_pred.items():
    accuracy = 100 * float(correct_count) / total_pred[classname]
    print(f'Accuracy for class: {classname:5s} is {accuracy:.1f} %')

Now go back and try a different model, e.g. mobilenet_v3_small. See if you can beat vgg16_bn.

And now couple of concluding remarks:

1. The next real challenge is using your own data. First you need to label it. This takes time but given that you now know a bit about using pre-trained networks, you might only need to label 100-1000 images to get okay results.

2. It can be a hassle creating you own data loader of custom data, but it is 100% doable. It might just require some trial and error.

3. We have only really done image classification here. Chances are that if you are doing anything more serious you might want to do object detection or segmentation. Don't worry: the models are a bit more complicated but they exist and you can use pre-trained models here as well.

4. Also, for custom annotations for object detection and segmentation tools such as LabelImg can help: https://github.com/tzutalin/labelImg.

5. The Pytorch API for pre-trained models is going to change a bit in the future. Nothing major, but you will be able to choose between different pre-trained weights for each model. That is super nice, but it also means that the `pretrained = True` argument is going to be deprecated. See https://pytorch.org/blog/introducing-torchvision-new-multi-weight-support-api/ for more.

6. Lastly, for this for this (and the previous) exercise we have imported models straight from pytorch and used native pytorch for everything. This is works very well, but if you going to use pre-trained models and want something a bit more powerful look into detectron2. This is especially important if you want to do more complicated stuff like object detection and semantic. Detectron2 is kind of a API wrapper around a lot of pre-defined and pre-trained models all implemented in pytorch. It is super powerful and it works very similar to what you just did. Indeed it is actually a bit more intuitive here and there when you get to know it. This links might prove useful. 


https://ai.facebook.com/blog/-detectron2-a-pytorch-based-modular-object-detection-library-/   
https://detectron2.readthedocs.io/en/latest/   
https://detectron2.readthedocs.io/en/latest/tutorials/getting_started.html   
https://github.com/facebookresearch/detectron2   
https://github.com/facebookresearch/detectron2/blob/main/MODEL_ZOO.md
