# Student: DERVISHI Megi

## Visualization of CNN: Grad-CAM
* **Objective**: Convolutional Neural Networks are widely used on computer vision. It is powerful for processing grid-like data. However we hardly know how and why it works, due to the lack of decomposability into individually intuitive components. In this assignment, we use Grad-CAM, which highlights the regions of the input image that were important for the neural network prediction.

* **To be submitted within 2 weeks**: this notebook, **cleaned** (i.e. without results, for file size reasons: `menu > kernel > restart and clean`), in a state ready to be executed (if one just presses 'Enter' till the end, one should obtain all the results for all images) with a few comments at the end. No additional report, just the notebook!

* NB: if `PIL` is not installed, try `conda install pillow`.


In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
from torchvision import models, datasets, transforms
import matplotlib.pyplot as plt
import pickle
import urllib.request

import numpy as np
from PIL import Image

%matplotlib inline

### Download the Model
We provide you a pretrained model `ResNet-34` for `ImageNet` classification dataset.
* **ImageNet**: A large dataset of photographs with 1 000 classes.
* **ResNet-34**: A deep architecture for image classification.

In [None]:
resnet34 = models.resnet34(pretrained=True)
resnet34.eval() # set the model to evaluation mode

![ResNet34](https://miro.medium.com/max/1050/1*Y-u7dH4WC-dXyn9jOG4w0w.png)

In [None]:
classes = pickle.load(urllib.request.urlopen('https://gist.githubusercontent.com/yrevar/6135f1bd8dcf2e0cc683/raw/d133d61a09d7e5a3b36b8c111a8dd5c4b5d560ee/imagenet1000_clsid_to_human.pkl') )

### Input Images
We provide you 20 images from ImageNet (download link on the webpage of the course or download directly using the following command line,).<br>
In order to use the pretrained model resnet34, the input image should be normalized using `mean = [0.485, 0.456, 0.406]`, and `std = [0.229, 0.224, 0.225]`, and be resized as `(224, 224)`.

In [None]:
def preprocess_image(dir_path):
    normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406],
                                     std=[0.229, 0.224, 0.225])

    dataset = datasets.ImageFolder(dir_path, transforms.Compose([
            transforms.Resize(256), 
            transforms.CenterCrop(224), # resize the image to 224x224
            transforms.ToTensor(), # convert numpy.array to tensor
            normalize])) #normalize the tensor

    return (dataset)

In [None]:
# The images should be in a *sub*-folder of "data/" (ex: data/TP2_images/images.jpg) and *not* directly in "data/"!
# otherwise the function won't find them

import os
os.mkdir("data")
os.mkdir("data/TP2_images")
!cd data/TP2_images && wget "https://www.lri.fr/~gcharpia/deeppractice/2022/TP2/TP2_images.zip" && unzip TP2_images.zip
dir_path = "data/" 
dataset = preprocess_image(dir_path)

In [None]:
# show the orignal image 
index = 5
input_image = Image.open(dataset.imgs[index][0]).convert('RGB')
plt.imshow(input_image)

In [None]:
output = resnet34(dataset[index][0].view(1, 3, 224, 224))
values, indices = torch.topk(output, 3)
print("Top 3-classes:", indices[0].numpy(), [classes[x] for x in indices[0].numpy()])
print("Raw class scores:", values[0].detach().numpy())

### Grad-CAM 
* **Overview:** Given an image, and a category (‘tiger cat’) as input, we forward-propagate the image through the model to obtain the `raw class scores` before softmax. The gradients are set to zero for all classes except the desired class (tiger cat), which is set to 1. This signal is then backpropagated to the `rectified convolutional feature map` of interest, where we can compute the coarse Grad-CAM localization (blue heatmap).


* **To Do**: Define your own function Grad_CAM to achieve the visualization of the given images. For each image, choose the top-3 possible labels as the desired classes. Compare the heatmaps of the three classes, and conclude. 


* **Hints**: 
 + We need to record the output and grad_output of the feature maps to achieve Grad-CAM. In pytorch, the function `Hook` is defined for this purpose. Read the tutorial of [hook](https://pytorch.org/tutorials/beginner/former_torchies/nnft_tutorial.html#forward-and-backward-function-hooks) carefully. 
 + The pretrained model resnet34 doesn't have an activation function after its last layer, the output is indeed the `raw class scores`, you can use them directly. 
 + The size of feature maps is 7x7, so your heatmap will have the same size. You need to project the heatmap to the resized image (224x224, not the original one, before the normalization) to have a better observation. The function [`torch.nn.functional.interpolate`](https://pytorch.org/docs/stable/nn.functional.html?highlight=interpolate#torch.nn.functional.interpolate) may help.  
 + Here is the link of the paper [Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization](https://arxiv.org/pdf/1610.02391.pdf)

![Grad-CAM](https://da2so.github.io/assets/post_img/2020-08-10-GradCAM/2.png)

In [None]:
class GradCam():
    def __init__(self):
        self.model = resnet34
        self.gradient = None
        self.activation = None
        self.hooks = []

        def backward_hook(module, grad_inp, grad_out):
            self.gradient = grad_out[0]
            return None
        def forward_hook(module, input, output):
            self.activation = output
            return None

        h = self.model.layer4[2].conv2.register_forward_hook(forward_hook)
        self.hooks.append(h)
        h = self.model.layer4[2].conv2.register_backward_hook(backward_hook)
        self.hooks.append(h)

    def run(self, index_img, i):
        img = dataset[index_img][0].view(1, 3, 224, 224)

        output = self.model(img)
        values, indices = torch.topk(output, 3)
        class_index = indices[0].numpy()[i] #class id
        class_score = values[0][i] #score 

        self.model.zero_grad()
        class_score.backward()
          
        alpha_ck = torch.mean(self.gradient, dim=[2,3])
        alpha_ck.unsqueeze_(-1).unsqueeze_(-1) #reshape as (1,512,1,1)
        
        lc_gradcam = F.relu((alpha_ck*self.activation).sum(1, keepdim=True))
        upsample_lc = F.interpolate(lc_gradcam, size=(224,224), mode="bilinear", align_corners=False)
        normalized_lc = (upsample_lc- upsample_lc.min())/(upsample_lc.max() -upsample_lc.min())
        self.gradient, self.activation = None, None
        return normalized_lc.squeeze(), class_index

    def remove_hooks(self):
        for h in self.hooks: #remove hooks
            h.remove()



In [None]:
#visualize
import cv2
from textwrap import wrap
gradcam = GradCam()

nb_images = 20
for index_img in range(nb_images):
  fig, axs = plt.subplots(1,4, figsize=(20,10))
  
  img = dataset[index_img][0].permute(1,2,0).detach().numpy() #original img
  img = (img - img.min())/(img.max()-img.min())

  axs[0].imshow(img)
  axs[0].set_title(f"Original image i={index_img}")
  for i in range(3):
    heatmap, label = gradcam.run(index_img, i)
    heatmap = heatmap.detach().numpy()
    heatmap = cv2.applyColorMap(np.uint8(255 * heatmap), cv2.COLORMAP_JET)

    img_heatmap = heatmap/255 *0.4  + img #new img
    img_heatmap /= img_heatmap.max()

    axs[i+1].set_title("\n".join(wrap(f'Label: {classes[label]}', 60)))
    axs[i+1].imshow(img_heatmap, interpolation='nearest')
  fig.tight_layout()
  plt.show()
    
gradcam.remove_hooks()

# Comments

In general all three heatmaps target similar areas when the species predicted is very similar accross the three most-likely labels. 

For example "African elephant, Indian Elephant and Tusker" are all very similar labels, and the heatmaps are also nearly identical. 

However there are certain cases in which the model pays different attention to different parts of the image and hence gives completely different results for classes. For example on image with index 15, the first label (and the one that is correct) is "sea lion". The heatmap pays attention to the whole body of the animal and makes a correct prediction. However in the second run, the model seems to pay attention only to a part of the animal's body, not including the head and that makes it predict a "balance beam" or "cowboy boot". 

In general the model focuses on the most "distinctive features" of the animal in order to classify it. For example in the case of image i=19 the model focuses on the thorns. 
