
<font color="blue"> <strong>Students:</strong> Biel CASTAÑO and David FAGET



## Visualization of CNN: Grad-CAM
* **Objective**: Convolutional Neural Networks are widely used on computer vision. It is powerful for processing grid-like data. However we hardly know how and why it works, due to the lack of decomposability into individually intuitive components. In this assignment, we use Grad-CAM, which highlights the regions of the input image that were important for the neural network prediction.


* NB: if `PIL` is not installed, try `conda install pillow`.

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
from torchvision import models, datasets, transforms
import matplotlib.pyplot as plt
import pickle
import urllib.request

import numpy as np
from PIL import Image

import cv2

%matplotlib inline

### Download the Model
We provide you a pretrained model `ResNet-34` for `ImageNet` classification dataset.
* **ImageNet**: A large dataset of photographs with 1 000 classes.
* **ResNet-34**: A deep architecture for image classification.

In [None]:
resnet34 = models.resnet34(weights='ResNet34_Weights.IMAGENET1K_V1')  # New PyTorch interface for loading weights!
resnet34.eval() # set the model to evaluation mode

![ResNet34](https://miro.medium.com/max/1050/1*Y-u7dH4WC-dXyn9jOG4w0w.png)


Input image must be of size (3x224x224).

First convolution layer with maxpool.
Then 4 ResNet blocks.

Output of the last ResNet block is of size (512x7x7).

Average pooling is applied to this layer to have a 1D array of 512 features fed to a linear layer that outputs 1000 values (one for each class). No softmax is present in this case. We have already the raw class score!

In [None]:
classes = pickle.load(urllib.request.urlopen('https://gist.githubusercontent.com/yrevar/6135f1bd8dcf2e0cc683/raw/d133d61a09d7e5a3b36b8c111a8dd5c4b5d560ee/imagenet1000_clsid_to_human.pkl'))

##classes is a dictionary with the name of each class
print(classes)

### Input Images
We provide you 20 images from ImageNet (download link on the webpage of the course or download directly using the following command line,).<br>
In order to use the pretrained model resnet34, the input image should be normalized using `mean = [0.485, 0.456, 0.406]`, and `std = [0.229, 0.224, 0.225]`, and be resized as `(224, 224)`.

In [None]:
def preprocess_image(dir_path):
    normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406],
                                     std=[0.229, 0.224, 0.225])
    # Note: If the inverse normalisation is required, apply 1/x to the above object

    dataset = datasets.ImageFolder(dir_path, transforms.Compose([
            transforms.Resize(256),
            transforms.CenterCrop(224), # resize the image to 224x224
            transforms.ToTensor(), # convert numpy.array to tensor
            normalize])) #normalize the tensor

    return (dataset)

In [None]:
import os
if not os.path.exists("data"):
    os.mkdir("data")
if not os.path.exists("data/TP2_images"):
    os.mkdir("data/TP2_images")
    !cd data/TP2_images && wget "https://www.lri.fr/~gcharpia/deeppractice/2023/TP2/TP2_images.zip" && unzip TP2_images.zip

dir_path = "data/"
dataset = preprocess_image(dir_path)

In [None]:
# show the orignal image
index = 5
input_image = Image.open(dataset.imgs[index][0]).convert('RGB')
plt.imshow(input_image)

In [None]:
output = resnet34(dataset[index][0].view(1, 3, 224, 224))
values, indices = torch.topk(output, 3)
print("Top 3-classes:", indices[0].numpy(), [classes[x] for x in indices[0].numpy()])
print("Raw class scores:", values[0].detach().numpy())

### Grad-CAM
* **Overview:** Given an image, and a category (‘tiger cat’) as input, we forward-propagate the image through the model to obtain the `raw class scores` before softmax. The gradients are set to zero for all classes except the desired class (tiger cat), which is set to 1. This signal is then backpropagated to the `rectified convolutional feature map` of interest, where we can compute the coarse Grad-CAM localization (blue heatmap).


* **To Do**: Define your own function Grad_CAM to achieve the visualization of the given images. For each image, choose the top-3 possible labels as the desired classes. Compare the heatmaps of the three classes, and conclude.


* **To be submitted within 2 weeks**: this notebook, **cleaned** (i.e. without results, for file size reasons: `menu > kernel > restart and clean`), in a state ready to be executed (if one just presses 'Enter' till the end, one should obtain all the results for all images) with a few comments at the end. No additional report, just the notebook!


* **Hints**:
 + We need to record the output and grad_output of the feature maps to achieve Grad-CAM. In pytorch, the function `Hook` is defined for this purpose. Read the tutorial of [hook](https://pytorch.org/tutorials/beginner/former_torchies/nnft_tutorial.html#forward-and-backward-function-hooks) carefully.
 + More on [autograd](https://pytorch.org/tutorials/beginner/basics/autogradqs_tutorial.html) and [hooks](https://pytorch.org/tutorials/beginner/former_torchies/nnft_tutorial.html#forward-and-backward-function-hooks)
 + The pretrained model resnet34 doesn't have an activation function after its last layer, the output is indeed the `raw class scores`, you can use them directly.
 + The size of feature maps is 7x7, so your heatmap will have the same size. You need to project the heatmap to the resized image (224x224, not the original one, before the normalization) to have a better observation. The function [`torch.nn.functional.interpolate`](https://pytorch.org/docs/stable/nn.functional.html?highlight=interpolate#torch.nn.functional.interpolate) may help.  
 + Here is the link of the paper [Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization](https://arxiv.org/pdf/1610.02391.pdf)

Class: ‘pug, pug-dog’ | Class: ‘tabby, tabby cat’
- | -
![alt](https://raw.githubusercontent.com/jacobgil/pytorch-grad-cam/master/examples/dog.jpg)| ![alt](https://raw.githubusercontent.com/jacobgil/pytorch-grad-cam/master/examples/cat.jpg)

In [None]:
# First, let's declare a class to handle the Grad-CAM algorithm

class GradCAM:
    def __init__(self, model, target_layer):
        self.model = model
        self.gradients = None
        self.features = None

        # Register hooks
        self.hook_handlers = [
          target_layer.register_forward_hook(self.save_features),
          target_layer.register_backward_hook(self.save_gradients),
        ]

    def save_features(self, module, input, output):
        self.features = output.detach()

    def save_gradients(self, module, grad_input, grad_output):
        self.gradients = grad_output[0]

    def __call__(self, image, label):
        # Forward pass
        output = self.model(image)

        # Set output for backpropagation
        one_hot_output = F.one_hot(torch.tensor([label]), num_classes=output.size(-1))
        one_hot_output = one_hot_output.to(dtype=torch.float).requires_grad_(True)
        one_hot_output = torch.sum(one_hot_output * output)

        # Backward pass
        self.model.zero_grad()
        one_hot_output.backward()

        # Gradients and features to numpy
        gradients = self.gradients.numpy().squeeze(0)
        features = self.features.numpy().squeeze(0)

        # Compute filter weights
        filter_weights = np.mean(gradients, axis=(1, 2))[:, np.newaxis, np.newaxis]

        # Compute weighted feature map
        heatmap = np.sum(filter_weights * features, axis=0)

        # ReLU on top of the heatmap
        heatmap = np.maximum(heatmap, 0)

        # Normalize the heatmap
        heatmap /= np.max(heatmap)

        # Resize heatmap
        heatmap = torch.from_numpy(heatmap).unsqueeze(0).unsqueeze(0)  # Add batch and channel dimension
        heatmap = F.interpolate(heatmap, size=image.size()[2:], mode='bilinear')
        heatmap = heatmap.numpy().squeeze(0).squeeze(0)  # remove batch and channel dimension

        self.remove_hooks() # Need to free memory

        return heatmap

    def remove_hooks(self):
        for handler in self.hook_handlers:
            handler.remove()

In [None]:
# Let's now define a function to plot the grid of heatmaps for the selected layer
# containing the 3 top classes for each image on the dataset

def plot_grad_cam(dataset, model, selected_layer, classes):
    for i in range(len(dataset)):
        # Load input
        img_path, _ = dataset.imgs[i]
        img = np.array(Image.open(img_path).convert('RGB'))
        img = np.float32(cv2.resize(img, (224, 224))) / 255
        input = dataset[i][0].unsqueeze(0)  # Add batch dimension

        # Set model to evaluation mode
        model.eval()

        # Get the 3 top classes via forward pass
        output = model(input)
        values, indices = torch.topk(output, 3)

        # Define image grid
        f, ax = plt.subplots(1, 4, figsize=(20, 5))

        # Plot input image
        ax[0].imshow(img)
        ax[0].set_title(f'Test image nº {i+1}')
        ax[0].axis('off')

        for j in range(1, 4):
            label = indices[0][j-1].item()

            # Compute heatmap
            grad_cam = GradCAM(model, selected_layer)
            heatmap = grad_cam(input, label)

            # Apply colormap to heatmap
            heat = cv2.applyColorMap(np.uint8(255 * heatmap), cv2.COLORMAP_JET)
            heat = np.float32(heat) / 255

            # Merge heatmap with original image
            merged_image = cv2.addWeighted(heat, 0.5, img, 0.5, 0)
            merged_image = np.uint8(255 * merged_image[:, :, ::-1])  # Convert BGR to RGB

            # Plot heatmap
            ax[j].imshow(merged_image)
            ax[j].axis('off')
            name = classes[label].split(',')[0] if classes else f'Class {label}'
            ax[j].set_title(name)

        plt.show()

In [None]:
# Apply Grad-CAM to get last layer's heatmaps (after bn2).

plot_grad_cam(dataset, resnet34, resnet34.layer4[2].bn2, classes)

<font color="blue"> Comments:

<font color="blue"> Frequently, it's observed that the heatmaps converge on the same regions across the three labels, which is explained by the visual similarity of the 3 animals predicted. However, there are some important and interesting exceptions. We are going to comment the most curious ones:

 - <font color="blue"> For test image nº8, we observe that the heatmap corresponding to "chesapeake bay retriever" focuses much more on the dog that is behind.

 - <font color="blue"> For test image nº11, we observe that only the first prediction is correct. The two last ones are false. Indeed, the network correctly predicts a horse when it focuses on the entire body of the animal. When it focuses only on the neck, it gives incorrect predictions such as ox or basenji.

 - <font color="blue"> Test image nº13 is also curious. We observe that the network fails completely to detect the animal, which is explained by the fact that it focuses more on the borders than on the animals.


 - <font color="blue"> For test image nº16, the network is only right for its first prediction (when it sees all the animal). We observe that when it predicts a cowboy boot or a balance beam (unrelated objects), the network only sees a small portion of the body.


<font color="blue"> In summary, GradCAM offers valuable insights into the inner workings of the network and its perception during prediction. It aids in comprehending the variations in predictions and their association with particular regions identified by the network within the input image.

<font color="blue"> But, what would the outcome be if we applied Grad-CAM to layers other than the final one? We will answer this question in the next section.


### Complementary questions:

##### Try GradCAM on others convolutional layers, describe and comment the results

<font color="blue"> We will write layer{i}[j] to refer to the BasicBlock nºj of the layer nºi (see the architecture above). We will always compute the heatmap after bn2 (batch normalization).

In [None]:
# Apply Grad-CAM to get the layer{4}[1] heatmap (this is the second last BasicBlock; heatmaps for the last one were obtained above)

plot_grad_cam(dataset, resnet34, resnet34.layer4[1].bn2, classes)

In [None]:
# Layer{3}[5]

plot_grad_cam(dataset, resnet34, resnet34.layer3[5].bn2, classes)

In [None]:
# Layer{2}[3]

plot_grad_cam(dataset, resnet34, resnet34.layer2[3].bn2, classes)

<font color="blue"> Comments:

- <font color="blue"> For layer{4}[1], we observe that heatmaps are close to what we obtained for the last layer. This is coherent with what we expected to have. However, we also observe some differences. For example, for test image nº8, the first prediction focuses more on the dog that is behind, while the last layer focused more on the dog at the front. Other examples such as the one of the horse (test image nº11) are also interesting. Here, we observe that when the models predict an ox, layer{4}[1] focuses more on the queue than the last layer.

- <font color="blue"> For layer{3}[5], we generally do not observe informative heatmaps. However, there are some exceptions. For example, we see that for test image nº10 all three heatmaps focus on the horns, and that for test image nº16 (sea lion), the last heatmap focuses on the head.


- <font color="blue"> In the case of layer{2}[3], it is the same: heatmaps are generally not informative. However, we observe some details. For example, in test image nº7 (fox), heatmaps focus on the eyes and nose, while in test image nº10, heatmaps focus on the feet.

<font color="blue"> In conclusion, as we move away from the final layer towards the initial layers, the heatmaps become less informative for human interpretability. This is expected, since CNN represent a hierarchical structure, with early layers capturing low-level features across the image, such as edges or local patterns. Progressing deeper into the network, layers begin to detect more complex patterns, including parts of objects like ears or eyes. And finally, the final leyers of a CNN are highly specialized for the classification task, so the heatmaps from these layers highlight regions of the image that are important for making the classification decision. This explains why it is difficult for humans to interpret the first layers (for example, layer{2}[3], that offers limited information for human interpretability), whereas it is less challenging to interpret middle layers (like layer layer{3}[5], where we start to see some parts of the objects that have been important for the decision, like horns in test image 10), and final layers are completely interpretable (as seen in layer 4 examples, where the parts of the image that were relevant for the classification align with human decisions).  

##### What are the principal contributions of GradCAM (the answer is in the paper) ?

<font color="blue"> We will <font color="green"> quote <font color="blue"> the authors' contributions verbatim, and then explain and summarise them:

<font color="green"> (1) We introduce Grad-CAM, a class-discriminative localization technique that generates visual explanations for any
CNN-based network without requiring architectural changes
or re-training. We evaluate Grad-CAM for localization (Sec. 4.1),
and faithfulness to model (Sec. 5.3), where it outperforms
baselines.

<font color="green"> (2) We apply Grad-CAM to existing top-performing classification, captioning (Sec. 8.1), and VQA (Sec. 8.2) models.
For image classification, our visualizations lend insight into
failures of current CNNs (Sec. 6.1), showing that seemingly
unreasonable predictions have reasonable explanations. For
captioning and VQA, our visualizations expose that common
CNN + LSTM models are often surprisingly good at localizing discriminative image regions despite not being trained
on grounded image-text pairs.

<font color="green"> (3) We show a proof-of-concept of how interpretable GradCAM visualizations help in diagnosing failure modes by
uncovering biases in datasets. This is important not just for
generalization, but also for fair and bias-free outcomes as
more and more decisions are made by algorithms in society.

<font color="green"> (4) We present Grad-CAM visualizations for ResNets [24]
applied to image classification and VQA (Sec. 8.2).

<font color="green"> (5) We use neuron importance from Grad-CAM and neuron
names from [4] and obtain textual explanations for model
decisions (Sec. 7).

<font color="green"> (6) We conduct human studies (Sec. 5) that show Guided
Grad-CAM explanations are class-discriminative and not
only help humans establish trust, but also help untrained users
successfully discern a ‘stronger’ network from a ‘weaker’
one, even when both make identical predictions.


<font color="blue">
In summary, Grad-CAM is a visualization technique that enhances the interpretability of CNN-based models without requiring any architectural modifications or retraining. It is effective in various domains, including image classification, captioning, and visual question answering (VQA), by providing insights into model decisions. Key contributions of Grad-CAM include its ability to localize relevant image regions, diagnose dataset biases, and offer a diagnostic tool for improving model fairness and generalization. Additionally, Grad-CAM facilitates the generation of textual explanations for model decisions and has been validated through human studies to improve trust in AI systems and helps in distinguishing between models of varying reliability.