## Visualization of CNN: Grad-CAM
* **Objective**: Convolutional Neural Networks are widely used on computer vision. It is powerful for processing grid-like data. However we hardly know how and why it works, due to the lack of decomposability into individually intuitive components. In this assignment, we will introduce the Grad-CAM which visualizes the heatmap of input images by highlighting the important region for visual question answering(VQA) task.

* **To be submitted**: this notebook in two weeks, **cleaned** (i.e. without results, for file size reasons: `menu > kernel > restart and clean`), in a state ready to be executed (if one just presses 'Enter' till the end, one should obtain all the results for all images) with a few comments at the end. No additional report, just the notebook!

* NB: if `PIL` is not installed, try `conda install pillow`.


In [None]:
import torch
import torch.nn.functional as F
import numpy as np

import torchvision.transforms as transforms
from PIL import Image

import warnings
import matplotlib.pyplot as plt
warnings.filterwarnings('ignore')
import cv2

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

### Visual Question Answering problem
Given an image and a question in natural language, the model choose the most likely answer from 3 000 classes according to the content of image. The VQA task is indeed a multi-classificaition problem.
<img src="vqa_model.PNG">

We provide you a pretrained model `vqa_resnet` for VQA tasks.

In [None]:
# load model
from load_model import load_model
vqa_resnet = load_model()


ModuleNotFoundError: ignored

In [None]:
print(vqa_resnet) # for more information 

In [None]:
checkpoint = '2017-08-04_00.55.19.pth'
saved_state = torch.load(checkpoint, map_location=device)
# reading vocabulary from saved model
vocab = saved_state['vocab']

# reading word tokens from saved model
token_to_index = vocab['question']

# reading answers from saved model
answer_to_index = vocab['answer']

num_tokens = len(token_to_index) + 1

# reading answer classes from the vocabulary
answer_words = ['unk'] * len(answer_to_index)
for w, idx in answer_to_index.items():
    answer_words[idx]=w

### Inputs
In order to use the pretrained model, the input image should be normalized using `mean = [0.485, 0.456, 0.406]`, and `std = [0.229, 0.224, 0.225]`, and be resized as `(448, 448)`. You can call the function `image_to_features` to achieve image preprocessing. For input question, the function `encode_question` is provided to encode the question into a vector of indices. You can also use `preprocess` function for both image and question preprocessing.

In [None]:
def get_transform(target_size, central_fraction=1.0):
    return transforms.Compose([
        transforms.Scale(int(target_size / central_fraction)),
        transforms.CenterCrop(target_size),
        transforms.ToTensor(),
        transforms.Normalize(mean=[0.485, 0.456, 0.406],
                             std=[0.229, 0.224, 0.225]),
    ])

In [None]:
def encode_question(question):
    """ Turn a question into a vector of indices and a question length """
    question_arr = question.lower().split()
    vec = torch.zeros(len(question_arr), device=device).long()
    for i, token in enumerate(question_arr):
        index = token_to_index.get(token, 0)
        vec[i] = index
    return vec, torch.tensor(len(question_arr), device=device)

In [None]:
# preprocess requires the dir_path of an image and the associated question. 
#It returns the spectific input form which can be used directly by vqa model. 
def preprocess(dir_path, question):
    q, q_len = encode_question(question)
    img = Image.open(dir_path).convert('RGB')
    image_size = 448  # scale image to given size and center
    central_fraction = 1.0
    transform = get_transform(image_size, central_fraction=central_fraction)
    img_transformed = transform(img)
    img_features = img_transformed.unsqueeze(0).to(device)
    
    inputs = (img_features, q.unsqueeze(0), q_len.unsqueeze(0))
    return inputs

We provide you two pictures and some question-answers.

In [None]:
Question1 = 'What animal'
Answer1 = ['dog','cat' ]
indices1 = [answer_to_index[ans] for ans in Answer1]# The indices of category 
img1 = Image.open('dog_cat.png')
img1

In [None]:
dir_path = 'dog_cat.png' 
inputs = preprocess(dir_path, Question1)
ans = vqa_resnet(*inputs) # use model to predict the answer
answer_idx = np.argmax(F.softmax(ans, dim=1).data.numpy())
print(answer_words[answer_idx])

In [None]:
Question2 = 'What color'
Answer2 = ['green','yellow' ]
indices2 = [answer_to_index[ans] for ans in Answer2]
img2 = Image.open('hydrant.png')
img2

In [None]:
dir_path = 'hydrant.png' 
inputs = preprocess(dir_path, Question2)
ans = vqa_resnet(*inputs) # use model to predict the answer
answer_idx = np.argmax(F.softmax(ans, dim=1).data.numpy())
print(answer_words[answer_idx])

### Grad-CAM 
* **Overview:** Given an image with a question, and a category (‘dog’) as input, we foward propagate the image through the model to obtain the `raw class scores` before softmax. The gradients are set to zero for all classes except the desired class (dog), which is set to 1. This signal is then backpropagated to the `rectified convolutional feature map` of interest, where we can compute the coarse Grad-CAM localization (blue heatmap).


* **To Do**: Define your own function Grad_CAM to achieve the visualization of the two images. For each image, consider the answers we provided as the desired classes. Compare the heatmaps of different answers, and conclude. 


* **Hints**: 
 + We need to record the output and grad_output of the feature maps to achieve Grad-CAM. In pytorch, the function `Hook` is defined for this purpose. Read the tutorial of [hook](https://pytorch.org/tutorials/beginner/former_torchies/nnft_tutorial.html#forward-and-backward-function-hooks) carefully. 
 + The pretrained model `vqa_resnet` doesn't have the activation function after its last layer, the output is indeed the `raw class scores`, you can use it directly. Run "print(vqa_resnet)" to get more information on VGG model.
 + The last CNN layer of the model is: `vqa_resnet.resnet_layer4.r_model.layer4[2].conv3` 
 + The size of feature maps is 14x14, so as your heatmap. You need to project the heatmap to the original image(224x224) to have a better observation. The function `cv2.resize()` may help.  
 + Here is the link of the paper [Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization](https://arxiv.org/pdf/1610.02391.pdf)

# Answer

In the next cell , we will define the inputs of the Grad-CAM model.

In [None]:
Question2 = 'What color'
Answer2 = ['green','yellow' ]
dir_path2 = 'hydrant.png' 
img2 = Image.open(dir_path2)

Question1 = 'What animal'
Answer1 = ['dog','cat' ]
dir_path1 = 'dog_cat.png' 
img1 = Image.open(dir_path1)

Questions = [Question1,Question2]
Answers = [Answer1,Answer2]
Dirs = [dir_path1,dir_path2]
Imgs = [img1,img2]

Let's introduce the important functions for our model.

In [None]:
def forward_pass(self, input, output):
    dict_['forward_activations'] = output[0]
def backword_pass(self, grad_input, grad_output):
    dict_['backward_activations']= grad_output[0]

def grad_cam (activations,gradients) :
    """
    Function that computes final 14x14 heatmap. 
    """
    ## calculate importance 
    importance = torch.mean(gradients.view(gradients.shape[0],-1),dim=1)

    heatmap = torch.sum(importance[:,None,None] * activations,dim=0)
    relu = torch.nn.ReLU()
    grad_cam_output = relu(heatmap)
    return grad_cam_output

def heatmap_on_image(heatmap , img , alpha = 0.5):
    """
    Create the array for the original image plus the heatmap.
    Alpha measures the weight of the heatmap on the final image.
    """ 
    min_ = np.min(heatmap)
    max_ = np.max(heatmap)
    heatmap = (heatmap - min_)/max_ ## standardize  heatmap
    heatmap = cv2.applyColorMap(np.uint8(255 * heatmap), cv2.COLORMAP_JET)
    final = cv2.addWeighted(heatmap, alpha, img, 1-alpha, 0)
    return final 

Now for every input picture and  possible answer , we output three images : input image, raw heatmap and heatmap on image.

In [None]:
for question,answer,dir_path,img in zip(Questions,Answers,Dirs,Imgs) :
    ## get vqa inputs and answer indexes
    inputs = preprocess(dir_path, question)
    indexes = [answer_to_index[ans] for ans in answer]
    img_array = np.array(img)
    for k,index_answer in enumerate(indexes) :
      
      vqa_resnet = load_model()
      dict_ = {} ## dict to save activations and gradients
      
      ## register hooks 
      vqa_resnet.resnet_layer4.r_model.layer4[2].conv3.register_forward_hook(forward_pass)
      vqa_resnet.resnet_layer4.r_model.layer4[2].conv3.register_backward_hook(backword_pass)

      output = vqa_resnet(*inputs)
      
      ## backpropagate through only the wanted class
      output[:,index_answer].backward()

      ## compute heatmap
      heatmap = grad_cam(activations = dict_['forward_activations'],gradients = dict_['backward_activations'].squeeze(0))
      ## resize
      heatmap_resized = cv2.resize(heatmap.detach().numpy(),(img_array.shape[0],img_array.shape[1]))

      ## Put gradients to zero
      vqa_resnet.zero_grad()
      ## create heatmap on original image 
      final_image = heatmap_on_image(heatmap_resized,img_array)
      print('The class is %s'%(answer[k]))
      fig,ax = plt.subplots(1,3 , figsize=(12,6))
      ax[0].axis('off')

      ax[0].imshow(img)
      ax[0].set_title('Input image')
      

      ax[1].imshow(heatmap_resized)
      ax[1].set_title('Raw heatmap for the class %s'%(answer[k]))

      ax[2].imshow(final_image)
      ax[2].set_title('Image plus heatmap for the class %s'%(answer[k]))

      plt.show()


With grad-cam, we can identify which parts of the image influenced the model when we back-propagate through only a certain class. For the first input, we can see that the model uses the correct information to identify the cat and the dog , without an overlapping between them. But we can also see that the parts relative to the corpse without head are not activated, even though these parts can present additional useful informations that the model can use to identify both classes.

For the second image, we can see that the model made a good distinction and used the correct activations to identify the class green. However, if we choose to backpropagate through the class yellow, we can observe that the model "looked" also at the pixels corresponding to green. This may explain why the vqa model outputs the class green at the first place. It seems that it confuses the yellow  color when it is present with other colors in an image. A solution to this issue may be to crop out the yellow object and isolate it from others. 

Hence, grad cam can be an excellent tool to obtain informations concerning the misclassifid images. This visualization may be used to do more processing on images in order to help the model recognize better the different objects.