## Visualization of CNN: Grad-CAM
* **Objective**: Convolutional Neural Networks are widely used on computer vision. It is powerful for processing grid-like data. However we hardly know how and why it works, due to the lack of decomposability into individually intuitive components. In this assignment, we will introduce the Grad-CAM which visualizes the heatmap of input images by highlighting the important region for visual question answering(VQA) task.

* **To be submitted**: this notebook in two weeks, **cleaned** (i.e. without results, for file size reasons: `menu > kernel > restart and clean`), in a state ready to be executed (if one just presses 'Enter' till the end, one should obtain all the results for all images) with a few comments at the end. No additional report, just the notebook!

* NB: if `PIL` is not installed, try `conda install pillow`.


In [None]:
import torch
import torch.nn.functional as F
import numpy as np

import torchvision.transforms as transforms
from PIL import Image
import cv2

import warnings
import matplotlib.pyplot as plt
warnings.filterwarnings('ignore')

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

### Visual Question Answering problem
Given an image and a question in natural language, the model choose the most likely answer from 3 000 classes according to the content of image. The VQA task is indeed a multi-classificaition problem.
<img src="vqa_model.PNG">

We provide you a pretrained model `vqa_resnet` for VQA tasks.

In [None]:
# load model
from load_model import load_model
vqa_resnet = load_model()

In [None]:
print(vqa_resnet) # for more information 

In [None]:
checkpoint = '2017-08-04_00.55.19.pth'
saved_state = torch.load(checkpoint, map_location=device)
# reading vocabulary from saved model
vocab = saved_state['vocab']

# reading word tokens from saved model
token_to_index = vocab['question']

# reading answers from saved model
answer_to_index = vocab['answer']

num_tokens = len(token_to_index) + 1

# reading answer classes from the vocabulary
answer_words = ['unk'] * len(answer_to_index)
for w, idx in answer_to_index.items():
    answer_words[idx]=w

### Inputs
In order to use the pretrained model, the input image should be normalized using `mean = [0.485, 0.456, 0.406]`, and `std = [0.229, 0.224, 0.225]`, and be resized as `(448, 448)`. You can call the function `image_to_features` to achieve image preprocessing. For input question, the function `encode_question` is provided to encode the question into a vector of indices. You can also use `preprocess` function for both image and question preprocessing.

In [None]:
def get_transform(target_size, central_fraction=1.0):
    return transforms.Compose([
        transforms.Scale(int(target_size / central_fraction)),
        transforms.CenterCrop(target_size),
        transforms.ToTensor(),
        transforms.Normalize(mean=[0.485, 0.456, 0.406],
                             std=[0.229, 0.224, 0.225]),
    ])

In [None]:
def encode_question(question):
    """ Turn a question into a vector of indices and a question length """
    question_arr = question.lower().split()
    vec = torch.zeros(len(question_arr), device=device).long()
    for i, token in enumerate(question_arr):
        index = token_to_index.get(token, 0)
        vec[i] = index
    return vec, torch.tensor(len(question_arr), device=device)

In [None]:
# preprocess requires the dir_path of an image and the associated question. 
#It returns the spectific input form which can be used directly by vqa model. 
def preprocess(dir_path, question):
    q, q_len = encode_question(question)
    img = Image.open(dir_path).convert('RGB')
    image_size = 448  # scale image to given size and center
    central_fraction = 1.0
    transform = get_transform(image_size, central_fraction=central_fraction)
    img_transformed = transform(img)
    img_features = img_transformed.unsqueeze(0).to(device)
    
    inputs = (img_features, q.unsqueeze(0), q_len.unsqueeze(0))
    return inputs

We provide you two pictures and some question-answers.

In [None]:
Question1 = 'What animal'
Answer1 = ['dog','cat' ]
indices1 = [answer_to_index[ans] for ans in Answer1]# The indices of category 
img1 = Image.open('dog_cat.png')
img1

In [None]:
dir_path = 'dog_cat.png' 
inputs = preprocess(dir_path, Question1)
ans = vqa_resnet(*inputs) # use model to predict the answer
answer_idx = np.argmax(F.softmax(ans, dim=1).data.numpy())
print(answer_words[answer_idx])

In [None]:
Question2 = 'What color'
Answer2 = ['green','yellow' ]
indices2 = [answer_to_index[ans] for ans in Answer2]
img2 = Image.open('hydrant.png')
print(img2.size)
img2

In [None]:
dir_path = 'hydrant.png' 
inputs = preprocess(dir_path, Question2)
ans = vqa_resnet(*inputs) # use model to predict the answer
answer_idx = np.argmax(F.softmax(ans, dim=1).data.numpy())
print(answer_words[answer_idx])

### Grad-CAM 
* **Overview:** Given an image with a question, and a category (‘dog’) as input, we foward propagate the image through the model to obtain the `raw class scores` before softmax. The gradients are set to zero for all classes except the desired class (dog), which is set to 1. This signal is then backpropagated to the `rectified convolutional feature map` of interest, where we can compute the coarse Grad-CAM localization (blue heatmap).


* **To Do**: Define your own function Grad_CAM to achieve the visualization of the two images. For each image, consider the answers we provided as the desired classes. Compare the heatmaps of different answers, and conclude. 


* **Hints**: 
 + We need to record the output and grad_output of the feature maps to achieve Grad-CAM. In pytorch, the function `Hook` is defined for this purpose. Read the tutorial of [hook](https://pytorch.org/tutorials/beginner/former_torchies/nnft_tutorial.html#forward-and-backward-function-hooks) carefully. 
 + The pretrained model `vqa_resnet` doesn't have the activation function after its last layer, the output is indeed the `raw class scores`, you can use it directly. Run "print(vqa_resnet)" to get more information on VGG model.
 + The last CNN layer of the model is: `vqa_resnet.resnet_layer4.r_model.layer4[2].conv3` 
 + The size of feature maps is 14x14, so as your heatmap. You need to project the heatmap to the original image(224x224) to have a better observation. The function `cv2.resize()` may help.  
 + Here is the link of the paper [Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization](https://arxiv.org/pdf/1610.02391.pdf)

<img src="grad_cam.png">

In [None]:
def Grad_Cam(dir_path, question):
    
    features = None
    grads = None
    
    def forward_hook(self, inpout, output):
        nonlocal features
        #print('Inside ' + self.__class__.__name__ + ' forward')
        features = output.data.detach().numpy()[-1]
        
    def backward_hook(self, grad_input, grad_output):
        nonlocal grads
        #print('Inside ' + self.__class__.__name__ + ' backward')
        grads = grad_output[0].detach().numpy()[-1]

    
    def learn_image( dir_path, question):
        inputs = preprocess(dir_path, question)
    
        input_img = inputs[ 0 ]
        input_img.requires_grad = True
        grad_cam_forward_hook = vqa_resnet.resnet_layer4.r_model.layer4[2].conv3.register_forward_hook(forward_hook)
        grad_cam_backward_hook = vqa_resnet.resnet_layer4.r_model.layer4[2].conv3.register_backward_hook(backward_hook)
        answer = vqa_resnet(*inputs) # use model to predict the answer
        answer_idx = np.argmax( answer.data.numpy())
        vqa_resnet.zero_grad()
        one_hot = np.zeros((1, answer.size()[-1]), dtype=np.float32)
        one_hot[0][answer_idx] = 1
        one_hot = torch.from_numpy(one_hot)
        one_hot = torch.sum(one_hot * answer)
        one_hot.backward( retain_graph=True)
        grad_cam_forward_hook.remove()
        grad_cam_backward_hook.remove()
        
    
    def compute_image():
        # features wieght based on gradiants 
        weights = np.mean( grads, axis=(1,2))

        # feature importance
        grayscale = np.zeros( features.shape[1:])
        for (i, weight) in enumerate( weights):
            grayscale += weight * features[ i ]
        
        # relu operator
        grayscale = np.maximum( grayscale, 0)
        
        return grayscale
        
    def merge_image(image, grad_cam):
        heatmap = cv2.applyColorMap(np.uint8(255 * grad_cam), cv2.COLORMAP_JET)
        heatmap = Image.fromarray( heatmap)
        #image_np = np.asarray(image, dtype=np.uint8)
        return Image.blend(image, heatmap, alpha=0.7 )
        # grad_cam_img = Image.fromarray( grad_cam * 255)
        # background = image.copy()
        #heatmap.paste( image ) #, heatmap )
        #return heatmap
        #heatmap = np.float32(heatmap) / 255
        #cam = heatmap + image_np
        #cam = cam / np.max(cam)
        #return Image.fromarray(np.uint8(255 * cam))
        #return heatmap  

    learn_image( dir_path, question)
    
    # img value between 0 and 1
    img = Image.open( dir_path)
    
    grayscale = compute_image()
    grayscale = cv2.resize(grayscale, img.size)
    grayscale = grayscale - np.min(grayscale)
    grayscale = grayscale / np.max(grayscale)

    
    merged_img =  merge_image(img, grayscale)
    return merged_img

In [None]:
img = Grad_Cam('dog_cat.png', Question1)
img

In [None]:
img = Grad_Cam('hydrant.png', Question2)
img

### Results
* The areas related to response are highlighted in the image, very clearly for first image. For second image, a second redish less visible region appears which seem less related to answer. Nonetheless, the main region of green firework appears very explictly.