## Project Introduction

Build a simple convolutional neural network in PyTorch and Train it to recognize natural objects using the CIFAR-10 dataset. The structure:

    1. Introduction
    2. Setting up the Environment
    3. Preparing the Data
    4. Building the Network
    5. Training the Model and Evaluating the Performance
    6. Understand the Deep Neural Networks


### Setting up the Environment

```
!pip install torch torchvision
!pip install Pillow==4.0.0
```

In [1]:
"""
Imports numpy, torch, torchvision, matplotlib
"""

import cv2
import torch
import torchvision
import torchvision.transforms as transforms
from torchvision.datasets import CIFAR10
from torch.utils.data import DataLoader
import torchvision.models as models
import matplotlib.pyplot as plt
import numpy as np
from copy import deepcopy
from torchvision.utils import make_grid

import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import random_split

import math
import os
import argparse

### Some testing for Pytorch

In [2]:
#some testing here

# Operations
y = torch.rand(2, 2)
x = torch.rand(2, 2)

# multiplication
z = x * y
z = torch.mul(x,y) #elementwise #must be broadcastable

#matrix multiplication
tensor1 = torch.randn(3, 4)
tensor2 = torch.randn(4, 3)
torch.matmul(tensor1, tensor2)

# NumPy conversion
x = torch.rand(2,2)
y = x.numpy()
print(type(y))

z1 = torch.from_numpy(y) #sharing the memory space with the numpy ndarray
z2 = torch.tensor(y) #a copy
print(type(z1))
print(type(z2))

# Pytorch attributes and functions for tensors
print(x.shape)
print(x.device)

# autograd

# requires grad equals true lets us compute gradients on the tenor
x = torch.tensor([2,3,5], dtype=float, requires_grad=True)
y =(5 * x**2).sum()

# When computation is finished .backward() and have all the gradients computed automatically
# The gradient for this tensor will be accumulated into .grad attribute
y.backward()
#print(z.grad) # dz/dz
print(x.grad) # dz/dx

# autograd requires computational resources and can take time.
# disable autograd for model eval by writing your evaluation code in 
# As such, with torch.no_grad() is usually used in evaluation part

<class 'numpy.ndarray'>
<class 'torch.Tensor'>
<class 'torch.Tensor'>
torch.Size([2, 2])
cpu
tensor([20., 30., 50.], dtype=torch.float64)


For repeatable experiments, I recommended to set random seeds for anything using random number generation - this means numpy and random as well! 

In [None]:
experiment_name = 'debug'  #Provide name to model experiment
model_name = 'basic' # Choose between [basic, alexnet]
batch_size = 5  #You may not need to change this but incase you do

torch.manual_seed(42)

## Preparing the Data

I have worte some data-loading functions and a few helper functions below. 

To note, the effects to be added on the training data should also consider the property of the dataset. For example, for the digit recognition in MNIST, can we perform 180-degree rotation on the image of digit 6 without altering the image label during network training.


In [None]:
def get_transform(model_name):

    if model_name == 'alexnet':
        transform = transforms.Compose([
            transforms.Resize((227, 227)),
            transforms.ToTensor(),
            transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5)),
        ])

    else:

        transform = transforms.Compose([
            transforms.ToTensor(),
            transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5)),
        ])
    
    return transform


def get_dataset(model_name, train_percent=0.9):
    '''
    Returns the train, val and test torch.Datasetin addition to a list of classes, where the idx of class name corresponds
    to the label used for it in the data
    
    
    @model_name: either 'basic' or 'alexnet'
    @train_percent: percent of training data to keep for training. Rest will be validation.
    '''
    
    transform  = get_transform(model_name)
    
    train_data = CIFAR10(root='./data', train=True, download=first_run, transform=transform)
    test_data  = CIFAR10(root='./data', train=False, download=first_run, transform=transform)

    train_size = int(train_percent * len(train_data))
    val_size = len(train_data) - train_size


    train_data, val_data = random_split(train_data, [train_size, val_size])
    class_names = ('plane', 'car', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck')
    
    return train_data, val_data, test_data, class_names


def get_dataloader(batch_size, num_workers=1, model_name='basic'):
    '''
    Returns the train, val and test dataloaders in addition to a list of classes, where the idx of class name corresponds
    to the label used for it in the data
    
    Reference for dataloader class: https://pytorch.org/docs/stable/data.html
    @batch_size: batch to be used by dataloader
    @num_workers: number of dataloader workers/instances used
    @model_name: either 'basic' or 'alexnet'
    '''
    
    train_set, val_set, test_set, class_names = get_dataset(model_name)
    trainloader = DataLoader(train_set, batch_size=batch_size, shuffle=True, num_workers=num_workers, pin_memory=True)
    valloader = DataLoader(val_set, batch_size=batch_size, shuffle=False, num_workers=num_workers, pin_memory=True)
    testloader = DataLoader(test_set, batch_size=batch_size, shuffle=False, num_workers=num_workers, pin_memory=True)

    
    return trainloader, valloader, testloader, class_names

def makegrid_images(model_name='basic'):
    '''
    For visualization purposes
    
    @model_name: either 'basic' or 'alexnet'
    '''
    

    _, trial_loader, _, _ = get_dataloader(32, model_name=model_name)
    images, labels = iter(trial_loader).next()
    
    grid = make_grid(images)
    
    return grid

def show_img(img, mean=(0.5, 0.5, 0.5), std=(0.5, 0.5, 0.5), viz=True, norm=True):
    '''
    For visualization purposes
    
    B: batch size
    C: channels
    H: height
    W: width
    
    @img: torch.Tensor for the image of type (B, C, H, W) 
    @mean: mean used for normalizing along 3 dimensions C, H, W in get_transform
    @std:  std. deviation used for normalizing along 3 dimensions C, H, WW in get_transform
    @viz: whether or not to plt.plot or just return the unnormalized image
    @norm: whether or not unnormalize. Unnormalizes if true.
    
    Returns:
    Viewable image in (H, W, C) as a numpy array
    '''
    

    if norm:
        for idx in range(img.shape[0]):

            img[idx] = img[idx] * std[idx] + mean[idx]
        
    image = np.asarray(img)

    if viz:
        if len(image.shape) == 4:
            image = image.squeeze()

        plt.imshow(np.transpose(image, (1, 2, 0)))
        plt.show()

    return np.transpose(image.squeeze(), (1, 2, 0))

## Building the Network

### Training from scratch

First, below will train a shallow convolutional neural network. Neural network is defined in **GradBasicNet**, randomly initialize the value of the parameters in the network, and train it on the training dataset:

1. Fill out **get_conv_layers()** with a network of **2 conv layers** each with kernel size of 5. After the first convolutional layers, add a **max pooling layer** with kernel size of 2 and stride of 2. Remember to add the non linearities immediately after the conv layers. You must choose the in-channels and out-channels for both the layers: the image to be input will be RGB so there is only one number that can be used for the in-channels of the first conv layer. Use the nn.Sequential API to combine all this into one layer, and return from **get_conv_layers()** method
2. Fill out **final_pool_layer()** with a max pooling layer. Typically after all the convolutional layers there is another max pooling layer. Use the same kernel size and stride as before and this time directly return the **nn.MaxPool2d**. 
3. Fill out the **get_fc_layers()** method with a classifier containing 3 linear layers. Feel free to choose the in_channels and out_channels for these. Once again use the **nn.Sequential** API and return the object from the method. Inside, alternate the Linear layers with ReLU activations. You do not need a ReLU after the final layer. Remember that the first Linear layer must take in the output of the final convolution layer so depending on the choice in (1.) there is only 1 value which can have for the in_channels of the first Linear layer. Also the final Linear layer must have out_channels=10 since performing 10-way classification. Lastly, **remember to add comments** on why you keep the relu of the first two layers while remove the relu of the last layer.
4. Finally, please fill out the forward pass using these layers. Remember to use **self.conv_model**, **self.final_max_pool** and **self.fc_model** one after the other.


In [None]:
class GradBasicNet(nn.Module):

    def __init__(self):
        super().__init__()

        self.conv_model = self.get_conv_layers()
        self.final_max_pool = self.final_pool_layer()
        self.fc_model = self.get_fc_layers()

    def get_conv_layers(self):

        #TODO: Group the convolutional layers using nn.Sequential
        
        layers = 

        return layers

    def final_pool_layer(self):

        #TODO Set this to a MaxPool layers
        
        layer = 
        
        return layer

    def get_fc_layers(self):

        # TODO Group the linear layers using nn.Sequential
        
        layers = 
        
        # =================================================
        # Please add the comment here(detail in above description)
        # =================================================
        return layers

    def register_grad_hook(self, grad):
        self.grad = grad


    def forward(self, x):

        #TODO
        x = #call the conv layers 
        #ignore this: relevance for gradcam section
        h = x.register_hook(self.register_grad_hook)

        x = #call the max pool layer
        x = #flatten the output of x 
        x = #call the fully connected layers

    def get_gradient_activations(self):
        return self.grad

    def get_final_conv_layer(self, x):
        return self.conv_model(x)

### Finetuning on a pre-trained model

Secondly, below will a stronger convolutional neural network by starting from a pre-trained model **AlexNet**:

In this subsection, might have to take advantage of a pretrained network in a process called transfer learning, where only train a few final layers of a neural network. Here use the AlexNet architecture that revolutionized Deep Learning. The pretrained model (on ImageNet) is available on torchvision library and ask PyTorch to allow updates on a few of thee final layers during training. Put the AlexNet model also into the API that's used for the model allow, except that needing an additionl **transition layer** function for the added **AvgPool** layer. Visualize the model by running the code cell. 
1. (features) contains most of the conv layers. We need upto (11) to include every conv layer
2. (features) (12) is the final max pooling layer
3. (avgpool) is the transition average pooling layer
4. (classifier) is the collection of linear layers


In [None]:
example_model = models.alexnet(pretrained=True)
print(example_model)

TODO task is two fold:

1. Implement the method **activate training layers** which sets the requires_grad of relevant parameters to True, so that training can occur, which can iterate over training parameters with 

``
for name, param in self.conv_model.named_parameters():
``

and 

``
for name, param in self.fc_model.named_parameters():
``

For the conv layers, every param should have requires_grad set to false except for the last layer (10). 
For the linear layers, all layers must be trainable aka requires_grad must be set to True.

2. Implement the forward pass in the following order: self.conv_model, self.final_max_pool, self.avg_pool, self.fc_model

3. Remember to fill in the comments under the question mentioned above.

In [None]:
class GradAlexNet(nn.Module):

    def __init__(self):
        super().__init__()

        self.base_alex_net = models.alexnet(pretrained=True)


        self.conv_model = self.get_conv_layers()
        self.final_max_pool = self.final_pool_layer()
        self.avg_pool = self.transition_layer()
        self.fc_model = self.get_fc_layers()

        self.activate_training_layers()

    def activate_training_layers(self):

        #TODO Fill out the function below
        for name, param in self.conv_model.named_parameters():
            
            print(name)
            #this is the number of every convolutional layer. From what model printed above, what is
            #the last convolutional layer?
            # =================================================
            # Please comment here
            # =================================================
            number = int(name.split('.')[0])
            
            # TODO: for all layers except the last layer set param.requires_grad = False

        for name, param in self.fc_model.named_parameters():

            # for all of these layers set param.requires_grad as True


    def get_conv_layers(self):

        return self.base_alex_net.features[:12]

    def final_pool_layer(self):
        return nn.MaxPool2d(kernel_size=3, stride=2, padding=0, dilation=1, ceil_mode=False)

    def transition_layer(self):
        return nn.AdaptiveAvgPool2d(output_size=(6, 6))

    def get_fc_layers(self):
        return nn.Sequential(
            nn.Dropout(p=0.5, inplace=False),
            nn.Linear(in_features=9216, out_features=4096, bias=True),
            nn.ReLU(inplace=True),
            nn.Dropout(p=0.5, inplace=False),
            nn.Linear(in_features=4096, out_features=4096, bias=True),
            nn.ReLU(inplace=True),
            nn.Linear(in_features=4096, out_features=1000, bias=True),
            nn.ReLU(inplace=True),
            nn.Linear(in_features=1000, out_features=10, bias=True),
        )

    def register_grad_hook(self, grad):
        self.grad = grad

    def forward(self, x):

        #TODO fill out the forward pass
        x = #call the conv layers
        h = x.register_hook(self.register_grad_hook)

        x = #call the max pool layer
        x = #call the avg pool layer
        x = #call fully connected layers
        
        return x

    def get_gradient_activations(self):
        return self.grad

    def get_final_conv_layer(self, x):
        return self.conv_model(x)

#### Please add a comment:
After finish the implementation and run the experiments, please add a comment explaining questions below:
1. what is the main difference between finetuning and training from scratch.

2. Compare the performance between finetuning from a pre-trained model and training from scratch. Explain why the better one outperforms the other?

## Training the Model and Evaluating the Model's Performance

To note, one training loop is denoted as one epoch and means all of the training data has been used to train the deep neural network once. Please fill out the classifier we use to train these networks. 
1. In the **__init__** method fill out the **self.criterion** and **self.optimizer**. Remember this is a classification problem so will need a cross entropy loss. For the optimizer, recommending using stochastic gradient descent with a learning rate of 0.001 and momentum of 0.9. The rest of **__init__** has been filled out.
2. Next, fill out the **training loop**. Expected to iterate over **self.dataloaders['train']** and optimize on the loss with the groundtruth labels. The **validation** aspect of the training loop has been provided and so has the evaluate method. In the training loop, print the average loss every 1000 images processed. Remember to zero out gradients on the model before doing loss.backward(), and then only after this backward step, make a step in the right direction using the optimizer.

At this stage ignore all methods, after evaluate. They will be relevant to later sections and you will have to return to them when you have more instructions.

In [None]:
class Classifier():

    def __init__(self, name, model, dataloaders, class_names, use_cuda=False):
        
        '''
        @name: Experiment name. Will define stored results etc. 
        @model: Either a GradBasicNet() or a GradAlexNet()
        @dataloaders: Dictionary with keys train, val and test and corresponding dataloaders
        @class_names: list of classes, where the idx of class name corresponds to the label used for it in the data
        @use_cuda: whether or not to use cuda
        '''
        
        self.name = name
        if use_cuda and not torch.cuda.is_available():
            raise Exception("Asked for CUDA but GPU not found")
            
        self.use_cuda = use_cuda
        
        self.model = model.to('cuda' if use_cuda else 'cpu')

        #TODO
        self.criterion = #use cross entropy loss
        self.optim = #use SGD with suggest hyperparams; you must select all the model params

        self.dataloaders = dataloaders
        self.class_names = class_names
        self.activations_path = os.path.join('activations', self.name)
        self.kernel_path = os.path.join('kernel_viz', self.name)

        save_path = os.path.join(os.getcwd(), 'models', self.name)
        if not os.path.exists(save_path):
            os.makedirs(save_path)

        if not os.path.exists(self.activations_path):
            os.makedirs(self.activations_path)

        if not os.path.exists(self.kernel_path):
            os.makedirs(self.kernel_path)
            
        self.save_path = save_path

    def train(self, epochs, save=True):
        '''
        @epochs: number of epochs to train
        @save: whether or not to save the checkpoints
        '''

        best_val_accuracy = - math.inf

        for epoch in range(epochs):

            self.model.train()

            batches_in_pass = len(self.dataloaders['train'])
            
            #You may comment these two lines if you do not wish to use them
            loss_total = 0.0 # Record the total loss within a few steps
            epoch_loss = 0.0 # Record the total loss for each epoch
            
            # TODO Iterate over the training dataloader (see how it is done for validation below) and make sure
            # to call the optim.zero_grad(), loss.backward() and optim.step()

            '''Give validation'''
            epoch_loss /= batches_in_pass

            self.model.eval()
            
            #DO NOT modify this part
            correct = 0.0
            total = 0.0
            for idx, data in enumerate(self.dataloaders['val']):

                inputs, labels = data
                inputs = inputs.to('cuda' if self.use_cuda else 'cpu')
                labels = labels.to('cuda' if self.use_cuda else 'cpu')

                outputs = self.model(inputs)
                _, predicted = torch.max(outputs, 1)

                total += labels.shape[0]
                correct += (predicted == labels).sum().item()

            epoch_accuracy = 100 * correct / total

            print(f'Train Epoch Loss (Avg): {epoch_loss}')
            print(f'Validation Epoch Accuracy:{epoch_accuracy}')
            
            if save:
                #  Make sure that your saving pipeline is working well. 
                # Is os library working on your file system? 
                # Is your model being saved and reloaded fine? 
                # When you do the kernel viz, activation maps, 
                # and GradCAM you must be using the model you have saved before.
                
                torch.save(self.model.state_dict(), os.path.join(self.save_path, f'epoch_{epoch}.pt'))
                
                if epoch_accuracy > best_val_accuracy:

                    torch.save(self.model.state_dict(), os.path.join(self.save_path, 'best.pt'))
                    best_val_accuracy = epoch_accuracy

        print('Done training!')                       

    
    def evaluate(self):
        
        try:
            assert os.path.exists(os.path.join(self.save_path, 'best.pt'))
            
        except:
            print('It appears you are testing the model without training. Please train first')
            return
        
        self.model.load_state_dict(torch.load(os.path.join(self.save_path, 'best.pt')))
        self.model.eval()

        #total = len(self.dataloaders['test'])
        
        correct = 0.0
        total = 0.0
        for idx, data in enumerate(self.dataloaders['test']):
            
                inputs, labels = data
                inputs = inputs.to('cuda' if self.use_cuda else 'cpu')
                labels = labels.to('cuda' if self.use_cuda else 'cpu')
                
                outputs = self.model(inputs)
                _, predicted = torch.max(outputs, 1)
                
                total += labels.shape[0]
                correct += (predicted == labels).sum().item()
                
        print(f'Accuracy: {100 * correct/total}%')
        
    def grad_cam_on_input(self, img):
        
        try:
            assert os.path.exists(os.path.join(self.save_path, 'best.pt'))

        except:
            print('It appears you are testing the model without training. Please train first')
            return

        self.model.load_state_dict(torch.load(os.path.join(self.save_path, 'best.pt')))


        self.model.eval()
        img = img.to('cuda' if self.use_cuda else 'cpu')


        out = self.model(img)

        _, pred = torch.max(out, 1)

        predicted_class = self.class_names[int(pred)]
        print(f'Predicted class was {predicted_class}')

        out[:, pred].backward()
        gradients = self.model.get_gradient_activations()

        print('Gradients shape: ', f'{gradients.shape}')

        mean_gradients = torch.mean(gradients, [0, 2, 3]).cpu()
        activations = self.model.get_final_conv_layer(img).detach().cpu()

        print('Activations shape: ', f'{activations.shape}')

        for idx in range(activations.shape[1]):
            activations[:, idx, :, :] *= mean_gradients[idx]

        final_heatmap = np.maximum(torch.mean(activations, dim=1).squeeze(), 0)

        final_heatmap /= torch.max(final_heatmap)

        return final_heatmap
        

    

    def trained_kernel_viz(self):
        
        all_layers = [0, 3]
        all_filters = []
        for layer in all_layers:

            #TODO: blank out first line
            filters = self.model.conv_model[layer].weight
            all_filters.append(filters.detach().cpu().clone()[:8, :8, :, :])

        for filter_idx in range(len(all_filters)):

            filter = all_filters[filter_idx]
            print(filter.shape)
            filter = filter.contiguous().view(-1, 1, filter.shape[2], filter.shape[3])
            image = show_img(make_grid(filter))
            image = 255 * image
            cv2.imwrite(os.path.join(self.kernel_path, f'filter_layer{all_layers[filter_idx]}.jpg'), image)
    

    def activations_on_input(self, img):
        
        img = img.to('cuda' if self.use_cuda else 'cpu')

        all_layers = [0,3,6,8,10]
        all_viz = []

        for each in all_layers:

            current_model = self.model.conv_model[:each+1]
            current_out = current_model(img)
            all_viz.append(current_out.detach().cpu().clone()[:, :64, :, :])

        for viz_idx in range(len(all_viz)):

            viz = all_viz[viz_idx]
            viz = viz.view(-1, 1, viz.shape[2], viz.shape[3])
            image = show_img(make_grid(viz))
            image = 255 * image
            cv2.imwrite(os.path.join(self.activations_path, f'sample_layer{all_layers[viz_idx]}.jpg'), image)

Run the classifier for the code using the basic model by running the following snippets. If all goes well, should have a test accuracy of about ~60-70% at the end of it. It should take <10 mins to run on a GPU.

In [None]:
experiment_name = 'basic_debug'  #Provide name to model experiment
model_name = 'basic' #Choose between [basic, alexnet]
batch_size = 5  #You may not need to change this but incase you do
first_run = True #whether or not first time running it

trainloader, valloader, testloader, class_names = get_dataloader(batch_size=batch_size, model_name=model_name)
dataloaders = {'train': trainloader, 'val' : valloader, 'test': testloader, 'mapping': class_names}

if model_name == 'basic':
    model = GradBasicNet()
elif model_name == 'alexnet':
    model = GradAlexNet()
else:
    raise NotImplementedError("This option has not been implemented. Choose between 'basic' and 'alexnet' ")

classifier = Classifier(experiment_name, model, dataloaders, class_names, use_cuda=True)

Downloading https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz to ./data/cifar-10-python.tar.gz


HBox(children=(HTML(value=''), FloatProgress(value=1.0, bar_style='info', layout=Layout(width='20px'), max=1.0…

Extracting ./data/cifar-10-python.tar.gz to ./data
Files already downloaded and verified



In [None]:
# When you develop your code, to save your time, you can choose to run the model
# for 5 cpoches. The accuracy after training for 5 epoches has already been high
# and close to the model after 20-epoch training.
# (As a reference, Validation Epoch Accuracy is above 61 after training 5 epoches)

# To note, for your final submission, make sure to train the model 
# for 20 epoches and analysis that model in the later sections.

# classifier.train(epochs=5) # For your reference
classifier.train(epochs=20)
classifier.evaluate()

Now run the classifier for the code using the alexnet model specified above, which will have a notable performance increase. On a GPU, this trained for about 30 mins.

In [None]:
experiment_name = 'alexnet_debug'  #Provide name to model experiment
model_name = 'alexnet' #Choose between [basic, alexnet]
batch_size = 5  #may not need to change this but incase you do
first_run = True #whether or not first time running it

trainloader, valloader, testloader, class_names = get_dataloader(batch_size=batch_size, model_name=model_name)
dataloaders = {'train': trainloader, 'val' : valloader, 'test': testloader, 'mapping': class_names}

#model = models.alexnet(pretrained=True)
if model_name == 'basic':

    model = GradBasicNet()

elif model_name == 'alexnet':

    model = GradAlexNet()

else:
    raise NotImplementedError("This option has not been implemented. Choose between 'basic' and 'alexnet' ")

classifier = Classifier(experiment_name, model, dataloaders, class_names, use_cuda=True)

In [None]:
# When you develop your code, to save your time, you can choose to run the model
# for 3 cpoches. The accuracy after training for 3 epoches has already been high
# and close to the model after 20-epoch training.

# To note, for your final submission, it is recommended to run the model for 20 epochs.
# However, if it takes time, you should at least run the model for 5 epochs. 

# classifier.train(epochs=3) # For your reference
classifier.train(epochs=20)
classifier.evaluate()

**Before you move forward to the next step**: Try and see what classes your model does well on (you can modify the testing code for this). This will help you pick the best visualization to show later.

## Deep Neural Networks

### Activation Map

For interpreting the alexnet model, which will be using a simple version of Grad-CAM (Gradient based Class activation mapping). (https://arxiv.org/abs/1610.02391). This will help to see what region of the input image the output is 'focusing on' while making its key choice. Below is the steps:

1. First, move the img to cuda if you are using GPU.
2. The code to load the model trained from before has been provided. Use self.model to output the predictions to the variable out. Your output should have dimension (1, 10). 
3. The predicted class is the index of the highest value of of these 10 values. Use **torch.max** along dim 1 to get the argument of the max value. This method will return two values, the latter of them is the required argument. The next two lines have been provided: they indicate the predicted loss
4. Call the **backward** method on out[:, pred]. Previously during the forward passes, we have applied a gradient hook on the last convolutional layer which mean that during the applied backward pass, we will record the value of the gradient for the **maximum predicted class** with respect to the **output** of the last convolutional layer. This will be the same size as the **output** of the last convolutional layer. 
5. After the **backward** call, get the value of the gradient above using the **get_gradient_activations** function on **self.model** and store it in gradients. Now use the torch.mean method to get the mean value of gradients across all dimensionss except the channels dimension (number of filters) and store this in **mean_gradients**. The output should be of shape (1, 64, 1, 1). If the tensors have been on GPU, you should move them to CPU using .detach().cpu() 
6. Use the self.model.get_final_conv_layer(img) to store the activations at the final layer to activations. This should be of shape (1, 64, H, W). 
7. Now for each of the 64 filters, **scale** the activations at that filter, with the corresponding **mean_gradients** value for that filter. The output should have size (1, 64, H, W) Iterate over the 64 filters in the following way:
 ``
 for idx in range(activations.shape[1]):
 ``
8. As a final step, take mean across the 64 filters and do a ReLU to get rid of negative activations before normalzing one last time to get the heatmap, which has been done.

In [None]:
'''Sample an image from the test set'''

#You may change the sampling code to sample an image as you desire.
#Make sure to NOT move the sampling code to a different cell.

img_batch, labels_batch = next(iter(testloader))
img = img_batch[3]
img = img.unsqueeze(0)

classifier = Classifier(experiment_name, model, dataloaders, class_names, use_cuda=True)
heatmap = classifier.grad_cam_on_input(img)

def visualize(img, heatmap):


    heatmap = heatmap.cpu().numpy()

    img = show_img(img)
    img = np.uint8(255 * img)
    img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
    print(img.shape)
    print(heatmap.shape)
    heatmap = cv2.resize(heatmap, (img.shape[1], img.shape[0]))
    heatmap = np.uint8(255 * heatmap)
    heatmap = cv2.applyColorMap(heatmap, cv2.COLORMAP_JET)
    heatmap = cv2.cvtColor(heatmap, cv2.COLOR_BGR2RGB)

    combine = 0.5 * heatmap + img

    #if not os.path.exists(write_path):
    #    os.makedirs(write_path)

    plt.imshow(combine/255)
    plt.show()


visualize(img, heatmap)

Show an example of an image where the method is looking at the object in question and another where it appear to be completely unrelated. In the latter case, it might have learnt a spurious correlation- or a bias in the data which always appears to be correlated with a given label. For the ship class, this **might** be the surrounding water or for a **horse** it might be the surrounding grass. In such cases, do you think the model would predict correctly for a ship on sand or a horse in the air? Please leave a comment in a text snippet below. **Causally trained neural networks** 

In [None]:
# Your code here to show an failure case. 
# You can refer the steps and functions implemented in the previous cell and reuse them.

**Answer**: 

### Kernel and Activation Visualizations

Visualize some learned convolutional kernels for two layers in the conv-net. Study the code provided for **trained_kernel_viz** carefully. Only have to fill out the line for **filter**. Expected to do is to access the relevantt layer from **self.conv_model** and set **filter** equal to its weight parameter.

Call this function on the alexnet classifier. 

In [None]:
classifier.trained_kernel_viz()

Now with the kernel viz filled out, write the method for activation visualizations **activations on input**. The structure for the code is very similar to the kernel viz, except that actually viewing the output of the model at each stage and not for the kernel at that stage. Once filled, please call this method on a few sample images. Please leave a cooment of your observation? Elaborate generally in a text snippet below.

**Answer**: 

In [None]:
'''Sample an image from the test set'''
#You may change the sampling code to sample an image as you desire.
#Make sure to NOT move the sampling code to a different cell.
img_batch, labels_batch = next(iter(testloader))
img = img_batch[3]
img = img.unsqueeze(0)

classifier = Classifier(experiment_name, model, dataloaders, class_names, use_cuda=True)
classifier.activations_on_input(img)

What is your observation about early layers v. later layers? Please leave a commnet in a text snippet below.

**Answer**: 