# Q5: Analysis (20 points)
By now you should know how to train networks from scratch or using from pre-trained models. You should also understand the relative performance in either scenarios. Needless to say, the performance of these models is stronger than previous non-deep architectures used until 2012. However, final performance is not the only metric we care about. It is important to get some intuition of what these models are really learning. Lets try some standard techniques.


**FEEL FREE TO WRITE UTIL CODE IN ANOTHER FILE AND IMPORT IN THIS NOTEBOOK FOR EASE OF READABILITY**

## 5.1 Nearest Neighbors (7 pts)
Pick 3 images from PASCAL test set from different classes, and compute 4 nearest neighbors over the entire test set for each of them. You should compare the following feature representations to find the nearest neighbors:
1. The features before the final fc layer from the ResNet (finetuned from ImageNet). It is the features right before the final class label output.
2. pool5 features from the CaffeNet (trained from scratch)

You may use the [this nearest neighbor function](https://scikit-learn.org/stable/modules/neighbors.html).
Plot the raw images of the ones you picked and their nearest neighbors.

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
from torchvision import models
from sklearn.neighbors import NearestNeighbors
import matplotlib.pyplot as plt
%matplotlib inline

import trainer
from utils import ARGS
from simple_cnn import SimpleCNN
from voc_dataset import VOCDataset

# Load all the test images. Pick 3 indices.
dataset = VOCDataset(split='test', inp_size=224)
loader = DataLoader(dataset, batch_size=512, num_workers=8)
indices = np.random.choice(len(dataset), size=3)
torch_images = []
caffenet_features = []
resnet_features = []

# Calculate the features for all the test images.
class CaffeNet(nn.Module):
    def __init__(self, num_classes=20, inp_size=224, c_dim=3):
        super().__init__()
        self.conv1 = nn.Conv2d(3, 96, 11, stride=4, padding="valid")
        self.pool1 = nn.MaxPool2d(3, 2)
        self.conv2 = nn.Conv2d(96, 256, 5, stride=1, padding="same")
        self.pool2 = nn.MaxPool2d(3, 2)
        self.conv3 = nn.Conv2d(256, 384, 3, stride=1, padding="same")
        self.conv4 = nn.Conv2d(384, 384, 3, stride=1, padding="same")
        self.conv5 = nn.Conv2d(384, 256, 3, stride=1, padding="same")
        self.pool5 = nn.MaxPool2d(3, 2)
        
        self.fc6 = nn.Linear(6400, 4096)
        self.dropout6 = nn.Dropout(0.5)
        self.fc7 = nn.Linear(4096, 4096)
        self.dropout7 = nn.Dropout(0.5)
        
        self.fc8 = nn.Linear(4096, 20)
        self.sigmoid = nn.Sigmoid()
    
    def forward(self, x):
        self.forward_features(x)
        x = F.relu(x)
        x = self.dropout7(x)
        x = self.fc8(x)
        x = self.sigmoid(x)
        return x
    
    def forward_features(self, x):
        x = self.conv1(x)
        x = F.relu(x)
        x = self.pool1(x)
        x = self.conv2(x)
        x = F.relu(x)
        x = self.pool2(x)
        x = self.conv3(x)
        x = F.relu(x)
        x = self.conv4(x)
        x = F.relu(x)
        x = self.conv5(x)
        x = F.relu(x)
        x = self.pool5(x)
        x = torch.flatten(x, start_dim=1)
        x = self.fc6(x)
        x = F.relu(x)
        x = self.dropout6(x)
        x = self.fc7(x)
        return x
    
class PretrainedResNet(nn.Module):
    def __init__(self):
        super().__init__()
        self.pretrained = models.resnet18(pretrained=True)
        self.pretrained.fc = nn.Linear(512, 20)
        self.sig = nn.Sigmoid()
    
    def forward(self, x):
        logits = self.pretrained(x)
        return self.sig(logits)
    
    def forward_features(self, x):
        x = self.pretrained.conv1(x)
        x = self.pretrained.bn1(x)
        x = self.pretrained.relu(x)
        x = self.pretrained.maxpool(x)
        x = self.pretrained.layer1(x)
        x = self.pretrained.layer2(x)
        x = self.pretrained.layer3(x)
        x = self.pretrained.layer4(x)
        x = self.pretrained.avgpool(x)
        return x


caffenet = CaffeNet()
caffenet.load_state_dict(torch.load("models/checkpoint-caffenet_scratch-epoch50.pth"))
resnet = PretrainedResnet()
resnet.load_state_dict(torch.load("models/checkpoint-resnet_finetuned-epoch10.pth"))

# Fine the nearest neighbors for the 3 images you picked.


# Plot the images and their neighbors.

## 5.2 t-SNE visualization of intermediate features (7pts)
We can also visualize how the feature representations specialize for different classes. Take 1000 random images from the test set of PASCAL, and extract caffenet (scratch) fc7 features from those images. Compute a 2D [t-SNE](https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html) projection of the features, and plot them with each feature color coded by the GT class of the corresponding image. If multiple objects are active in that image, compute the color as the ”mean” color of the different classes active in that image. Legend the graph with the colors for each object class.

In [None]:
# plot t-SNE here

## 5.3 Are some classes harder? (6pts)
Show the per-class performance of your caffenet (scratch) and ResNet (finetuned) models. This is an open-ended question and you may use any performance metric that makes sense. Try to explain, by observing examples from the dataset, why some classes are harder or easier than the others (consider the easiest and hardest class). Do some classes see large gains due to pre-training? Can you explain why that might happen?

**YOUR ANSWER HERE**