# Extraction of representations in hidden layers

Now we start looking at representations in hidden layers of
a trained network

# Guided exercise

First of all we have to open 

    mnist_cnn.py 

and

- familiarize with the code
- launch the script to train it and save the model in 
      
        mnist_cnn.pt
       
using command  `python3 mnist_cnn.py --save-model`

Then we will go back to this notebook and

- load the model (see also
https://pytorch.org/tutorials/beginner/saving_loading_models.html)

- extract and visualize representations with T-SNE

We will not explain T-SNE but you will find the following resources

- the documentation on scikit-learn
  https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html

- the original paper        http://www.jmlr.org/papers/volume9/vandermaaten08a/vandermaaten08a.pdf 
  (also in the bibliography)

- the *distill* article (which is a wonderful source of information on networks)
  https://distill.pub/2016/misread-tsne/

In [1]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline

In [2]:
from __future__ import print_function
import argparse
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torchvision import datasets, transforms
from torchsummary import summary
from matplotlib import pyplot as plt

This is the same network as in the script

    mnist_cnn.py
    
Usually you may want to put this in a module, anyway I reproduce it
here

In [3]:
class Net(nn.Module):
    #define the building blocks that i need:
    def __init__(self):
        super(Net, self).__init__()
        #convolutional layers for 1 channel, 20 filters of size 5x5:
        self.conv1 = nn.Conv2d(1, 20, 5, 1) # keep last parameter =1
        
        #another convolutional layer with an input of 20 channels 
        #(20 versions of the image given by the 20 filter)
        # this convolutiona layer has 50 filters of size 5x5:
        self.conv2 = nn.Conv2d(20, 50, 5, 1)
        
        # 500 is the width of the last hidden layer
        self.fc1 = nn.Linear(4*4*50, 500)
        # transformation from the size of the last
        # hidden layer to the output
        self.fc2 = nn.Linear(500, 10)

    # describe the flow of information inside the network:
    def forward(self, x):
        # do a convulution and apply the relu :
        x = F.relu(self.conv1(x))
        # take the max value of the maps  reducing the 
        # representation by a factor of 2x2=4 :
        x = F.max_pool2d(x, 2, 2)
        # again, do a convulution and apply the relu :
        x = F.relu(self.conv2(x))
        # further reducing the dimensionality of the reppresentation :
        x = F.max_pool2d(x, 2, 2)
        # for each datapoint a want one vector of length 4*4*50 :
        x = x.view(-1, 4*4*50)
        # flatten it into a single vector : 
        x = F.relu(self.fc1(x))
        # perform the last linear transformation applying the last
        # non-linearity :
        x = self.fc2(x)
        return F.log_softmax(x, dim=1)

In [4]:
# Training settings
input_size=(1,28,28,) # Notice that now the input is not unrolled like in the MLP
batch_size=64
test_batch_size=1000
epochs=1
lr=0.01
momentum=0.0   
seed=1
log_interval=100

In [5]:
use_cuda = torch.cuda.is_available()
torch.manual_seed(seed)
device = torch.device("cuda" if use_cuda else "cpu")
kwargs = {'num_workers': 1, 'pin_memory': True} if use_cuda else {}

In [6]:
train_loader = torch.utils.data.DataLoader(
        datasets.MNIST('../data', train=True, download=True,
                       transform=transforms.Compose([
                           transforms.ToTensor(),
                           transforms.Normalize((0.1307,), (0.3081,))
                       ])),
        batch_size=batch_size, shuffle=True, **kwargs)
test_loader = torch.utils.data.DataLoader(
        datasets.MNIST('../data', train=False, transform=transforms.Compose([
                           transforms.ToTensor(),
                           transforms.Normalize((0.1307,), (0.3081,))
                       ])),
        batch_size=test_batch_size, shuffle=True, **kwargs)

In [7]:
model = Net().to(device)
optimizer = optim.SGD(model.parameters(), lr=lr, momentum=momentum)

In [8]:
summary(model,input_size)

----------------------------------------------------------------
        Layer (type)               Output Shape         Param #
            Conv2d-1           [-1, 20, 24, 24]             520
            Conv2d-2             [-1, 50, 8, 8]          25,050
            Linear-3                  [-1, 500]         400,500
            Linear-4                   [-1, 10]           5,010
Total params: 431,080
Trainable params: 431,080
Non-trainable params: 0
----------------------------------------------------------------
Input size (MB): 0.00
Forward/backward pass size (MB): 0.12
Params size (MB): 1.64
Estimated Total Size (MB): 1.76
----------------------------------------------------------------


In [9]:
print("model's state_dict:")
for p in model.state_dict():
    print(p, "\t", model.state_dict()[p].size())


print("\noptimizer's state_dict:")
for var_name in optimizer.state_dict():
    print(var_name, "\t", optimizer.state_dict()[var_name])

model's state_dict:
conv1.weight 	 torch.Size([20, 1, 5, 5])
conv1.bias 	 torch.Size([20])
conv2.weight 	 torch.Size([50, 20, 5, 5])
conv2.bias 	 torch.Size([50])
fc1.weight 	 torch.Size([500, 800])
fc1.bias 	 torch.Size([500])
fc2.weight 	 torch.Size([10, 500])
fc2.bias 	 torch.Size([10])

optimizer's state_dict:
state 	 {}
param_groups 	 [{'lr': 0.01, 'momentum': 0.0, 'dampening': 0, 'weight_decay': 0, 'nesterov': False, 'params': [140182433811336, 140182433811480, 140182433811408, 140182433811768, 140182433811912, 140182433811840, 140182433811984, 140182433811696]}]


In [10]:

#optimizer.state_dict??

`optimizer.step` re-computes $\theta=\theta - \alpha $ $<$grad L$>$. Where $<.>$ is an exponentially weighted average.

When adding momentum through `data, target = data.to(device), target.to(device)`

momentum = 0.9

In [11]:
for i, p in enumerate(model.parameters()):
    print(i, p.requires_grad)

0 True
1 True
2 True
3 True
4 True
5 True
6 True
7 True


We can choose to freeze the first layers as we'll do in the transfert learning exercise:

In [12]:
print(list(model.parameters())[4][0,0:10])

tensor([ 0.0272, -0.0286, -0.0184,  0.0234, -0.0131,  0.0106, -0.0341, -0.0158,
        -0.0351,  0.0106], grad_fn=<SliceBackward>)


In [None]:
model.load_state_dict(torch.load('mnist_cnn.pt'))
model.eval()

In [None]:
print(list(model.parameters())[4][0,0:10])

Now that we have the trained model back we extract its representations, but how can we do that?
Think about that for a minute before going on...

In [None]:
inputs,labels = next(iter(test_loader))
print(inputs.shape)

In [None]:
output = model(inputs).detach().numpy()

In [None]:
plt.plot(output[0,:],'-o')
print(labels[0])

# Representations extraction

Define a new class identical to Net but with a method to extract activation in
the following places:

- after the first ReLU        **(h1)**
- after the first pooling     **(h2)**
- after the second ReLU       **(h3)**
- after the second pooling    **(h4)**
- after the third ReLU        **(h5)**
- at output                   **(h6)**

Then extract all these activations in correspondance with the input
forward pass and analyze them with T-SNE 
(including the input in the analysis for comparison)

In [None]:
class Net2(nn.Module):
    def __init__(self):
        super(Net2, self).__init__()
        self.conv1 = nn.Conv2d(1, 20, 5, 1)
        self.conv2 = nn.Conv2d(20, 50, 5, 1)
        self.fc1 = nn.Linear(4*4*50, 500)
        self.fc2 = nn.Linear(500, 10)

    def forward(self, x):
        x = F.relu(self.conv1(x))
        x = F.max_pool2d(x, 2, 2)
        x = F.relu(self.conv2(x))
        x = F.max_pool2d(x, 2, 2)
        x = x.view(-1, 4*4*50)
        x = F.relu(self.fc1(x))
        x = self.fc2(x)
        return F.log_softmax(x, dim=1)
    
    def extract(self,x):
        ...