# Deep Learning Course Final Project: The Lottery Ticket Hypothesis

### Cristiano De Nobili - My Contacts
For any questions or doubts you can find my contacts here:

<p align="center">

[<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/e/e7/Instagram_logo_2016.svg/2048px-Instagram_logo_2016.svg.png" width="20">](https://www.instagram.com/denocris/?hl=it)
[<img src="https://1.bp.blogspot.com/-Rwqcet_SHbk/T8_acMUmlmI/AAAAAAAAGgw/KD_fx__8Q4w/s1600/Twitter+bird.png" width="30">](https://twitter.com/denocris) 
[<img src="https://loghi-famosi.com/wp-content/uploads/2020/04/Linkedin-Simbolo.png" width="40">](https://www.linkedin.com/in/cristiano-de-nobili/)     

</p>

or here (https://denocris.com).

### Useful Links

All notebooks can be found [here!](https://drive.google.com/drive/folders/1i3cNfzWZTNXfvkFVVIIDXjRDdSa9L9Dv?usp=sharing)

Introductory slides [here!](https://www.canva.com/design/DAEa5hLfuWg/-L2EFFfZLVuiDkmg4KiKkQ/view?utm_content=DAEa5hLfuWg&utm_campaign=designshare&utm_medium=link&utm_source=publishsharelink)

Collection of references: [here!](https://denocris.notion.site/Deep-Learning-References-0c5af2dc5c8d40baba19f1328d596fff)


# Introduction

The aim of this exercise is to explore a somewhat mysterious property of deep neural networks 
(DNN), i.e. the existence of very small subnetworks S of a given network N that are *trainable* 
and can reach the same performance of N (or even higher). 

This intriguing observation has been elevated to the rank of an **hypothesis** in the seminal work by Frankle & Carbin [3]

https://arxiv.org/abs/1803.03635

*Any large network that trains successfully contains a subnetwork that is initialized such that - when trained in isolation - it can match the accuracy of the original network in at most the same number of training iterations*

The authors metaphorically called this subnetwork as a *winning ticket*.

In some cases these subnetworks are really small. We can find subnetworks with the  $\approx 1 \%$ or less of the original connections.

Imagine the possible applications of such tiny DNNs. Being small means less memory and less computation required: this can be crucial in developing embedded systems that, given a target performance to be reached, can afford it with lower resources, including energy consumption.

From a theoretical viewpoint there is also another interesting angle. If these subnetworks are so small then, shall we think differently about the problem of *overparametrization*?

Let us state it more clearly. A very diffuse concern about DNNs is the fact that the number of parameters is often vastly larger then the number of data points. For example we will work in this notebook with a network with $\approx 400,000$ *trainable* parameters on the MNIST dataset, that has 60,000 training samples. This means that on average we have 7 parameters for each data point.
If we compare this situation with polynomial regression of a dataset of 10 points, for example, a polynomial of degree 10 (that has 11 parameters) overfits the data perfectly. But in DNNs overfitting - while still there - is not as severe as we could expect.

But now, if we can reduce by two orders of magnitude the number of parameters in DNNs, does this concern still hold?
We don't know the answer, and that's *one* of the reasons why we are here.

We are PyTorch beginners, but what we saw in the course is enough to approach this *research* problem in DNNs, from an empirical viewpoint. Hopefully, this will give you stimuli to explore more in depth these problems, adding your perspective to it.

Another point of this exercise is to make experience of *new strategies in research and knowledge dissemination*. 
Short communications - going straight to the point - are very effective indeed as a *first* exposition to a new subject. 

### Project Evaluation

In order to receive the certification you should work on the following assignments

- Reading assignments;
- Brief answers to the questions in the notebook;
- Completion of the PyTorch code in this notebook;
- write the conclusions.

# Reading assignment

The first part of this exercise is a *reading assignment*.

You should read

- the research account from the Uber Engineering AI team's blog:
https://eng.uber.com/deconstructing-lottery-tickets/


We suggest if you want to go deeper in this, to read also the original references:

- the original paper of Uber AI

https://arxiv.org/abs/1905.01067

- and the paper by Frankle and Carbin

https://arxiv.org/abs/1803.03635

After the readings, we will try to reproduce one of their numerous experiments of the Uber AI paper. Take your time for this and enjoy your reading!


![](./figs_nb/manuscript.jpg)

# Intro to The Lottery Ticket Hypothesis


A recent work by Frankle & Carbin was thus surprising to many researchers when it presented
a simple algorithm for finding sparse subnetworks within larger networks that are trainable.

Here the key word is *trainable*: it is possible to find sparse subnetworks that perform well but they are difficult 
to train directly from scratch.


You may want to check the research described in the original paper

https://arxiv.org/abs/1803.03635

and a video presentation of the original research at ICLR conference 2019

https://www.youtube.com/watch?v=s7DqRZVvRiQ&t=3s

Briefly, their approach for finding these sparse, performant networks is as follows: 

- train a network
- set all weights smaller than some threshold (in absolute value) to zero
- prune them
- rewind the rest of the weights to their initial configuration
- retrain the network from this starting configuration but with the zero weights frozen (not trained)


Using this approach, they obtained two intriguing results

- the pruned networks performed well
- the network trains well only if it is rewound to its initial state

**The Lottery Ticket Algorithm**

[...]We begin by briefly describing the lottery ticket algorithm (we simplify things a bit with respect
to the paper):

- Initialize a mask m to all ones (in the PyTorch code this will be a list of tensors of the same shapes of the ones given by model.parameters()). 

- Randomly initialize the parameters w of a network
$f(x;w \star m)$ ($\star$ stands for elementwise multiplication), in this case of course the multiplication by the mask does not have any effect.

- Train the parameters w of the network $f(x;w \star m)$ to completion. Denote the initial weights
before training wi and the final weights after training wf

- Mask Criterion. Use the mask criterion $M(wi; wf)$ to produce a masking score for each
currently unmasked weight. 

Method to create the mask: 


<span style="color:red">
Rank the weights in each layer by their scores, set the mask
value for the top $p\% $ to 1, the bottom $(100 - p) \% $ to 0. 
The mask selected weights with large final value corresponding to $M(wi;wf) = |wf|$.
</span>   


This text is colored in red because this is what is done in the paper, but we will do it *differently*, as we will explain below!

3. Mask-1 Action. Take some action with the weights with mask value 1. In [3] these weights
were reset to their initial values and marked for training in the next round.

4. Mask-0 Action. Take some action with the weights with mask value 0. In [3] these weights
were pruned: set to 0 and frozen during any subsequent training [...]


We do not consider iterative pruning (as in the original paper by Frankle and Carbin too): we do it just once.

Other masking criteria are possible:


![](./figs_nb/mask_criteria.png)

we will stick for the moment with the original one, which is the large final (LF) mask.


We will do experiments on a small convolutional network trained on the MNIST dataset.

# Imports

Here we start with the necessary imports for the PyTorch version.

In [1]:
from __future__ import print_function
import argparse
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torchvision import datasets, transforms
from tqdm import tqdm
from matplotlib import pyplot as plt
import numpy as np
from torchsummary import summary
import os
from os.path import join

# Exercise 1: the Network

Define the network. It should be composed by

* A 1st 2d convolutional layer with 1 input channel, 20 output channels, and a squared window size of 5 (stride 1 is fine);
* A ReLu activation function;
* A Max Pooling layer of square size 2;
* A 2nd 2d convolutional layer of the same window size and with 50 output channels (you must understand the input size by youself);
* A ReLu activation function;
* A Max Pooling layer of square size 2;
* A 1st Linear layers with input size equal to the flattened output size of the last convolution and an output size of 500;
* A 2nd and last Linear layer with input size 500 and output size 10 (MNIST classes);
* A softmax activation function.

We suggest you to look at the documentation of the layers that are involved: convolutional and pooling
layers.

To check your architecture is correct you should see something like this when printing the model:
    
    model = Net().to(device)
    print(model)

    Net(
    (conv1): Conv2d(1, 20, kernel_size=(5, 5), stride=(1, 1))
    (conv2): Conv2d(20, 50, kernel_size=(5, 5), stride=(1, 1))
    (fc1): Linear(in_features=800, out_features=500, bias=True)
    (fc2): Linear(in_features=500, out_features=10, bias=True)
    )


    and summary() should give something like this 

    summary(model, (1,28,28))
    
    ----------------------------------------------------------------
        Layer (type)               Output Shape         Param #
    ================================================================
                Conv2d-1           [-1, 20, 24, 24]             520
                Conv2d-2             [-1, 50, 8, 8]          25,050
                Linear-3                  [-1, 500]         400,500
                Linear-4                   [-1, 10]           5,010
    ================================================================
    Total params: 431,080
    Trainable params: 431,080
    Non-trainable params: 0
    ----------------------------------------------------------------
    Input size (MB): 0.00
    Forward/backward pass size (MB): 0.12
    Params size (MB): 1.64
    Estimated Total Size (MB): 1.76
    ----------------------------------------------------------------

In [2]:
class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(1 , 20, kernel_size = 5, stride = 1)
        self.conv2 = nn.Conv2d(20, 50, kernel_size = 5, stride = 1)
        self.linear1 = torch.nn.Linear(800, 500)
        self.linear2 = torch.nn.Linear(500, 10)

    def forward(self, x):

        pooling = nn.MaxPool2d(2)
        relu = nn.ReLU()
        flatten = torch.nn.Flatten()
        
        out = self.conv1(x)
        out = relu(out)
        out = pooling(out)
        out = self.conv2(out)
        out = relu(out)
        out = pooling(out)
        out = self.linear1(flatten(out))
        out = self.linear2(out)
        return F.log_softmax(out, dim = 1)

##### Check your Net

In [3]:
use_cuda = torch.cuda.is_available()
device = torch.device("cuda" if use_cuda else "cpu")

# instantiate the model
model = Net()
# put the model on the GPU       
model.to(device)
    
print(model)
print(summary(model, (1,28,28)))

Net(
  (conv1): Conv2d(1, 20, kernel_size=(5, 5), stride=(1, 1))
  (conv2): Conv2d(20, 50, kernel_size=(5, 5), stride=(1, 1))
  (linear1): Linear(in_features=800, out_features=500, bias=True)
  (linear2): Linear(in_features=500, out_features=10, bias=True)
)
Layer (type:depth-idx)                   Output Shape              Param #
├─Conv2d: 1-1                            [-1, 20, 24, 24]          520
├─Conv2d: 1-2                            [-1, 50, 8, 8]            25,050
├─Linear: 1-3                            [-1, 500]                 400,500
├─Linear: 1-4                            [-1, 10]                  5,010
Total params: 431,080
Trainable params: 431,080
Non-trainable params: 0
Total mult-adds (M): 2.29
Input size (MB): 0.00
Forward/backward pass size (MB): 0.12
Params size (MB): 1.64
Estimated Total Size (MB): 1.76
Layer (type:depth-idx)                   Output Shape              Param #
├─Conv2d: 1-1                            [-1, 20, 24, 24]          520
├─Conv2d: 1-2 

# Exercise 2: Training and test functions

Before implementing the LT algorithm there are two changes we have to make to the *standard* 
training functions

- add an optional argument *mask*
- write the code that, if a mask is passed, modifies the update in order to freeze the parameters whose mask value is zero

Notice that in PyTorch we can set the flag require_grad to *tensors*, but not to their individual elements. Let us recall this once.

For tensors we could proceed as follows (we did something like this in the transfer learning, when we froze all the network but the last hidden layer):

In [4]:
# show the requires_grad flags
for p in model.parameters():
    print(p.shape, p.requires_grad)

torch.Size([20, 1, 5, 5]) True
torch.Size([20]) True
torch.Size([50, 20, 5, 5]) True
torch.Size([50]) True
torch.Size([500, 800]) True
torch.Size([500]) True
torch.Size([10, 500]) True
torch.Size([10]) True


In [5]:
# freeze first hidden layer as an example
list(model.parameters())[0].requires_grad=False

In [6]:
# check that the flag has changed
for p in model.parameters():
    print(p.requires_grad)

False
True
True
True
True
True
True
True


Let us re-set to True the flag we changed.

In [7]:
list(model.parameters())[0].requires_grad=True

But in order to *freeze* individual elements of our parameter tensors we have to proceed differently:

- compute the gradient with respect to all the parameters 
- set to zero the gradients of the parameters whose mask value is zero
- do the normal update


Complete the code in the following cell:

In [8]:
def train(model, device, train_loader, optimizer, epoch, mask=None):
    model.train()
    for batch_idx, (data, target) in enumerate(train_loader):
        data, target = data.to(device), target.to(device)
        optimizer.zero_grad()
        output = model(data)
        loss = F.nll_loss(output, target)
        
        #----------------------------------------------------------------------------------------------
        # compute the gradient with respect to all the parameters
        loss.backward()
        
        # set to zero the gradients of the freezed parameters if a mask is passed as a parameter        
        if mask is not None:
            try:
                for i,p in enumerate(model.parameters()):
                    p.data = p.data * mask[i]
                    
            except Exception as e:
                print("Exception {} when applying the mask!".format(e))

        
        # parameters update
        optimizer.step()
        #----------------------------------------------------------------------------------------------

Here the test function

In [9]:
def test(model, device, test_loader,verbose=False):
    model.eval()
    loss = 0
    correct = 0
    with torch.no_grad():
        for data, target in test_loader:
            data, target = data.to(device), target.to(device)
            output = model(data)
            loss += F.nll_loss(output, target, reduction='sum').item() # sum up batch loss
            pred = output.argmax(dim=1, keepdim=True) # get the index of the max log-probability
            correct += pred.eq(target.view_as(pred)).sum().item()
    loss /= len(test_loader.dataset)
    acc = 100. * correct / len(test_loader.dataset)
    
    if verbose:
        print('\nTest set: Average loss: {:.4f}, Accuracy: {}/{} ({:.0f}%)\n'.format(
            loss, correct, len(test_loader.dataset), acc))

    return loss,acc

Now we are ready to do the experiment with the LF mask.
Before doing that we have to 

- set our hyperparameters
- deal with the data

Just evaluate the following two cells.

# Hyperparameters settings

In [10]:
batch_size=64

#number of epochs 
epochs=5

#learning rate
lr=0.01

# keep the momentum to 0, otherwise also freezed parameters 
# will move for the momentum contribution to parameters evolution
momentum=0.0

seed=1
torch.manual_seed(seed)
save_model=1

# Load MNIST Dataset

In [11]:
kwargs = {'num_workers': 1, 'pin_memory': True} if use_cuda else {}
train_loader = torch.utils.data.DataLoader(
    datasets.MNIST('./data', train=True, download=True,
                   transform=transforms.Compose([
                       transforms.ToTensor(),
                       transforms.Normalize((0.1307,), (0.3081,))
                   ])),
    batch_size=batch_size, shuffle=True, **kwargs)
test_loader = torch.utils.data.DataLoader(
    datasets.MNIST('./data', train=False, transform=transforms.Compose([
                       transforms.ToTensor(),
                       transforms.Normalize((0.1307,), (0.3081,))
                   ])),
    batch_size=batch_size, shuffle=True, **kwargs)

# Exercise 3: First Standard Training (with no mask)

During the first standard training you will

* initialize the network and send to device;
* store the initial weights $w_i$ of the network;
* initialize the optimizer;
* train the network without any mask (or passing a mask of ones if you prefer);
* store the final weights $w_f$;
* The final weights will be used later to compute the Large Final (LF) mask.

During the training check that your loss is getting smaller and your accuracy on test set higher.

In [12]:
model = Net().to(device)
if not os.path.isdir('./models'):
    os.mkdir('./models')

torch.save(model.state_dict(), 'models/initial.pt')

In [13]:
optimizer = optim.SGD(model.parameters(), lr=lr, momentum=momentum)

In [14]:
for epoch in range(1, epochs + 1):
    train(model, device, train_loader, optimizer, epoch, mask=None)
    
if (save_model):
    print('Saving model.')
    torch.save(model.state_dict(), './models/final.pt')

Saving model.


# Masks Definitions

We define different masks

- the identity mask, which is useful for debugging: it will leave the training unaltered;
- the random mask, that we will use to make comparisons;
- the Large Final mask (LF mask).

The first two are given, the implementation of the LF mask is an exercise.

In [15]:
def identity_mask(model):
    '''
    Returns the identity mask for all parameter tensors in a list
    '''
    mask = []
    for p in model.parameters():
        mask.append(torch.ones_like(p))
    return mask

def random_mask(model, level=0.0):
    '''
    Construct random mask with a given level of pruning (probability to assign a zero value)
    '''
    # construct random mask
    mask = []
    frac = 0
    tot = 0
    for i,p in enumerate(model.parameters()):
        mask.append( (torch.rand_like(p) > level).float().to(device) )
        frac += torch.sum( mask[i] ).item() 
        tot += mask[i].numel()
    frac = frac/tot
    return mask, frac

# Exercise 4: build the LF Mask

You will implement here a *variant* of the original LF (which is a bit more sophisticated). 

The general idea is just to mark with 1 the weights that at the end of the training are larger then a certain threshold. 

![LF-masks.png](https://drive.google.com/uc?export=view&id=1Xc50-81zo5YPmn3EBu8eGMDnKlFvbAtv)


Concretely you will do it as follows. Let us denote with

- p a vector that contains all the weights;
- |p| its elementwise absolute value;
- m the average of |p|;
- s the standard deviation of p (notice that now we are considering the original parameters, not their absolute values);
- $\alpha$ is a parameter in the range $[0,2]$;

Then a parameter $p_i$ will be considered relevant (mask = 1 for this parameter) if its final value is such that:

$$
|p_i| > m + \alpha \times s
$$

Your function (LF_mask) will have the following signature:
    
    input : model,alpha
        
    output : mask (a list of tensors with the same shape of list(model.parameters()) ) in which to 
             each parameter is associated a value 1 if relevant, 0 if irrelevant
             
             frac (a the fraction of parameters whose mask is equal to 1)
             
             The latter is useful to keep track of the level of parameters'pruning which is 
             implicitly set by alpha
             
Then we will test it and see how many parameters we pruned with a certain $\alpha$.
             
Hint: Basically you will have to concatenate all the parameters in a long numpy array first, in order to compute the statistics you need. Maybe you will find useful the functions torch.where, torch.ones_like, torch.zeros_like. Furthermore, before acting on torch tensors you will need to detach them from the computational graph (see detach() method), send them to the cpu device, and convert them into numpy arrays.

In [16]:
def LF_mask(model,alpha):
    '''
    Construct large final (LF) mask. The threshold for 
    decision is determined globally.
    '''
    frac = 0
    tot = 0
    
    
    # concatenate all parameters into a numpy array
    params = np.array([])
    for i, p in enumerate( model.parameters() ):
        value = p.to('cpu').detach().numpy().reshape(-1)
        params = np.concatenate((params, value), axis=0)
          
    # compute mean of the absolute values and standard deviation
    mean = np.mean(params)
    std = np.std(params) 
        
    # compute the mask
    mask = []
    for i, p in enumerate( model.parameters() ):
        condition = np.abs(p) > mean + alpha*std
        temp_mask = torch.where(condition,1,0)
        
        mask.append(temp_mask)
        
        frac += torch.sum( mask[i] ).item()
        tot += mask[i].numel()

    frac = frac/tot
    
    return mask, frac

##### Test the LF mask

To make a test let us create the mask with the LF function, then print the fraction of weights that received 
a mask equal to 1.
Notice that with $\alpha=1$ you are already pruning a substantial part of your network ($\approx 98 \%$ of the parameters).

This means that the LF mask defines an highly *sparse* subnetwork.

In [17]:
model.load_state_dict(torch.load('models/final.pt') )
alpha = 1
with torch.no_grad():
    mask, frac = LF_mask(model,alpha)
print('Fraction of weights with mask=1:  {}'.format(np.round(frac,3)) )

Fraction of weights with mask=1:  0.384


We can inspect LF mask values

In [18]:
for m, p in zip(mask, model.parameters()):
  print(m)
  break

tensor([[[[1, 1, 1, 0, 1],
          [1, 0, 1, 1, 0],
          [1, 1, 1, 0, 1],
          [1, 1, 1, 1, 1],
          [1, 1, 1, 1, 0]]],


        [[[1, 1, 1, 1, 1],
          [1, 1, 1, 1, 1],
          [0, 1, 0, 0, 1],
          [1, 0, 1, 1, 1],
          [1, 0, 1, 1, 1]]],


        [[[1, 1, 1, 1, 1],
          [1, 0, 1, 1, 1],
          [1, 1, 1, 1, 1],
          [0, 1, 1, 1, 1],
          [1, 0, 1, 0, 1]]],


        [[[1, 1, 1, 1, 1],
          [1, 1, 0, 0, 1],
          [1, 1, 1, 0, 1],
          [1, 1, 1, 1, 1],
          [1, 1, 1, 1, 1]]],


        [[[1, 1, 1, 1, 1],
          [1, 1, 1, 1, 0],
          [1, 1, 1, 1, 0],
          [1, 1, 1, 1, 1],
          [1, 1, 1, 1, 1]]],


        [[[1, 1, 1, 1, 1],
          [1, 1, 1, 1, 1],
          [1, 1, 1, 1, 1],
          [1, 1, 1, 1, 1],
          [0, 1, 1, 1, 1]]],


        [[[1, 1, 1, 1, 1],
          [1, 1, 1, 1, 1],
          [1, 1, 0, 1, 1],
          [1, 1, 1, 1, 1],
          [1, 1, 1, 1, 1]]],


        [[[1, 1, 0, 1, 1],


# Exercise 5: Compare the different trainings 

### Rewind to init state and train again: with LF mask

Now we will verify that this highly sparse subnetwork is also *trainable* when if start from the original weights.
    
We will

- rewind the network to its initial state;
- apply the LF mask *before* training: in this way we will obtain our sparse subnetwork;
- retrain just the subnetwork (it is enough to pass the mask to the train function: convince yourself that this is indeed the case);
- evaluate it at the end.

Verify that the subnetwork is trainable if we start from the original weights (how much accuracy is 
reached on the test set?)

We do the initial step and you will complete the following code

In [19]:
# rewind to initial state
model.load_state_dict(torch.load('models/initial.pt') )

# apply mask
for i,p in enumerate(model.parameters()):
    p.data = p.data * mask[i]
    
# re-instantiate optimizer
optimizer = optim.SGD(model.parameters(), lr=lr, momentum=momentum)

# train
for epoch in tqdm(range(1, epochs + 1)):
    train(model, device, train_loader, optimizer, epoch, mask=None)
    
# evaluate at the end of training
_, acc_trained_subnetwork = test(model, device, test_loader)

print('Accuracy of the subnetwork after the training: {}'.format(acc_trained_subnetwork) )

100%|██████████| 5/5 [02:34<00:00, 30.85s/it]


Accuracy of the subnetwork after the training: 98.51


### Start from a new random init and train again: with LF mask

What happens if we start from a new initialization? Verify also that the subnetwork is
not as good as before in this case (how much accuracy do you reach now? Notice that in this context if a subnetwork reaches *only* the $\approx 90 \%$ - that one might think is not a bad result after all - of performance we consider it as *not trainable*, meaning only that the training is not completely effective).

In [20]:
# rewind to a random state
model = Net().to(device)

# apply mask
for i,p in enumerate(model.parameters()):
    p.data = p.data * mask[i]
    
# re-instantiate optimizer
optimizer = optim.SGD(model.parameters(), lr=lr, momentum=momentum)

# train
for epoch in tqdm(range(1, epochs + 1)):
    train(model, device, train_loader, optimizer, epoch, mask=None)
    
# evaluate at the end of training
_, acc_trained_subnetwork = test(model, device, test_loader)

print('Accuracy of the subnetwork after the training: {}'.format(acc_trained_subnetwork) )

100%|██████████| 5/5 [02:32<00:00, 30.55s/it]


Accuracy of the subnetwork after the training: 97.85


### LF on randomly initilized net without training

**Question:** What happens if we do not train the network at all, but simply applying the (learned) mask to the randomly initialized network?

It turns out that with a well-chosen mask, an untrained network can already attain a test accuracy far better than chance. This might come as a surprise, because if you use a randomly initialized and untrained network to, say, classify images of handwritten digits from the MNIST dataset, you would expect accuracy to be no better than chance (about $10 \%$). But now imagine you multiply the network weights by a mask containing only zeros and ones. In this instance, weights are either unchanged or deleted entirely, but the resulting network now achieves nearly 40 percent accuracy at the task! This is strange, but it is exactly what we observe with masks created using the large final criterion.

With the randomly initialized network the application of the LF mask sets to zero all the weights that would have vanished whith the training, leaving non-zero only the weights that would have reached a given threshold after training (from the paper: "the benefit derived from freezing values to zero comes from the fact that those values were moving towards zero anyway"). So, the LF mask acts as a training by allowing, from the very first initial state, to have a lower cost function than the one of the dense neural network.

We will perform these steps:

* rewind our network to its initial state
* evaluate it once
* apply the LF mask
* repeat the evaluation on the subnetwork

Consider that **we are not re-training** our network but just applying the LF mask.

**Rewind to the initial network with random weights**

In [21]:
# rewind to initial state
model.load_state_dict(torch.load('models/initial.pt') )

#evaluate the network:
_, acc_init = test(model, device, test_loader)
    
print('Accuracy of the randomly initialized network, without mask: {}'.format(acc_init) )


Accuracy of the randomly initialized network, without mask: 10.78


**Network with random weights + random mask**

Apply a random mask to the random initialized net. You should obtain around $10\%$ of accuracy (the same performance of a random classifier if the number of classes are 10)

In [22]:
# compute random mask

pruning_fraction = 1 - frac

rmask, rfrac = random_mask(model, pruning_fraction)

# apply random mask
for i,p in enumerate(model.parameters()):
    p.data = p.data * rmask[i]
    
#re-evaluate the network:
_, acc_rand_subnetwork = test(model, device, test_loader)

print('Accuracy of the randomly initialized network, with a random mask: {}'.format(acc_rand_subnetwork) )

Accuracy of the randomly initialized network, with a random mask: 17.02


**Network with random weights + LF mask**

In [23]:
# rewind to random
model.load_state_dict(torch.load('models/initial.pt') )

#evaluate the network:
_, acc_subnetwork = test(model, device, test_loader)

print('Accuracy of the randomly initialized subnetwork WITHOUT a LF mask: {}'.format(acc_subnetwork) )

# apply LF mask
# by setting to zero the weights of the frozen parameters 
for i,p in enumerate(model.parameters()):
    p.data = p.data * mask[i]
    
#re-evaluate the network:
_, acc_subnetwork = test(model, device, test_loader)

print('Accuracy of the randomly initialized subnetwork with a LF mask: {}'.format(acc_subnetwork) )

Accuracy of the randomly initialized subnetwork WITHOUT a LF mask: 10.78
Accuracy of the randomly initialized subnetwork with a LF mask: 78.41


# Conclusions

Write down conclusions about the exercise as if it were a scientific paper. In particular

* Summarise findings;
* Rise possible questions;
* highlight future developments or next steps.


I hope that this exercise was a stimulus for you to delve more into this subject, and more 
generally into deep learning problems.

Good Luck!!!

