<a href="https://colab.research.google.com/github/matteoalberti/Lectures_AdvCV_Experis2020/blob/main/exam/pruning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Experis 2020 -  Deep Learning 4 Computer Vision Assignment

# The Lottery Ticket Hypothesis

The aim of this exercise is to explore a somewhat misterious property of deep neural networks 
(DNN), i.e. the existence of very small subnetworks S of a given network N that are *trainable* 
and can reach the same performance of N (or even higher). 

This intriguing observation has been elevated to the rank of an **hypothesis** in the seminal work by Frankle & Carbin [3]

https://arxiv.org/abs/1803.03635

*Any large network that trains successfully contains a subnetwork that is initialized such that - when trained in isolation - it can match the accuracy of the original network in at most the same number of training iterations*

The authors metaphorically called this subnetwork as a *winning ticket*.

In some cases these subnetworks are really small. We can find subnetworks with the  $\approx 1 \%$ or less of the original connections.

Imagine the possible applications of such tiny DNNs. Being small means less memory and less computation required: this can be crucial in developing embedded systems that, given a target performance to be reached, can afford it with lower resources, including energy consumption.

From a theoretical viewpoint there is also another interesting angle. If these subnetworks are so small then, shall we think differently about the problem of *overparametrization*?

Let us state it more clearly. A very diffuse concern about DNNs is the fact that the number of parameters is often vastly larger then the number of data points. For example we will work in this notebook with a network with $\approx 400,000$ *trainable* parameters on the MNIST dataset, that has 60,000 training samples. This means that on average we have 7 parameters for each data point.
If we compare this situation with polynomial regression of a dataset of 10 points, for example, a polynomial of degree 10 (that has 11 parameters) overfits the data perfectly. But in DNNs overfitting - while still there - is not as severe as we could expect.

But now, if we can reduce by two orders of magnitude the number of parameters in DNNs, does this concern still hold?
We don't know the answer, and that's *one* of the reasons why we are here.

We are PyTorch beginners, but what we saw in the course is enough to approach this *research* problem in DNNs, from an empirical viewpoint. Hopefully, this will give you stimuli to explore more in depth these problems, adding your perspective to it.

Another point of this exercise is to make experience of *new strategies in research and knowledge dissemination*. 
Short communications - going straight to the point - are very effective indeed as a *first* exposition to a new subject. 

### Evaluation

In order to pass the exam you should work on some mandatory assignments.
These are:

- reading assignments
- brief answers to the questions in the notebook
- completion of the PyTorch code in this notebook

If you want to receive full grades

- do all the optional exercises in PyTorch (there are two)

The optional assignments will be marked with 'OPTIONAL' in the following.

# Reading assignment

The first part of this exercise is a mandatory *reading assignment*.

You should read


- the original paper
https://arxiv.org/abs/1905.01067

- the research account from the Uber Engineering AI team's blog:
https://eng.uber.com/deconstructing-lottery-tickets/


We suggest if you want to go deeper in this, to read also the original reference

https://arxiv.org/abs/1803.03635

After the readings, we will try to reproduce one of their numerous experiments. Take your time for this and enjoy your reading!


![](./figs_nb/manuscript.jpg)

# After the readings

Once we are confident with the ideas in the original paper

https://arxiv.org/abs/1905.01067

we will take it as a reference, and intertwin some excerpts from the original text and some of its figures 
with code and exercises.

You may parallel now the reading of the notebook with the paper.

### Introduction 


[...] A recent work by Frankle & Carbin [3] was thus surprising to many researchers when it presented
a simple algorithm for finding sparse subnetworks within larger networks that are trainable[...] 

Here the key word is *trainable*: it is possible to find sparse subnetworks that perform well but they are difficult 
to train directly from scratch.


You may want (OPTIONAL) to check the research described in the original paper

https://arxiv.org/abs/1803.03635


and a video presentation (OPTIONAL) of the original research at ICLR conference 2019

https://www.youtube.com/watch?v=s7DqRZVvRiQ&t=3s


Briefly, their approach for finding these sparse, performant networks is as follows: 

- train a network
- set all weights smaller than some threshold (in absolute value) to zero
- prune them
- rewind the rest of the weights to their initial configuration
- retrain the network from this starting configuration but with the zero weights frozen (not trained)


Using this approach, they obtained two intriguing results

- the pruned networks performed well
- the network trains well only if it is rewound to its initial state

**The Lottery Ticket Algorithm**

[...]We begin by briefly describing the lottery ticket algorithm (we simplify things a bit with respect
to the paper):

- Initialize a mask m to all ones (in the PyTorch code this will be a list of tensors of the same shapes of the ones given by model.parameters()). 

- Randomly initialize the parameters w of a network
$f(x;w \star m)$ ($\star$ stands for elementwise multiplication), in this case of course the multiplication by the mask does not have any effect.

- Train the parameters w of the network $f(x;w \star m)$ to completion. Denote the initial weights
before training wi and the final weights after training wf

- Mask Criterion. Use the mask criterion $M(wi; wf)$ to produce a masking score for each
currently unmasked weight. 

Method to create the mask: 


<span style="color:red">
Rank the weights in each layer by their scores, set the mask
value for the top $p\%$ to 1, the bottom $(100 - p)\%$ to 0. 
The mask selected weights with large final value corresponding to $M(wi;wf) = |wf|$.
</span>   


This text is colored in red because this is what is done in the paper, but we will do it *differently*, as we will explain below!

3. Mask-1 Action. Take some action with the weights with mask value 1. In [3] these weights
were reset to their initial values and marked for training in the next round.

4. Mask-0 Action. Take some action with the weights with mask value 0. In [3] these weights
were pruned: set to 0 and frozen during any subsequent training [...]


We do not consider iterative pruning (as in the original paper by Frankle and Carbin too): we do it just once.

Other masking criteria are possible:


![](./figs_nb/mask_criteria.png)

we will stick for the moment with the original one, which is the large final (LF) mask.


We will do experiments on a small convolutional network trained on the MNIST dataset.

### Imports

Here we start with the necessary imports for the PyTorch version.

In [None]:
from __future__ import print_function
import argparse
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torchvision import datasets, transforms
from tqdm import tqdm
from matplotlib import pyplot as plt
import numpy as np
from torchsummary import summary
import os
from os.path import join

### Exercise 1: the Network

Define the network with a class Net as we already did (check conv.py on day2 for example).

To check your architecture is correct you should see something like this when printing the model:
    
    model = Net().to(device)
    print(model)

    Net(
    (conv1): Conv2d(1, 20, kernel_size=(5, 5), stride=(1, 1))
    (conv2): Conv2d(20, 50, kernel_size=(5, 5), stride=(1, 1))
    (fc1): Linear(in_features=800, out_features=500, bias=True)
    (fc2): Linear(in_features=500, out_features=10, bias=True)
    )


    and summary() should give something like this 

    summary(model, (1,28,28))
    
    ----------------------------------------------------------------
        Layer (type)               Output Shape         Param #
    ================================================================
                Conv2d-1           [-1, 20, 24, 24]             520
                Conv2d-2             [-1, 50, 8, 8]          25,050
                Linear-3                  [-1, 500]         400,500
                Linear-4                   [-1, 10]           5,010
    ================================================================
    Total params: 431,080
    Trainable params: 431,080
    Non-trainable params: 0
    ----------------------------------------------------------------
    Input size (MB): 0.00
    Forward/backward pass size (MB): 0.12
    Params size (MB): 1.64
    Estimated Total Size (MB): 1.76
    ----------------------------------------------------------------

In [None]:
class Net(nn.Module):
    '''
    your code here
    '''

##### Check your Net

In [None]:
use_cuda = torch.cuda.is_available()
device = torch.device("cuda" if use_cuda else "cpu")

model = Net()
model = Net().to(device)
print(model)
print(summary(model, (1,28,28)))

### Exercise 2: Training and test functions

Before implementing the LT algorithm there are two changes we have to make to the *standard* 
training functions (as used for example in conv.py of day2)

- add an optional argument *mask*
- write the code that, if a mask is passed, modifies the update in order to freeze the parameters whose mask value is zero

Notice that in PyTorch we can set the flag require_grad to *tensors*, but not to their individual elements. Let us recall this once.

For tensors we could proceed as follows (we did something like this in the transfer learning, when we froze all the network but the last hidden layer):

In [None]:
# show the requires_grad flags
for p in model.parameters():
    print(p.shape, p.requires_grad)

In [None]:
# freeze first hidden layer as an example
list(model.parameters())[0].requires_grad=False

In [None]:
# check that the flag has changed
for p in model.parameters():
    print(p.requires_grad)

Let us re-set to True the flag we changed.

In [None]:
list(model.parameters())[0].requires_grad=True

But in order to *freeze* individual elements of our parameter tensors we have to proceed differently:

- compute the gradient with respect to all the parameters 
- set to zero the gradients of the parameters whose mask value is zero
- do the normal update


Complete the code in the following cell:

In [None]:
def train(model, device, train_loader, optimizer, epoch, mask=None):
    model.train()
    for batch_idx, (data, target) in enumerate(train_loader):
        data, target = data.to(device), target.to(device)
        optimizer.zero_grad()
        output = model(data)
        loss = F.nll_loss(output, target)
        
        #----------------------------------------------------------------------------------------------
        # compute the gradient with respect to all the parameters
        '''
        your code here
        '''
        
        # set to zero the gradients of the freezed parameters if a mask is passed as a parameter        
        '''
        your code here
        '''
        
        # parameters update
        '''
        your code here
        '''
        #----------------------------------------------------------------------------------------------

The test function is left unchanged

In [None]:
def test(model, device, test_loader,verbose=False):
    model.eval()
    loss = 0
    correct = 0
    with torch.no_grad():
        for data, target in test_loader:
            data, target = data.to(device), target.to(device)
            output = model(data)
            loss += F.nll_loss(output, target, reduction='sum').item() # sum up batch loss
            pred = output.argmax(dim=1, keepdim=True) # get the index of the max log-probability
            correct += pred.eq(target.view_as(pred)).sum().item()
    loss /= len(test_loader.dataset)
    acc = 100. * correct / len(test_loader.dataset)
    
    if verbose:
        print('\nTest set: Average loss: {:.4f}, Accuracy: {}/{} ({:.0f}%)\n'.format(
            loss, correct, len(test_loader.dataset), acc))

    return loss,acc

Now we are ready to do the experiment with the LF mask.
Before doing that we have to 

- set our hyperparameters
- deal with the data

Just evaluate the following two cells.

# Hyperparameters settings

In [None]:
batch_size=64

#number of epochs 
epochs=5

#learning rate
lr=0.01

# keep the momentum to 0, otherwise also freezed parameters 
# will move for the momentum contribution to parameters evolution
momentum=0.0

seed=1
torch.manual_seed(seed)
save_model=1

# Data

In [None]:
kwargs = {'num_workers': 1, 'pin_memory': True} if use_cuda else {}
train_loader = torch.utils.data.DataLoader(
    datasets.MNIST('./data', train=True, download=True,
                   transform=transforms.Compose([
                       transforms.ToTensor(),
                       transforms.Normalize((0.1307,), (0.3081,))
                   ])),
    batch_size=batch_size, shuffle=True, **kwargs)
test_loader = torch.utils.data.DataLoader(
    datasets.MNIST('./data', train=False, transform=transforms.Compose([
                       transforms.ToTensor(),
                       transforms.Normalize((0.1307,), (0.3081,))
                   ])),
    batch_size=batch_size, shuffle=True, **kwargs)

# Exercise 3: First Training

During the first training you will

- initialize the network and send to device
- store the initial weights (wi) of the network
- init the optimizer
- train the network without any mask (or passing a mask of ones if you prefer)
- store the final weights (wf)

The final weights will be used later to compute the LF mask.

During the training check that your loss is getting smaller and your accuracy on test set higher.

In [None]:
model = Net().to(device)
if not os.path.isdir('./models'):
    os.mkdir('./models')

torch.save(model.state_dict(), 'models/initial.pt')

In [None]:
optimizer = optim.SGD(model.parameters(), lr=lr, momentum=momentum)

In [None]:
for epoch in range(1, epochs + 1):
    '''
    your code here
    '''
    
if (save_model):
    print('Saving model.')
    
    '''
    your code here
    '''

### Masks

We define different masks.

- the identity mask, which is useful for debugging: it will left the training unaltered
- the random mask, that we will use to make comparisons
- the LF (large final) mask

The first two are given, the implementation of the LF mask is an exercise.

In [None]:
def identity_mask(model):
    '''
    Returns the identity mask for all parameter tensors in a list
    '''
    mask = []
    for p in model.parameters():
        mask.append(torch.ones_like(p))
    return mask

def random_mask(model, level=0.0):
    '''
    Construct random mask with a given level of pruning (probability to assign a zero value)
    '''
    # construct random mask
    mask = []
    frac = 0
    tot = 0
    for i,p in enumerate(model.parameters()):
        mask.append( (torch.rand_like(p) > level).float().to(device) )
        frac += torch.sum( mask[i] ).item() 
        tot += mask[i].numel()
    frac = frac/tot
    return mask, frac

### Exercise 4: the LF Mask

The original LF mask (as described in the introduction in the red text) is left as an **OPTIONAL Exercise 1 (see at the end of this notebook)**.
You will implement here a *variant* of the original LF (large final) mask. 

The general idea is just to mark with 1 the weights that at the end of the training are 
larger then a certain threshold. This is explained in the follwing figure

![](./figs_nb/LF_mask.png)

Concretely you will do it as follows. Let us denote with

- p a vector that contains all the weights
- |p| its elementwise absolute value
- m the average of |p|
- s the standard deviation of p (notice that now we are considering the original parameters, not their absolute values)
- $\alpha$ is a parameter in the range $[0,2]$

Then a parameter $p_i$ will be considered relevant (mask = 1 for this parameter) if its final value is such that:

$$
|p_i| > m + \alpha \times s
$$

Your function (LF_mask) will have the following signature:
    
    input : model,alpha
        
    output : mask (a list of tensors with the same shaper of list(model.parameters()) ) in which to 
             each parameter is associate a value 1 if relevant, 0 if irrelevant
             
             frac (a the fraction of parameters whose mask is equal to 1)
             
             The latter is useful to keep track of the level of parameters'pruning which is 
             implicitly set by alpha
             
Then we will test it and see how many parameters we pruned with a certain $\alpha$.
             
Hint: Basically you will have to concatenate all the parameters in a long numpy array first, in order to compute the statistics you need. Maybe you will find useful the functions torch.where, torch.ones_like, torch.zeros_like. Furthermore, before acting on torch tensors you will need to detach them from the computational graph (see detach() method), send them to the cpu device, and convert them into numpy arrays.

In [None]:
def LF_mask(model,alpha):
    '''
    Construct large final (LF) mask. The threshold for 
    decision is determined globally.
    '''
    frac = 0
    tot = 0
    
    
    # concatenate all parameters into a numpy array
    '''
    your code here
    '''
       
    # compute mean of the absolute values and standard deviation         
    '''
    your code here
    '''
        
        
    # compute the mask
    mask = []
    for i, p in enumerate( model.parameters() ):
        
        '''
        your code here
        '''
        
        mask.append(...)
        
        frac += torch.sum( mask[i] ).item()
        tot += mask[i].numel()

    frac = frac/tot
    
    return mask, frac

##### Test the LF mask

To make a test let us create the mask with the LF function, then print the fraction of weights that received 
a mask equal to 1.
Notice that with $\alpha=1$ you are already pruning a substantial part of your network ($\approx 98 \%$ of the parameters).

This means that the LF mask defines an highly *sparse* subnetwork.

In [None]:
model.load_state_dict(torch.load('models/final.pt') )
alpha = 1
with torch.no_grad():
    mask, frac = LF_mask(model,alpha)
print('Fraction of weights with mask=1:  {}'.format(np.round(frac,3)) )

# Exercise 5: Retraining with the LF mask

Now we will verify that this highly sparse subnetwork is also *trainable* when if start from the original weights.
    
We will

- rewind the network to its initial state
- apply the FL mask *before* training: in this way we will obtain our sparse subnetwork
- retrain just the subnetwork (it is enough to pass the mask to the train function: convince yourself that this is indeed the case)
- evaluate it at the end

Verify that the subnetwork is trainable if we start from the original weights (how much accuracy is 
reached on the test set?).
What happens if we start from a new initialization? Verify also that the subnetwork is
not as good as before in this case (how much accuracy do you reach now? Notice that in this context if a subnetwork reaches *only* the $\approx 90 \%$ - that one might think is not a bad result after all - of performance we consider it as *not trainable*, meaning only that the training is not completely effective).

We do the initial step and you will complete the following code

In [None]:
# rewind
'''
your code here
'''

# apply mask
'''
your code here
'''
    
# re-instantiate optimizer
'''
your code here
'''

# train
for epoch in tqdm(range(1, epochs + 1)):
    '''
    your code here
    '''
    
# evaluate at the end of training
_, acc_trained_subnetwork = test(model, device, test_loader)

print('Accuracy of the subnetwork after the training: {}'.format(acc_trained_subnetwork) )

In [None]:
# rewind to a random state
model = Net().to(device)

# apply mask
'''
your code here
'''
    
# re-instantiate optimizer
'''
your code here
'''

# train
for epoch in tqdm(range(1, epochs + 1)):
    '''
    your code here
    '''
    
# evaluate at the end of training
_, acc_trained_subnetwork = test(model, device, test_loader)

print('Accuracy of the subnetwork after the training: {}'.format(acc_trained_subnetwork) )

# Exercise 6: The LF is actually a very good mask (not still the SuperMask but...)

Let us read again this sentence from the paper in section **5 Supermasks**:

[...] The hypothesis above suggests that for certain mask criteria, like large final, that masking is training:
the masking operation tends to move weights in the direction they would have moved during training.

If so, just how powerful is this training operation? To answer this question, we can start all the way
from the beginning—not training the network at all, but simply applying a mask to the randomly
initialized network.

It turns out that with a well-chosen mask, an untrained network can already attain a test accuracy
far better than chance. This might come as a surprise, because if you use a randomly initialized and
untrained network to, say, classify images of handwritten digits from the MNIST dataset, you would
expect accuracy to be no better than chance (about $10 \%$). But now imagine you multiply the network
weights by a mask containing only zeros and ones. In this instance, weights are either unchanged or
deleted entirely, but the resulting network now achieves nearly 40 percent accuracy at the task! This
is strange, but it is exactly what we observe with masks created using the large final criterion [...]


First of all two questions:

- what is the hypothesis the authors are referring to at the beginning of this excerpt?
- why is, in your opinion, reasonable that you could - at least in principle - obtain better than chance results applying the LF mask to a randomly initialized network?

Write short answers to this questions below.

### Answer 1

### Answer 2

Our aim is now to reproduce the phenomenon evidenced in this plot

![](./figs_nb/supermask.png)

We do not want to reproduce the numbers but just the *masking is training* effect.

We will perform these steps:

- rewind our network to its initial state
- evaluate it once
- apply the FL mask
- repeat the evaluation on the subnetwork

Consider that we are not re-training our network but just applying the LF mask.

Then do the same for a random mask, creating one, with a level of pruning that
matches the one found for the LF mask.

    Hint: you should use 
    
    rmask, rfrac = random_mask(model, 1-frac)
    
    where frac is computed on the LF mask

What do you see? Are these results qualitatively similar to the ones plotted above?

In [None]:
# rewind
'''
your code here
'''

# eval
_, acc_init = test(model, device, test_loader)

print('Accuracy of the randomly initialized network: {}'.format(acc_init) )

In [None]:
# apply mask
'''
your code here
'''
    
# eval the sub-network
_, acc_subnetwork = test(model, device, test_loader)

print('Accuracy of the randomly initialized subnetwork: {}'.format(acc_subnetwork) )

In [None]:
# create random mask
'''
your code here
'''

print('Fraction of weights with mask=1:  {}'.format(np.round(rfrac,3)) )

In [None]:
# rewind
'''
your code here
'''

# apply random mask
'''
your code here
'''

# eval the random sub-network
_, acc_rand_subnetwork = test(model, device, test_loader)

print('Accuracy of the randomly initialized network, with a random mask: {}'.format(acc_rand_subnetwork) )

Let us stop for a moment and think at what we obtained so far.
With this version of LF masking we were able to identify a sub-network S 
of the original network N (orders of magnitude smaller) that had the following properties

- it was *trainable*, recovering almost the original performance
- was still very good in recognizing digits (well above chance) *without* any training

These are amazing result, isn't it?

What is left, for our experiments, is to see how far can we go with the pruning. At the moment we
had a subnetwork whose size was about $\approx 2\%$ of the original.

### Exercise 7: How far can we go with the miniaturization?

Then next task is to understand how much we can shrink our network without
substantial harm.

We will do the same analysis as in Exercise 5 in a sequence of increasing $\alpha$ from 0 to 2, and 
than we will plot results.

The useful plots are fraction of surviving weighs vs test accuracy, like the ones in this plot

![](./figs_nb/plot_example.png)


(but notice that your plots will not start from $100 \%$ weights remaining, because for $\alpha=0$ you are
already pruning weights...)


For each value of alpha do just one run (we will not compute error bars because it would take too long).
At the end you will have a plot with just one curve, without error bars.

If it takes to long just check a few values of $\alpha$, 
for example $\alpha=0.5,1.0,1.5$ and print a list of (fraction, performance)

You will have to complete the function

    alpha_training
   
below.

In [None]:
num_levels=10
alphas = np.linspace(0,4,num_levels)

# uncomment this if it takes too long
#alphas = np.array([0.5, 1.0, 1.5])

In [None]:
def alpha_training(model, alpha, epochs):
    
    # load final weights and compute mask at level alpha
    '''
    your code here
    '''
        
    # rewind
    '''
    your code here
    '''
        
    # apply mask
    '''
    your code here
    '''
    
    # reinit optimizer
    '''
    your code here
    '''
    
    # train
    stats = []
    for epoch in range(1, epochs + 1):
        '''
        your code here
        '''
    stats = np.array(stats)
        
    return stats,frac

Iterate and save results

In [None]:
Stats = []
Fracs = []
for alpha in tqdm(alphas):  
    stats, fracs = alpha_training(model, alpha, epochs)
    Stats.append(stats)
    Fracs.append(fracs)
Stats = np.asarray(Stats)
Fracs = np.asarray(Fracs)
np.save('models/Stats_LF_mask.npy', Stats)
np.save('models/Fracs_LF_mask.npy', Fracs)

# Plots

In [None]:
fig = plt.figure(figsize=(10,7))
plt.semilogx(Fracs*100, Stats[:,-1,-1],'-bo')
plt.ylabel('test accuracy')
plt.xlabel('weights remaining %')
plt.ylim([90,100])
plt.gca().invert_xaxis()
plt.show()

# Exercise 8: Compare LF mask with random masking

Then next task is to compare the efficiency of LF mask with respect to random masking. This goes as before, the only difference being how you compute the mask.
You will just have to complete the function
    
    random_training
    
    
When you finish plot the results of the LF mask and the random mask together.

In [None]:
def random_training(model, level, epochs):
    
    # compute mask at a pruning level
    '''
    your code here
    '''
        
    # rewind
    '''
    your code here
    '''
        
    # apply mask
    '''
    your code here
    '''
    
    # reinit optimizer
    '''
    your code here
    '''
    
    # train
    stats = []
    for epoch in range(1, epochs + 1):
        '''
        your code here
        '''
    stats = np.array(stats)
        
    return stats,frac

In [None]:
# we define the levels of pruning here, from 0 to 1
pruning_levels = np.linspace(0,1,num_levels)
print(pruning_levels)

In [None]:
Stats = []
Fracs = []

for level in tqdm(pruning_levels):  
    stats, fracs = random_training(model, level, epochs)
    Stats.append(stats)
    Fracs.append(fracs)
Stats = np.asarray(Stats)
Fracs = np.asarray(Fracs)

np.save('models/Stats_random_mask.npy', Stats)
np.save('models/Fracs_random_mask.npy', Fracs)

In [None]:
Stats = np.load('models/Stats_LF_mask.npy')
Fracs = np.load('models/Fracs_LF_mask.npy')
Stats_rand = np.load('models/Stats_random_mask.npy')
Fracs_rand = np.load('models/Fracs_random_mask.npy')

# Plot together

In [None]:
fig = plt.figure(figsize=(10,7))
plt.semilogx(Fracs*100, Stats[:,-1,-1],'-bo',label='LF mask (freeze at zero)')
# we leave out the last point if the frac is 0, in order to avoid the log(0)
plt.semilogx(Fracs_rand[:-1]*100, Stats_rand[:-1,-1,-1],'-ro',label='random mask')
plt.ylabel('test accuracy')
plt.xlabel('weights remaining %')
plt.xlim([0, 110])
plt.ylim([90,100])
plt.gca().invert_xaxis()
plt.legend()
plt.show()

It should be clear from this plot that 

- if we do random masking, pruning more than $20 \%$ of the parameters severely affect the network performance 
- if we use the mask given by the LF criterion, we can prune much more without affecting the performance sensibly

This means that, for example, we could endow an embedded system - at least in principle - with a tenfold smaller network just by using a very simple masking method.

# Conclusions

We hope that this exercise was a stimulus for you to delve more into this subject, and more 
generally into deep learning problems.

Good Luck!!!



Not really!

# OPTIONAL ASSIGNMENTS

If you want to receive full grades you should work on the following two exercises.

<span style="color:blue">These exercises are just sketched: incomplete but reasonable attempts to solve them will be considered for evaluation. Variations on these themes, inspired by your curiosity (for example in the case of Exercise 1 you could also implement another type of mask, I would suggest in this case the *magnitude increase* mask), will be considered equivalently.</span>


### Exercise 1

Implement the original version of the LF mask (or another one of your choice).
Check that it works for a few values of the pruning level and if you have time/will plot it as a function of pruning level.

### Exercise 2

Pruning means setting exactly to zero the weigths we want to eliminate.
What if, on the contrary, we keep these weigths frozen at their initialization values? 

In this case notice that we are *not* pruning the network.
In order to do that there is just a very simple thing to do (or not do) in the function alpha_training.
Find it, make this new experiment and plot all the 
three results together (with LF masking, random masking and this version). 

(You can do that with both versions of the LF masking: the original version of the optional Exercise 1 or our version: choose whatever you want).

If you want you could also add a test to see that the freezed parameters 
do not move from their initial values, by constructing a suitable alert function.