# Homework 1

Provide the solutions to the homework either in (a) a Python script or (b) in a Jupyter notebook (preferred choice).

Points 1, 2, 4 should be provided as code.

Point 3 should be provided in a markdown cell (if Jupyter notebook) or in a multiline comment (if Python script).

1. Taking inspiration from the notebook `01-intro-to-pt.ipynb`, build a class for the Multilayer Perceptron (MLP) whose scheme is drawn in the last figure of the notebook. As written there, no layer should have bias units and the activation for each hidden layer should be the Rectified Linear Unit (ReLU) function, also called ramp function. The activation leading to the output layer, instead, should be the softmax function, which prof. Ansuini explained during the last lecture. You can find some notions on it also on the notebook.                    


2. After having defined the class, create an instance of it and print a summary using a method of your choice.


3. Provide detailed calculations (layer-by-layer) on the exact number of parameters in the network.

   1. Provide the same calculation in the case that the bias units are present in all layers (except input).
   
   
4. For each layer within the MLP, calculate the L2 norm and L1 norm of its parameters.

## Point 1

Let us suppose we wish to build a larger model from the graph below.

![](img/mlp_graph_larger.jpg)

We suppose that

1. The layers have no bias units
2. The activation function for hidden layers is `ReLU`

Moreover, we suppose that this is a classification problem.

As you might recall, when the number of classes is > 2, we encode the problem in such a way that the output layer has a no. of neurons corresponding to the no. of classes. Doing so, we establish a correspondence between output units and classes. The value of the $j$-th neuron represents the **confidence** of the network in assigning a given data instance to the $j$-th class.

Classically, when the network is encoded in such way, the activation function for the final layer is the **softmax** function.
If $C$ is the total number of classes,

$softmax(z_j) = \frac{\exp(z_j)}{\sum_{k=1}^C \exp(z_k)}$

where $j\in \{1,\cdots,C\}$ is one of the classes.

If we repeat this calculation for all $j$s, we end up with $C$ normalized values (i.e., between 0 and 1) which can be interpreted as probability that the network assigns the instance to the corresponding class.

In [1]:
import torch

In [2]:
class MLP(torch.nn.Module):
    
    def __init__(self):
        super().__init__()
        self.layer1 = torch.nn.Linear(in_features =  5, out_features = 11, bias = False)
        self.layer2 = torch.nn.Linear(in_features = 11, out_features = 16, bias = False)
        self.layer3 = torch.nn.Linear(in_features = 16, out_features = 13, bias = False)
        self.layer4 = torch.nn.Linear(in_features = 13, out_features = 8, bias = False)
        self.layer5 = torch.nn.Linear(in_features =  8, out_features = 4, bias = False)
        
    def forward(self, X):
        out = self.layer1(X)
        out = torch.nn.functional.relu(out)
        out = self.layer2(out)
        out = torch.nn.functional.relu(out)
        out = self.layer3(out)
        out = torch.nn.functional.relu(out)
        out = self.layer4(out)
        out = torch.nn.functional.relu(out)
        out = self.layer5(out)
        out = torch.nn.functional.softmax(out)
        return out

## Point 2

In [3]:
model = MLP()
model

MLP(
  (layer1): Linear(in_features=5, out_features=11, bias=False)
  (layer2): Linear(in_features=11, out_features=16, bias=False)
  (layer3): Linear(in_features=16, out_features=13, bias=False)
  (layer4): Linear(in_features=13, out_features=8, bias=False)
  (layer5): Linear(in_features=8, out_features=4, bias=False)
)

In [4]:
model.state_dict()

OrderedDict([('layer1.weight',
              tensor([[ 0.2176, -0.2026, -0.1615,  0.3598, -0.3251],
                      [-0.3639,  0.2191,  0.2360,  0.4328,  0.3723],
                      [-0.0855,  0.0101, -0.2407,  0.3295, -0.1332],
                      [-0.0401,  0.2312,  0.4157, -0.0795,  0.3541],
                      [-0.3643, -0.3908, -0.0275, -0.1664,  0.4195],
                      [-0.3084, -0.0194,  0.1615,  0.2360, -0.0530],
                      [ 0.1561, -0.1288, -0.2637, -0.0754, -0.1483],
                      [ 0.2527,  0.2568, -0.3527,  0.2794,  0.1436],
                      [ 0.0716, -0.2470, -0.1089, -0.1401,  0.1881],
                      [-0.2801,  0.0522, -0.2089, -0.1880,  0.3127],
                      [-0.0431,  0.2016, -0.1951,  0.1926,  0.2564]])),
             ('layer2.weight',
              tensor([[ 0.1107, -0.2437, -0.0147, -0.0664,  0.1571,  0.0306,  0.0584,  0.0386,
                       -0.2491, -0.0664,  0.0020],
                      [ 0.0498

In [5]:
from torchsummary import summary

_ = summary(model)

Layer (type:depth-idx)                   Param #
├─Linear: 1-1                            55
├─Linear: 1-2                            176
├─Linear: 1-3                            208
├─Linear: 1-4                            104
├─Linear: 1-5                            32
Total params: 575
Trainable params: 575
Non-trainable params: 0


## Point 3

In the first layer we have $5 * 11 = 55$ parameters, in the second one $11 * 16 = 176$, in the third one $16 * 13 = 208$, in the fourth one $13 * 8 = 104$, and in the fifth one $8 * 4 = 32$, so in total we have $575$ parameters.

If we include also the bias, then in the first layer we have $(5 + 1) * 11 = 66$ parameters, in the second one $(11 + 1) * 16 = 192$, in the third one $(16 + 1) * 13 = 221$, in the fourth one $(13 + 1) * 8 = 112$, and in the fifth one $(8 + 1) * 4 = 36$, so in total we have $627$ parameters.

## Point 4

In [6]:
print('Computation of the L1 norm of the parameters:')
for n, i in enumerate(model.state_dict().items()):
    print(f'Layer {n+1} has norm {torch.linalg.norm(i[1], 1)}')

Computation of the L1 norm of the parameters:
Layer 1 has norm 2.7061924934387207
Layer 2 has norm 3.1074023246765137
Layer 3 has norm 2.067498207092285
Layer 4 has norm 1.487830400466919
Layer 5 has norm 0.8834104537963867


In [7]:
print('Computation of the L2 norm of the parameters:')
for n, i in enumerate(model.state_dict().items()):
    print(f'Layer {n+1} has norm {torch.linalg.norm(i[1], 2)}')

Computation of the L2 norm of the parameters:
Layer 1 has norm 1.1267356872558594
Layer 2 has norm 1.3487457036972046
Layer 3 has norm 0.9211719036102295
Layer 4 has norm 0.9701603055000305
Layer 5 has norm 0.7465438842773438
