# Homework 1

Provide the solutions to the homework either in (a) a Python script or (b) in a Jupyter notebook (preferred choice).

Points 1, 2, 4 should be provided as code.

Point 3 should be provided in a markdown cell (if Jupyter notebook) or in a multiline comment (if Python script).

1. Taking inspiration from the notebook `01-intro-to-pt.ipynb`, build a class for the Multilayer Perceptron (MLP) whose scheme is drawn in the last figure of the notebook. As written there, no layer should have bias units and the activation for each hidden layer should be the Rectified Linear Unit (ReLU) function, also called ramp function. The activation leading to the output layer, instead, should be the softmax function, which prof. Ansuini explained during the last lecture. You can find some notions on it also on the notebook.                    


2. After having defined the class, create an instance of it and print a summary using a method of your choice.


3. Provide detailed calculations (layer-by-layer) on the exact number of parameters in the network.

   1. Provide the same calculation in the case that the bias units are present in all layers (except input).
   
   
4. For each layer within the MLP, calculate the L2 norm and L1 norm of its parameters.

## Point 1

Let us suppose we wish to build a larger model from the graph below.

![](img/mlp_graph_larger.jpg)

We suppose that

1. The layers have no bias units
2. The activation function for hidden layers is `ReLU`

Moreover, we suppose that this is a classification problem.

As you might recall, when the number of classes is > 2, we encode the problem in such a way that the output layer has a no. of neurons corresponding to the no. of classes. Doing so, we establish a correspondence between output units and classes. The value of the $j$-th neuron represents the **confidence** of the network in assigning a given data instance to the $j$-th class.

Classically, when the network is encoded in such way, the activation function for the final layer is the **softmax** function.
If $C$ is the total number of classes,

$softmax(z_j) = \frac{\exp(z_j)}{\sum_{k=1}^C \exp(z_k)}$

where $j\in \{1,\cdots,C\}$ is one of the classes.

If we repeat this calculation for all $j$s, we end up with $C$ normalized values (i.e., between 0 and 1) which can be interpreted as probability that the network assigns the instance to the corresponding class.

In [1]:
import torch

In [2]:
class MLP(torch.nn.Module):
    
    def __init__(self):
        super().__init__()
        self.layer1 = torch.nn.Linear(in_features =  5, out_features = 11, bias = False)
        self.layer2 = torch.nn.Linear(in_features = 11, out_features = 16, bias = False)
        self.layer3 = torch.nn.Linear(in_features = 16, out_features = 13, bias = False)
        self.layer4 = torch.nn.Linear(in_features = 13, out_features = 8, bias = False)
        self.layer5 = torch.nn.Linear(in_features =  8, out_features = 4, bias = False)
        
    def forward(self, X):
        out = self.layer1(X)
        out = torch.nn.functional.relu(out)
        out = self.layer2(out)
        out = torch.nn.functional.relu(out)
        out = self.layer3(out)
        out = torch.nn.functional.relu(out)
        out = self.layer4(out)
        out = torch.nn.functional.relu(out)
        out = self.layer5(out)
        out = torch.nn.functional.softmax(out)
        return out

## Point 2

In [3]:
model = MLP()
model

MLP(
  (layer1): Linear(in_features=5, out_features=11, bias=False)
  (layer2): Linear(in_features=11, out_features=16, bias=False)
  (layer3): Linear(in_features=16, out_features=13, bias=False)
  (layer4): Linear(in_features=13, out_features=8, bias=False)
  (layer5): Linear(in_features=8, out_features=4, bias=False)
)

In [4]:
model.state_dict()

OrderedDict([('layer1.weight',
              tensor([[ 0.3575, -0.1398, -0.4320, -0.4417, -0.3660],
                      [-0.0365, -0.1304,  0.1207,  0.0152,  0.0177],
                      [ 0.0810, -0.4347,  0.3170,  0.0636,  0.1497],
                      [-0.3158, -0.3486,  0.1617,  0.1791, -0.3645],
                      [ 0.2803, -0.3764, -0.3359,  0.3304, -0.2652],
                      [ 0.2848, -0.2202,  0.1533,  0.0190,  0.4040],
                      [-0.3167,  0.0412,  0.0585,  0.1546,  0.4256],
                      [ 0.1684,  0.2240,  0.1046,  0.0644,  0.1343],
                      [-0.0099, -0.1314, -0.4399,  0.2924, -0.1028],
                      [-0.1140, -0.1220,  0.3910, -0.0302,  0.1748],
                      [ 0.0529,  0.3622,  0.0514,  0.3865,  0.3588]])),
             ('layer2.weight',
              tensor([[-0.0064, -0.0029,  0.0103, -0.2114,  0.0764, -0.1479, -0.2010, -0.0172,
                        0.0823, -0.1868, -0.2468],
                      [ 0.0897

In [5]:
from torchsummary import summary

_ = summary(model)

Layer (type:depth-idx)                   Param #
├─Linear: 1-1                            55
├─Linear: 1-2                            176
├─Linear: 1-3                            208
├─Linear: 1-4                            104
├─Linear: 1-5                            32
Total params: 575
Trainable params: 575
Non-trainable params: 0


In [6]:
def pretty_print(obj, title=None):
    if title is not None:
        print(title)
    print(obj)
    print("\n")

In [8]:
x1 = torch.range(0, 9).unsqueeze(-1)
x2 = torch.range(0, 9).unsqueeze(-1)
x3 = torch.range(0, 9).unsqueeze(-1)
x4 = torch.range(0, 9).unsqueeze(-1)
x5 = torch.range(0, 9).unsqueeze(-1)
X = torch.cat((x1, x2, x3, x4, x5), dim=1)
eps = torch.normal(0, .3, (10, 5))
X += eps

y = torch.range(0, 9).unsqueeze(-1)


pretty_print(X, "X (covariates)")
pretty_print(y, "y (response)")

X (covariates)
tensor([[ 0.2183, -0.0832, -0.1295,  0.0803,  0.1045],
        [ 1.0737,  0.9031,  0.3763,  1.1204,  0.5320],
        [ 2.0429,  1.7837,  1.9326,  2.2841,  2.0563],
        [ 2.8074,  2.8198,  3.3179,  2.7253,  3.1570],
        [ 3.9642,  3.8509,  4.5325,  4.3087,  4.1998],
        [ 4.9211,  5.0005,  4.5771,  5.3332,  4.9440],
        [ 5.8898,  6.2374,  6.3189,  6.3029,  6.0736],
        [ 7.4177,  7.2779,  6.7890,  6.9428,  6.7353],
        [ 8.4458,  8.0981,  7.8257,  7.9372,  8.1026],
        [ 8.8283,  8.5995,  9.3828,  9.3346,  9.1644]])


y (response)
tensor([[0.],
        [1.],
        [2.],
        [3.],
        [4.],
        [5.],
        [6.],
        [7.],
        [8.],
        [9.]])




  """Entry point for launching an IPython kernel.
  
  This is separate from the ipykernel package so we can avoid doing imports until
  after removing the cwd from sys.path.
  """
  # Remove the CWD from sys.path while we load stuff.


In [9]:
y_hat = model(X)
pretty_print(y_hat, "Predictions")

logistic regressor predictions
tensor([[0.2490, 0.2512, 0.2504, 0.2494],
        [0.2492, 0.2520, 0.2501, 0.2486],
        [0.2463, 0.2591, 0.2489, 0.2457],
        [0.2441, 0.2644, 0.2478, 0.2437],
        [0.2420, 0.2698, 0.2470, 0.2412],
        [0.2415, 0.2717, 0.2473, 0.2395],
        [0.2389, 0.2781, 0.2459, 0.2371],
        [0.2384, 0.2793, 0.2465, 0.2358],
        [0.2359, 0.2854, 0.2455, 0.2332],
        [0.2329, 0.2930, 0.2434, 0.2307]], grad_fn=<SoftmaxBackward>)






In [10]:
def accuracy(y, y_hat):
    # Assign each y_hat to its predicted class
    pred_classes = torch.where(y_hat < .5, 0, 1).squeeze().long()
    correct = (pred_classes == y).sum()
    return (correct / y.shape[0]).item()

In [11]:
accuracy(y, y_hat)

0.4000000059604645

## Point 3

In the first layer we have $5 * 11 = 55$ parameters, in the second one $11 * 16 = 176$, in the third one $16 * 13 = 208$, in the fourth one $13 * 8 = 104$, and in the fifth one $8 * 4 = 32$, so in total we have $575$ parameters.

If we include also the bias, then in the first layer we have $(5 + 1) * 11 = 66$ parameters, in the second one $(11 + 1) * 16 = 192$, in the third one $(16 + 1) * 13 = 221$, in the fourth one $(13 + 1) * 8 = 112$, and in the fifth one $(8 + 1) * 4 = 36$, so in total we have $627$ parameters.

## Point 4

In [12]:
print('Computation of the L1 norm of the parameters:')
for n, i in enumerate(model.state_dict().items()):
    print(f'Layer {n} has norm {torch.linalg.norm(i[1], 1)}')

Computation of the L1 norm of the parameters:
Layer 0 has norm 2.763221263885498
Layer 1 has norm 2.6886987686157227
Layer 2 has norm 2.1026058197021484
Layer 3 has norm 1.5712890625
Layer 4 has norm 1.1986228227615356


In [14]:
print('Computation of the L2 norm of the parameters:')
for n, i in enumerate(model.state_dict().items()):
    print(f'Layer {n} has norm {torch.linalg.norm(i[1], 2)}')

Computation of the L2 norm of the parameters:
Layer 0 has norm 1.227863073348999
Layer 1 has norm 1.0856382846832275
Layer 2 has norm 1.0596067905426025
Layer 3 has norm 0.853816032409668
Layer 4 has norm 0.8204464912414551
