# Defining and Training Neural Networks

What we will learn:
- How to initialize a NN
- Forward pass
- Backward pass
- Optimization of the network parameters

## Pytorch: <code>nn</code>

The <code>nn</code> package defines a set of Modules (i.e. neural networks layers).

Each module receive an input and produces an output.

The <code>nn</code> package also defines losses. 

In [1]:
# Import libs
import torch
import torch.nn as nn

import math
from IPython import display

### Objective
Create a model that approximate the $sin(x)$ function.

In [2]:
# Create Tensors to hold input and outputs.
x = torch.linspace(-math.pi, math.pi, 2000)
y = torch.sin(x)

In [8]:
# For this example, the network will learn the Sin function using a Polynomial Approximation.
# The output y is a function of (x, x^2, x^3), so
# we can consider it as an output of a linear layer neural network. Let's prepare the
# tensor (x, x^2, x^3).
p = torch.tensor([1, 2, 3])
xx = x.unsqueeze(-1).pow(p)
print(xx)
print(xx.size())

tensor([[ -3.1416,   9.8696, -31.0063],
        [ -3.1384,   9.8499, -30.9133],
        [ -3.1353,   9.8301, -30.8205],
        ...,
        [  3.1353,   9.8301,  30.8205],
        [  3.1384,   9.8499,  30.9133],
        [  3.1416,   9.8696,  31.0063]])
torch.Size([2000, 3])


In [9]:
# Use the nn package to define our model as a sequence of layers. nn.Sequential
# is a Module which contains other Modules, and applies them in sequence to
# produce its output. The Linear Module computes output from input using a
# linear function, and holds internal Tensors for its weight and bias.
# The Flatten layer flatens the output of the linear layer to a 1D tensor,
# to match the shape of `y`.

# Costruiamo l'oggetto contenente la rete neurale:
model = torch.nn.Sequential(
    # layer lineare
    torch.nn.Linear(3, 1),
    # serve solo per flattenizzare
    torch.nn.Flatten(0, 1)
)

In [10]:
# The nn package also contains definitions of popular loss functions; in this
# case we will use Mean Squared Error (MSE) as our loss function.

# Mi creo un oggetto contenente la funzione di errore MSE:
loss_fn = torch.nn.MSELoss()
# Mi definisco un learning rate:
learning_rate = 1e-3
# Construct the Optimizer. The call to model.parameters()
# in the SGD constructor will contain the learnable parameters
# which are members of the model.

# Uso SDG:
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)

### MSE Loss
$$\mathcal{L}_{MSE} = \frac{1}{n}\sum^n_{i=n}(y - \hat{y})^2$$

In [11]:
# Ciclo sul training set
for t in range(2000):
    # Forward pass: compute predicted y by passing x to the model. Module objects
    # override the __call__ operator so you can call them like functions. When
    # doing so you pass a Tensor of input data to the Module and it produces
    # a Tensor of output data.
    y_pred = model(xx)

    # Compute and print loss. We pass Tensors containing the predicted and true
    # values of y, and the loss function returns a Tensor containing the
    # loss.
    loss = loss_fn(y_pred, y) # (y_pred - y).pow(2).mean()
    
    # Print loss every 200 epochs
    if t % 200 == 199:
        # per stampare il valore numerico della loss, usa loss.item()
        print(t, loss.item())
    
    # Before the backward pass, use the optimizer object to zero all of the
    # gradients for the variables it will update (which are the learnable
    # weights of the model). This is because by default, gradients are
    # accumulated in buffers( i.e, not overwritten) whenever .backward()
    # is called. Checkout docs of torch.autograd.backward for more details.
    optimizer.zero_grad()
    
    # Alternative: zero the gradients of the model
    # model.zero_grad()
    
    # Backward pass: compute gradient of the loss with respect to all the learnable
    # parameters of the model. Internally, the parameters of each Module are stored
    # in Tensors with requires_grad=True, so this call will compute gradients for
    # all learnable parameters in the model.
    loss.backward() 

    # Calling the step function on an Optimizer makes an update to its
    # parameters
    optimizer.step()
    
    # Alternative: Update the weights using gradient descent MANUALLY. Each parameter is a Tensor, so
    # we can access its gradients.
    """
    with torch.no_grad():
        for param in model.parameters():
            param -= learning_rate * param.grad
    """

199 0.2046881467103958
399 0.1372782289981842
599 0.09257517009973526
799 0.06292480230331421
999 0.043254755437374115
1199 0.030203089118003845
1399 0.021541040390729904
1599 0.01579095609486103
1799 0.011972922831773758
1999 0.009437178261578083


In [12]:
# You can access the first layer of `model` like accessing the first item of a list
linear_layer = model[0]

# For linear layer, its parameters are stored as `weight` and `bias`.
print(f'Result: y = {linear_layer.bias.item()} + {linear_layer.weight[:, 0].item()} x + {linear_layer.weight[:, 1].item()} x^2 + {linear_layer.weight[:, 2].item()} x^3')

Result: y = -0.029610617086291313 + 0.7630205154418945 x + 0.005108322948217392 x^2 + -0.07999949157238007 x^3


In [13]:
# The network has effectively learned something?
idx = 1000
print(xx[idx]) # x[500] = -pi/2
print("%.6f %.6f" % (model(xx)[idx].item(), torch.sin(x)[idx]))

tensor([1.5717e-03, 2.4701e-06, 3.8821e-09])
-0.028411 0.001572


In [14]:
# plot results
import matplotlib.pyplot as plt
from res.plot_lib import plot_data, plot_data_np, plot_model, set_default
set_default()

yy = model(xx)

plt.plot(x,y, label='Sin(x)')
plt.plot(x,yy.detach().numpy(), label='model(x)')
plt.legend()
plt.show()

ModuleNotFoundError: No module named 'res'

## Custom models

In [15]:
class SinModel(nn.Module):
    def __init__(self, in_dim = 3, out_dim = 1):
        """
        In the constructor we instantiate all the layer of the NN
        """
        super().__init__()
        # self.model = nn.Sequential(
        #     nn.Linear(in_dim, out_dim),
        #     nn.Flatten(0, 1)
        # )
        """
        In alternative we could also define each layer individually
        """
        # Linear Layer della rete:
        self.l1 = nn.Linear(in_dim, out_dim)
        # Layer che flattenizza:
        self.flt = nn.Flatten(0, 1)
        
    def forward(self, x):
        """
        In the forward function we accept a Tensor of input data and we must return
        a Tensor of output data. We can use Modules defined in the constructor as
        well as arbitrary operators on Tensors.
        """
        #return self.model(x)

        # Sempre meglio lavorare sul singolo modulo della rete e non darlo in pasto a self.model()!!
        x = self.l1(x)
        # do stuff: potrei infatti dover fare delle operazioni sul risultato del layer lineare:
        return self.flt(x)

In [16]:
# Construct our model by instantiating the class defined above
model = SinModel()

print(model)

SinModel(
  (l1): Linear(in_features=3, out_features=1, bias=True)
  (flt): Flatten(start_dim=0, end_dim=1)
)


In [17]:
# Construct our loss function and an Optimizer. The call to model.parameters()
# in the SGD constructor will contain the learnable parameters (defined 
# with torch.nn.Parameter) which are members of the model.
criterion = nn.MSELoss()
optimizer = torch.optim.SGD(model.parameters(), lr=1e-3)
for t in range(2000):
    # Forward pass: Compute predicted y by passing x to the model
    y_pred = model(xx)

    # Compute and print loss
    loss = criterion(y_pred, y)
    if t % 200 == 199:
        print(t, loss.item())

    # Zero gradients, perform a backward pass, and update the weights.
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

199 0.21490709483623505
399 0.1475391834974289
599 0.10184041410684586
799 0.07080710679292679
999 0.0497097373008728
1199 0.03535117581486702
1399 0.02556794509291649
1599 0.0188945010304451
1799 0.014337141066789627
1999 0.011221330612897873


In [23]:
# The network has effectively learned something?
idx = 500
print(xx[idx]) # x[500] = -pi/2
print("%.6f %.6f" % (model(xx)[idx].item(), torch.sin(x)[idx]))

tensor([-1.5700,  2.4649, -3.8700])
-0.853562 -1.000000


In [None]:
yy = model(xx)

plt.plot(x,y, label='Sin(x)')
plt.plot(x,yy.detach().numpy(), label='model(x)')
plt.legend()
plt.show()

In [None]:
# Ex1: write a model (using custom modules) where the output y is a function of (x, x^2)
# and it approximates the cosine function

In [39]:
# Andiamo a creare per prima cosa il dataset:
x_cosine = torch.linspace(-torch.pi, torch.pi, 2000)
y_cosine = torch.cos(x_cosine)
# Cambio il Tensor delle potenze perchè qui sto lavorando con un polinomio di secondo grado:
power = torch.Tensor([1, 2])
# Creo la struttura dati per il modello:
xx_cosine = x_cosine.unsqueeze(-1).pow(power)
print(xx_cosine.size())
print(xx_cosine)
print(y_cosine.size())

torch.Size([2000, 2])
tensor([[-3.1416,  9.8696],
        [-3.1384,  9.8499],
        [-3.1353,  9.8301],
        ...,
        [ 3.1353,  9.8301],
        [ 3.1384,  9.8499],
        [ 3.1416,  9.8696]])
torch.Size([2000])


In [40]:
class CosineModule(nn.Module):
    def __init__(self, input_dim, output_dim):
        # costruttore classe padre:
        super().__init__()
        # creazione del layer lineare:
        self.l1 = nn.Linear(input_dim, output_dim)
        self.f = nn.Flatten(0, 1)

    def forward(self, x):
        result = self.l1(x)
        return self.f(result)
    
# Istanziazione dell'oggetto:
cosine_nn = CosineModule(2, 1)
print(cosine_nn)
# Istanzio la loss function:
my_loss_function = nn.MSELoss()
# Istanzio l'optimizer:
my_optimizer = torch.optim.SGD(cosine_nn.parameters(), lr=1e-3)

CosineModule(
  (l1): Linear(in_features=2, out_features=1, bias=True)
  (f): Flatten(start_dim=0, end_dim=1)
)


In [45]:
# Addestramento della rete:
for t in range(2000):
    # calcolo la prediction:
    y_pred = cosine_nn(xx_cosine)
    # calcolo la loss e la stampo ogni 100 iterazioni:
    loss = my_loss_function(y_pred, y_cosine)
    if t%99 == 0:
        print(t, loss.item())
    # backward pass e aggiornamento dei pesi della rete:
    my_optimizer.zero_grad()
    loss.backward()
    my_optimizer.step()

  return F.mse_loss(input, target, reduction=self.reduction)


0 1.643784761428833
99 0.8743630051612854
198 0.6672363877296448
297 0.6009028553962708
396 0.5742444396018982
495 0.559681236743927
594 0.549544095993042
693 0.5415782332420349
792 0.5350212454795837
891 0.5295374989509583
990 0.5249274969100952
1089 0.5210455060005188
1188 0.5177748799324036
1287 0.5150187611579895
1386 0.5126962065696716
1485 0.5107388496398926
1584 0.5090893507003784
1683 0.5076992511749268
1782 0.5065277218818665
1881 0.5055404305458069
1980 0.5047084093093872


In [60]:
# Vediamo se la rete ha appreso:
idx = 750
print("%.6f %.6f" % (cosine_nn(xx_cosine)[idx].item(), torch.cos(x)[idx]))

-0.088470 0.707940


In [None]:
# Ex2: write a model (using custom modules) where the output y is a function of (x, x^2, x^3)
# and it approximates the function -5 + 2*x + 3/4x^2 + 7*x^3
# which values do you expect bias and weights should have after training?

In [87]:
# Creazione del dataset:
x = torch.linspace(-1, +1, 2000) #usa un intervallino più ristretto per evitare loss troppo grandi
y = (-5 + 2*x + (3/4)*x**2 + 7*x**3).to(torch.float32)
p = torch.tensor([1, 2, 3], dtype=torch.float32)
xx = x.unsqueeze(-1).pow(p)
print("Size of x: ", x.size())
print("Size of y: ", y.size())
print("Size of xx: ", xx.size())

Size of x:  torch.Size([2000])
Size of y:  torch.Size([2000])
Size of xx:  torch.Size([2000, 3])


In [88]:
class ApproximatorModel(nn.Module):
    def __init__(self, input_dim, output_dim):
        # chiamata al costruttore della classe padre:
        super().__init__()
        # creazione del primo layer lineare:
        self.l1 = nn.Linear(input_dim, output_dim)
        # creazione del layer di flattening:
        self.f = nn.Flatten(0, 1)

    def forward(self, x):
        result = self.l1(x)
        return self.f(result)

# Istanzio un oggetto del modello:
approximator = ApproximatorModel(3, 1)
# Istanzio la funzione di errore:
my_loss_function = nn.MSELoss()
# Istazio l'optimizer:
optimizer = torch.optim.SGD(approximator.parameters(), lr=1e-3)

print(approximator)

ApproximatorModel(
  (l1): Linear(in_features=3, out_features=1, bias=True)
  (f): Flatten(start_dim=0, end_dim=1)
)


In [92]:
# Addestramento del modello:
for t in range(10000):
    # calcolo il predicted result:
    y_pred = approximator(xx)
    # calcolo la loss e la stampo ogni 100 iterazioni:
    loss = my_loss_function(y_pred, y)
    if t%99 == 0:
        print(t, loss.item())
    # backward pass + aggiornamento dei pesi:
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

# La rete avrà imparato qualcosa?
idx = 500
print(xx[idx])
print("%.6f %.6f" % (approximator(xx)[idx].item(), y[idx]))

0 0.9663819074630737
99 0.8944557905197144
198 0.8332880139350891
297 0.7810604572296143
396 0.7362793684005737
495 0.6977102160453796
594 0.6643339991569519
693 0.6353052854537964
792 0.6099230051040649
891 0.5876033902168274
990 0.5678605437278748
1089 0.5502913594245911
1188 0.5345574617385864
1287 0.5203779339790344
1386 0.5075181126594543
1485 0.4957819879055023
1584 0.48500487208366394
1683 0.47504961490631104
1782 0.4658012092113495
1881 0.4571630656719208
1980 0.44905394315719604
2079 0.44140613079071045
2178 0.43416205048561096
2277 0.42727360129356384
2376 0.4206998944282532
2475 0.41440650820732117
2574 0.40836402773857117
2673 0.4025477170944214
2772 0.39693647623062134
2871 0.3915121853351593
2970 0.3862591087818146
3069 0.38116395473480225
3168 0.3762151300907135
3267 0.3714025914669037
3366 0.3667176365852356
3465 0.3621525168418884
3564 0.3577004373073578
3663 0.3533552885055542
3762 0.34911176562309265
3861 0.3449651896953583
3960 0.3409111797809601
4059 0.336945623159

In [93]:
# You can access the first layer of `model` like accessing the first item of a list
linear_layer = approximator.l1

# For linear layer, its parameters are stored as `weight` and `bias`.
print(f'Result: y = {linear_layer.bias.item()} + {linear_layer.weight[:, 0].item()} x + {linear_layer.weight[:, 1].item()} x^2 + {linear_layer.weight[:, 2].item()} x^3')

Result: y = -4.878225803375244 + 3.766690731048584 x + 0.4140462577342987 x^2 + 4.204785346984863 x^3
