# Introduction 

The objective of this session is to implement a multi-layer perceptron with one hidden layer from scratch and test it on MNIST.
You can get information about the practical sessions and the provided helper functions on the course’s website.

# 1 Activation function

Write the two functions
    
    def sigma(x)
    def dsigma(x)
 
that take as input a float tensor and returns a tensor of same size, obtained by applying component-wise respectively tanh, and the first derivative of tanh.

Hint: The functions should have no python loop, and use in particular torch.tanh , torch.exp , torch.mul , and torch.pow . My versions are 34 and 62 character long.

In [9]:
import math
import torch
from torch import Tensor

import dlc_practical_prologue as prologue

In [4]:
def sigma(x):
    return torch.tanh(x)

def dsigma(x):
    tanX2 = torch.tanh(x).pow(2)
    return 1 - tanX2

In [7]:
x = torch.Tensor(2,2).random_()
print(x)
print(sigma(x))
print(dsigma(x))

tensor([[ 2131524., 12771452.],
        [ 6708555., 13028076.]])
tensor([[1., 1.],
        [1., 1.]])
tensor([[0., 0.],
        [0., 0.]])


### Correction

In [10]:
def sigma(x):
    return x.tanh()

def dsigma(x):
    return 4 * (x.exp() + x.mul(-1).exp()).pow(-2)

In [11]:
print(sigma(x))
print(dsigma(x))

tensor([[1., 1.],
        [1., 1.]])
tensor([[0., 0.],
        [0., 0.]])


### Note :
Pas de faute, mais faire gaffe à bien se rappeler formule tanh

# 2 Loss

Write the two functions

    def loss(v, t)
    def dloss(v, t)

that take as input two float tensors of same dimensions with v the predicted tensor and t the target one, and return respectively ∥t − v ∥2 , and a tensor of same size equal to the gradient of that quantity as a function of v.

Hint: The functions should have no python loop, and use in particular torch.sum , torch.pow . My versions are 48 and 40 character long.

In [12]:
def loss(v, t):
    return (t-v).pow(2).sum()

def dloss(v, t):
    ret = (t-v)*loss(v,t).sum(0)
    return ret

### Correction

In [13]:
def loss(v, t):
    return (v - t).pow(2).sum()

def dloss(v, t):
    return 2 * (v - t)

### Note:

Revoir sa définition du gradient

# 3 Forward and backward passes

Write a function

    def forward ̇pass(w1, b1, w2, b2, x)

whose arguments correspond to an input vector to the network, and the weight and bias of the two layers, and returns a tuple composed of the corresponding x(0), s(1), x(1), s(2), and x(2).

Write a function
    
    def backward ̇pass(w1, b1, w2, b2,
                  t,
                  x, s1, x1, s2, x2,
                  dl ̇dw1, dl ̇db1, dl ̇dw2, dl ̇db2)

whose arguments correspond to the target vector, the quantities computed by the forward pass, and the tensors used to store the cumulated sums of the gradient on individual samples, and update the latters according to the formula of the backward pass.

Hint: The functions should have no python loop, and use in particular torch.t, torch.mv, torch.mm , and torch.view , and the functions previously written. The main difficulty is to deal
properly with the tensor size and transpose. My versions are 165 and 436 character long.


In [14]:
def forward_pass(w1, b1, w2, b2, x):
    s1 = torch.mm(w1, x) + b1
    x1 = sigma(s1)
    s2 = torch.mm(w2, x1) + b2
    x2 = sigma(s2)
    return x, s1, x1, s2, x2
    
def backward_pass(w1, b1, w2, b2, t, x, s1, x1, s2, x2, d1_dw1, d1_db1, d1_dw2, d1_db2):
    w1 = w1 - t*d1_dw1
    w2 = w2 - t*d1_dw2
    b1 = b1 - t*d1_db1
    d2 = b2 - t*d1_db2
    

### Correction

In [16]:
def forward_pass(w1, b1, w2, b2, x):
    x0 = x
    s1 = w1.mv(x0) + b1
    x1 = sigma(s1)
    s2 = w2.mv(x1) + b2
    x2 = sigma(s2)

    return x0, s1, x1, s2, x2

def backward_pass(w1, b1, w2, b2,
                  t,
                  x, s1, x1, s2, x2,
                  dl_dw1, dl_db1, dl_dw2, dl_db2):
    x0 = x
    dl_dx2 = dloss(x2, t)
    dl_ds2 = dsigma(s2) * dl_dx2
    dl_dx1 = w2.t().mv(dl_ds2)
    dl_ds1 = dsigma(s1) * dl_dx1

    dl_dw2.add_(dl_ds2.view(-1, 1).mm(x1.view(1, -1)))
    dl_db2.add_(dl_ds2)
    dl_dw1.add_(dl_ds1.view(-1, 1).mm(x0.view(1, -1)))
    dl_db1.add_(dl_ds1)

### Note: 

Claire incompréhension, relire le cours

# 4 Training the network

Write the code to train and test a MLP with one hidden layer of 50 units. This network should have an input dimension of 784, which is the dimension of the MNIST training set, and an output dimension of 10, which is the number of classes.
You code should:

1. Load the data using the provided prologue.load ̇data function, with one-hot label vectors and normalized inputs. Multiply the target label vectors by ζ = 0.9 (so that they are strictly in the value range of tanh).
2. Create the four weight and bias tensors, and fill them with random values sampled according to N(0, ε) with ε = 1e − 6.
3. Create the four tensors to sum up the gradients on individual samples, with respect to the weights and biases.
4. Perform 1, 000 gradient steps with a step size η equal to 0.1 divided by the number of training samples.

Each of these steps requires to reset to zero the tensors for summing up the gradients, and doing a forward and a backward pass for each training example.

Compute and print the training loss, training error and test error after every step using the class of maximum response as the predicted one.

Hint: My solution is 1987 character long and achieves 3.6% training error and 15.70% test error with 50 hidden units. It takes 1min40s to finish on a Intel i7 with no GPU, using the default small sets of prologue.load ̇data.

In [27]:
train_input, train_target, test_input, test_target = prologue.load_data()
train_target *= 0.9
test_target*= 0.9
n_classes = 10
hidden_dimension = 50
input_dimension = 784
n_train = train_input.size(0)
n_test = test_input.size(0)

epsilon = (10**(-6))
w1 = Tensor(hidden_dimension, train_input.size(1)).normal_(0, epsilon)
b1 = Tensor(hidden_dimension).normal_(0, epsilon)
w2 = Tensor(n_classes, hidden_dimension).normal_(0, epsilon)
b2 = Tensor(n_classes).normal_(0, epsilon)

dl_dw1 = Tensor(w1.size())
dl_dw2 = Tensor(w2.size())
dl_db1 = Tensor(b1.size())
dl_db2 = Tensor(b2.size())

eta = 0.1/n_train

for i in range(1000):
    acc_loss = 0
    nb_train_errors = 0

    dl_dw1.zero_()
    dl_db1.zero_()
    dl_dw2.zero_()
    dl_db2.zero_()
    
    for j in range(n_train):
        x, s1, x1, s2, x2 = forward_pass(b1=b1, b2=b2, w1=w1, w2=w2, x=train_input[j])
        backward_pass(w1, b1, w2, b2,
                  train_target[j].float(),
                  x, s1, x1, s2, x2,
                  dl_dw1, dl_db1, dl_dw2, dl_db2)
    
    w1 = w1 - eta * dl_dw1
    b1 = b1 - eta * dl_db1
    w2 = w2 - eta * dl_dw2
    b2 = b2 - eta * dl_db2
    
    
    
    
    

* Using MNIST
** Reduce the data-set (use --full for the full thing)
** Use 1000 train and 1000 test samples


In [28]:
print(w1)
print(w2)
print(b1)
print(b2)

tensor([[ 7.9368e-07, -2.9087e-07, -1.9056e-07,  ..., -4.5025e-09,
          5.2184e-07,  1.2222e-07],
        [ 1.6297e-07, -8.9330e-07, -2.4524e-07,  ..., -2.8647e-06,
          7.7882e-07,  1.0929e-06],
        [ 1.3245e-06, -5.0951e-07, -2.6161e-07,  ...,  8.5209e-07,
          1.3358e-06,  3.9310e-07],
        ...,
        [-5.2221e-07,  2.6156e-06, -2.7214e-07,  ...,  1.5129e-07,
          8.7381e-07, -1.8542e-07],
        [ 7.0642e-07, -1.5696e-06,  1.2612e-06,  ...,  1.4513e-06,
          5.7798e-07,  1.4351e-06],
        [-8.2720e-07,  1.1239e-06,  8.2591e-07,  ...,  1.5426e-06,
          6.9170e-07, -5.1783e-07]])
tensor([[ 6.2110e-07,  4.6845e-07,  2.0607e-06, -6.8884e-07, -8.4178e-07,
          8.1810e-08,  2.0056e-06,  1.1767e-06, -3.0302e-07, -1.0057e-06,
          3.2718e-07, -5.1883e-07,  1.6384e-07,  3.6668e-07, -2.5713e-07,
          4.6792e-08,  1.0478e-06, -1.2150e-06, -1.2761e-06, -2.8307e-07,
          6.8178e-07, -5.6880e-07,  6.6020e-08,  3.8623e-07,  7.7206e-07

### Correction:

In [29]:
train_input, train_target, test_input, test_target = prologue.load_data(one_hot_labels = True,
                                                                        normalize = True)

nb_classes = train_target.size(1)
nb_train_samples = train_input.size(0)

zeta = 0.90

train_input = train_input * zeta
test_input = test_input * zeta

nb_hidden = 50
eta = 1e-1 / nb_train_samples
epsilon = 1e-6

w1 = Tensor(nb_hidden, train_input.size(1)).normal_(0, epsilon)
b1 = Tensor(nb_hidden).normal_(0, epsilon)
w2 = Tensor(nb_classes, nb_hidden).normal_(0, epsilon)
b2 = Tensor(nb_classes).normal_(0, epsilon)

dl_dw1 = Tensor(w1.size())
dl_db1 = Tensor(b1.size())
dl_dw2 = Tensor(w2.size())
dl_db2 = Tensor(b2.size())

for k in range(0, 1000):

    # Back-prop

    acc_loss = 0
    nb_train_errors = 0

    dl_dw1.zero_()
    dl_db1.zero_()
    dl_dw2.zero_()
    dl_db2.zero_()

    for n in range(0, nb_train_samples):
        x0, s1, x1, s2, x2 = forward_pass(w1, b1, w2, b2, train_input[n])

        pred = x2.max(0)[1][0]
        if train_target[n, pred] < 0: 
            nb_train_errors = nb_train_errors + 1
            
        acc_loss = acc_loss + loss(x2, train_target[n])

        backward_pass(w1, b1, w2, b2,
                      train_target[n],
                      x0, s1, x1, s2, x2,
                      dl_dw1, dl_db1, dl_dw2, dl_db2)

    # Gradient step

    w1 = w1 - eta * dl_dw1
    b1 = b1 - eta * dl_db1
    w2 = w2 - eta * dl_dw2
    b2 = b2 - eta * dl_db2

    # Test error

    nb_test_errors = 0

    for n in range(0, test_input.size(0)):
        _, _, _, _, x2 = forward_pass(w1, b1, w2, b2, test_input[n])

        pred = x2.max(0)[1][0]
        if test_target[n, pred] < 0: nb_test_errors = nb_test_errors + 1
    if i%100 == 0:
        print('{:d} acc_train_loss {:.02f} acc_train_error {:.02f}% test_error {:.02f}%'
              .format(k,
                      acc_loss,
                      (100 * nb_train_errors) / train_input.size(0),
                      (100 * nb_test_errors) / test_input.size(0)))

* Using MNIST
** Reduce the data-set (use --full for the full thing)
** Use 1000 train and 1000 test samples




In [30]:
print(w1)
print(w2)
print(b1)
print(b2)

tensor([[-0.0062, -0.0062, -0.0062,  ..., -0.0062, -0.0062, -0.0062],
        [-0.0030, -0.0030, -0.0030,  ..., -0.0030, -0.0030, -0.0030],
        [-0.0021, -0.0021, -0.0021,  ..., -0.0021, -0.0021, -0.0021],
        ...,
        [ 0.0007,  0.0008,  0.0008,  ...,  0.0008,  0.0008,  0.0008],
        [ 0.0095,  0.0095,  0.0095,  ...,  0.0095,  0.0095,  0.0095],
        [ 0.0090,  0.0090,  0.0090,  ...,  0.0090,  0.0090,  0.0090]])
tensor([[-0.0100,  0.0813,  0.1226,  0.0825,  0.1825, -0.0622, -0.1460,  0.4874,
          0.0037, -0.0068,  0.3498,  0.3570, -0.4793,  0.1693, -0.0469,  0.0732,
         -0.4761,  0.0552, -0.0450, -0.2526,  0.1625,  0.2181, -0.0513, -0.2099,
         -0.0087, -0.0637,  0.0406,  0.1108,  0.1112, -0.2052, -0.1830,  0.2817,
          0.2019,  0.0639, -0.1164, -0.1447,  0.1226,  0.1929,  0.0133, -0.0197,
          0.1552, -0.0178, -0.1760,  0.0067, -0.1987, -0.0190, -0.2251,  0.1664,
          0.2019,  0.2134],
        [-0.3586, -0.2734, -0.2013, -0.2553, -0.0488