# Feed-forward neural networks


### 2.2 Learning rate and  Training Algorithms
The **learning rate** is usually chosen experimentally based on the figure bellow. Note that in practice variants of the gradient descent formula are used; some implementation for 3D visualization can be found [here](Gradient_Methods.ipynb) and of course in the [Keras library](https://keras.io/optimizers/):
- Stochastic Gradient Descent with Momentum (**SGD+Momentum**)
- [**Adadelta**](https://arxiv.org/abs/1212.5701)
- [**Adam**](https://arxiv.org/pdf/1412.6980v8.pdf)

<p align="center">
<img src="images/learningrates.jpeg" width="450" title="Learning Rates" >
</p> 


## 3. Training 

### 4.1 Avoid over-fitting

<p align="center">
<img src="images/fittings.jpg" width="550" title="Types of fittings during training." >
</p> 


### Separate the data into three folds
<p align="center">
<img src="images/training_splits.png" width="500" title="Dataset separation." >
</p> 


### 4.2 Assessing the model performance on the validation set


<p align="center">
<img src="images/accuracies.jpeg" width="350" title="Dataset separation." >
</p> 


## 4. Further readings

- [Deep learning book](http://deeplearningbook.org/)
- [Stanford's CS231n](http://cs231n.github.io/)
- [Washington University in St. Louis](https://github.com/jeffheaton/t81_558_deep_learning)
- [Deep learning paper](http://www.cs.toronto.edu/~hinton/absps/NatureDeepReview.pdf)
- [Tensorflow/Keras tutorial](https://www.tensorflow.org/guide/keras)

## 4. Example

### 4.1 MNIST (handwritten digit classification)

In [1]:
import torch as tc
import torch.nn as nn # contains methods to create neural networks
import torchvision as tv # allows access to various computer vision databases and image processing functions 


We define here the set of transformations to apply on each image before getting to the network
1. We transform the vector to a Torch tensor 
2. We flatten the image: from $(1 \times 28 \times 28)$ to a vector of size $784$
3. First we normalize the image : (image - mean) / std

In [33]:
transform = tv.transforms.Compose([tv.transforms.ToTensor(),
                              tv.transforms.Normalize(0.5, 0.5),
                              tv.transforms.Lambda(lambda x: tc.flatten(x))])

The MNIST dataset is downloaded and store in the folder $\textit{./mnist}$ (if not already downloaded).

In the same process, the training images are transformed using the previously defined transformations.

In [87]:
# Using only : trainingset = tv.datasets.MNIST(root="./mnist", train=True, download=True, transform=transform)
# is supposed to be enough to download the dataset.
# But you may run into an HTTP error with some of the ressources on Yann LeCun's page being unavailable

# See the issue here: https://github.com/pytorch/vision/issues/3549

# The following ressources line is a just a temporary fix

tv.datasets.MNIST.resources = [
            ('https://ossci-datasets.s3.amazonaws.com/mnist/train-images-idx3-ubyte.gz', 'f68b3c2dcbeaaa9fbdd348bbdeb94873'),
            ('https://ossci-datasets.s3.amazonaws.com/mnist/train-labels-idx1-ubyte.gz', 'd53e105ee54ea40749a09fcbcd1e9432'),
            ('https://ossci-datasets.s3.amazonaws.com/mnist/t10k-images-idx3-ubyte.gz', '9fb629c4189551a2d022fa330f9573f3'),
            ('https://ossci-datasets.s3.amazonaws.com/mnist/t10k-labels-idx1-ubyte.gz', 'ec29112dd5afa0611ce80d1b7f02629c')
        ]

trainingset = tv.datasets.MNIST(root="./mnist", train=True, download=True, transform=transform)

We now create a loader i.e a variable that can be used to (easily) navigate through our trainingset.
1. Our loader will return a batch of 64 images along with their labels each time it is called
2. Also, the loader shuffles the dataset before sampling the 64 images before each call

In [88]:
trainloader = tc.utils.data.DataLoader(trainingset, batch_size=64, shuffle=True)

Lets now define our Neural network. It's a simple multi-layer perceptron consisting of:
1. a Dense layer
2. a ReLU activation layer
3. a Dropout layer
4. and a Softmax layer

In [89]:

class MLP(nn.Module):
    def __init__(self, n_inputs, n_classes,  dropout_prob=0.3):
        super(MLP, self).__init__()

        self.pipe = nn.Sequential(
            nn.Linear(n_inputs, 256),
            nn.ReLU(),
            nn.Linear(256, 128),
            nn.ReLU(),
            nn.Linear(128, n_classes),
            #nn.Dropout(p=dropout_prob),
            nn.Softmax(dim=-1)
        )
    def forward(self, x):
        return self.pipe(x)


In [90]:
# Our images are vectors of size 784
# and we have 10 classes

net = MLP(n_inputs=784, n_classes=10)

The only things left are 
- the loss function
- and the optimizer

We will be using the Cross-Entropy loss function.
And for the optimizer a standard Stochastic Gradient Descent (SGD) should be good enough.

In [91]:
# Loss function definition
# CrossEntropy since we have a Softmax as last layer
criterion = nn.CrossEntropyLoss()

In [92]:
# Optimizers takes as input parameters to optimize and a learning rate
optimizer = tc.optim.SGD(net.parameters(), lr=0.5)

###### Let the training begin now

In [93]:
epochs = 100 # overall number of iterations

In [94]:
iter_train = iter(trainloader) # converting the loader to a Python iterator, so we can use 'next' to navigate 
for e in range(epochs):
    images, labels = next(iter_train) # getting the images and their labels

    optimizer.zero_grad() # Zero-out the gradients, cleaning the optimizer

    output = net.forward(images) # we apply the model on the batch images

    loss = criterion(output, labels) # we compute the loss between the model's outputs and the labels

    loss.backward() # we backpropagate the loss in the model to compute all the gradiants

    optimizer.step() # final we update the weights
    
    print("Epoch {}/{} - loss {}".format(e+1, epochs, loss.detach().numpy()))


Epoch 1/100 - loss 2.301060438156128
Epoch 2/100 - loss 2.3007211685180664
Epoch 3/100 - loss 2.3013782501220703
Epoch 4/100 - loss 2.2947072982788086
Epoch 5/100 - loss 2.2953991889953613
Epoch 6/100 - loss 2.3019230365753174
Epoch 7/100 - loss 2.29240083694458
Epoch 8/100 - loss 2.296236753463745
Epoch 9/100 - loss 2.294034481048584
Epoch 10/100 - loss 2.2830793857574463
Epoch 11/100 - loss 2.2933497428894043
Epoch 12/100 - loss 2.290843963623047
Epoch 13/100 - loss 2.2861928939819336
Epoch 14/100 - loss 2.2870659828186035
Epoch 15/100 - loss 2.2778518199920654
Epoch 16/100 - loss 2.270634889602661
Epoch 17/100 - loss 2.2710959911346436
Epoch 18/100 - loss 2.2508740425109863
Epoch 19/100 - loss 2.2316088676452637
Epoch 20/100 - loss 2.2321789264678955
Epoch 21/100 - loss 2.2499282360076904
Epoch 22/100 - loss 2.2377519607543945
Epoch 23/100 - loss 2.2352070808410645
Epoch 24/100 - loss 2.208195209503174
Epoch 25/100 - loss 2.336005687713623
Epoch 26/100 - loss 2.220587730407715
Epoch

###### Testing

In [95]:
testingset = tv.datasets.MNIST(root="./mnist", train=False, download=True, transform=transform)
testloader = tc.utils.data.DataLoader(testingset, batch_size=1, shuffle=True)

In [86]:
for k in range(10):
    images, labels = next(iter(testloader))
    with tc.no_grad():
        output = net.forward(images)  

    print(tc.argmax(output), labels)

tensor(1) tensor([2])
tensor(1) tensor([5])
tensor(1) tensor([6])
tensor(9) tensor([9])
tensor(9) tensor([8])
tensor(1) tensor([4])
tensor(1) tensor([2])
tensor(1) tensor([3])
tensor(1) tensor([9])
tensor(1) tensor([9])
