## What is deep learning?

- Subcase of machine learning, which is itself a subcase of artificial intelligence
- Artificial intelligence is about understanding concepts
- Machine learning is about learning concepts through example
- Deep learning is a specific machine learning method which works (very) well in some cases 

## Learning tasks

In machine learning, we teach a learning algorithm to learn to perform a task by showing it examples. We want the learning algorithm to find general patterns in what it is shown, and not to memorize. We typically represent an example with a vector $x$. If we're doing supervised learning then there will also be an accompanying label $y$. Here are a few common tasks:

- **Classification**: the algorithm has to predict one of $k$ possible classes. Binary classification is special case where $k = 1$. In multi-output classification, there can be more than one class.
- **Regression**: the algorithm has to predict a real-valued output. In multi-output regression, there can be multiple real-valued outputs to predict.
- **Denoising**: the algorithm takes as input a noisy vector $\tilde{x}$, and has to produce a clean output vector $x$. 
- **Auto-encoding**: the algorithm takes as input $x$, and has to predict the exact same output $x$. It is however constrained by it's capacity. If the algorithm doesn't manage to reproduce an example, then that can be indicator that the input is an outlier/anomaly with respect to the training data.
- **Density estimation**: the algorithm has to output the density $p$ of a vector $x$. In effect the learning algorithm acts as a non-parametric probability distribution that fits the data.
- **Optimal control**: The algorithm inputs a state $s$ of the *dynamic system/environment* and has to output an action $a$ then apply it in the *dynamic system/environment* (Deep reinforcement learning). 
- **Generative models**: Generate data belonging to the training data distribution $p_{x}$, starting from a vector (usually random) $z$. GANs or diffusion models are generative models.

# Yet another explanation of backprop

There are many tutorials on backpropagation out there. I've skimmed through a bunch of them, and overall my favorite was [this one](https://www.ritchievink.com/blog/2017/07/10/programming-a-neural-network-from-scratch/) by Ritchie Vink. I preferred because the code examples are of good quality and give a lot of leeway for improvement. [This](https://victorzhou.com/blog/intro-to-neural-networks/) blogpost by Victor Zhou also helped me develop a mental model of what's going on.

## Neural networks in a nutshell

A neural network is a sequence of layers. Every layer takes as input $x$ and outputs $z$. We can denote this by a function which we call $f$:

$$z = f(x)$$

Note that the input $x$ can be a set of features, as well as the output from another layer. In the case of a dense layer, $f$ is an affine transformation:

$$z = w x + b$$

When we stack layers, we are simply chaining functions:

$$\hat{y} = f(f(f(\dots(f(x)))))$$

In the case of dense layers, which are linear, chaining them essentially results in a linear function. This means that even if we have a million dense layers stacked together, we still won't be able to learn non-linear patterns such as the XOR function. To add non-linearity, we add an *activation function* after each layer. Let's call these activation functions $g$. The output from the activation functions will be called $a$.

$$a = g(f(x))$$

When we stack layers, our final output is:

$$\hat{y} = g(f(g(f(\dots(g(f(x)))))))$$

Of course there are many more flavors of neural networks but that's the general idea. In the case of using dense layers, we're looking to tune the weights $w$ and biases $b$. That's where backpropagation comes in.

## Loss functions
A loss function is a function that takes as input the output of the network $\hat{y}$ and the ground truth $y$ and outputs a scalar value. The goal of the loss function is to indicate how far off the network is from the ground truth. The learning algorithm will then try to minimize the loss function. Here are a few examples of loss functions:

**Mean Squared Error (MSE)**

Used for general purpose regression. Penalizes large mistakes.

$$L(y, \hat{y}) = (y - \hat{y}) ^ 2$$

**Logistic loss**

The most common loss used for classification is the cross-entropy loss. It's also called the logistic loss. It's used for binary classification, but can be extended to multi-class classification. It's also used for estimating probabilities.

$$L(y, p) = log(1 + exp(-yp))$$

**Poisson loss**

Used for estimating counts (arrivals in an airport, number of call events to call center, etc.), which is a specific case of regression.

$$L(y, \hat{y}) = \hat{y} - y \times log(\hat{y})$$

**Hinge loss**

$$L(y, p) = max(0, 1 - yp)$$

Loss functions have a big impact on the learning algorithm. For instance the only difference between linear regression and logistic regression is that linear regression is a linear model with a squared loss whereas logistic regression is a linear model with a logistic loss.

There are many more loss functions that you can use in machine learning. You can even design your own! For example the deep learning community introduced the [focal loss](https://arxiv.org/abs/1708.02002) to deal with imbalanced datasets. Vowpal Wabbit also [designed](https://arxiv.org/abs/1011.1576) a set of loss functions that support importance weights.


## Backpropagation

First of all, let's get the chain rule out of the way. Say you have a function $f$, a function $g$, and an input $x$. If we compose our functions and apply them to $x$ we get $g(f(x))$. Now say we want to find the derivative of $g$ with respect to $x$. The trick is that there the function $f$ in between $g$ and $x$. In this case we use the chain rule, which gives us:

$$\frac{\partial g}{\partial x} = \frac{\partial g}{\partial f} \times \frac{\partial f}{\partial x}$$

In other words, in order to compute $\frac{\partial g}{\partial x}$, we have to compute $\frac{\partial g}{\partial f}$ and $\frac{\partial f}{\partial x}$ and multiply them together. The chain rule is thus just a tool that we can add to our toolkit. In the case of neural networks it's super useful because we're basically just chaining functions. 

Let's say we're looking at the weights of the final layer. We'll call them $w$. The output of the network is denoted as $\hat{y}$ whilst the ground truth is $y$. We have a loss function $L$ which indicates the error between $y$ and $\hat{y}$. To update the weights, we need to calculate the gradient of the loss function with respect to the weights:

$$\frac{\partial L}{\partial w}$$

In between $w_i$ and $L$, there is the application of the dense layer and the activation function. We can thus apply the chain rule:

$$\frac{\partial L}{\partial w} = \frac{\partial L}{\partial a} \times \frac{\partial a}{\partial z} \times \frac{\partial z}{\partial w}$$

In the case where our loss function is the mean squared error, the derivative is:

$$\frac{\partial L}{\partial a} = 2 \times (a - y)$$

For a sigmoid activation function, the derivative is:

$$\frac{\partial a}{\partial z} = \sigma(z) (1 - \sigma(z))$$

where $\sigma$ is in fact the sigmoid function. In the case of a dense layer, the derivative is:

$$\frac{\partial z}{\partial w} = x$$

We simply have to multiply all these elements together in order to obtain $\frac{\partial L}{\partial w}$:

$$\frac{\partial L}{\partial w} = (2 \times (a - y)) \times (\sigma(z) (1 - \sigma(z))) \times x$$

Recall that $a$ is the output of the network after having been processed by the activation function. We could have as well called it $\hat{y}$ because we're looking at the final layer, but we use $a$ because it's more generic and applies to each layer in the network. $z$ is the output of the network *before* being processed by the activation function. Note that implementation wise we thus have to keep both in memory. We can't just obtain $a$ and erase $z$.

If we plug in a different activation function and/or a different loss function, then everything will still work as long as each element is differentiable. Note that if we use the identity activation function (which doesn't change the input and has a derivative of 1), then we're simply doing linear regression!

Now how about the weights of the penultimate layer (the one just before the last one). Well we "just" have write it down using the chain rule. Here goes:

$$\frac{\partial L}{\partial w_2} = \frac{\partial L}{\partial a_3} \times \frac{\partial a_3}{\partial z_3} \times \frac{\partial z_3}{\partial a_2} \times \frac{\partial a_2}{\partial z_2} \times \frac{\partial z_2}{\partial w_2}$$

We've indexed the $a$s and $z$s because we're looking at multiple layer. In this case $a_3$ is the output of the 3rd layer (we called it $a$ before) whilst $a_2$ is the output of the 2nd layer. An important thing to notice is that we're using $\frac{\partial L}{\partial a_3} \times \frac{\partial a_3}{\partial z_3}$, which we already calculated previously. We can exploit this when we implement backpropagation in order to speed up our code but also make it shorter.

Here is the gradients for the weights of the 1st layer:

$$\frac{\partial L}{\partial w_2} = \frac{\partial L}{\partial a_3} \times \frac{\partial a_3}{\partial z_3} \times \frac{\partial z_3}{\partial a_2} \times \frac{\partial a_2}{\partial z_2} \times \frac{\partial z_2}{\partial a_1} \times \frac{\partial a_1}{\partial z_1} \times \frac{\partial z_1}{\partial w_1}$$

Again the first four elements of the product have already been computed.

How about the biases $b_i$? Well in a dense layer the derivative with respect to the biases is 1 (it was $x$ with respect to the weights). For the 3rd layer this will result in:

$$\frac{\partial L}{\partial b} = (2 \times (a - y)) \times (\sigma(z) (1 - \sigma(z))) \times 1$$

## Stochastic gradient descent

1. For each observation $(x_i, y_i)$, the gradient $\nabla_i$ is obtained
2. An optimizer takes care of obtaining the new weights $w_{i+1}$ by modifying the current weights $w_i$ and using the current gradient $\nabla_i$
3. We can loop multiple times through the dataset; each iteration is called an **epoch**

A general formulation of stochastic gradient descent (SGD):

$$w_{i+1} \leftarrow f(w_i, \nabla_i, \eta_i)$$

$\eta_i$ is the learning rate at iteration $i$, it's *extremely* important and we'll come back to it very soon.

<div class="alert alert-block alert-info">
    
<b> Exercise 0: </b>  
In your words comment the following code. Explain what is happening at each step.
</div>

In [1]:
import numpy as np
from sklearn import datasets
from sklearn import metrics
from sklearn import model_selection
from sklearn import preprocessing


class ReLU:
    """Rectified Linear Unit (ReLU) activation function."""

    @staticmethod
    def activation(z):
        z[z < 0] = 0
        return z

    @staticmethod
    def gradient(z):
        z[z < 0] = 0
        z[z > 0] = 1
        return z


class Sigmoid:
    """Sigmoid activation function."""

    @staticmethod
    def activation(z):
        return 1 / (1 + np.exp(-z))

    @staticmethod
    def gradient(z):
        s = Sigmoid.activation(z)
        return s * (1 - s)


class Identity:
    """Identity activation function."""

    @staticmethod
    def activation(z):
        return z

    @staticmethod
    def gradient(z):
        return np.ones_like(z)


class MSE:
    """Mean Squared Error (MSE) loss function."""

    @staticmethod
    def loss(y_true, y_pred):
        return np.mean((y_pred - y_true) ** 2)

    @staticmethod
    def gradient(y_true, y_pred):
        return 2 * (y_pred - y_true)


class SGD:
    """Stochastic Gradient Descent (SGD)."""

    def __init__(self, learning_rate):
        self.learning_rate = learning_rate

    def step(self, weights, gradients):
        weights -= self.learning_rate * gradients


class NN:
    """

    Parameters:
        dimensions (tuples of ints of length n_layers)

    """

    def __init__(self, dimensions, activations, loss, optimizer):
        self.n_layers = len(dimensions)
        self.loss = loss
        self.optimizer = optimizer

        # Weights and biases are initiated by index. For a one hidden layer net you will have a w[1] and w[2]
        self.w = {}
        self.b = {}

        # Activations are also initiated by index. For the example we will have activations[2] and activations[3]
        self.activations = {}
        for i in range(len(dimensions) - 1):
            self.w[i + 1] = np.random.randn(dimensions[i], dimensions[i + 1]) / np.sqrt(dimensions[i])
            self.b[i + 1] = np.zeros(dimensions[i + 1])
            self.activations[i + 2] = activations[i]

    def _feed_forward(self, X):
        """Executes a forward pass through the neural network.

        This will return the state at each layer of the network, which includes the output of the
        network.

        Parameters:
            X (array of shape (batch_size, n_features))

        """

        # z = w(x) + b
        z = {}

        # a = f(z)
        a = {1: X}  # First layer has no activations as input

        for i in range(2, self.n_layers + 1):
            z[i] = np.dot(a[i - 1], self.w[i - 1]) + self.b[i - 1]
            a[i] = self.activations[i].activation(z[i])

        return z, a

    def _backprop(self, z, a, y_true):
        """Backpropagation.

        Parameters:
            z (dict of length n_layers - 1):

                z = {
                    2: w1 * x + b1
                    3: w2 * (w1 * x + b1) + b2
                    4: w3 * (w2 * (w1 * x + b1) + b2) + b3
                    ...
                }

            a (dict of length n_layers):

                a = {
                    1: x,
                    2: f(w1 * x + b1)
                    3: f(w2 * (w1 * x + b1) + b2)
                    4: f(w3 * (w2 * (w1 * x + b1) + b2) + b3)
                    ...
                }

            y_true (array of shape (batch_size, n_targets))

        """

        # Determine the partial derivative and delta for the output layer
        y_pred = a[self.n_layers]
        final_activation = self.activations[self.n_layers]
        delta = self.loss.gradient(y_true, y_pred) * final_activation.gradient(y_pred)
        dw = np.dot(a[self.n_layers - 1].T, delta)

        update_params = {
            self.n_layers - 1: (dw, delta)
        }

        # Go through the layers in reverse order
        for i in range(self.n_layers - 2, 0, -1):
            delta = np.dot(delta, self.w[i + 1].T) * self.activations[i + 1].gradient(z[i + 1])
            dw = np.dot(a[i].T, delta)
            update_params[i] = (dw, delta)

        # Update the parameters
        for k, (dw, delta) in update_params.items():
            self.optimizer.step(weights=self.w[k], gradients=dw)
            self.optimizer.step(weights=self.b[k], gradients=np.mean(delta, axis=0))

    def fit(self, X, y, epochs, batch_size, print_every=np.inf):
        """Trains the neural network.

        Parameters:
            X (array of shape (n_samples, n_features))
            y (array of shape (n_samples, n_targets))
            epochs (int)
            batch_size (int)

        """

        # As a convention we expect y to be 2D, even if there is only one target to predict
        if y.ndim == 1:
            y = np.expand_dims(y, axis=1)

        # Go through the epochs
        for i in range(epochs):

            # Shuffle the data
            idx = np.arange(X.shape[0])
            np.random.shuffle(idx)
            x_ = X[idx]
            y_ = y[idx]

            # Iterate over the training data in mini-batches
            for j in range(X.shape[0] // batch_size):
                start = j * batch_size
                stop = (j + 1) * batch_size
                z, a = self._feed_forward(x_[start:stop])
                self._backprop(z, a, y_[start:stop])

            # Display the performance every print_every eooch
            if (i + 1) % print_every == 0:
                y_pred = self.predict(X)
                print(f'[{i+1}] train loss: {self.loss.loss(y, y_pred)}')

    def predict(self, X):
        """Predicts an output for each sample in X.

        Parameters:
            X (array of shape (n_samples, n_features))

        """
        _, a = self._feed_forward(X)
        return a[self.n_layers]

California housing dataset

## Input normalization

When you're doing gradient descent, the scale of the features matters a lot. Indeed the magnitudes of the gradient descent steps are influenced by the absolute values of the features. If the features are too large, then the gradient steps might be too large and the model will diverge.

**Always scale your data**

99% of the time, it is recommended to scale your data so that each feature has mean 0 and standard deviation 1. See [this](http://www.faqs.org/faqs/ai-faq/neural-nets/part2/section-16.html) for a deeper explanation.
 
You can use scikit-learn's `scale` method from the `preprocessing` module.

In [2]:
from sklearn.datasets import fetch_california_housing

In [3]:
housing = fetch_california_housing()

np.random.seed(1)

housing = fetch_california_housing()
X = housing["data"]
y = housing["target"]

X = preprocessing.scale(X)

# Split into train and test
X_train, X_test, y_train, y_test = model_selection.train_test_split(
    X, y,
    test_size=.3,
    shuffle=True,
    random_state=42
)

nn = NN(
    dimensions=(8, 10, 1),
    activations=(ReLU, Identity),
    loss=MSE,
    optimizer=SGD(learning_rate=1e-3)
)
nn.fit(X_train, y_train, epochs=30, batch_size=8, print_every=10)

y_pred = nn.predict(X_test)

print(metrics.mean_absolute_error(y_test, y_pred))

[10] train loss: 0.40180578575148257
[20] train loss: 0.36574521376471303
[30] train loss: 0.35598143129945997
0.4271864357461832


Digits.

In [4]:
np.random.seed(1)

X, y = datasets.load_digits(return_X_y=True)

# One-hot encode y
y = np.eye(10)[y]

# Split into train and test
X_train, X_test, y_train, y_test = model_selection.train_test_split(
    X, y,
    test_size=.3,
    shuffle=True,
    random_state=42
)

nn = NN(
    dimensions=(64, 15, 10),
    activations=(ReLU, Sigmoid),
    loss=MSE,
    optimizer=SGD(learning_rate=1e-3)
)
nn.fit(X_train, y_train, epochs=50, batch_size=16, print_every=10)

y_pred = nn.predict(X_test)

print(metrics.classification_report(y_test.argmax(1), y_pred.argmax(1)))

[10] train loss: 0.008308476136280956
[20] train loss: 0.004984925198988305
[30] train loss: 0.004102445263740697
[40] train loss: 0.0029634369443098745
[50] train loss: 0.0018708680417568037
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        53
           1       0.96      0.98      0.97        50
           2       0.94      1.00      0.97        47
           3       0.96      0.96      0.96        54
           4       0.98      1.00      0.99        60
           5       0.94      0.97      0.96        66
           6       0.98      0.98      0.98        53
           7       1.00      0.98      0.99        55
           8       1.00      0.93      0.96        43
           9       0.98      0.93      0.96        59

    accuracy                           0.97       540
   macro avg       0.98      0.97      0.97       540
weighted avg       0.97      0.97      0.97       540



**If you want to understand what's going on under the hood of your favorite deep learning framework [here](https://github.com/3outeille/Yaae)**

## 1. Deep learning framework : Pytorch

Pytorch is a deep learning framework allowing to automate many operations.
It also provides a lot of tools to create and train deep learning models.

In deep learning, the type of data used is the **Tensor**.
A tensor is an array of data that can contain vectors,images and much more ! 


You can instantiate a tensor using the `torch.tensor` method.


<div class="alert alert-block alert-info">
    
<b> Exercise 1.0: </b>
* Create a tensor `age_1` containing your age with type `long` and a tensor `size_1` containing your height in cm with type `float32`.
</div>

In [6]:
import torch
age_1 = torch.tensor(24).float()
size = torch.tensor(173).float()

<div class="alert alert-block alert-info">
    
<b> Exercise 1.1: </b>  
Imagine your task is to predict the life expectancy of a person $y_i$ from a set of bioligical measure $(x_i^0,x_i^1,x_i^2,\ldots,x_i^6)$.
* Create three tensors `x_1`, `x_2` and `x_3` of biological data and form a data batch of size 3 `batch_x` with these three tensors.

**Tip**: Use `torch.stack`.
</div>

In [7]:
x_1 = torch.tensor([0.5, 1.2, 3.3, 4.4, 5.5, 6.6, 7.7])
x_2 = torch.tensor([1.1, 2.2, 3.3, 4.4, 5.5, 6.6, 7.7])
x_3 = torch.tensor([2.1, 3.2, 4.3, 5.4, 6.5, 7.6, 8.7])

batch_x = torch.stack([x_1, x_2, x_3])

You can also check some attributes of your tensor.
For example you can look at the shape of the tensor using the `shape` attribute, the gradient of a tensor using the `grad` attribute and the type using `dtype`.

You can also see under which device your tensor is with the attribute `device`.
Finally you can also put your data on gpu using the `torch.tensor.to` method.
The possible devices are "cpu" and "cuda".

<div class="alert alert-block alert-info">
    
<b> Exercise 1.2: </b>  
* Look at the device, gradient, type and shape of your tensors.
* Change the device of your tensors to "cuda".
</div>

In [9]:
# Look at the device, gradient, type and shape of your tensors
print(f"Device of age_1: {age_1.device}, Gradient: {age_1.grad}, Type: {age_1.dtype}, Shape: {age_1.shape}")
print(f"Device of size: {size.device}, Gradient: {size.grad}, Type: {size.dtype}, Shape: {size.shape}")
print(f"Device of batch_x: {batch_x.device}, Gradient: {batch_x.grad}, Type: {batch_x.dtype}, Shape: {batch_x.shape}")

# Change the device of your tensors to "cuda"
age_1 = age_1.to("cuda")
size = size.to("cuda")
batch_x = batch_x.to("cuda")

# Verify the device change
print(f"Device of age_1 after transfer: {age_1.device}")
print(f"Device of size after transfer: {size.device}")
print(f"Device of batch_x after transfer: {batch_x.device}")

Device of age_1: cpu, Gradient: None, Type: torch.float32, Shape: torch.Size([])
Device of size: cpu, Gradient: None, Type: torch.float32, Shape: torch.Size([])
Device of batch_x: cpu, Gradient: None, Type: torch.float32, Shape: torch.Size([3, 7])


AssertionError: Torch not compiled with CUDA enabled

**Now you will using again the California housing dataset.** 

To manage more easily the processing of your data, pytorch proposes a tool: the `Dataset` class.

This class allows the creation of a data generator which will be very useful when training your model.

This dataset allows to retrieve the data as a tensor.
The dataset implements 2 methods : The `__get_item__` method which allows to access to a sample and the `__len__` method which returns the number of sample.

<div class="alert alert-block alert-info">
    
<b> Exercise 1.4: </b>  
Implement a torch dataset allowing you to access the boston dataset.
* The `__get_item__` method should return a dict containing the data and labels in tensor form.
* The `__len__` method should return the number of samples.

</div>

In [12]:
class BostonDataset(Dataset):
    def __init__(self, X, y):
        self.X = torch.tensor(X, dtype=torch.float32)
        self.y = torch.tensor(y, dtype=torch.float32)

    def __getitem__(self, idx):
        return {"data": self.X[idx], "label": self.y[idx]}

    def __len__(self):
        return len(self.X)

NameError: name 'Dataset' is not defined

In [None]:
from torch.utils.data import Dataset

In [None]:
class TabularDataset(Dataset):
  def __init__(self,X,y):
    self.X = ...
    self.y = ...
  def __getitem__(self,idx): 
    return {"data": ...,"label": ...}

  def __len__(self):
    return ...


<div class="alert alert-block alert-info">
    
<b> Exercise 1.5: </b>  
Set up a dataset for the train set and one for the validation set

</div>

In [None]:
train_dataset = ...
val_dataset = ...

In [None]:
print(f"First sample (x,y) of the dataset : {train_dataset[0]} \n") # get_item method
print(f"There are {len(train_dataset)} samples in the dataset.") # len method

Once the dataset is created, another class is used to wrap this dataset: The `Dataloader`.
A dataloader allows to sample batches of dataset data and to parallelize the batch formation on several workers.

## Mini-batching

At each iteration $i$, instead of updating the weights $w_i$ by using the gradient $g_i$, we can accumulate the gradients and only update the weights every $k$ iterations.

The gradient we will use to update the weights will thus be the average of the past $k$ gradients. This is called **mini-batch gradient descent**. Stochastic gradient descent can be seen as a special case of mini-batch gradient descent when the batch size is set to 1.

![mini-batch](mini-batch.png)

The batch size is important. Small batch size work well because they are a form of regula

Here are some links if you want to get some intuitions:

- [Tradeoff batch size vs. number of iterations to train a neural network - Cross Validated](https://stats.stackexchange.com/questions/164876/tradeoff-batch-size-vs-number-of-iterations-to-train-a-neural-network)
- [What is batch size in neural network? - Cross Validated](https://stats.stackexchange.com/questions/153531/what-is-batch-size-in-neural-network)
- [In deep learning, why don't we use the whole training set to compute the gradient? - Quora](https://www.quora.com/In-deep-learning-why-dont-we-use-the-whole-training-set-to-compute-the-gradient)

<div class="alert alert-block alert-info">
    
<b> Exercise 1.6: </b>  
Create a train dataloader and a validation dataloader that returns sample batches of size 16.

</div>

In [None]:

from torch.utils.data import DataLoader

In [None]:
batch_size = ...
num_workers = ...
train_dataloader = ...
val_dataloader = ...

<div class="alert alert-block alert-info">
    
<b> Exercise 1.7: </b>  
* Inspect the first batch of your training loader.
* Create a data variable and a label variable containing respectively, the data and the labels of the first batch

**Tip**: To get the first element of the loader use: `next(iter(loader))`
</div>

In [None]:
first_batch = ...
data = ...
label = ...

In [None]:
import torch.nn as nn
class MLPRegression(nn.Module): # All pytorch models must inherit the nn.Module
  def __init__(self,in_features):
     super(MLPRegression, self).__init__() # The constructor of the class calls the constructor of its parent class with the keyword "super".
     self.layer1 = ...
  def forward(self,x):
    value = ...
    return value


<div class="alert alert-block alert-info">
    
<b> Exercise 1.9: </b>  
Use your neural network to make a prediction on the first batch of data.
</div>

In [None]:
in_features = ...
mlp_regression = ...

In [None]:
preds = ...

Now we will move on to the most important part of the session: Training the model.
Writing a **training loop** is not a simple thing at first.
The following code implements a generic training loop that I advise you to keep for your future work in deep learning.

<div class="alert alert-block alert-info">
    
<b> Exercise 1.10: </b>  
* Explain the role of each argument of the function
* Fill in the blank code
* Comment on each line of the training loop (except the lines used for visualization).
* Raise your hand to give me an oral report on this question 
</div>

In [None]:
import seaborn as sns
from matplotlib import pyplot as plt
import numpy as np
sns.set()
def train_regressor(
    model: nn.Module,
    optimizer: torch.optim.Optimizer,
    train_loader: DataLoader,
    valid_loader: DataLoader,
    nb_epoch: int,
    criterion: nn.Module,
    batch_size:int=16,
    device: torch.device = torch.device("cuda:0"),
    
    verbose: bool=True,
) -> None:
    """
    Pytorch training loop
    Args:
        model (nn.Module): Pytorch classification model
        optimizer (torch.optim.Optimizer): Optimizer for the model
        train_loader (DataLoader): DataLoader for training fold
        valid_loader (DataLoader): DataLoader for validation fold
        nb_epoch (int): Number of epoch
        criterion (nn.Module): Loss
        device (torch.device): .Defaults to `torch.device("cuda:0")`
        verbose (bool): Verbose term
    """
    loaders = {"train": train_loader, "validation": valid_loader}
    model.to(device)
    train_loss = []
    val_loss = []
    for epoch in range(1, nb_epoch + 1):
        if verbose:
          print("-" * 80)
        for phase in ["train", "validation"]:
            if phase == "train":
                model.train()
            else:
                model.eval()
            running_loss = 0.0
            for sample in loaders[phase]:
                data = ...
                label = ...
                optimizer.zero_grad()
                data, label = data.to(device), label.to(device)
                with torch.set_grad_enabled(phase == "train"):
                    output = ...
                    pred_label = ...
                    loss = ...
                    if phase == "train":
                        ... # Compute the gradient
                        ... # Make a gradient step
                running_loss += loss.item()
            epoch_loss = running_loss / (len(loaders[phase].dataset)/batch_size)
            if phase == "train":
              train_loss.append(epoch_loss)
            else:
              val_loss.append(epoch_loss)
            if verbose:
              print(
                  f" Epoch number: {epoch}, Phase: {phase}, Loss value: {epoch_loss:.4f}"
              )
    fig,(axe1,axe2) = plt.subplots(2,figsize=(10,10))
    fig.suptitle('Training and validation statistics')


    y_pred = model(torch.from_numpy(X_val).to(device)).detach().cpu().numpy()
    axe1.plot(np.arange(len(train_loss)),train_loss,label="train MSE")
    axe1.plot(np.arange(len(val_loss)),val_loss,label = "val MSE")
    axe1.set_title("MSE loss")
    axe1.legend()
    axe2.scatter(range(len(y_val)), scaler.inverse_transform(y_val), label='target');
    axe2.scatter(range(len(y_val)), scaler.inverse_transform(y_pred), label='prediction');
    axe2.set_title(f"Prediction // MSE = {mean_absolute_error(y_pred, y_val)}")
    axe1.legend()
    plt.show()
    return train_loss,val_loss

<div class="alert alert-block alert-info">
    
<b> Exercise 1.11: </b>  
Instantiate the list of parameters necessary to launch the function and justify each of these parameters (Why use this loss, why use this optimizer, why set this learning rate, why use this architecture ...)  
</div>

**Tip**: Use `Adam` optimizer and have a look to https://pytorch.org/docs/stable/nn.html#loss-functions

In [None]:
from torch.optim import Adam,SGD
mlp_logistic = ...
batch_size = ...
num_workers = ...
train_dataloader = ...
val_dataloader = ...
learning_rate = ...
optimizer = ...
criterion = ...
device = ...
nb_epochs = ...
verbose = ...

In [None]:
train_loss,val_loss= ...

<div class="alert alert-block alert-info">
    
<b> Exercise 1.12: </b>  
Comment the results on : 
* The loss function
* The model error

Can we do better and how?
</div>

<div class="alert alert-block alert-info">
    
<b> Exercise 3.13: </b>  
* Implement a new deeper architecture.
* Set the necessary parameters for the training again and restart the procedure.
* Comment on the new results obtained
</div>

In [None]:
import torch.nn as nn
class MLPDeep(nn.Module):
  def __init__(self,in_features):
     super(MLPDeep, self).__init__()
     ...
  def forward(self,x):
    ...
    return value

In [None]:
mlp_deep = ...
batch_size = ...
num_workers = ...
train_dataloader = ...
val_dataloader = ...
learning_rate = ...
optimizer = ...
criterion = ...
device = ...
nb_epochs = ...
verbose = ...

In [None]:
train_loss,val_loss= ...

<div class="alert alert-block alert-info">
    
<b> Exercise 1.14: </b>  
* Train the model for 100 epochs and comment on the results obtained.
* Plot the loss function and the model error as a function of the epochs. For the training and validation set.
</div>

The phenomena of exercise 1.14 is called overfitting.


<div class="alert alert-block alert-info">
<b> Exercise 1.15: </b>   
* Explain in your own words what is overfitting.
* How can we avoid this phenomenon?
</div>

<div class="alert alert-block alert-info">
    
<b> Exercise 3.14: </b>  
* Implement early stopping in the training loop.
* Add some regularization 
* Add weight decay on the optimizer and explain in your own words what is weight decay.
</div>

## Regularization

In the previous cells we just mentionned the word **regularization**, what is it exactly?

1. A machine learning model learns from a training set 
2. We want the model to perform well on a test set it hasn't seen
3. The stronger the model, the more there is a chance that it overfits by memorizing patterns that only exists in the training set
4. Regularizing a model means that we make it's life harder  
5. There are many ways to regularize a neural network:
    1. Use more data! The more training data there is, the more the model will focus on general patterns
    2. Penalize the updates made to the weights by the optimizer 
    3. Use dropout
    4. Use batch normalization
    5. Use early stopping
    
![complexity](complexity.png)

# How di I choose the architecture of my neural network ?
There are so many choices you can make that it can quickly become overwhelming. Choosing the right pieces of the puzzle is very much an art rather than a science. There is no getting around trying things out. It is thus very important to setup a stable environment for testing and evaluating model choices. You should always start by defining a reliable testing procedure.

Here is some general advice:

1. Start with a very simple model (for example a logistic regression)
2. Add complexity to the model as long as the validation score improves
3. If the model is overfitting (you can detect it by comparing the training and validation scores) then add regularization
4. Last but not least, spend most of your time checking that the data you're using is correct

Here are some more links if you're interested:

- [How to decide neural network architecture? - Data Science exchange](https://datascience.stackexchange.com/questions/20222/how-to-decide-neural-network-architecture)
- [How to choose the number of hidden layers and nodes in a feedforward neural network? - Cross Validated](https://stats.stackexchange.com/questions/181/how-to-choose-the-number-of-hidden-layers-and-nodes-in-a-feedforward-neural-netw)