In [None]:
import time

from tensorflow import keras 
from tensorflow.keras import models
from tensorflow.keras import optimizers
from tensorflow.keras import layers
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.callbacks import EarlyStopping

import numpy as np
from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt
from pandas import DataFrame
import sklearn
from sklearn.preprocessing import StandardScaler
%matplotlib inline

# Deep Learning week - Multiclass Classification Exercise

The data are created from the `make_blob` function of scikit learn. 
It returns categorical data, so that this notebook is a multiclass classification task : based on the input data $x$, tells whether the sample belongs to the first, second, third, ... category

# Create data

The `make_blob` function [(see documentation)](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_blobs.html) enables to draw : 
- an arbitrary number of data sample, argument `n_samples`
- an arbitrary number of features per data sample, argument `n_features`
- an arbitrary number of categories, argument `centers`
- a distance between the categories, argument `cluster_std`

There is also the `random_state` argument that allows to draw the data deterministically, in order to reproduce the same data. Two persons that choose the same random_state will have the same data.

### Question : Generate data with : 
- 1200 samples
- 8 features per sample
- 7 categories of data
- 8 as the distance between the categories

Select a `random_state` equal to 1.

Print the shape and check that it corresponds to (1200, 8) for `X` and (1200) for `y`

In [None]:
X, y = make_blobs(### TODO) 
X.shape, y.shape

### Question 

Thanks to matplotlib, plot two (arbitrary) dimensions of the input data. Each dot should be colored by the category it belongs to.

In [None]:
######## Plot

### TODO 

### Question : repeat the operation on other dimensions, to visualy that the data are not easily separable

In [None]:
######## Plot

### TODO 

As for now, `y` is the list of integers, each correspoding to the category of the related input data.
It looks like `[3, 2, 2, 3, 0, 5, 1, 1, 0, 5, ...]` (in this example, we have 6 categories, from 0 to 5).

However, for categorical task in Keras, the output should have a number of columns equal to the number of different categories. Each row, corresponding to an input data, is a list of the probabilities that this input belongs to the corresponding category. AS here, the probabilities to belong to each category is equal to 1, it should look like

```
[
[0, 0, 0, 1, 0, 0], 
[0, 0, 1, 0, 0, 0], 
[0, 0, 1, 0, 0, 0], 
[1, 0, 0, 0, 0, 0], 
[0, 0, 0, 0, 0, 1], 
[0, 1, 0, 0, 0, 0],
[0, 1, 0, 0, 0, 0],
[1, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 1],
...
]
```

Each column corresponds to a category. Each row corresponds to a target, the 1 being the category the input data belongs to.

To transform `y` to categories, use `to_categorical` function from Keras (already imported). 


### Question: First print `y`, then apply it and store it into `y_cat` and reprint `y_cat` to see the new structure.

In [None]:
print(y)

### TODO 

print(y_cat)

### Question : Split the initial dataset into a train and test set (size: 70/30%)

Remark : Please call the variables `X_train`, `y_train`, `X_test` and `y_test`

In [None]:
### TO DO 

For technical reasons, the data should be rescaled, so that the data are _approximately_ all in [-10, 10].
To do so, the `StandardScaler` function from Scikit-Learn [(see documentation)](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html) allows to do that easily.

[Advanced notion] The technical reason for this standardisation/normalisation/centering is partly due to the activation function, whose non-linearity  and _variations_ are around 0.

The function should be applied as 
```
SScaler = StandardScaler()
SScaler.fit(X)             ### Used to fit the coefficients of the standardisation
X = Sscaler.transform(X)   ### Used to rescale X
```

### Question: Given that you splited you dataset into `X_train` and `X_test`, how would you perform this task? 

In [None]:
### TODO 

# Initialize the model

Once the data is set, we will initialize your first Neural Network.

In [None]:
def initialize_model():
    
    ### The first lines are as in the previous model, except for the input_dimension that corresponds to the
    ### number of features per sample we have, i.e. 8.
    model = models.Sequential()
    model.add(layers.Dense(100, input_dim=8, activation='relu'))
    model.add(layers.Dense(7, activation='softmax'))
    
    ### Here, the real different is the name of the loss. The loss is not designed to distinguish between two categories
    ### but between multiple categories.
    model.compile(loss='categorical_crossentropy', 
                  optimizer='adam', 
                  metrics=['accuracy'])
    
    return model

# Fit the model - Reminder

Reminder : One step of parameter update is "called“ a _backpropagation_. It is done by evaluating the prediction on a set of `N` data, called the batch. `N` is thus the batch size. One iteration is when the updates has been made considering all the batches, i.e. it went through all the data once, and only once.

This is an example of the model fitting, with :
- training data (input and output)
- a validation set that corresponds to unused data for training but on which the model compute some estimation to see its generalization
- the epochs, i.e. the number of iterations 
- a batch_size
- verbose: commonly used arguments to output some logs. It usually goes from 0 (no logs) to greated numbers, each being associated to a certain amount of logs.


In [None]:
model = initialize_model()

history = model.fit(X_train, y_train, 
                    validation_data=(X_test, y_test), 
                    epochs=400, 
                    batch_size=50,
                    verbose=0)

You can check the results on the train set with the following command. The results contains the list of evaluated values that are, first, the loss (here, the binary_crossentropy) and then, the list of metrics that were listed in the `metrics` argument.

In [None]:
results = model.evaluate(X_train, y_train, verbose=0)
print('Train loss: {} - Train accuracy (MAE): {}'.format(results[0], results[1]))

results = model.evaluate(X_test, y_test, verbose=0)
print('Test loss: {} - Test accuracy (MAE): {}'.format(results[0], results[1]))

### Question : Write a function, that given the `history` returned by the `model.fit`, plots two figures:


- The first figure represents two curves, the first being the value of the train loss during the iterations, the second being the value of the test loss during the iterations.

- The second figure has also two curves, the train accuracy and the test accuracy at each iteration.

### Question bis : Use this function on the history you got previously and comment it

In [None]:
def plot_loss_accuracy(history):
    ### TODO

In [None]:
plot_loss_accuracy(history)

You again see a strong effect of the overfitting : the Neural Network gets better and better on the examples it sees but it lacks generalization in the sense that the test loss and accuracy are getting worse.

### Question: As in the previous notebook, use the following Early Stopping Criterion : 

`es = EarlyStopping(monitor='val_loss', mode='min', verbose=1, patience=30)`

To call it, you should just add `callbacks=[es]` to your `model.fit`.

In [None]:
### Initialize the model, initialize the stopping criterion, and fit the model (store the output in `history`)
### TO DO 

### Question: Use the previously defined plot function to look at both the loss and accuracy now. What do you conclude?

In [1]:
### TODO 

In the previous set-up, you used the test set as the validation set. 

### Question: Would it be correct to say that your final accuracy is thus the one on this validation set?


To use a validation set independant from the test_set, you can use due argument `validation_split` [(See documentation)](https://keras.io/models/sequential/) in the `model.fit`. Given a split value of `0.7` so that 70% of the `X_train` data are used for training and 30% of the `X_train` are used for validation set.

### Question: Report now the loss and accuracy value on the `X_test` set thanks to `model.evaluate`

_Hint_ : It writes `results = model.evaluate(X_test, y_test, verbose=0)` after training the model again (and initialization)


In [None]:
### TODO 

### Question: What is now your best accuracy? 

### Question/Answer/Remark : 

You might say that reporting one test value is not correct as we should do a proper K-fold cross-validation. This is perfectly correct. As the stream within each fold is the same, we consider that you can do the K-fold on your own in a real setting. However, here, training K algorithm each time is too long to be done within the allocated time (For your information, some Facebook/Google Neural networks are trained for week on heavy distributed computers).


<hr><hr>

We will now look at the effect of the batch_size on the learning

### Question : Complete the fonction that, given `X`, `y` and the `batch_size` does the following : 
- splits the X and y into a training and test set
- initializes the model
- Fits the model
- Evaluates the loss and accuracy on the test set
- Return the `history` of the fit and the `results` of the evaluation on the test set.

In [None]:
def run_model_batch(X, y, batch_size):    
    ### Data split
    ### TODO 
    
    ### Model initialization
    ### TODO 
    
    ### Fitting the model 
    ### TODO 
    
    ### Evaluate on the test set
    ### TODO 
    
    ### Return 
    return history, results

### Question : Run this function for `batch_size=50`, then plot the `history` and print the results (loss and accuracy)

In [None]:
### TODO 

### Question : Run this function on multiple batch_sizes. 

! Warning : The function `time.time()` returns the current time. Use it twice to compute the time it takes for each `run_model_batch` to run.

At each iteration, plot the history, store the loss and accuracy and store the time it takes to run.

In [None]:
loss = []
accuracy = []
elapsed_time = []
batch_sizes = [20, 50, 100, 250, 500]

for batch_size in batch_sizes:
    ### TODO : Starting time
    
    ### Computing history and results, and appending to the correct list
    ### TODO 
    
    ### TODO : Final time, and appending it to elapsed_time


### Question: plot the loss and accuracy with respect to the batch_size. 

### Question bis: Also plot the elapsed_time with respect to the `batch_size`. What is the reason of such trend?

In [None]:
########## PLOT

### TODO 

In the following, we will fix the `batch_size` to 50 and the patience to of the Early Stopping Criterion `30`.

### Write a function that, given `X_train`, `y_train`, `X_test`, `y_test`, `model` does the following : 
- Initializes the early stopping criterion (verbose to 1)
- Fit the model with a `validation_split` equal to 0.7, with 2000 `epochs` (Do not forget the batch_size and the early stopping criterion.
- Evaluates the model on the test set
- Return this evaluation

In [None]:
def run_model(X_train, y_train, X_test, y_test, model):    
    ### Early stopping criterion
    ### TODO 
    
    ### Fitting the model 
    ### TODO 
    
    ### Evaluation on the test set
    ### TODO 
    
    ### Return the results
    return history, results

### Question : Run the previous function on a newly initialized model, and, print the results

In [None]:
### Initialize the model
### TODO 

### Run the model 
### TODO 

### Plot
### TODO 

### Question: Write a function that does intialize a model as similarly (`initialize_model`) except that the activation function of the first layer _AND_ the loss functions are parameters of the initialize_model_2

In [None]:
def init(activation, loss):
    ### TODO 
    
    return model

### Question : Use the previous functions to do : 
- initialize a model with a the `categorical_crossentropy` loss and `relu` activation function
- use `run_model` to run the model
- print the results

In [None]:
#### TODO 

### Question : Now, loop over the different activation function you can find [here](https://keras.io/activations/) (`relu`, previously used, is one of them) to see which one gives the best result


Store the results so that you can plot them  

In [None]:
accuracy = []

for activation in ['relu', 'softmax', 'linear', 'tanh']:
    ### TODO 

plt.plot(accuracy)

The `categorical_crossentropy` is not the only loss you can use. There are two more for multiclass classification tasks offered by Keras [(see here)](https://keras.io/losses/).

### Question: Do the same as previously, but for the `kullback_leibler_divergence` loss.

In [None]:
### TODO

# Now, let's look deeper at the optimizer.

In the 2 category example (first tutorial), we initialize the optimizer with a string : `'sgd'`, `'adam'`, `'adadelta'`, ... In fact, each of this optimizer depends on hyperparameters that have default values. There are no reasons for these default values to be the best for the problem at hand, therefore, we will dig a bit deeper into their optimization.

!! Essential !! : If there was _one_ essential to remember, it is the _learning rate_.

### Question: Write an `init` function similar to the previous one, but instead of having the activation and the loss as arguments, put the `optimizer`.

Set the loss to be `kullback_leibler_divergence` and the activation to be `relu`.

In [None]:
def init_2(optimizer):
    ### TODO 
    
    return model

### Question: Initialize a model, and run it as previously. As for the optimizer, you can put any string you want amond the previous one mentioned

In [None]:
### TODO 

Now, let's look on how to initialize an optimizer with _not_ default values of the optimizer - as it is done when you give a string. This is an example where the learning rate `lr` is equal to 0.001

In [None]:
sgd = optimizers.SGD(lr=0.01)

model = init_2(sgd)
history, results = run_model(X_train, y_train, X_test, y_test, model)

### Question: Now, do the same with different values of the learning rate. Store and plot the accuracies.

In [None]:
## TODO 

In [None]:
###### PLOT
### TODO

Look at how many iterations it took to stop the early stopping criterion, for different values of the learning rate. 

The reason for this is that the learning rate is the coefficient that makes the parameter change as in this picture : 

![Learning rate](learning_rate.png)

Therefore, a too large learning rate makes the algorithm not converge well. On the other hand, a too small learning rate makes the algorithm converge very very slowly.

<hr><hr>

Now, lets try the Adam optimizer that has three parameters, `learning rate` and `beta_1`, `beta_2` that are both between 0 and 1, closer to 1 in general.

It all writes as : `adam = optimizers.Adam(learning_rate=0.001, beta_1=0.99, beta_2=0.99)`

### Question: Run the model with the adam optimizer and different values of the three above mentioned parameters. Look at the different accuracies.

In [None]:
accuracy = []

for lr in [0.001, 0.01, 0.1]:
    for beta_1 in [0.8, 0.9, 0.95, 0.99]:
        for beta_2 in [0.8, 0.9, 0.95, 0.99]:
            # TODO 

In [None]:
########### PLOT
### TODO 

### Try another optimizer in the documentation (https://keras.io/optimizers/)

In [None]:
### TODO 

# Now, let's change the architecture of the model ! 

### Write a new function to initialize the model `init_model`, where you can change the number of layers.

The parameter of the `init_model` is `latents_dim` which is a list of integers: the length of `latent_dims` is the number of additional layers you add, and each integer is the number of neurons in the layer.

Therefore, the Neural Network is made of 
- a first layer of `input_dim = 8`, output being the first integer in `latent_dims`
- as many layers as integers in `latent_dims` - 1, each of output_dim being the related integer
- a last layer whose input-dim is the last integer in `latent_dims`, the output_dim is the number of classes in the dataset.

For example `latent_dims=[10, 3, 10]` means that the neural net is made of
- a layer of input dim 8, output dim 10
- a layer of input dim 10, output dim 3
- a layer of input dim 3, output dim 10
- a layer of input dim 10, output dim 7

You can use any loss and optimizer you want

In [None]:
def init_3(latent_dims):
    ### TODO 
    
    return model

### Question: init a model with latents_dim=[15, 6, 15], and run it

In [None]:
### TODO 

### Question: Test multiple architectures for only one additional layer but different number of neurons. Look at their relative predictive power (i.e. their accuracies)

In [None]:
accuracy = []

for ld in [[8], [16], [32], [50], [100], [200], [400]]:
    #### TODO 

In [None]:
### PLOT
### TODO 

### Question: You are now set to try any architecture you want, feel free to add additional layers with different number of neurons

In [None]:
accuracy = []

for ld in [[10], [10,10], [10,10,10], [25,10,25], [30,30], [10,10,10,10], [50,50]]:
    ### TODO 

In [None]:
########## PLOT
### TODO 

### Question: Now, for a given architecture, remake the dataset with an increasing number of samples. For each, look at the time to train the model.


In [None]:
accuracy = []
elapsed_time = []

for s in [100, 500, 1000, 2000, 5000]:
    ### Create data
    ### TODO 

    ### Init the model, run it and store the results
    ### TODO 
    

In [None]:
############### PLOT
### TODO 

### Optional : Go beyond by studying the effect of the dataset : 
- the number of features
- the number of categories
- the distance between the groups