# Multiclass classification

We've just solved a binary classification problem. What about a multiclass one?

### Exercise Objectives:
- Write a Neural Network for multiclass classification
- Observe overfitting during the model convergence

# 1. Create the data


The `make_blob` function [(see documentation)](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_blobs.html) enables to draw : 
- an arbitrary number of data sample, argument `n_samples`
- an arbitrary number of features per data sample, argument `n_features`
- an arbitrary number of categories, argument `centers`
- a distance between the categories, argument `cluster_std`

There is also the `random_state` argument that allows to draw the data deterministically, in order to reproduce the same data. Two persons that choose the same random_state will have the same data.

❓ **Question** ❓ Based on the documentation, generate data with : 
- 1200 samples
- 8 features per sample
- 7 categories of data
- 8 as the distance between the categories

Select a `random_state` equal to 1.

Print the shape and check that it corresponds to (1200, 8) for `X` and (1200) for `y`

In [0]:
# YOUR CODE HERE

❓ **Question** ❓ Thanks to matplotlib, scatter plot two (arbitrary) dimensions of the input data together. Each dot should be colored by the category it belongs to.

In [0]:
# YOUR CODE HERE

❓ **Question** ❓ Repeat the operation on other dimensions, to visualy that the data are not easily separable

In [0]:
# YOUR CODE HERE

As for now, `y` is the list of integers, each correspoding to the category of the related input data.
It looks like `[3, 2, 2, 3, 0, 5, 1, 1, 0, 5, ...]` (in this example, we have 7 categories, from 0 to 6).

However, for categorical task in Keras, the **output should have a number of columns equal to the number of different categories**. Each row, corresponding to an input data, is a list of the probabilities that this input belongs to the corresponding category. As here, the probabilities to belong to each category is equal to 1, it should look like

```
[
[0, 0, 0, 1, 0, 0, 0], 
[0, 0, 1, 0, 0, 0, 0], 
[0, 0, 1, 0, 0, 0, 0], 
[1, 0, 0, 0, 0, 0, 0], 
[0, 0, 0, 0, 0, 1, 0], 
[0, 1, 0, 0, 0, 0, 0],
[0, 1, 0, 0, 0, 0, 0],
[1, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 1],
...
]
```

Each column corresponds to a category. Each row corresponds to a target, the 1 being the category the input data belongs to.

To transform `y` to categories, use `to_categorical` function from Keras . 


❓ **Question** ❓ First print `y`, then apply it and store it into `y_cat` and reprint `y_cat` to see the new structure.

In [0]:
# YOUR CODE HERE

❓ **Question** ❓ Split the dataset $X$ and $y_{cat}$  into a train and test set (size: 70/30%)

Remark : Please call the variables `X_train`, `y_train`, `X_test` and `y_test`

In [0]:
# YOUR CODE HERE

In deep learning, the data should always be standard-scaled, so as to lay _approximately_ in [-1, 1]. (We will see later why).

❓ **Question** ❓ Fit a sklearn [`StandardScaler`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html) on the train set and transform both your train and test set.

In [0]:
# YOUR CODE HERE

❓ **Question** ❓ Complete the following function to initialize a model that has 
- a first layer with 50 neurons (activation being `relu` and appropriate input dimension)
- a output layer designed for a multiclassification task which outputs probabilities for each class

In [0]:
def initialize_model():
    ### Model architecture
    pass  # YOUR CODE HERE
    
    ### Model optimization : Optimizer, loss and metric 
    model.compile(loss='categorical_crossentropy', 
                  optimizer='adam', 
                  metrics=['accuracy'])
    
    return model 

### Note here that the loss is different! This is because the task is not with two categories only, therefore
### the solver is somehow different (will see it tomorrow)

model = initialize_model()

❓ **Question** ❓ How many parameters (a.k.a. weights) are there in the model? How many a logistic regression would have had with the same data?

In [0]:
# YOUR CODE HERE

❓ **Question** ❓ Fit your model onto the train data with 50 epochs and plot the history

In [0]:
# YOUR CODE HERE

❓ **Question** ❓ Evaluate your model on the test set and print the accuracy

In [0]:
# YOUR CODE HERE

❓ **Question** ❓ Is this a good score? You should compare it to some sort of benchmark value. In this case, what score would a random guess give? Store this baseline score in the `accuracy_baseline` variable.

In [0]:
# YOUR CODE HERE

In [0]:
from nbresult import ChallengeResult
result = ChallengeResult('baseline',
                         accuracy=accuracy_baseline)
result.write()
print(result.check())

❗ **Remark** ❗ Wait ... If you get a closer look at the plot of the loss, it seems that the loss was still decreasing after 50 epochs. Why stopping it so soon? Let's rerun the model (with the initialization first) with 1000 epochs and plot the history

In [0]:
# YOUR CODE HERE

❓ **Question** ❓ 
- What can you say about the new loss? 
- Evaluate once again your model on the test set and print the accuracy

In [0]:
# YOUR CODE HERE

❗ **Remark** ❗ On the one hand, the loss (computed on the train set) seems smaller than with 50 epochs. However, the accuracy on the test set got worse than before... 

❓ **Question** ❓ How is phenomenon called? 

> YOUR ANSWER HERE

❗ **Remark** ❗ The overfitting occurs at some point during the iteration of the gradient descent, once the accuracy starts getting worse on the test set. Therefore, there is a need to stop the fitting at some point.

Let's see when does the test loss increases in practice. (Yes, we data-leak, we should create a validation set for that in reality...)

❓ **Question** ❓ Run the following command and plot the history

In [0]:
model = initialize_model()

history = model.fit(X_train, y_train, 
                    validation_data=(X_test, y_test), 
                    epochs=500, 
                    batch_size=16,
                    verbose=0)
plot_history(history)

❓ **Question** ❓ Plot the values of the loss and accuracy on the train set (in blue) and on the test set (in orange). What can you comment on that?

In [0]:
def plot_loss_accuracy(history):
    plt.plot(history.history['loss'])
    plt.plot(history.history['val_loss'])
    plt.title('Model loss')
    plt.ylabel('Loss')
    plt.xlabel('Epoch')
    plt.legend(['Train', 'Test'], loc='best')
    plt.show()
    
    plt.plot(history.history['accuracy'])
    plt.plot(history.history['val_accuracy'])
    plt.title('Model Accuracy')
    plt.ylabel('Accuracy')
    plt.xlabel('Epoch')
    plt.legend(['Train', 'Test'], loc='best')
    plt.show()

In [0]:
# YOUR CODE HERE

❓ **Question** ❓ Reproduce similar results by defining a more complex architecture that includes : 

- a first layer with 25 neurons 
- a second layer with 15 neurons
- a third layer with 10 neurons
- a final layer that outputs probability for each class



In [0]:
def initialize_model_2():
    pass  # YOUR CODE HERE


❗ **Remark** ❗ 
- We clearly see that an overfitting can happend during the training. More in our next lecture
- The model overfits as the number of parameters is very very large (compare the number of weights with a logistic regression on the same data)

**🏁 Congratulation! Commit and push your notebook**