_In this notebook, every question will be marked by a blue border, and answers should be provided in cells in a green border. All code-related answers are preceded by a #TODO._

## Students (to fill in)

 - Nguyen Y-Quynh (group A2)
 - Cossoul Lucile (group A2)

# Introduction

The objective of this lab is to dive into particular kind of neural network: the *Multi-Layer Perceptron* (MLP).

To start, let us take the dataset from the previous lab (hydrodynamics of sailing boats) and use scikit-learn to train a MLP instead of our hand-made single perceptron.
The code below is already complete and is meant to give you an idea of how to construct an MLP with scikit-learn. You can execute it, taking the time to understand the idea behind each cell.

In [14]:
# Importing the dataset
import numpy as np
dataset = np.genfromtxt("https://arbimo.github.io/tp-supervised-learning/tp1/yacht_hydrodynamics.data", delimiter='')
X = dataset[:, :-1]
Y = dataset[:, -1]

In [15]:
# Preprocessing: scale input data 
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X = sc.fit_transform(X)

In [16]:
# Split dataset into training and test set
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(X, Y,random_state=1, test_size = 0.20)

In [17]:
# Define a multi-layer perceptron (MLP) network for regression
from sklearn.neural_network import MLPRegressor
mlp = MLPRegressor(max_iter=3000, random_state=1) # define the model, with default params
mlp.fit(x_train, y_train) # train the MLP



In [18]:
# Evaluate the model
from matplotlib import pyplot as plt

print('Train score: ', mlp.score(x_train, y_train))
print('Test score:  ', mlp.score(x_test, y_test))
plt.plot(mlp.loss_curve_)
plt.xlabel("Iterations")
plt.ylabel("Loss")


AttributeError: module 'matplotlib' has no attribute 'get_data_path'

In [None]:
# Plot the results
num_samples_to_plot = 20
plt.plot(y_test[0:num_samples_to_plot], 'ro', label='y')
yw = mlp.predict(x_test)
plt.plot(yw[0:num_samples_to_plot], 'bx', label='$\hat{y}$')
plt.legend()
plt.xlabel("Examples")
plt.ylabel("f(examples)")

### Analyzing the network

Many details of the network are currently hidden as default parameters.

Using the [documentation of the MLPRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPRegressor.html), answer the following questions.
<!-- Question Start -->
<div style="border: 1px solid blue; padding: 20px;border-radius: 5px;">
    
- What is the structure of the network?
- What it is the algorithm used for training? Is there algorithm available that we mentioned during the courses?
- How does the training algorithm decides to stop the training?
</div>
<!-- Question End -->

In [None]:
print('mlp.n_layers_ :',  mlp.n_layers_)
print('mlp.hidden_layer_sizes :',  mlp.hidden_layer_sizes)

<!-- Answer Section Start -->
<div style="border: 1px solid green; padding: 10px; margin-top: 10px; border-radius: 5px">

- What is the structure of the network?
The network structure consist of a multi layer perceptron neural network; with an input layer, 3 hidden layers (with 100 neurons by layer) and an output layer.

- What it is the algorithm used for training? Is there algorithm available that we mentioned during the courses?
The algorithm used is the Adam solver, it refers to a stochastic gradient-based optimizer. The stochastic gradient descent was mentionned during the courses.

- How does the training algorithm decides to stop the training?
Several conditions can make the algorithm stop:
    - The algorithm will stop after 3000 iterations
    - The algorithm will stop if the loss of score does not improve by at least tol(1e-4) for 10 times.

In this example, the algorithm stopped a 1600 iterations because of the tol
    
    
    
</div><!-- Answer Section End -->

# Onto a more challenging dataset: house prices

For the rest of this lab, we will use the (more challenging) [California Housing Prices dataset](https://www.kaggle.com/datasets/camnugent/california-housing-prices).

In [None]:
# clean all previously defined variables for the sailing boats
%reset -f

In [None]:
"""Import the required modules"""
from sklearn.datasets import fetch_california_housing
from sklearn.utils import shuffle
import pandas as pd

cal_housing = fetch_california_housing()
print(f"dataset type : {type(cal_housing)}")
print(f"number of data : {len(cal_housing.data)}")
X_all = pd.DataFrame(cal_housing.data,columns=cal_housing.feature_names)
y_all = pd.DataFrame(cal_housing.target,columns=["target"])

X_all, y_all = shuffle(X_all, y_all, random_state=1)

display(X_all.head(10)) # print the first 10 values
display(y_all.head(10))

Note that each row of the dataset represents a **group of houses** (one district). The `target` variable denotes the average house value in units of 100.000 USD. Median Income is per 10.000 USD.

### Data Preparation

The dataset consists of 20,000 datas. We first extract the last 5,000 for test samples, which we will use later.

For training and validation, we will use a subset consisting of only 2,000 datas to speed up computations.

<!-- Question Start -->
<div style="border: 1px solid blue; padding: 20px;border-radius: 5px;">
    
- Split those 2000 remaining dataset between a training set and a validation set (see usage of `train_test_split` function earlier)
- Why did you choose this partition?
- What is the purpose of each subset (train, validation, test) ?

</div>
<!-- Question End -->


Please use the conventional names `X_train`, `X_val`, `y_train` and `y_val`.

In [None]:
# use the last N samples for test (for later use)
num_test_samples = 5000
X_test, y_test = X_all[-num_test_samples:], y_all[-num_test_samples:]

# only use the first N samples to limit training time
num_samples = 2000
X, y = X_all[:num_samples], y_all[:num_samples]

In [None]:
# TODO 
from sklearn.model_selection import train_test_split
X_train, X_val, y_train, y_val = train_test_split(X,y,random_state=1, test_size = 0.20)

#x_train size
print(f"x_train size : {len(X_train)}")
#x_val size
print(f"x_val size : {len(X_val)}")


<!-- Answer Section Start -->
<div style="border: 1px solid green; padding: 10px; margin-top: 10px; border-radius: 5px">

**Your answer here:**

- Why did you choose this partition?
The standards for train-validation-test splits is 60-80% training data, 10-20% validation data, and 10-20% test data.
Out of the 20 000, we use 7000 data for this part of the TP, 5000 will be for testing purpose and 2000 for training and validation.
Out of those 2000 data, we chose to use 20% of the data for validation and 80% for training. (Those 2000 data for training and validation sould be the 15000 data available but to speed up computations, we only use 2000 data)).


- What is the purpose of each subset (train, validation, test) ?
To avoid overfitting a particular dataset, we split it in 3: train, validation and test.
    - The training set is used to train and make the model
    - The validation set is used to check our model after each training, it helps us change hyperparameters.
    - The testing set is used after we ended on our final model, it helps us test our model on less biased data and know how accurate our model is. 


</div>
<!-- Answer Section End -->

### Scaling the input data


A step of **scaling** of the data is often useful to ensure that all input data centered on 0 and with a fixed variance.

Standardization of a dataset is a common requirement for many machine learning estimators: they might behave badly if the individual features do not more or less look like standard normally distributed data (e.g. Gaussian with 0 mean and unit variance). The function `StandardScaler` from `sklearn.preprocessing` computes the standard score of a sample as:

```
z = (x - u) / s
```

where `u` is the mean of the training samples, and `s` is the standard deviation of the training samples.

<!-- Question Start -->
<div style="border: 1px solid blue; padding: 20px;border-radius: 5px;">
    
- Using the `StandardScaler`, first fit this scaler on your training dataset (`X_train`), then use this fitted scaler to transform the training dataset, the validation dataset (`X_val`), and the test dataset (`X_test`).


- Why is it important to fit the scaler only on the training data and not on the entire dataset or separately on each dataset?

</div>
<!-- Question End -->

[Documentation of standard scaler in scikit learn](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html)



In [None]:
# TODO
print("Initial train data:\n", X_train)



#fit the standard scaler with traain input dataset x_train
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
fitted_scaler = sc.fit(X_train)

# tranform x_train, x_val and x_test
std_X_train = fitted_scaler.transform(X_train)
std_X_val = fitted_scaler.transform(X_val)
std_X_test = fitted_scaler.transform(X_test)
print("Standardized data x_train:\n", std_X_train)
print("Standardized val data x_val:\n", std_X_val)
print("Standardized test data x_test:\n", std_X_test)


<!-- Answer Section Start -->
<div style="border: 1px solid green; padding: 10px; margin-top: 10px; border-radius: 5px">

**Your answer here:**

- Why is it important to fit the scaler only on the training data and not on the entire dataset or separately on each dataset?

We use the Scaler for maintaining the consistency of data points and suppress the differences in the scale of the features of the data.

We only fit the training data because it will modify the variance and mean. Doing it on the whole dataset would give us biased estimates of our model (already using knowledge about the distribution of the test set to set the scale of the training set).
We only use transform() on the test data because we use the scaling paramaters learned on the train data to scale the test data.

</div>
<!-- Answer Section End -->

## Overfitting

In this part, we are only interested in maximizing the **train score**, i.e., having the network memorize the training examples as well as possible. While doing this, you should (1) remain within two minutes of training time, and (2) obtain a score that is greater than 0.90.

<!-- Question Start -->
<div style="border: 1px solid blue; padding: 20px;border-radius: 5px;">
    
- Propose a parameterization of the network (number of neurons per layer, number of layers, epochs, learning rates) that will maximize the train score (without considering the test score). Ensure that you disable any form of internal validation checks such as early stopping to promote overfitting.

- Is the **validation** score substantially smaller than the **train** score (indicator of overfitting) ?
- Explain how the parameters you chose allow the learned model to overfit.
</div>
<!-- Question End -->

In [None]:
# TODO

# Define a multi-layer perceptron (MLP) network for regression
from sklearn.neural_network import MLPRegressor
import numpy as np

mlp = MLPRegressor(hidden_layer_sizes = (200,200,200,), max_iter=4000, random_state=1, learning_rate_init=0.001, activation='tanh') # define the model, with default params
mlp.fit(std_X_train, np.ravel(y_train)) # train the MLP
print('Training score: ', mlp.score(std_X_train, np.ravel(y_train)))

In [None]:
print('Validation score: ', mlp.score(std_X_val, np.ravel(y_val)))


print('Test score: ', mlp.score(std_X_test, np.ravel(y_test)))

<!-- Answer Section Start -->
<div style="border: 1px solid green; padding: 10px; margin-top: 10px; border-radius: 5px">

**Your answer here:**

- Is the **validation** score substantially smaller than the **train** score (indicator of overfitting) ?
The validation score is considerably smaller (training: 0.99 and validation: 0.67). We are indeed overfitting to the training data since when testing on another dataset (the validation one), the score is not great.

- Explain how the parameters you chose allow the learned model to overfit.
In this exemple, changing the learning rate did not change the score much (between 0.1 and 0.00001); the number of iteration either (between 2000 and 5000).
The parameters that allowed us to overfit our model were the change of the activation function and the number of neurons in each layer (more than the number of layers itself).
    - We double the size of each layers and add 2 more layers. A single layer with a lot of neurons has more redundancy, and thus is more likely to converge to a good model. Increasing the number of hidden layers much more than the sufficient number of layers will cause accuracy in the test set to decrease because of the overfit generated.
    - We change the activation function from relu (rectified linear unit function) for the tanh (hyperbolic tan).  The activation function ReLU is linear while the tanh one is S-shaped and nonlinear, it is better to model after our specific data. Tanh is slower, but for our reduced dataset, it works great to overfit.





</div>
<!-- Answer Section End -->

## Hyperparameter tuning

In this section, we are now interested in maximizing the ability of the network to predict the value of unseen examples, i.e., maximizing the **validation** score.
You should experiment with the possible parameters of the network in order to obtain a good test score, ideally with a small learning time.

Parameters to vary:

- number and size of the hidden layers
- activation function
- stopping conditions
- maximum number of iterations
- initial learning rate value

Results to present for the tested configurations:

- Train/val score
- training time

<!-- Question Start -->
<div style="border: 1px solid blue; padding: 20px;border-radius: 5px;">
Present in a table the various parameters tested and the associated results. 
</div>
<!-- Question End -->

You can find a cell in the notebook a code snippet that will allow you to plot tables from python structure.
Be methodical in the way your run your experiments and collect data. For each run, you should record the parameters and results into an external data structure.

(Note that, while we encourage you to explore the solution space manually, there are existing methods in scikit-learn and other learning framework to automate this step as well, e.g., [GridSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html))

In [None]:
# TODO
from sklearn.neural_network import MLPRegressor
import numpy as np

mlp = MLPRegressor(activation='relu', hidden_layer_sizes = (300,300,), max_iter=5000, learning_rate_init=0.0001, early_stopping=False)
mlp.fit(std_X_train, np.ravel(y_train)) # train the MLP
print('Training score: ', mlp.score(std_X_train, np.ravel(y_train)))
print('Validation score: ', mlp.score(std_X_val, np.ravel(y_val)))



In [None]:
# Code snippet to display a nice table in jupyter notebooks  (remove from report)
import pandas as pd
import numpy as np
data = []

data.append({'activation': 'relu', 'nb layers' : 3, 'size' : 100, 'max_iter': 3000, 'learning rate' : 0.001, 'early_stopping': False, 'train_score': 0.83, 'val_score': 0.74, 'time' : 5.6})
data.append({'activation': 'relu', 'nb layers' : 5, 'size' : 100, 'max_iter': 3000, 'learning rate' : 0.001, 'early_stopping': False, 'train_score': 0.97, 'val_score': 0.67, 'time' : 11.6})
data.append({'activation': 'relu', 'nb layers' : 4, 'size' : 100, 'max_iter': 3000, 'learning rate' : 0.001, 'early_stopping': False, 'train_score': 0.83, 'val_score': 0.72, 'time' : 4.1})
data.append({'activation': 'relu', 'nb layers' : 3, 'size' : 200, 'max_iter': 3000, 'learning rate' : 0.001, 'early_stopping': False, 'train_score': 0.85, 'val_score': 0.76,  'time' : 8.0})
data.append({'activation': 'relu', 'nb layers' : 3, 'size' : 200, 'max_iter': 3000, 'learning rate' : 0.001, 'early_stopping': True, 'train_score': 0.73, 'val_score': 0.71,  'time' : 6.3})
data.append({'activation': 'relu', 'nb layers' : 3, 'size' : 200, 'max_iter': 3000, 'learning rate' : 0.0001, 'early_stopping': False, 'train_score': 0.75, 'val_score': 0.73,  'time' : 12.2})
data.append({'activation': 'tanh', 'nb layers' : 3, 'size' : 200, 'max_iter': 3000, 'learning rate' : 0.001, 'early_stopping': False, 'train_score': 0.78, 'val_score': 0.74,  'time' : 11.3})
data.append({'activation': 'tanh', 'nb layers' : 3, 'size' : 100, 'max_iter': 3000, 'learning rate' : 0.001, 'early_stopping': False, 'train_score': 0.77, 'val_score': 0.75,  'time' : 10.3})
data.append({'activation': 'relu', 'nb layers' : 3, 'size' : 300, 'max_iter': 3000, 'learning rate' : 0.001, 'early_stopping': False, 'train_score': 0.84, 'val_score': 0.73,  'time' : 7.7})
data.append({'activation': 'relu', 'nb layers' : 3, 'size' : 300, 'max_iter': 3000, 'learning rate' : 0.001, 'early_stopping': False, 'train_score': 0.84, 'val_score': 0.73,  'time' : 7.7})
data.append({'activation': 'relu', 'nb layers' : 4, 'size' : 200, 'max_iter': 5000, 'learning rate' : 0.001, 'early_stopping': False, 'train_score': 0.91, 'val_score': 0.74,  'time' : 8.8})
data.append({'activation': 'relu', 'nb layers' : 4, 'size' : 300, 'max_iter': 5000, 'learning rate' : 0.0001, 'early_stopping': False, 'train_score': 0.90, 'val_score': 0.76,  'time' : 43.3})


table = pd.DataFrame.from_dict(data)
table = table.replace(np.nan, '-')
table = table.sort_values(by='val_score', ascending=False)
table

Unnamed: 0,activation,nb layers,size,max_iter,learning rate,early_stopping,train_score,val_score,time
3,relu,3,200,3000,0.001,False,0.85,0.76,8.0
11,relu,4,300,5000,0.0001,False,0.9,0.76,43.3
7,tanh,3,100,3000,0.001,False,0.77,0.75,10.3
0,relu,3,100,3000,0.001,False,0.83,0.74,5.6
6,tanh,3,200,3000,0.001,False,0.78,0.74,11.3
10,relu,4,200,5000,0.001,False,0.91,0.74,8.8
5,relu,3,200,3000,0.0001,False,0.75,0.73,12.2
8,relu,3,300,3000,0.001,False,0.84,0.73,7.7
9,relu,3,300,3000,0.001,False,0.84,0.73,7.7
2,relu,4,100,3000,0.001,False,0.83,0.72,4.1


## Evaluation
<!-- Question Start -->
<div style="border: 1px solid blue; padding: 20px;border-radius: 5px;">
    
- From your experiments, what seems to be the best model (i.e. set of parameters) for predicting the value of a house?
- Evaluate the score of your model on the test set that was not used for training nor for model selection.
- Train a model using your optimal parameters on the initial 15,000 data points. Evaluate the performance using the test set. What are your thoughts on the amount of data used? Do you believe the time spent is worthwhile in terms of the improvement in performance?
</div>
<!-- Question End -->

In [None]:
#TODO
#Best model : (76% on validation)
from sklearn.neural_network import MLPRegressor
import numpy as np
mlp = MLPRegressor(activation='relu', hidden_layer_sizes = (200,), max_iter=3000, learning_rate_init=0.001, early_stopping=False)
mlp.fit(std_X_train, np.ravel(y_train)) # train the MLP
print('Training score: ', mlp.score(std_X_train, np.ravel(y_train)))
print('Test score: ', mlp.score(std_X_test, np.ravel(y_test)))


In [None]:
#new test set on 15000 first data (5000 last for test)
num_test_samples = 5000
X_test, y_test = X_all[-num_test_samples:], y_all[-num_test_samples:]

num_train_samples = 15000
X_train, y_train = X_all[:num_train_samples], y_all[:num_train_samples]


#fit the standard scaler with traain input dataset x_train
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
fitted_scaler = sc.fit(X_train)

# tranform x_train and x_test
std_X_train = fitted_scaler.transform(X_train)
std_X_test = fitted_scaler.transform(X_test)
print("Standardized data x_train:\n", std_X_train)
print("Standardized test data x_test:\n", std_X_test)

#Best model : with 15000 data
from sklearn.neural_network import MLPRegressor
import numpy as np
mlp = MLPRegressor(activation='relu', hidden_layer_sizes = (200,), max_iter=3000, learning_rate_init=0.001, early_stopping=False)
mlp.fit(std_X_train, np.ravel(y_train)) # train the MLP
print('Training score: ', mlp.score(std_X_train, np.ravel(y_train)))
print('Test score: ', mlp.score(std_X_test, np.ravel(y_test)))

NameError: name 'X_all' is not defined

<!-- Answer Section Start -->
<div style="border: 1px solid green; padding: 10px; margin-top: 10px; border-radius: 5px">

**Your answer here:**

- From your experiments, what seems to be the best model (i.e. set of parameters) for predicting the value of a house?

From our experiments, we found that the best model used:
    activation function: rectangle linear unit function
    number of layers: 3
    size of each layer: 200
    max number of iteration: 3000
    learning rate: 0.001
    early stopping: no


With these parameters, we obtained a score of 0.85 with the training set, 0.76 with the training set in a time of 8 seconds. 
    

Another good contender used:
    activation function: rectangle linear unit function
    number of layers: 4
    size of each layer: 300
    max number of iteration: 5000
    learning rate: 0.0001
    early stopping: no


With these parameters, we obtained a score of 0.90 with the training set, 0.76 with the training set but in a time of 43 seconds. It is too slow, considering we are using a smaller subset. With all of the data, it would take too long.
    


- Train a model using your optimal parameters on the initial 15,000 data points. Evaluate the performance using the test set. What are your thoughts on the amount of data used? Do you believe the time spent is worthwhile in terms of the improvement in performance?


</div>
<!-- Answer Section End -->