<a href="https://colab.research.google.com/github/FatimaEzzedinee/ML-bachelor-course-labs-sp24/blob/main/03_model_performance.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Machine Learning SP 2023/2024

- Prof. Cesare Alippi
- Alvise Dei Rossi ([`alvise.dei.rossi@usi.ch`](mailto:alvise.dei.rossi@usi.ch))<br>
- Fatima Ezzeddine ([`fatima.ezzeddine@usi.ch`](mailto:fatima.ezzeddine@usi.ch))<br>
- Alessandro Manenti ([`alessandro.manenti@usi.ch`](mailto:alessandro.manenti@usi.ch))

---

# Lab 03: Model Performance

The objectives of the lab are as follows:

- Evaluate the performance of a model
- Perform splits on the data
- Assess which model is the best along many models
- Gain familiarity with the concept of models hyper-parameter tuning

In [None]:
import warnings
warnings.filterwarnings("ignore")

In [None]:
import numpy as np
import matplotlib.pyplot as plt

# --- Auxiliary code -------------------- #

# function to plot decision boundaries
def plot_decision_surface(model, x, y, transform=lambda x:x, title=""):

  from matplotlib.colors import ListedColormap
  # color_maps
  cm = plt.cm.RdBu
  cols = ['#FF0000', '#0000FF']
  cm_bright = ListedColormap(cols)

  #init figure
  fig = plt.figure()

  # Create mesh
  h = .1  # step size in the mesh
  x_min, x_max = x[:, 0].min() - .5, x[:, 0].max() + .5
  y_min, y_max = x[:, 1].min() - .5, x[:, 1].max() + .5
  xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                       np.arange(y_min, y_max, h))

  # plot train data
  cy = [cols[int(yi)] for yi in y] # list of color for every observation based on its class
  plt.scatter(x[:, 0], x[:, 1], c=cy, cmap=cm_bright,
              edgecolors='k')
  plt.xlim(xx.min(), xx.max())
  plt.ylim(yy.min(), yy.max())

  plt.xlabel(r'$x_1$')
  plt.ylabel(r'$x_2$')
  plt.title(title)

  y_pred = model.predict(transform(np.c_[xx.ravel(), yy.ravel()])) # predict for every point in the mesh
  # note that we predict the class directly, we don't use predict_proba

  y_pred = y_pred.reshape(xx.shape)
  plt.contourf(xx, yy, y_pred > 0.5, cmap=cm, alpha=.5)

## Use-case scenario
Consider the following use-case scenario: A company has approached us with a request to develop a machine learning model for one of their machines. We have been provided with a labeled dataset consisting of $(x_i, y_i)$ for $i=1, ..., N$.

Our objective is to identify the best possible model, denoted as $f(x; \hat \theta)$, and provide an estimate of its performance, which is represented as $V(\hat \theta)$.

In [None]:
# Prepare some data
N = 400

np.random.seed(42)
# generating data points by sampling from a random distribution and then changing the means of the distributions by adding constants to the values
Xa = np.random.randn(N//4, 2) # second cluster
Xb = np.random.randn(N//4, 2) + np.array([ 8.,  1.]) # forth cluster
Xc = np.random.randn(N//4, 2) + np.array([-4., -1.]) # first cluster
Xd = np.random.randn(N//4, 2) + np.array([ 4., -1.]) # third cluster

#stacking the features together
X = np.vstack([Xa, Xb, Xc, Xd])

plt.scatter(X[:, 0], X[:, 1]);

Creates two classes of equal size, one consisting of the first half of the elements in y (which are all zero) and the other consisting of the second half of the elements in y (which are all one).

In [None]:
# Assigning labels to our dataset

# Creates a NumPy array y of size N with all elements initialized to zero.
y = np.zeros((N,)) # First half of the elements in y will be class zero. (clusters 2 and 4)

# Sets the second half of the elements in y to one. (clusters 1 and 3)
y[N//2:] = 1

# plot the data
plt.scatter(X[:N//2, 0], X[:N//2, 1], c="red", label="class 0")
plt.scatter(X[N//2:, 0], X[N//2:, 1], c="blue", label="class 1")
plt.legend()
plt.show()

In [None]:
X.shape, y.shape

## Train some models

Let's start from a logistic regression ([sklearn doc](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression)).


We will try 2 types of models:
- Logistic regression as the linear model
- Feed Forward Neural Network as the non-linear model

For this lab, since we want to focus on evaluation of model performance, we're going to use a sklearn implementation for the neural network. In the next labs, as we're going to explore more complex architectures, we're going to use Pytorch instead.

In [None]:
from sklearn.linear_model import LogisticRegression

#initiate the class
logreg = LogisticRegression()
#feed it with X and y
logreg.fit(X, y)
#plot the decision surface
plot_decision_surface(model=logreg, x=X, y=y)

Let's try a feed-forward neural net.

Sklearn implementation is quite simple and abstracts away most of the complexity of neural networks. You don't need to specify several components that you'd need in a more complex (and complete) framework like Pytorch, like the loss function.

Binary cross entropy is a common loss function used in machine learning, particularly for binary classification tasks where the goal is to predict one of two possible outcomes. It measures the difference between the predicted probabilities and the actual binary labels of a given dataset.

When training neural networks for binary classification, we take the loss to be the __cross-entropy error function__:

$$
L({\boldsymbol \theta}) =  -\frac1n \sum_{i=1}^n \bigg[y_i  \log \hat y_i + (1 - y_i)  \log (1 - \hat y_i)\bigg]
$$

In [None]:
from sklearn.neural_network import MLPClassifier

nn_kwargs= {"hidden_layer_sizes":(150,), # number of neurons
            "activation":"relu", # non-linear activation function (through all the network)
            "max_iter":250, # epochs
            "solver":"adam"} # optimizer

custom_nn_kwargs = {} # change this to customize your neural net!
ffnn = MLPClassifier(**nn_kwargs)
ffnn.fit(X, y)

plot_decision_surface(model=ffnn, x=X, y=y)

**Task** : Play around with the hyperparameters of your neural network!

Check the [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html) and modify the `custom_nn_kwargs` dictionary with your preferences.

## Performance assessment

In [None]:
# Set sizes
n = int(N * .8)  # Training points, 80%
l = N - n        # Test points, the remaining 20%
print("num training observations: n=",  n)
print("num test observations:     l= ", l)

# Data split
X_train, y_train = X[:n], y[:n]
X_test, y_test = X[n:], y[n:]

In [None]:
# Train the two models
# logistic regression
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
# neural network
ffnn = MLPClassifier(**nn_kwargs)
ffnn.fit(X_train, y_train)

# Accuracy: rate of correct classifications:
# Logistic regression
correct_classif = (logreg.predict(X_test) == y_test).astype(int)
print("LR acc   :", np.mean(correct_classif))
# Neural network
y_pred = ffnn.predict(X_test)
correct_classif = (y_pred == y_test).astype(int)
print("NN acc   :", np.mean(correct_classif))

# Plot boundaries
plot_decision_surface(model=logreg, x=X_test, y=y_test, title="Logistic Regression")
plot_decision_surface(model=ffnn,   x=X_test, y=y_test, title="Feed forward Neural Network")

#### What's wrong?

Let's Look at the confusion matrix.

A confusion matrix is a table that summarizes the performance of a classification model by comparing the predicted labels with the actual labels. The confusion matrix shows the number of true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN) for each class.

<div style="text-align:center;">
    <img src="https://assets-global.website-files.com/6266b596eef18c1931f938f9/644aea65cefe35380f198a5a_class_guide_cm08.png" width="40%">
</div>


1. True Negative – Indicates how many negative values are predicted as negative only by the model
2. False Positive – Indicates how many negative values are predicted as positive values by the model
3. False Negative – Indicates how many positive values are predicted as negative values by the model
4. True Positive – Indicates how many positive values are predicted as positive only by the model

In [None]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import ConfusionMatrixDisplay

# the NN Confusion matrix
print(f'accuracy score: {accuracy_score(y_test, y_pred)}')
ConfusionMatrixDisplay.from_predictions(y_test, y_pred)
plt.show()

In [None]:
# Plot split data
plt.subplot(121)
plt.scatter(X_train[:, 0], X_train[:, 1], c=np.where(y_train==1, "blue", "red"))
plt.title("Training set")
plt.xlim([min(X[:,0])-0.5, max(X[:,0])+0.5])
plt.ylim([min(X[:,1])-0.5, max(X[:,1])+0.5])

plt.subplot(122)
plt.scatter(X_test[:, 0], X_test[:, 1], c=np.where(y_test==1, "blue", "red"))
plt.title("Test set")
plt.xlim([min(X[:,0])-0.5, max(X[:,0])+0.5])
plt.ylim([min(X[:,1])-0.5, max(X[:,1])+0.5])
plt.show()

We did not shuffle the data.

In [None]:
# Shuffle the data!
# take some random permutations of N -> randomly permute betweeen 0 and N
p = np.random.permutation(N)
idx_train = p[:n] # split is still 80/20
idx_test = p[n:]

# Data split
X_train, y_train = X[idx_train], y[idx_train]
X_test, y_test = X[idx_test], y[idx_test]

# Plot split data
plt.subplot(121)
plt.scatter(X_train[:, 0], X_train[:, 1], c=np.where(y_train==1, "blue", "red"))
plt.title("Training set")
plt.subplot(122)
plt.scatter(X_test[:, 0], X_test[:, 1], c=np.where(y_test==1, "blue", "red"))
plt.title("Test set")
plt.show()


SkLearn provides many [utilities](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.model_selection) to split the data. For example, `train_test_split`, `GroupShuffleSplit` and `StratifiedShuffleSplit`.


Try to use the first one, reading [its documentation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html), to split the data. Reminder: we want the test set to be 20% of the data available. Then retrain the models using the training set and assess their performance in terms of accuracy on the test set (you may want to use directly the [appropriate sklearn function](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html) to do so).


In [None]:
from sklearn.model_selection import train_test_split

... # split the data

... # instantiate your logistic regression model

... # fit it on the training set

... # get the predictions on the test set

... # compute the accuracy

# do it again for the neural feed-forward neural network

##### Solution:

In [None]:
from sklearn.model_selection import train_test_split
idx_train, idx_test = train_test_split(np.arange(N), test_size=0.2, shuffle=True, random_state=42)
#X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle=True, random_state=42)

# Data split
X_train, y_train = X[idx_train], y[idx_train]
X_test, y_test = X[idx_test], y[idx_test]

# Plot split data
plt.subplot(121)
plt.scatter(X_train[:, 0], X_train[:, 1], c=np.where(y_train==1, "blue", "red"))
plt.title("Training set")
plt.subplot(122)
plt.scatter(X_test[:, 0], X_test[:, 1], c=np.where(y_test==1, "blue", "red"))
plt.title("Test set")
plt.show()

In [None]:
# Train the two models
#logistic
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
#nn
ffnn = MLPClassifier(**nn_kwargs)
ffnn.fit(X_train, y_train)
y_pred = np.array(ffnn.predict_proba(X_test) < .5, dtype=int)[:, 0]

# Evaluate accuracy
acc_lr = logreg.score(X_test, y_test)
acc_nn = ffnn.score(X_test, y_test)
print("acc_lr", acc_lr)
print("acc_nn", acc_nn)

# Plot boundaries
plot_decision_surface(model=logreg, x=X_test, y=y_test, title="Logistic Regression")
plot_decision_surface(model=ffnn, x=X_test, y=y_test, title="Feed Forward Neural Network")

In [None]:
from sklearn.metrics import accuracy_score, confusion_matrix

print(f'accuracy score: {accuracy_score(y_test, y_pred)}')
cf_mat = confusion_matrix(y_test, y_pred)
print('Confusion matrix')
print(cf_mat)
ConfusionMatrixDisplay.from_predictions(y_test, y_pred)
plt.show()

Note that accuracy is calculated as:

$$accuracy = \frac{TP+TN}{TP+TN+FP+FN} = \frac{TP+TN}{N}$$

## Let's look at some performance metrics

Let's look at the [classification report](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html) for our task.

In [None]:
from sklearn.metrics import classification_report

print(classification_report(y_test, y_pred))

This shows, for every class, precision, recall, f1-score, support, accuracy, and the macro/weighted averages:

$$precision = \frac{TP}{TP+FP}$$

basically, out of all the predicted positives for a class, how many of them are actually positive?

$$recall = TPR = \frac{TP}{TP+FN}$$

How many of the true positives was I able to predict appropriately? (also called true positive rate, TPR, or sensitivity)

$$F1score = \frac{2 \cdot precision \cdot recall}{precision + recall}$$

Macro averages don't consider the support for each class when averaging metrics, weighted averages do.

Another metric (not shown in the report is the false positive rate):

$$FPR = \frac{FP}{FP+TN}$$

ROC (Receiver Operating Characteristic) curve is a graphical representation of the performance of a binary classification model.

The ROC curve plots the TPR against the FPR at various classification thresholds.

The area under the ROC curve (AUC) is a scalar value that represents the performance of the binary classification model.
AUC ranges between 0 and 1, where 0 represents a model that makes all predictions wrong, and 1 represents a perfect model that makes all predictions correctly.

Interpretation of the ROC curve and AUC:

- The closer the ROC curve is to the upper-left corner of the plot (TPR=1, FPR=0), the better the performance of the binary classification model (and the best threshold for your hyperparameter is the one closest to it).
- If the ROC curve is a diagonal line, it means that the model performs no better than random chance.
- An AUC of 0.5 indicates that the model performs no better than random chance, while an AUC of 1 indicates perfect classification.

AUC can be interpreted as the probability that the model will correctly classify a randomly chosen positive instance higher than a randomly chosen negative instance.

In summary, the ROC curve and AUC are useful tools for evaluating the performance of binary classification models.

In [None]:
from sklearn.metrics import roc_auc_score, roc_curve

def plot_roc_curve(true_y, y_prob, name):
    """
    plots the roc curve based of the probabilities
    """
    fpr, tpr, thresholds = roc_curve(true_y, y_prob)
    plt.plot(fpr, tpr, label=name)
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')

plot_roc_curve(y_test, ffnn.predict_proba(X_test)[:,1], 'FFNN')
print(f'FFNN AUC score: {roc_auc_score(y_test, ffnn.predict(X_test) )}')


plot_roc_curve(y_test, logreg.predict_proba(X_test)[:,1], 'Logistic Regression')
print(f'Logistic regression AUC score: {roc_auc_score(y_test, logreg.predict_proba(X_test)[::,1] )}')

plt.legend()
plt.show()

Given the prediction of one of your models and the true y for the test set, test out some of the functions for the [sklearn metrics module](https://scikit-learn.org/stable/modules/model_evaluation.html).

In [None]:
y_pred = ffnn.predict(X_test)
y_proba = ffnn.predict_proba(X_test)[:,1] # notice that we get the column relative to the "positive" class

from sklearn.metrics import ...

# your tests here

#### Solution

In [None]:
y_pred = ffnn.predict(X_test)
y_proba = ffnn.predict_proba(X_test)[:,1] # notice that we get the column relative to the "positive" class

from sklearn.metrics import average_precision_score, balanced_accuracy_score, log_loss

print(f"Average precision: {average_precision_score(y_test, y_pred)}")
print(f"Balanced accuracy: {balanced_accuracy_score(y_test, y_pred)}")
print(f"Log loss: {log_loss(y_test, y_proba)}")

## Can we say which model is the best?

What is a T-test?

The null hypothesis of a T-test is a statement that there is no significant difference between the means of two groups.

T-test is done by building the sample mean and the sample variance for the classification case.

There are two types of T-tests:
- unpaired samples T-test: used when the two groups being compared are independent of each other
- paired samples T-test:  used when the two groups are dependent or related

In our case we will be testing on the same test set, so we investigate the application of the paired T-test




$$
e_i =
\begin{cases}
1, & \text{if } y_i =   f(x_i;\hat \theta)\\
0, & \text{if } y_i \ne f(x_i;\hat \theta)
\end{cases}
$$

$$\overline e = \frac{1}{l}\sum_{i=1}^l e_i ;\qquad s^2 = \overline e (1 - \overline e)$$

Under [Central Limit Theorem](https://en.wikipedia.org/wiki/Central_limit_theorem) we compute:

$$T = \frac{\overline e_{nn} - \overline e_{lr}}
           {\sqrt{ \frac{s^2_{nn}}{l} + \frac{s^2_{lr}}{l}}}$$


We want to see if we can fulfill the null hypothesis. We need to check if we are in a 95% confidence interval --> if the results of the statistics is inside the interval (-1.96, 1.96).           

In [None]:
# Logistic Regression part
e_lr = (y_test == logreg.predict(X_test)).astype(int)
mean_e_lr = e_lr.mean()
s2_lr = mean_e_lr * (1 - mean_e_lr)
print("mean: {} -- s2: {}".format(mean_e_lr, s2_lr))

# Neural Net part
y_pred = (ffnn.predict(X_test)).astype(int)
e_nn = (y_test == y_pred).astype(int)
mean_e_nn = e_nn.mean()
s2_nn = mean_e_nn * (1 - mean_e_nn)
print("mean: {} -- s2: {}".format(mean_e_nn, s2_nn))

# Test statistics
T = (mean_e_nn - mean_e_lr)
T /= np.sqrt( s2_nn / l + s2_lr / l )
print("is T={} in 95\% confidence interval (-1.96, 1.96) ?".format(T))

# t-test
from scipy.stats import ttest_ind
tt, p_val = ttest_ind(e_lr, e_nn, equal_var=False)
print('t-test: T={:.2f}, p-value={:.4f}'.format(tt, p_val))

# paired t-test
from scipy.stats import ttest_rel
tt, p_val = ttest_rel(e_lr, e_nn)
print('t-test: T={:.2f}, p-value={:.4f}'.format(tt, p_val))

## Did we finish?

- The best model was the neural net,
- We estimated its performance,
- ...

We retrain the best model on the entire dataset.

And [save it](https://scikit-learn.org/stable/model_persistence.html).


In [None]:
ffnn_final = MLPClassifier(**nn_kwargs)
ffnn_final.fit(X, y)

from joblib import dump, load
dump(ffnn_final, 'my_neural_net.joblib')
loaded_model = load('my_neural_net.joblib')

# Check they are actually the same
print(ffnn_final.score(X, y))
print(loaded_model.score(X, y))

---


## K-fold cross-validation

Say we have a single model and we want to identify a confidence interval for its accuracy.

### Split the data: cross-validation

Cross-validation is a technique used in machine learning to assess the performance of a model. In addition to making sure that we do not have bias in our dataset. It help us to be sure that whatever is been learnt on the training set it is gonna be generalized to the test set.

It involves splitting the available data into training and validation sets, training the model on the training set, and then evaluating its performance on the validation set. This process is repeated several times with different splits of the data to obtain a more robust estimate of the model's performance.


There are [several types of cross-validation techniques](https://scikit-learn.org/stable/modules/cross_validation.html), including:

- k-fold cross-validation: This involves dividing the data into k equally sized folds, training the model k times, and using a different fold as the validation set each time.
- Stratified cross-validation: This is used when the data is imbalanced, and ensures that the class distribution is preserved in the training and validation sets.

The main idea behind cross-validation is that each observation in our dataset has the opportunity of being tested. In each round, we split the dataset into  k parts: one part is used for validation, and the remaining  k−1 parts are merged into a training subset for model evaluation. The figure below illustrates the process of 5-fold cross-validation:


<div style="text-align:center;">
    <img src="https://www.mltut.com/wp-content/uploads/2020/05/cross-validation.png" width="30%">
</div>

$$ Fold:  k_i $$

$$ Train:  D_{−k_i}$$

$$ Evaluate: {x_i, y_i} \in D_{k_i}$$

$$Performance = \frac{1}{N}  \sum_{i=1}^N   {Performance_i} $$


where:

$D_{-k_i}$ is the training set with the $i$-th fold removed

$D_{k_i}$ is the validation set consisting of the $i$-th fold

$\text{Performance}_i$ is the performance metric (e.g. accuracy) on the $i$-th fold



In [None]:
from sklearn.model_selection import StratifiedKFold

# Utility to split the data
kfcv = StratifiedKFold(n_splits=10, shuffle=True)
fold_iterator = kfcv.split(X, y)

# Utility to split the data
acc_nn = []

for idx_train, idx_val in fold_iterator:

    # split data
    X_train, y_train = X[idx_train], y[idx_train]
    X_val, y_val = X[idx_val], y[idx_val]

    # train model
    ffnn = MLPClassifier(**nn_kwargs)  # Remember: train the model from scratch.
    ffnn.fit(X_train, y_train)

    # evaluate model
    current_acc = ffnn.score(X_val, y_val)
    acc_nn.append(current_acc)

print("Acc list:", acc_nn)
print("This is our estimated accuracy:  {:.3f} +- {:.3f}".format(np.mean(acc_nn), np.std(acc_nn)))

Here, we will have a set of ten models, but the question is which one to select?

We have two additional alternatives:

- Use all the data for training:
    - build a new model M, this model will be a richer model in term of information, it will be better in in term of probability than the other models simply because we are considering more data.

- Ensemble of models:
    - We possess k models and we can employ all of them and establish a voting mechanism (A voting mechanism is a technique used in ensemble modeling where multiple models are combined to make a single prediction. Each model in the ensemble is allowed to make its own prediction based on the input data, and then a final prediction is made by taking a vote among the individual predictions.)
    - Ensemble models are more resilient since they are less affected by individual data points, but the downside is the computational expense.  
<div style="text-align:center;">
    <img src="https://raw.githubusercontent.com/Project-MONAI/tutorials/9b796e43e527f29c6b8563b573513dff0fd86d98/figures/models_ensemble.png" width="50%">
</div>

We could have also compared the two models.

In [None]:
# Utility to split the data
kfcv = StratifiedKFold(n_splits=10, shuffle=True)
fold_iterator = kfcv.split(X, y)

# Utility to split the data
acc_nn = []
acc_lr = []

for idx_train, idx_val in fold_iterator:

    X_train, y_train = X[idx_train], y[idx_train]
    X_val, y_val = X[idx_val], y[idx_val]

    ffnn = MLPClassifier(**nn_kwargs)  # Remember: train the model from scratch.
    ffnn.fit(X_train, y_train)

    logreg = LogisticRegression()
    logreg.fit(X_train, y_train)

    current_nn_acc = ffnn.score(X_val, y_val)
    acc_nn.append(current_nn_acc)
    current_lr_acc = logreg.score(X_val, y_val)
    acc_lr.append(current_lr_acc)

print("LogReg list:   ", acc_lr)
print("NeuralNet list:", acc_nn)

print("LogReg:     {:.3f} +- {:.3f}".format(np.mean(acc_lr), np.std(acc_lr)))
print("NeuralNet:  {:.3f} +- {:.3f}".format(np.mean(acc_nn), np.std(acc_nn)))

# Paired two sample test
T, p_val = ttest_rel(acc_lr, acc_nn)
print('t-test: T={:.2f}, p-value={:.4f}'.format(T, p_val))
print("is T={:.2f} in 95\% confidence interval (-1.96, 1.96) ?".format(T))

## More than two models and hyper-parameter tuning

In the context of neural networks, parameters refers to a set of weights and biases that are learned during the training process.
A hyperparameter, on the other hand, is a configuration variable that is set before the training process begins. These variables affect the behavior of the training algorithm itself, and can have a significant impact on the performance of the resulting model.

Examples of hyperparameters include:

    - learning rate
    - the number of hidden layers
    - the number of neurons in each layer
    - the activation function used in each layer

The performance of a neural network model heavily depends on the choice of hyperparameters: Hyperparameters can be adjusted in order to find the best combination for a given problem. This process is often referred to as hyperparameter tuning, and can be done using various techniques such as [grid search](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html), [random search](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html), etc..

It can be a time-consuming process as it requires training multiple models with different hyperparameter configurations. However, it's a crucial step in building a successful machine learning model as it can significantly improve its performance and generalization ability.

<div style="text-align:center;">
    <img src="https://miro.medium.com/v2/resize:fit:640/format:webp/1*5mStLTnIxsANpOHSwAFJhg.png" width="40%">
</div>


We will learn how to apply the grid search hyper-parameter tuning on the number of neurons of the hidden layer and the activation function.

In [None]:
# neurons - activation
model_parameters = [(50, "tanh"),
                    (50, "relu"),
                    (150, "tanh"),
                    (150, "relu")]

<div style="text-align:center;">
    <img src="https://studymachinelearning.com/wp-content/uploads/2019/10/summary_activation_fn.png" width="50%">
</div>

### Data split

The model is trained on the training set and evaluated on the testing set to assess its performance on unseen data.
The validation set is used to tune hyperparameters such as the learning rate, and number of hidden units, etc.

<div style="text-align:center;">
    <img src="https://b1739487.smushcdn.com/1739487/wp-content/uploads/2021/04/train-and-test-1-min-1.png?lossy=0&strip=1&webp=1" width="35%">
</div>

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle=True)
X_atr, X_val, y_atr, y_val = train_test_split(X_train, y_train, test_size=0.25, shuffle=True)

# Model selection
acc_list = []
for (neurons, activation) in model_parameters:
    print("Training NN with {} neurons and {} activation".format(neurons, activation))

    model = MLPClassifier(hidden_layer_sizes=(neurons,),
                          activation=activation,
                          max_iter=100,
                          solver="adam")
    model.fit(X_atr, y_atr)

    acc = model.score(X_val, y_val)
    acc_list.append(acc)

imax = np.argmax(acc_list)
print("Best model parameters:", model_parameters[imax])

# Performance of best model
(neurons, activation) = model_parameters[imax]
best_model = MLPClassifier(hidden_layer_sizes=(neurons,),
                           activation=activation,
                           max_iter=100,
                           solver="adam")
best_model.fit(X_train, y_train)
final_acc = best_model.score(X_test, y_test)
print("Best model accuracy:", final_acc)

# Final trained model
final_model = MLPClassifier(hidden_layer_sizes=(neurons,),
                           activation=activation,
                           max_iter=100,
                           solver="adam")
final_model.fit(X, y);
dump(final_model, 'my_best_neural_net.joblib')
loaded_final_model = load('my_best_neural_net.joblib')

**Note**: advanced topic for those curious: [nested cross validation](https://scikit-learn.org/stable/auto_examples/model_selection/plot_nested_cross_validation_iris.html)