# Notebook 2: Model selection and evaluation

Notebook prepared by [Chloé-Agathe Azencott](http://cazencott.info) with the help of [Arthur Imbert](https://github.com/Henley13) and contributions from [Giann Karlo](https://www.giannkarlo.info/).

In this notebook it is
* evaluate a model on a test set
* to choose the value of a hyperparameter of a learning algorithm
* to understand the interest of polynomial regression and regularization

In [None]:
# load numpy as np, matplotlib as plt
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt

In [None]:
plt.rc('font', **{'size': 12}) # sets the font size globally for the plots (in pt)

In [None]:
import pandas as pd

## 1. Loading data

We will work with a dataset containing physicochemical information on a number of Portuguese wines (vinho verde), as well as the ratings given to them by people who tasted them. Our goal is to automate this process: we want to directly predict the rating of wines based on their physicochemical characteristics, in order to assist oenologists, improve wine production, and target the tastes of niche consumers.

This dataset is available on the UCI Machine Learning Dataset Archive, where you will find many classic datasets: https://archive.ics.uci.edu/dataset/186/wine+quality. No need to download it, it is already in your directory, in the `data/winequality-white.csv` file. We will load it with pandas:

In [None]:
df = pd.read_csv('data/winequality-white.csv', # file name
                   sep=";" # column separator
                   )

**Alternatively:** If you need to download the file (for example on colab):

In [None]:
!wget https://raw.githubusercontent.com/CBIO-mines/fml-dassault-systems/main/data/winequality-white.csv

df = pd.read_csv('winequality-white.csv', # filename
                   sep=";" # column separator
                   )

We can now examine this file directly in our notebook, for example by looking at the first lines:

In [None]:
df.head()

### Creation of X and y data matrices

In [None]:
X = np.array(df.drop(columns=['quality']))

In [None]:
y = np.array(df['quality'])

In [None]:
print(X.shape, y.shape)

**Question:** How many training examples are in the data? How many variables?

**Question:** What do you think about using linear regression to solve this problem?

### Transformation into a binary classification problem

In [None]:
y = np.where(y >= 6, 1, 0)

## 2. Separation of data into a training set and a test set

To be able to evaluate a learning model in an unbiased way, we need to create a test set containing data on which the model has not been trained. This test set corresponds to “new” data.

To do this, we will use the [train_test_split](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) function of the scikit-learn `model_selection` module:

In [None]:
from sklearn import model_selection

X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y,
                                    test_size=0.3, # 30% of data in the test set
                                    random_state=42 # random generator seed
                                    )

Fixing the random generator seed allows us to get the same training and test sets by rerunning the command.

In [None]:
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

**Question:** How many samples does the training set (X_train, y_train) contain? And the test set (X_test, y_test)?

### Transformation of variables
We saw in Notebook 1 that it is more reasonable to center-reduce the variables before proceeding.

Let's not forget that the test set is supposedly unknown at the time of training: we must use **only the training set** to center-reduce the data.

In [None]:
from sklearn import preprocessing

In [None]:
# Create a "standardizer" and calibrate it to the training data
std_scaler = preprocessing.StandardScaler().fit(X_train)

# Apply standardization to training data
X_train_scaled = std_scaler.transform(X_train)

# Apply standardization to test data
X_test_scaled = std_scaler.transform(X_test)

## 3. Nearest neighbors

We will now evaluate the ability of a nearest neighbors algorithm to classify wines.

To do this, we use the [KNeighborsClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html) class of the scikit-learn `neighbors` module.

In [None]:
from sklearn import neighbors

### Training on the training set

As in Notebook 1, we start by instantiating an object of the class that interests us:

In [None]:
model_knn = neighbors.KNeighborsClassifier()

We can then train it on the centered-reduced training data:

In [None]:
model_knn.fit(X_train_scaled, y_train)

### Test set Predictions

We can now use the classifier trained on the test data, still centered-reduced:

In [None]:
y_pred_knn = model_knn.predict(X_test_scaled)

### Performance evaluation

Many metrics make it possible to evaluate the performance of a classification algorithm (see [the scikit-learn doc on this subject](https://scikit-learn.org/stable/modules/model_evaluation.html#classification-metrics)). The **confusion matrix** in particular allows you to visualize how many examples of each class receive each label:

In [None]:
from sklearn import metrics

In [None]:
metrics.ConfusionMatrixDisplay.from_predictions(y_test, y_pred_knn)

**Question:** How many true positives are there? False negatives?

The confusion matrix can be summarized by the [F1 score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html#sklearn.metrics.f1_score)

**F1 score** is useful to evaluate the performance of a classification model. it combines two other important metrics: **precision** and **recall**. The F1 score is the harmonic mean of precision and recall, providing a single score that balances both.

The F1 score is valuable when you have an imbalanced dataset, for example, a dataset for **misinformation** where 99% of posts are not fake and only 1% are. A model that just predicts _not fake_ every time would be 99% _accurate_, but it would be useless because it never finds the fake news. The F1 score provides a much better assessment in these cases.

In [None]:
print("F1 of kNN on the test set : %.2f" % metrics.f1_score(y_test, y_pred_knn))

## 4. Selecting the number of nearest neighbors

**Question:** How many nearest neighbors were used in the previous section? Rely on the documentation, for example by typing:

In [None]:
neighbors.KNeighborsClassifier?

### Setting up cross validation

The number of nearest neighbors (`n_neighbors`) is a **hyperparameter** of the nearest neighbors algorithm: it is not part of the parameters of the model learned by the algorithm, but we must set it ourselves before training.

We will now *choose* this number of nearest neighbors by a **gridsearch** procedure, which consists of *comparing* the performances of models trained using predefined values ​​(the grid) of the hyperparameter.

Heads up ! If we want to be able to use the test set to evaluate the generalization error of the model using the optimal value of the number of nearest neighbors, we cannot use it also for this selection step, because otherwise we could bias the model and overlearn.

To compare our models **on the training set**, we will use **cross-validation**, once again thanks to the [model-selection](http://scikit-learn.org/stable/model_selection.html#model-selection) module from scikit-learn.

In [None]:
n_folds = 10

# Create a KFold object which will allow cross-validation in n_folds folds
kf = model_selection.KFold(n_splits=n_folds,
                           shuffle=True, # mix the samples before creating the folds
                           random_state=42
                          )

# Use kf to split the training set into n_folds folds.
# kf.split returns an iterator (consumed after a loop).
# To be able to use the same folds several times, we transform this iterator into a list of indices:
kf_indices = list(kf.split(X_train))

`kf_indices` contains 10 pairs of two index vectors.

Each of these pairs corresponds to a fold.

The first vector gives the indices of the samples forming the training part of this fold. The second gives the indices of the samples forming the test part of this fold.

In [None]:
for (idx, fold) in enumerate(kf_indices):
    print("The fold %d contains %d observations for training and %d observations for testing" % (idx, len(fold[0]), len(fold[1])))

**Question:** How many times does each sample appear in the training portion of a fold? In the test part? (There is no need to write any code to answer.)

### Grid search

In [None]:
k_values = np.arange(3, 50, step=2)

In [None]:
k_values

**Question:** Why select only odd numbers of neighbors?

We will now use the [GridSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) class of the scikit-learn `model_selection` module to determine the optimal value of the number of nearest neighbors by grid search.

In [None]:
# Instantiating a GridSearchCV object
grid = model_selection.GridSearchCV(neighbors.KNeighborsClassifier(), # predictor to evaluate
                                    {'n_neighbors':k_values}, # hyperparameter value dictionary
                                    cv=kf_indices, # cross-validation to use
                                    scoring='f1' # performance evaluation metric
                                   )

We will also use the [time magic command](https://ipython.readthedocs.io/en/stable/interactive/magics.html) to measure the calculation time of a cell in our notebook.

In [None]:
%%time
# Using this object on training data (centered-reduced)
grid.fit(X_train_scaled, y_train)

The optimal value of the hyperparameter is given by:

In [None]:
print(grid.best_params_)

The following code displays the performance of the model according to the value of the hyperparameter:

In [None]:
mean_test_score = grid.cv_results_['mean_test_score']
stde_test_score = grid.cv_results_['std_test_score'] / np.sqrt(n_folds) # standard error

p = plt.plot(k_values, mean_test_score)
plt.plot(k_values, (mean_test_score + stde_test_score), '--', color=p[0].get_color())
plt.plot(k_values, (mean_test_score - stde_test_score), '--', color=p[0].get_color())
plt.fill_between(k_values, (mean_test_score + stde_test_score),
                 (mean_test_score - stde_test_score), alpha=0.2)

best_index = np.where(k_values == grid.best_params_['n_neighbors'])[0][0]
plt.scatter(k_values[best_index], mean_test_score[best_index])


plt.xlabel('number of nearest neighbors')
plt.ylabel('F1')
plt.title("Performance (in cross-validation) along the grid")

### Optimal nearest neighbor model

In [None]:
print("Best F1 in cross-validation: %.3f" % grid.best_score_)

The model trained on the entire data provided to `grid.fit` with the best hyperparameter value(s) is given by `grid.best_estimator_`.

In [None]:
y_pred_knn_opt = grid.best_estimator_.predict(X_test_scaled)

In [None]:
metrics.ConfusionMatrixDisplay.from_predictions(y_test, y_pred_knn_opt)

In [None]:
print("F1 of kNN (optimal k) on the test set : %.3f" % metrics.f1_score(y_test, y_pred_knn_opt))

## 5. Regularized logistic regression

### Performance of an unregularized logistic regression

We will now train a **logistic** regression (because we have a classification problem) *on the training set* and evaluate it *on the test set*.

We use the [LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) class from the `linear_model` module of scikit-learn.

In [None]:
from sklearn import linear_model

In [None]:
# Create a linear regression model
model_rlog = linear_model.LogisticRegression(penalty=None) # model not regularized for the moment

# Train this model on (X_train_scaled, y_train)
model_rlog.fit(X_train_scaled, y_train)

In [None]:
# Predict test set labels
y_pred_rlog = model_rlog.predict(X_test_scaled)

In [None]:
metrics.ConfusionMatrixDisplay.from_predictions(y_test, y_pred_rlog)

In [None]:
print("F1 score of a logistic regression on the test set: %.3f" % metrics.f1_score(y_test, y_pred_rlog))

**Question:** What do you think of the quality of the model?

### Model coefficients

In [None]:
# Calculate the number of variables
num_features = X_train.shape[1]

# Display for each variable its coefficient in the model
plt.scatter(range(num_features), # on the abscissa: indices of the variables
            model_rlog.coef_ # on the ordinate: their weight in the model
           )

# Label the x-axis tick marks
tmp = plt.xticks(range(num_features), # one mark per variable
                 list(df.columns[:-1]),  # display variable name
                 rotation=90, # turn labels 90 degrees
                 fontsize=14)

# Label the axes
tmp = plt.xlabel('Variable', fontsize=14)
tmp = plt.ylabel('Coefficient', fontsize=14)

In [None]:
# Display the coefficients with their corresponding variable names
for i, col_name in enumerate(df.columns[:-1]):
    print(f"{col_name}: {model_rlog.coef_[0][i]:.4f}")

### Ridge regularization

We will now add an l2 (or ridge) regularization to this logistic regression.

Here there are few variables and their coefficients take low values: it is not certain that regularization is necessary, but as this dataset has few variables, we can use it to look at the effect of regularization on the values ​​of the coefficients of the learned model.

Let's start by giving ourselves a grid of values ​​for the regularization parameter `C`.

Watch out! The larger `C` is, *the less* there is regularization.

In [None]:
c_values = np.logspace(-6, 3, 50)
c_values

We will now not use `GridSearchCV` but implement our grid search ourselves, in order to have access to the values ​​of the coefficients of each of the models:

In [None]:
%%time

f1_per_c = [] # to record the F1 score values ​​for each of the 50 values ​​of C
weights_per_c = [] # to record the coefficients associated with each variable,
                   # for the 50 values ​​of C
for c_val in c_values:
    # Create a logistic regression model regularized by the c_val parameter
    model_ridge = linear_model.LogisticRegression(penalty='l2', C=c_val)

    # Calculate the cross-validation performance of the model
    f1 = model_selection.cross_val_score(model_ridge, # predictor to evaluate
                                         X_train_scaled, y_train, # training data
                                         cv=kf_indices, # cross-validation to use
                                         scoring='f1' # performance evaluation metric
                                         )
    f1_per_c.append(f1)

    # Train the model on the total training set
    model_ridge.fit(X_train_scaled, y_train)

    # Save regression coefficients
    weights_per_c.append(model_ridge.coef_[0])

### Evolution of performance according to the regularization coefficient

In [None]:
mean_test_score = np.mean(np.array(f1_per_c), axis=1)
stde_test_score = np.std(np.array(f1_per_c), axis=1) / np.sqrt(n_folds) # standard error

p = plt.plot(c_values, mean_test_score)
plt.plot(c_values, (mean_test_score + stde_test_score), '--',
         color=p[0].get_color()) # reuse the same color as before instead of moving forward
plt.plot(c_values, (mean_test_score - stde_test_score), '--', color=p[0].get_color())
plt.fill_between(c_values, (mean_test_score + stde_test_score),
                 (mean_test_score - stde_test_score), alpha=0.2)


plt.xscale('log') # use a logarithmic abscissa scale

# Label the axes
tmp = plt.xlabel('Value of C', fontsize=14)
tmp = plt.ylabel('Average F1', fontsize=14)

# Title
tmp = plt.title("Performance (cross-validation) of logistic regression", fontsize=14)

**Question:** How does the model error (in cross-validation) scale with the amount of regularization?

### Optimal ridge regression model

In [None]:
# Find the index of the optimal value of C
best_C_idx = np.argmax(np.mean(f1_per_c, axis=1))

# Optimal C value
c_opt = c_values[best_C_idx]
print("Optimal C value (ridge regression): %.3e" % c_opt)

# Corresponding MSE
print("F1 score (cross-validation) of the optimal regularized logistic regression model: %.2f +/- %.2f" %      (np.mean(np.array(f1_per_c)[best_C_idx]), # average value
      np.std(np.array(f1_per_c)[best_C_idx]) # standard deviation
     ))

### Evolution of regression coefficients as a function of regularization

In [None]:
# Create a figure
fig = plt.figure(figsize=(8, 5))

# Changing colors for better visualization
plt.rcParams['axes.prop_cycle'] = plt.cycler(color=['blue', 'green', 'red', 'cyan', 'magenta', 'yellow', 'black', 'purple', 'pink', 'brown', 'orange', 'teal', 'coral', 'lightblue', 'lime', 'lavender', 'turquoise', 'darkgreen', 'tan', 'salmon', 'gold'])

lines = plt.plot(c_values,
                 weights_per_c # ordinate = values ​​of regression coefficients
                )
plt.xscale('log') # logarithmic scale in abscissa

# Display again (at the abscissa 2x1e3) the regression coefficients obtained without regularization
for coeff in model_rlog.coef_[0]:
    plt.scatter([2e3], [coeff])

# Mark the optimal value of C with a vertical bar
plt.plot([c_opt, c_opt], [-0.75, 1.25], 'k--')

# Show legend
tmp = plt.legend(lines, # retrieve the identifier
                 list(df.columns[:-1]), # name of each variable, excluding 'quality'
                 frameon=False, # no frame around the legend
                 loc=(1, 0),  # place the caption to the right of the image
                 fontsize=14)

tmp = plt.xlabel('Value of C', fontsize=14)
tmp = plt.ylabel('Regression coefficient', fontsize=14)

tmp = plt.title('Logistic regression', fontsize=16)

In [None]:
# Get the coefficients of the best estimator from the grid search
optimal_coefficients = weights_per_c[best_C_idx]

# Display the coefficients with their corresponding variable names
for i, col_name in enumerate(df.columns[:-1]):
    print(f"{col_name}: {optimal_coefficients[i]:.4f}")

**Question:** How do the model coefficients change depending on the amount of regularization?

**Question:** Do these coefficients seem consistent with those obtained for non-regularized logistic regression?

## 6. Ridge regularization on a textbook case

To better understand ridge regularization, we will simulate a non-linear data set which will take the form of a sinusoidal curve.

### Data simulation

In [None]:
nb_samples = 30

np.random.seed(13)

# real model
def true_model(X):
    return np.cos(1.5 * np.pi * X) * 5

# "ground truth" samples taken from the real model
X_ground_truth = np.linspace(0, 1, 100).reshape(-1, 1)
y_ground_truth = true_model(X_ground_truth)

# data = observations taken from the real model then noisy
X = np.sort(np.random.rand(nb_samples, 1))
y = true_model(X)
# adding noise
y += np.random.randn(nb_samples, 1) * 0.3

print(X.shape, y.shape)

In [None]:
# Draw the real model
plt.plot(X_ground_truth, y_ground_truth, label="Real model", linewidth=2)

# View simulated data
plt.scatter(X, y, label="Simulated data", marker="o")

plt.xlabel("X")
plt.ylabel("y")
plt.legend(loc="best")
plt.tight_layout()

### Training/test split

In [None]:
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=0.3)
print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)

### Linear regression

**Question:** How many variables do we have in our problem?

Let's train a “classic” linear regression (like the one seen in Notebook 1) on `(X_train, y_train)` and evaluate its performance on the training set and on the test set.

**Question:** Why compare these two performances?

In [None]:
# Training - initialize linear regression
reg = linear_model.LinearRegression()
# Train the linear model
reg.fit(X_train, y_train)

#### Performance

In [None]:
# RMSE
print("RMSE of a linear regression:")
# On the training set
rmse_reg_train = metrics.root_mean_squared_error(y_train, reg.predict(X_train))
print("\r train: {0:0.2f}".format(rmse_reg_train))
# On the test set
rmse_reg_test = metrics.root_mean_squared_error(y_test, reg.predict(X_test))
print("\r test: {0:0.2f}".format(rmse_reg_test))

We will now display the model learned on the previous graph.

In [None]:
# Draw the real model
plt.plot(X_ground_truth, y_ground_truth, label="Real model", linewidth=2)

# Show learned model
y_model = reg.predict(X_ground_truth)
plt.plot(X_ground_truth, y_model, label="Model learned", linewidth=2)

# View simulated data
plt.scatter(X_train, y_train, label="Simulated data (train)", marker="o")
plt.scatter(X_test, y_test, label="Simulated data (test)", marker="D")

# plot format
plt.xlabel("X")
plt.ylabel("y")
plt.title("Linear regression")
plt.legend(loc="best")
plt.tight_layout()

**Question:** What do you think of the performance of linear regression here?

### Polynomial regression

Polynomial regression consists of learning a non-linear model by learning a linear model on a new set of variables, formed from mononoms of the variables describing our data.

Generally speaking, for a problem described by $p$ variables $(X_1, X_2, \dots, X_p)$, a polynomial regression of degree $d$ is a linear regression on the variables $(X_1, X_2, \dots, X_p, X_1^2, X_2^2 X_3^2, \dots, X_p^2, \dots, X_p^d)$. Note that we are thus creating a large number of variables that are correlated with each other; we gain in modeling finesse, but lose in model complexity, risk of overfitting, and the curse of dimensionality.

Such a transformation is possible with the `PolynomialFeatures` class of `sklearn.preprocessing`.

Here, we are regressing a line based on the powers of $X$ rather than $X$ alone: we are approximating the true model using a polynomial.

In [None]:
# calculation of powers of x, up to degree 15
polynomial_features = preprocessing.PolynomialFeatures(degree=15)# , include_bias=False)

# creation of corresponding datasets
X_train_poly = polynomial_features.fit_transform(X_train)
X_test_poly = polynomial_features.transform(X_test)
X_ground_truth_poly = polynomial_features.transform(X_ground_truth)

print(X_train_poly.shape)
print(X_test_poly.shape)
print(X_ground_truth_poly.shape)

**Question:** How many variables do we have now?

In [None]:
# Training
reg_poly = linear_model.LinearRegression()
reg_poly.fit(X_train_poly, y_train)

#### Performance

In [None]:
# RMSE
print("RMSE of a polynomial regression:")
# On the training set
rmse_reg_poly_train = metrics.root_mean_squared_error(y_train, reg_poly.predict(X_train_poly))
print("\r train: {0:0.2f}".format(rmse_reg_poly_train))
# On the test set
rmse_reg_poly_test = metrics.root_mean_squared_error(y_test, reg_poly.predict(X_test_poly))
print("\r test: {0:0.2f}".format(rmse_reg_poly_test))

**Question:** Compare the performance of the model on the training set and the test set. What to conclude?

We will now display the model learned on the previous graph.

In [None]:
# Draw the real model
plt.plot(X_ground_truth, y_ground_truth, label="Real model", linewidth=2)

# Show learned model
plt.plot(X_ground_truth, reg_poly.predict(X_ground_truth_poly), label="Model learned", linewidth=2)

# View simulated data
plt.scatter(X_train, y_train, label="Simulated data (train)", marker="o")
plt.scatter(X_test, y_test, label="Simulated data (test)", marker="D")

# plot format
plt.xlabel("X")
plt.ylabel("y")
plt.title("Polynomial regression")
plt.legend(loc="best")
plt.tight_layout()
plt.ylim([-6, 6])

**Question:** What can you conclude about choosing polynomial regression?

#### Model coefficients

In [None]:
# Calculate the number of variables
num_features = X_train_poly.shape[1]

# Display for each variable its coefficient in the model
plt.scatter(range(num_features), # on the abscissa: indices of the variables
            reg_poly.coef_ # on the ordinate: their weight in the model
           )

# Label the axes
tmp = plt.xlabel('Variable', fontsize=14)
tmp = plt.ylabel('Coefficient', fontsize=14)
plt.yscale("log")

**Question:** What do you notice? Pay close attention to the scale of the coefficients.

### Ridge regularized polynomial regression

As polynomial regression overfits, we will now apply a ridge regularization term to it to try to compensate for this effect.

In [None]:
# Training
ridge_poly = linear_model.Ridge(alpha=0.01, random_state=13)
ridge_poly.fit(X_train_poly, y_train)

#### Performance

In [None]:
# RMSE
print("RMSE of a regularized polynomial regression:")
# On the training set
rmse_ridge_poly_train = metrics.root_mean_squared_error(y_train, ridge_poly.predict(X_train_poly))
print("\r train: {0:0.2f}".format(rmse_ridge_poly_train))
# On the test set
rmse_ridge_poly_test = metrics.root_mean_squared_error(y_test, ridge_poly.predict(X_test_poly))
print("\r test: {0:0.2f}".format(rmse_ridge_poly_test))

**Question:** Compare the performance of the model on the training set and the test set. What to conclude?

We will now display the model learned on the previous graph.

In [None]:
# Draw the real model
plt.plot(X_ground_truth, y_ground_truth, label="Real model", linewidth=2)

# Show learned model
plt.plot(X_ground_truth, ridge_poly.predict(X_ground_truth_poly), label="Model learned", linewidth=2)

# View simulated data
plt.scatter(X_train, y_train, label="Simulated data (train)", marker="o")
plt.scatter(X_test, y_test, label="Simulated data (test)", marker="D")

# plot format
plt.xlabel("X")
plt.ylabel("y")
plt.title("Regularized polynomial regression")
plt.legend(loc="best")
plt.tight_layout()

**Question:** What can you conclude about choosing Ridge regularization?

#### Model coefficients

In [None]:
# Calculate the number of variables
num_features = X_train_poly.shape[1]

# Display for each variable its coefficient in the model
plt.scatter(range(num_features), # on the abscissa: indices of the variables
            ridge_poly.coef_ # on the ordinate: their weight in the model
           )

# Label the axes
tmp = plt.xlabel('Variable', fontsize=14)
tmp = plt.ylabel('Coefficient', fontsize=14)

**Question:** What do you notice now? What is the effect of regularization on the model coefficients?

### Lasso regularized polynomial regression

In [None]:
# Training
lasso_poly = linear_model.Lasso(alpha=0.01, random_state=13)
lasso_poly.fit(X_train_poly, y_train)

#### Performance

In [None]:
# RMSE
print("RMSE of a regularized polynomial regression:")
# On the training set
rmse_lasso_poly_train = metrics.root_mean_squared_error(y_train, lasso_poly.predict(X_train_poly))
print("\r train: {0:0.2f}".format(rmse_lasso_poly_train))
# On the test set
rmse_lasso_poly_test = metrics.root_mean_squared_error(y_test, lasso_poly.predict(X_test_poly))
print("\r test: {0:0.2f}".format(rmse_lasso_poly_test))

We will now display the model learned on the previous graph.

In [None]:
# Draw the real model
plt.plot(X_ground_truth, y_ground_truth, label="Real model", linewidth=2)

# Show learned model
plt.plot(X_ground_truth, lasso_poly.predict(X_ground_truth_poly), label="Model learned", linewidth=2)

# View simulated data
plt.scatter(X_train, y_train, label="Simulated data (train)", marker="o")
plt.scatter(X_test, y_test, label="Simulated data (test)", marker="D")

# plot format
plt.xlabel("X")
plt.ylabel("y")
plt.title("l1 regularized polynomial regression")
plt.legend(loc="best")
plt.tight_layout()

#### Model coefficients

In [None]:
# Calculate the number of variables
num_features = X_train_poly.shape[1]

# Display for each variable its coefficient in the model
plt.scatter(range(num_features), # on the abscissa: indices of the variables
            lasso_poly.coef_ # on the ordinate: their weight in the model
           )

# Label the axes
tmp = plt.xlabel('Variable', fontsize=14)
tmp = plt.ylabel('Coefficient', fontsize=14)

## Conclusion
We reached the end of this notebook. Here is a summary of what we have covered, with the key takeaways:
- We used the `scikit-learn` library to classify wine quality from continuous variables (wine features).
- We tried a first classifier model: [KNeighborsClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html) (after data scaling).
- We evaluated the model perfomance based on a confusion matrix, and using the F1 score, which balances both **precision** and **recall**.
- We used cross-validation and grid search to find the best F1 score by testing a range of numbers representing neighbors. **Cross-validation** allows to assess how well a model generalizes to unseen data by splitting the dataset into multiple subsets. **Grid search** enables hyperparameter optimization by systematically testing combinations or range of different values for hyperparameters, such as the number of neighbors.

We manually performed a grid search on the effects of regularization by testing the hyperparameter `C`, remember, the larger `C` is, the less the is regularization. Finally, we explored ridge regularization on polynomial regression. We saw how a polynomial function of degree 15 can overfit the _ground truth_, but after applying a reguarization technique (ridge or lasso), there is a better fit.