# Lab 7. Learning curves

#### Table of contents

1. Overview
2. Equation of state
3. Linear regression with sklearn
4. Prepare the data & linear regression
5. Learning curves
6. Model selection

## 1. Overview

In this lab session we will learn how to use Python's optimized libraries to perform linear regression and we will explore learning curves based on simulated equation of states data.

## 2. Equation of state

In physics and thermodynamics, an equation of state is a thermodynamic equation relating state variables which describe the state of matter under a given set of physical conditions, such as pressure, volume, temperature (PVT), or internal energy [Wikipedia].

## 3. Linear regression with sklearn

In the previous labs, we wrote from scratch various functions to perform pre-processing of the data, regressions, regularization, post-processing, etc. The goal of these labs was to understand the principle of the principal machine learning algorithms and techniques. We will now use Python's libraries to perform these tasks because the functions in these libraries have been highly optimized to work with large datasets and large number of features. Sklearn is an efficient Python library for machine learning.

### 3.1. Linear regression

Linear regression can be achieved with sklearn on the dataset `{X,y}` simply with the commands:

`model = sklearn.linea_model.LinearRegression()`<br>
`model.fit(X,y)`

Note that the data `X` (and `y`) must be an ndarray of shape (m, n) with `m` the number of samples and `n` the number of features. For example, if we consider the data `X0` to be a Series, you can transform the data to an ndarray as:

`X = np.c_[X0]`

or alternatively,

`X = X0.to_numpy().reshape(-1,1)`

To predict the output values of the model based on an input array `Z` (of the same shape as `X`) you can use the following command:

`model.predict(Z)`

This will return an array of the same shape as `Z.shape[0]`.

### 3.2. Linear regression with polynomial features

With sklearn you can also easily transform features. Starting from the feature `X`, you can create polynomial features as:

`poly_features = sklearn.preprocessing.PolynomialFeatures(degree = degree)`<br>
`X_poly = poly_features.fit_transform(X)`

with `degree` the degree of the polynomial. For example, if your feature is `X = [[1],[2],[3]]` a polynomial feature of degree 2 will be `X = [[1,1],[2,4],[3,9]]`.

Because sklearn's `LinearRegression` model takes in a ndarray, you can directly provide a design matrix with polynomial features to perform linear regression.

### 3.3. Metrics

Sklearn also provides a wide variety of pre-built performance measures. For example, you can compute the mean square error between predicted values `y_predict` and actual data `y` as:

`mse = sklearn.metrics.mean_squared_error(y,y_predict)`

You can learn more on sklean on the official [website](https://scikit-learn.org/stable/).

## 4. Prepare the data & linear regression

Let first load some noisy data of an equation of state representing the energy of a crystal as a function of its volume.

In [None]:
import pandas as pd
e0 = pd.read_csv('elastic.csv')
e0.info()

In [None]:
e0.head()

__Q.1.__ Define the series `X0` and `y0` containing the data for the `Volume` and the `Energy`, respectively. The data is clean and there is no need to check for NaN values (1 marks).

In [None]:
### BEGIN SOLUTION
### END SOLUTION

We can now visualize the data.

In [None]:
%matplotlib inline
import matplotlib
import numpy as np
import matplotlib.pyplot as plt

plt.plot(X0,y0,marker='.',lw=0,c='b',ms=12)
plt.xlabel('$X_0$',fontsize=22)
plt.ylabel('$y_0$',fontsize=22)
plt.show()

We need to prepare data for sklearn. Let's convert the Series to a numpy array and reshape the array to be 2D. This must be done because sklearn takes in a general ndarray as the design matrix.

In [None]:
X = np.c_[X0]
y = np.c_[y0]
print(X.shape,y.shape)

We can now perfrom linear regression with sklearn.

In [None]:
from sklearn.linear_model import LinearRegression

model = LinearRegression()
model.fit(X,y)

__Q.2.__ Define the array `y_predict` that contains the estimate of the linear model based on the input `X`. `y_predict` must be of shape (m,1) with `m` the number of example in the dataset (1 mark).

In [None]:
### BEGIN SOLUTION
### END SOLUTION
print(y_predict.shape)

Let's plot the linear model together with the data points.

In [None]:
plt.plot(X0,y0,marker='.',lw=0,c='b',ms=12,label='data')
plt.plot(X,y_predict,marker='o',ms=0,lw=2,color='r',label='linear regression')
plt.xlabel('$X_0$',fontsize=22)
plt.ylabel('$y_0$',fontsize=22)
plt.legend()
plt.show()

Of course, the fit is very poor because the data is not linear at all.

__Q.3.__ Assign to the variable `rmse` the root mean squared error between the output data and the prediction of the linear model (1 mark).

In [None]:
from sklearn.metrics import mean_squared_error

### BEGIN SOLUTION
### END SOLUTION

print("The RMSE between the data and the linear fit is:", rmse)

## 5. Learning curves

We will now see how the number of data in the training set affects the mean square error of the training and validation sets. We first need to divide the data into training and validation data. One could simply split the dataset array sequencially however it is better to select randomly the data of the training and validation set. This can be achieved with sklearn as follow:

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_val, y_train, y_val = train_test_split(X,y,test_size=0.2,random_state=235)
print(X_train.shape,X_val.shape)

The variable `test_size` defines here the fraction of the dataset to be included in the test set (here the validation set). We use a random state for reproductibility. You should not change the value of the random state.
Let's visualize the training and validation datasets.

In [None]:
plt.plot(X_train,y_train,marker='.',lw=0,c='g',label='Training data',ms=12)
plt.plot(X_val,y_val,marker='.',lw=0,c='b',label='Validation data',ms=12)
plt.xlabel('$X$',fontsize=22)
plt.ylabel('y',fontsize=22)
plt.legend()
plt.show()

We would like now to plot the learning curve of the model. This corresponds to training the model on a variable number of examples (1$\rightarrow m_{train}$) and to plot the corresponding training and validation MSE.

__Q.4.__ Complete the function below that takes in a model (for example LinearRegression as in the last line of the code block below) and the full training and validation datasets. The function does the following (you must complete what is highlighted in red):

- loops over the number of training examples (`m` from 1 training example to $m_{train}$)
- fits the model based on the selected training examples
<font color=red>
- predicts the training output values of the selected training set (size `m`)
- predicts the validation output values based on the full validation set
- compute the MSE between the actual training data output and the predicted selected training data
- compute the MSE between the actual validation data output and the predicted full validation data
</font>
- store the MSE values into lists
- plot the corresponding RMSE of the training and validation data

(3 marks)

In [None]:
def plot_learning_curves(model,X_train,y_train,X_val,y_val):
    mse_train_list,mse_val_list = [],[]
    mtrain = X_train.shape[0]
    for m in range(1,mtrain):
        model.fit(X_train[:m],y_train[:m])
        ### BEGIN SOLUTION
        ### END SOLUTION
        mse_train_list.append(mse_train)
        mse_val_list.append(mse_val)
    plt.plot(np.sqrt(mse_train_list),'r-+',lw=2,label='Training error')
    plt.plot(np.sqrt(mse_val_list),'b-',lw=3,label='Validation error')
    plt.xlabel('Training set size',fontsize=22)
    plt.ylabel('RMSE',fontsize=22)
    plt.legend()
    plt.ylim(0,8)
    plt.show()
    
model = LinearRegression()
plot_learning_curves(model,X_train,y_train,X_val,y_val)

Let's discuss the curves. When there is one or two examples in the training set, the model can fit them perfectly this is why the training error starts at zero. But as more examples are added, it becomes impossible for the linear model to fit the training data because the data is noisy and not linear at all. So the error on the training data goes up and reaches a plateau at which adding new training data doesnt change much the training error. Concerning the validation error, when there is only few examples in the training data, the model cannot generalize which is why the validation error starts high. As we add examples to the training data, the model learns and the validation error goes down. However, once again, a straight line cannot do a good job to fit the non-linear data hence the validation error also reaches a plateau. These learning curves are typical of an underfitting model; both curves reached a plateau, they are close to each other around a high error value. If a model underfits the training data, adding more examples won't help, you need to define a more complex model to improve the description. For example, we can add polynomial features to the input data. Let's define polynomial features of degree 20.

In [None]:
from sklearn.preprocessing import PolynomialFeatures

poly_features = PolynomialFeatures(degree = 20)
X_poly_train = poly_features.fit_transform(X_train)
X_poly_val = poly_features.fit_transform(X_val)
print(X_poly_train.shape,X_poly_val.shape)

Note that the number of features is 21 because of the first column of ones in the linear model. We can now represent the learning curves corresponding to this new model. Here the only difference is to provide the newly developed polynomial features.

In [None]:
model = LinearRegression()
plot_learning_curves(model,X_poly_train,y_train,X_poly_val,y_val)

These learning curves look similar to the previous with the following important differences:

- The error on the training data is much lower than with the simple linear model
- There is a gap between the training and validation error curves, especially for small training set size, which is a mark of an overfitting model

We note that as the number of training examples are added, the curves get closer to each other.

## 6. Model selection



__Q.5.__ Complete the function below that returns the best value of the degree of a polynomial model to fit the dataset based on the minimium of the validation error. Here is the explanation of the function (you must complete what is highlighted in red):

- it loops over degrees of the polynomial (`degree` from 1 to 20)
- it transforms training and validation features into polynomial features of degree `degree`
- fits the model based on the polynomial training features (full training examples)
<font color=red>
- predicts the training output values based of the (full) training set
- predicts the validation output values based of the (full) validation set
- computes the MSE between the actual training data output and the predicted training data
- computes the MSE between the actual validation data output and the predicted validation data
</font>
- stores the MSE values into lists
- returns lists of the training and validation MSEs and the argument of the minimum validation error (adds 1 because the list indices start at 0)

(3 marks)

In [None]:
def find_best_degree(degrees,model,X_train,y_train,X_val,y_val):
    mse_train_list,mse_val_list = [],[]
    for degree in degrees:
        poly_features = PolynomialFeatures(degree = degree)
        X_poly_train = poly_features.fit_transform(X_train)
        X_poly_val = poly_features.fit_transform(X_val)
        model.fit(X_poly_train, y_train)
        ### BEGIN SOLUTION
        ### END SOLUTION
        mse_train_list.append(mse_train)
        mse_val_list.append(mse_val)
    return mse_train_list, mse_val_list, np.argmin(mse_val_list)+1

model = LinearRegression()
degrees = list(range(1,20))
mse_train_list, mse_val_list, best_degree = find_best_degree(degrees,model,X_train,y_train,X_val,y_val)
print("The degree of polynomial with lowest validation error is: ", best_degree)

We can now plot the training and validation MSEs as a function of the degree of the polynomial to better appreciate the model selection.

In [None]:
plt.xlabel("Polynomial degree",fontsize=22)
plt.ylabel("Error",fontsize=22)
plt.plot(degrees,mse_train_list,'r-+',lw=2,label='Training error')
plt.plot(degrees,mse_val_list,'b-',lw=3,label='Validation error')
plt.ylim(0,2)
plt.legend()
plt.tight_layout()
plt.savefig('figure.pdf')
plt.show()

We see that as we increase the polynomial degree, the training error decreases because the model always fits better the training points. However, as we increase the polynomial degree, the validation error reaches a minima and then further increases, because of overfitting. Let's plot the data and the best polynomial that fits it.

__Q.6.__ Complete the code below to assign `y_predict` the array of output values predicted by the best polynomial model over the training data set (1 mark).

In [None]:
poly_features = PolynomialFeatures(degree = best_degree)
X_poly_train = poly_features.fit_transform(X_train)
model.fit(X_poly_train, y_train)

### BEING SOLUTION
### END SOLUTION

plt.plot(X_train,y_train,marker='.',lw=0,c='g',label='Training data',ms=12)
plt.plot(X_val,y_val,marker='.',lw=0,c='b',label='Validation data',ms=12)
X_fit, y_predict = zip(*sorted(zip(X_train, y_predict)))
plt.plot(X_fit,y_predict,lw=2,c='r',marker=None,label='Polynomial of degree '+str(best_degree))
plt.xlabel('$X$',fontsize=22)
plt.ylabel('y',fontsize=22)
plt.legend()
plt.show()