# GTI771 - Apprentissage machine avancé

### Created: Thiago M. Paixão <br> Created/Revised: Alessandro L. Koerich <br> Ver 1.0 <br> December 2020¶

### NB0 - Generalization with Linear/Polinomial Regression

This notebook addresses the regression task using a [Bayesian approach](https://en.wikipedia.org/wiki/Bayesian_linear_regression), which is also a supervised learning task. The focus is shown the challenge of generalizing a model from a set of data points.

The notebook is divided into four parts:

* Setup
* Data generation
* Regression
    * Linear regression
    * $n$-degree polinomal regression
* MSE analysis

## Setup

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
import numpy as np
import matplotlib.pyplot as plt

## Data generation

We need to generate some data to play with. So, we need to choose a base function (true function) where training and test data will be derived from. The functions has the form $r = f(x)$, where $r$ is the output (label) and $x$ is the input (features). Example of functions: 

* $r = x + 2$
* $r = x^2 + 2x + 4$
* $r = \sin(2\pi x)$
* ...

In our demonstration, we chose $r = \cos(6\pi x)$.

In [None]:
f_true = lambda X: np.cos(6 * np.pi * X)
# f_true = lambda X: 15*(X-0.5)*(X-0.5) - 1.2

Now, we generate training and test data based on the above function. We assume that features ($x$) are real values sampled from the interval $[0, 1]$ and that the corresponding label is given by the true function plus some random noise, i.e., $r_i = f(x_i) + \delta_i$.

In [None]:
# seed the experiment
np.random.seed(0)

n_samples_train = n_samples_test = 60

# random data points
X_train = np.sort(np.random.rand(n_samples_train))
X_test  = np.sort(np.random.rand(n_samples_test))

# corresponding labels
delta_train = np.random.randn(n_samples_train) * 0.2
delta_test = np.random.randn(n_samples_test) * 0.2

r_train = f_true(X_train) + delta_train
r_test = f_true(X_test) + delta_test

Let's plot the data points and also the true function, i.e., the function used to generate the data points:

In [None]:
# chart setup
fig, (ax1, ax2) = plt.subplots(ncols=2, sharey=True, figsize=(16, 4))

X_dummy = np.linspace(0, 1., 100)

ax1.plot(X_dummy, f_true(X_dummy), label='True function', color='gray')

ax1.scatter(X_train, r_train, edgecolor='b', s=20, label='Training samples')
ax1.scatter(X_test, r_test, edgecolor='y', s=40, label='Test samples')
ax1.set_ylabel('r')
ax1.set_xlabel('x')
ax1.set_title('Data points (with the true function)')
ax1.set_xlim((0, 1))
ax1.set_ylim((-2, 2))

ax2.scatter(X_train, r_train, edgecolor='b', s=20, label='Training samples')
ax2.scatter(X_test, r_test, edgecolor='y', s=40, label='Test samples')
ax2.set_xlabel('x')
ax2.set_title('Data points (without the true function)')
ax2.set_xlim((0, 1))
ax2.set_ylim((-1.5, 1.5))

plt.show()

## Regression

Some error metrics can be used to evaluate the regression quality, such as

* Mean Square Error ([MSE](https://scikit-learn.org/stable/modules/model_evaluation.html#mean-squared-error))
* Mean Absolute Value ([MAE](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_absolute_error.html))

We defined both in the next cell. Although we use only MSE in the examples, you can replace it - as exercise - by MAE and analyze the behaviour.

In [None]:
# Compute the MSE and MAE metric for Regression
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error, make_scorer

MSE = lambda y_true, y_pred: mean_squared_error(y_true, y_pred)
MAE = lambda y_true, y_pred: mean_absolute_error(y_true, y_pred)

### Linear regression

In [None]:
from sklearn.linear_model import LinearRegression

fig, ax = plt.subplots(figsize=(12, 6))
plt.setp(ax, xticks=(), yticks=()) # disable ticks

# linear regression
degrees = 1
linear_regression = LinearRegression()
linear_regression.fit(X_train[:, np.newaxis], r_train)

# evaluate the models using the X_test samples
Y_pred_test = linear_regression.predict(X_test[:, np.newaxis])
Y_pred_train = linear_regression.predict(X_train[:, np.newaxis])

ax.plot(X_test, linear_regression.predict(X_test[:, np.newaxis]), label='Model', color='g')
ax.plot(X_train, f_true(X_train), label='True function', color='gray')

ax.scatter(X_train, r_train, edgecolor='b', s=20, label='Train samples')
ax.scatter(X_test, r_test, edgecolor='y', s=40, label='Test samples')

ax.set_xlabel('x')
ax.set_ylabel('r')
ax.set_xlim((0, 1))
ax.set_ylim((-2, 2))
ax.legend(loc='best')
ax.set_title("Degree {}\nMSE on test set = {:.5f}\nMSE on training set = {:.5f}".format(degrees, MSE(r_test,Y_pred_test), MSE(r_train,Y_pred_train)))
plt.show()

### $n$-D polynomial regression

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures
from sklearn.model_selection import cross_val_score

def plot_regression(degree):

    fig, ax = plt.subplots(figsize=(12, 6))
    plt.setp(ax, xticks=(), yticks=()) # disable ticks

    ax.scatter(X_train, r_train, edgecolor='b', s=20, label='Train samples')
    ax.scatter(X_test, r_test, edgecolor='y', s=40, label='Test samples')

    polynomial_features = PolynomialFeatures(degree=degree, include_bias=False)
    linear_regression = LinearRegression()
    pipeline = Pipeline([
        ('polynomial_features', polynomial_features),
        ('linear_regression', linear_regression)
    ])
    pipeline.fit(X_train[:, np.newaxis], r_train)

    # evaluate the models using the X_test samples
    Y_pred_test = pipeline.predict(X_test[:, np.newaxis])
    Y_pred_train = pipeline.predict(X_train[:, np.newaxis])

    ax.plot(X_test, pipeline.predict(X_test[:, np.newaxis]), label='Model (degree {})'.format(degree), color='g')
    ax.plot(X_train, f_true(X_train), label='True function', color='gray')
    ax.set_xlabel('x')
    ax.set_ylabel('r')
    ax.set_xlim((0, 1))
    ax.set_ylim((-1.5, 1.5))
    ax.legend(loc='best', ncol=2)
    ax.set_title('Degree {}\nMSE on test set = {:.5f}\nMSE on training set = {:.5f}'.format(degree, MSE(r_test, Y_pred_test), MSE(r_train, Y_pred_train)))
    plt.show()

In [None]:
degrees = [1, 2, 3, 5, 9, 30]
for degree in degrees:
    plot_regression(degree)

## MSE analysis

In [None]:
def plot_error(degrees):
    fig, ax = plt.subplots(figsize=(16, 4))
    mse_train = []
    mse_test = []
    for degree in degrees:
        polynomial_features = PolynomialFeatures(degree=degree, include_bias=False)
        linear_regression = LinearRegression()
        pipeline = Pipeline([
            ('polynomial_features', polynomial_features),
            ('linear_regression', linear_regression)
        ])
        pipeline.fit(X_train[:, np.newaxis], r_train)

        # evaluate the models using the X_test samples
        Y_pred_test = pipeline.predict(X_test[:, np.newaxis])
        Y_pred_train = pipeline.predict(X_train[:, np.newaxis])
        
        mse_train.append(MSE(r_train, Y_pred_train))
        mse_test.append(MSE(r_test, Y_pred_test))

    ax.plot(degrees, mse_train, label='MSE (Train)')
    ax.plot(degrees, mse_test, label='MSE (Test)')
    ax.set_xlabel('degree')
    ax.set_xticks(degrees)
    ax.legend(loc='best')
    ax.set_title('MSE error')
    plt.show()

### Degree $\in [1, 15]$

In [None]:
degrees = list(range(1, 16))
plot_error(degrees)

### Degree $\in [16, 30]$

In [None]:
degrees = list(range(16, 31))
plot_error(degrees)