## 2.72 Machine Learning - Reducible Error

We are going to work with a very small file of training data.

In [1]:
import numpy as np
import pandas as pd

df_train = pd.read_csv('data/ml-data-train.csv')
df_train

Unnamed: 0,x,y
0,18.836133,11.269769
1,12.64668,8.734799
2,9.747432,8.173146
3,2.334745,5.424436
4,0.409672,2.339696
5,6.327346,7.787972
6,9.708542,10.423231
7,14.599289,10.390283
8,3.301732,7.423751
9,3.158246,6.116124


<img src='files/resources/ic_assignment_black_24dp_2x.png' align='left'>Create a scatterplot of the data.
Does the relationship between x and y look linear or non-linear?

In [2]:
from bokeh.charts import output_notebook, Scatter, show

output_notebook(hide_banner=True)
p = Scatter(data=df_train, x='x', y='y')
show(p)

<bokeh.io._CommsHandle at 0x7fa3d9257350>

## Fitting a linear regression model

We need a few imports from sklearn.  We are also going to create a scikit learn pipeline class so that we can apply prepprocessing to our data before passing to the linear regression model.  THis will allow us to create arbitrary order non-linear polynomial functions of x later on.

In [3]:
# Show code to fit model of degree 1
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline

Now we define the model - initially we are using degree 1 polynomial features - this is redundant but will be helpful in the exercises below.

In [4]:
model = make_pipeline(PolynomialFeatures(degree=1), LinearRegression())
model.fit(df_train.x.reshape(-1,1), df_train.y)

Pipeline(steps=[('polynomialfeatures', PolynomialFeatures(degree=1, include_bias=True, interaction_only=False)), ('linearregression', LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False))])

Now we have the model fit to the data we can predict.  Let's predict the y values for the original x:

In [5]:
y_pred=model.predict(df_train.x.reshape(-1,1))
zip(df_train.y, y_pred)

[(11.2697687831, 12.048324408706502),
 (8.7347992045500007, 9.6023431649136821),
 (8.173146226970001, 8.4566026915005992),
 (5.4244362102299997, 5.5272172546817302),
 (2.3396962877000003, 4.7664565027105876),
 (7.7879716006100015, 7.10503474058621),
 (10.4232313916, 8.4412338597456458),
 (10.390283031000001, 10.373985744713735),
 (7.42375101859, 5.9093563701675214),
 (6.11612369783, 5.8526527144537903)]

<img src='files/resources/ic_assignment_black_24dp_2x.png' align='left'>Write function to compute MSE based on y and y_pred.  Remember mean squared error is the average squared difference between y and y_pred.  What are the units of the MSE?  Do you think this is a good model?

In [6]:
def mse(a,b):
    return np.sum(np.power(np.subtract(a,b),2)) / len(a)

print('mse = {0:6.2f}'.format(mse(df_train.y,y_pred)))

mse =   1.41


## Increasing the flexibility of the model

By increasing the degree of the polynomial feature generator we can get increasingly flexible models.

For example here are the degree 3 features generated by the `PolynomialFeatures()` method:

In [7]:
data = np.array([2,3]).reshape(-1,1)
print(data)
model = PolynomialFeatures(degree=4)
model.fit(data)
model.transform(data)

[[2]
 [3]]


array([[  1.,   2.,   4.,   8.,  16.],
       [  1.,   3.,   9.,  27.,  81.]])

By using the `PolynomialFeatures()` method the linear model has non-linear transformations of the original features.  The resulting model is still linear in the parameters - but has more non-linear features available.

<img src='files/resources/ic_assignment_black_24dp_2x.png' align='left'>Do you think that the non-linear transformations will increase of decrease the ability of the model to fit the training data (i.e. reduce the MSE)?  Write a function to fit a degree 'n' polynomial regression model and output the MSE.  Test it by computing the MSE for all degree's up to 10.

In [8]:
def fitPolynomialRegression(x, y, degree=1):    
    model = make_pipeline(PolynomialFeatures(degree=degree), LinearRegression())
    model.fit(x, y)
    return mse(y, model.predict(x))

for degree in range(0,11):
    print('Degree = {0:2d} MSE = {1:6.2f}'
          .format(degree, fitPolynomialRegression(df_train.x.reshape(-1,1), df_train.y, degree)))


Degree =  0 MSE =   6.51
Degree =  1 MSE =   1.41
Degree =  2 MSE =   0.94
Degree =  3 MSE =   0.51
Degree =  4 MSE =   0.46
Degree =  5 MSE =   0.46
Degree =  6 MSE =   0.40
Degree =  7 MSE =   0.32
Degree =  8 MSE =   0.32
Degree =  9 MSE =   0.00
Degree = 10 MSE =   0.00


<img src='files/resources/ic_assignment_black_24dp_2x.png' align='left'>Do you think this is as expected - which is the best model?  Which one would you use to predict new data?

# Test error vs train error

We have another data set from the same data generating process.  This data was not used to fit the model but can be used to see how well the models predict future unknown data points.  We call this data the 'test' set.

In [9]:
df_test = pd.read_csv('data/ml-data-test.csv')
df_test

Unnamed: 0,x,y
0,17.878083,10.699816
1,0.255616,1.543354
2,14.498679,8.531225
3,17.015902,9.661305
4,9.264371,7.449388
5,0.389505,2.978019
6,2.174721,7.405458
7,8.404072,9.296855
8,17.143085,10.059268
9,3.605394,7.557129


In [10]:
from bokeh.plotting import figure

p = figure()
p.circle(x=df_train.x, y= df_train.y, color='Orange', size=10)
p.circle(x=df_test.x, y= df_test.y, color='Red', size=10)
p.xaxis.axis_label='x'
p.xaxis.axis_label='y'
show(p)

<bokeh.io._CommsHandle at 0x7fa3ceb28f50>

<img src='files/resources/ic_assignment_black_24dp_2x.png' align='left'>Write a function to fit a degree 'n' polynomial regression model to the training data and return the mse on the training and test data. What do you notice about the test error?  Which model is the best?

In [13]:
def fitPolynomialRegression(x, y, x_test, y_test, degree=1):    
    model = make_pipeline(PolynomialFeatures(degree=degree), LinearRegression())
    model.fit(x, y)
    return [ mse(y, model.predict(x)), mse(y_test, model.predict(x_test)) ]

for degree in range(1,11):
    fit = fitPolynomialRegression(df_train.x.reshape(-1,1), df_train.y, df_test.x.reshape(-1,1), df_test.y, degree)
    print('Degree = {0:2d} Train MSE = {1:6.2f} Test MSE = {2:6.2f}'
          .format(degree, fit[0], fit[1]))

Degree =  1 Train MSE =   1.41 Test MSE =   3.05
Degree =  2 Train MSE =   0.94 Test MSE =   2.07
Degree =  3 Train MSE =   0.51 Test MSE =   1.11
Degree =  4 Train MSE =   0.46 Test MSE =   1.18
Degree =  5 Train MSE =   0.46 Test MSE =   1.23
Degree =  6 Train MSE =   0.40 Test MSE =   4.81
Degree =  7 Train MSE =   0.32 Test MSE =  30.61
Degree =  8 Train MSE =   0.32 Test MSE = 2192.29
Degree =  9 Train MSE =   0.00 Test MSE = 5224730.15
Degree = 10 Train MSE =   0.00 Test MSE = 23709708.46
