## 2.73 Machine Learning - Overfitting Classification Example

Here we are going to assume the classification boundary between two classes is defined by the polynomial:

$y_1 = ax_1^4+bx_1^3+cx_1^2+dx_1+e$

Where 
a = -0.0086948
b =  0.18616
c = -0.90713
d = -0.46162
e =  7.8933

for the real space x in R(0,15) y in R(0,15)

We are going to introduce normally distributed measurement error of +-d on x and y.

In [1]:
import numpy as np
import pandas as pd 
from sklearn.cross_validation import train_test_split
from bokeh.charts import Scatter, output_notebook, show
from bokeh.plotting import figure, output_notebook, show
from bokeh.models import Range1d

def f(x):
    # Fourth order polynomial
    a = -0.0086948
    b =  0.18616
    c = -0.90713
    d = -0.46162
    e =  7.8933
    return a*x**4 + b*x**3 + c*x**2 + d*x + e

def sample(n=200):
    np.random.seed(seed=1)
    x = np.random.uniform(0,15,n)
    y = np.random.uniform(0,15, n)
    y1 = f(x)
    l = np.sign(y-y1)
    x = x + np.random.normal(loc=0, scale = 1, size = n)
    y = y + np.random.normal(loc=0, scale = 1, size = n)
    return pd.DataFrame(data=zip(x,y, l),columns=['x','y','l'])
    
xb = np.arange(0,15,0.1)
yb = f(xb)

df_grid = pd.DataFrame(data = [ (x, y) for x in np.arange(0,15,0.5) for y in np.arange(0,15,0.5)], columns = ['x','y'])

df = sample(1000)
df_test, df_train = train_test_split(df, test_size = 0.5, random_state=72)


output_notebook(hide_banner=True)

p = figure(plot_width=400, plot_height=400)

def plotcolor(l):
    if l == -1:
        return 'Red'
    else:
        return 'Blue'

p.circle(x=df_train.x, y=df_train.y, size=8, color=[ plotcolor(l) for l in df_train.l], alpha = 0.5)
p.line(x=xb, y=yb, line_width=2, color='Grey')
p.y_range = Range1d(0,15)
p.x_range = Range1d(0,15)
show(p)

<bokeh.io._CommsHandle at 0x7fe0aad76c10>

Now we can split the data into test and train.   Fit a KNN classifier with various N and plot model fit to a grid of data as well as the test error. 

In [2]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

clfs = [KNeighborsClassifier(n_neighbors=i).fit(X = df_train[['x','y']], y = df_train.l) for i in range(1,101,5)]
pred_test = [clf.predict(df_test[['x','y']]) for clf in clfs]
pred_train = [clf.predict(df_train[['x','y']]) for clf in clfs]
fit = [clf.predict(df_grid[['x','y']]) for clf in clfs]

In [3]:
from bokeh.plotting import gridplot

def plt(x,y,l, xb, yb):
    p = figure(plot_width=300, plot_height=300)
    p.circle(x=x, y=y, size=6, color=[ plotcolor(li) for li in l], alpha = 0.5)
    p.line(x=xb, y=yb, line_width=2, color='Grey')
    p.y_range = Range1d(0,15)
    p.x_range = Range1d(0,15)
    return(p)

p1 = plt(df_grid.x, df_grid.y, fit[0], xb, yb)
p2 = plt(df_grid.x, df_grid.y, fit[2], xb, yb)
p3 = plt(df_grid.x, df_grid.y, fit[4], xb, yb)
p4 = plt(df_grid.x, df_grid.y, fit[15], xb, yb)
p5 = plt(df_grid.x, df_grid.y, fit[17], xb, yb)
p6 = plt(df_grid.x, df_grid.y, fit[19], xb, yb)

p = gridplot([[p1, p2, p3] ,[p4, p5, p6]])
              
show(p)

<bokeh.io._CommsHandle at 0x7fe0a9ff22d0>

We can look at the test and train accuracy as a function of number of neighbours:

In [29]:
from sklearn.metrics import accuracy_score
acc_train = []
acc_test = []
for i,k in enumerate(range(1,101,5)):
    acc_train.append(accuracy_score(df_train.l, pred_train[i]))
    acc_test.append(accuracy_score(df_test.l, pred_test[i]))
    
p = figure(title='Train vs Test Accuracy')
p.line(x=range(1,101,5), y=acc_train, color='Blue')
p.line(x=range(1,101,5), y=acc_test, color='Red')
p.xaxis.axis_label='k - neighbours'
p.yaxis.axis_label='Classification accuracy'
show(p)

<bokeh.io._CommsHandle at 0x7fd52fcd3a90>

We can see that the training accuracy (blue line) is 100% when neighbourhood size is set to 1 - this is just memorization of the training data.  However the test accuracy (red line) is very low at this point.  As we increase number of neighbours the training performance drops but the test performance increases.  Eventually we increase number of neighbours too much and the test performance also starts to drop.  

For a KNN classifier low number of neighbours is associated with high flexibility and high variance of the model - but also low model bias.  As the number of neighbours increases the flexibilty and variance drops and generalisation performance increases. Eventually the gain from reducing the variance of the model is offset by the squared bias and the overall test error starts to drop.