# 4.6.5 K-Nearest Neighbors

Load modules and data

In [2]:
from scipy import stats
import pandas as pd
import seaborn as sns
import scipy as sp
import numpy as np
import matplotlib as mpl
from matplotlib import pyplot as plt
from sklearn.preprocessing import scale
import sklearn.linear_model as skl_lm
from sklearn.metrics import mean_squared_error, r2_score
import statsmodels.api as sm
import statsmodels.formula.api as smf
%matplotlib inline
plt.style.use('seaborn-white')
from sklearn.metrics import confusion_matrix, classification_report
from sklearn import neighbors

Smarket = pd.read_csv('Data/Smarket.csv', usecols = range(1,10),parse_dates=True)

We will use K-Nearest Neighbors (KNN) to predict Direction using pecentage returns from the previous two days (Lag1 and Lag2). The python library Sklearn has a module with a function for KNN - KNeighborsClassifier()

Starting with K = 1 we get:

In [54]:
x_train = Smarket[0:sum(Smarket.Year<2005)][['Lag1','Lag2']] # Data from 2001-2004
y_train = Smarket[0:sum(Smarket.Year<2005)]['Direction'] # Data from 2001-2004
x_test = Smarket[sum(Smarket.Year<2005):][['Lag1','Lag2']] # Data from 2005
y_test = Smarket[sum(Smarket.Year<2005):]['Direction'] # Data from 2005

knn = neighbors.KNeighborsClassifier(n_neighbors=1) # K=1
predict = knn.fit(x_train, y_train).predict(x_test)
pd.DataFrame(confusion_matrix(y_test, predict).T,['Down', 'Up'],['Down','Up'])

Unnamed: 0,Down,Up
Down,43,58
Up,68,83


In [9]:
(43+83.0)/(43+58+68+83)

0.5

The confusion matrix() function can be used to produce a confusion matrix in order to determine how many observations were correctly or incorrectly classified. The classification report() function gives us some summary statistics on the classifier’s performance:

In [10]:
print(classification_report(y_test, predict, digits=3))

             precision    recall  f1-score   support

       Down      0.426     0.387     0.406       111
         Up      0.550     0.589     0.568       141

avg / total      0.495     0.500     0.497       252



The results using K = 1 are not very good, since only 50% of the observations
are correctly predicted. Of course, it may be that K = 1 results in an
overly flexible fit to the data. Below, we repeat the analysis using K = 3.

In [67]:
knn = neighbors.KNeighborsClassifier(n_neighbors=3) # K=3
predict = knn.fit(x_train, y_train).predict(x_test)
pd.DataFrame(confusion_matrix(y_test, predict).T,['Down', 'Up'],['Down','Up'])

Unnamed: 0,Down,Up
Down,48,55
Up,63,86


In [12]:
(48+86.0)/(48+55+63+86)

0.5317460317460317

In [31]:
print(classification_report(y_test, predict, digits=3))

             precision    recall  f1-score   support

       Down      0.466     0.432     0.449       111
         Up      0.577     0.610     0.593       141

avg / total      0.528     0.532     0.529       252



The results have improved slightly. But increasing K further turns out
to provide no further improvements. It appears that for this data, QDA
provides the best results of the methods that we have examined so far.

In [105]:
old_mean = 0
for K in range(67):
   knn = neighbors.KNeighborsClassifier(n_neighbors=K+1) # K=3
   predict = knn.fit(x_train, y_train).predict(x_test)
   cm1, cm2 = confusion_matrix(y_test, predict)
   mean = (cm1[0]+cm2[1])/sum(cm1+cm2+0.0) 
   if mean > old_mean:
    old_mean = mean
print(old_mean)

0.539682539683


No further improvements before K = 67