# *k*-NN Model Selection
Model selection in `scikit-learn` using `grid-search`.

### First load the `heart` data.  

Details on the dataset are available here:
http://archive.ics.uci.edu/ml/datasets/Heart+failure+clinical+records

We have separate train and test datasets. 

In [1]:
import pandas as pd
import numpy as np
from sklearn import preprocessing
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import plot_confusion_matrix
from sklearn.metrics import accuracy_score, f1_score, balanced_accuracy_score
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV, cross_val_score, KFold

In [2]:
train_data = pd.read_csv('heart-train.csv')
print(train_data.shape)
train_data.head()

(199, 13)


Unnamed: 0,age,anaemia,creatinine_phosphokinase,diabetes,ejection_fraction,high_blood_pressure,platelets,serum_creatinine,serum_sodium,sex,smoking,time,DEATH_EVENT
0,60.0,0,253,0,35,0,279000.0,1.7,140,1,0,250,0
1,40.0,1,129,0,35,0,255000.0,0.9,137,1,0,209,0
2,86.0,0,582,0,38,0,263358.03,1.83,134,0,0,95,1
3,45.0,0,582,0,35,0,385000.0,1.0,145,1,0,61,1
4,72.0,0,127,1,50,1,218000.0,1.0,134,1,0,33,0


In [3]:
y_train = train_data.pop('DEATH_EVENT').values
train_data.pop('time')
X_train_raw = train_data.values
feature_names = train_data.columns
len(y_train), y_train.sum()

(199, 64)

Fit a scaler on the training data.

In [4]:
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train_raw)

In [5]:
test_data = pd.read_csv('heart-test.csv')
print('Test data shape', test_data.shape)
y_test = test_data.pop('DEATH_EVENT').values
test_data.pop('time')
X_test_raw = test_data.values
feature_names = test_data.columns
X_test = scaler.transform(X_test_raw)
len(y_test), y_test.sum()

Test data shape (100, 13)


(100, 32)

### Preliminary results
No model selection, *k*-NN classifier with default parameters.  
The dataset is not balanced (less deaths) so we focus on balanced error. 

In [6]:
kNN = KNeighborsClassifier()

In [7]:
kf = KFold(n_splits=10, shuffle = True)
scores = cross_val_score(kNN, X_train, y_train, cv=kf, 
                         scoring='balanced_accuracy')
prelim_train = scores.mean()
print("Initial X-val accuracy training: {0:4.2f}".format(prelim_train))

Initial X-val accuracy training: 0.51


In [8]:
kNN.fit(X_train,y_train)

KNeighborsClassifier()

In [9]:
y_pred = kNN.predict(X_test)

In [10]:
prelim_test = balanced_accuracy_score(y_test, y_pred)
prelim_test
print("Initial hold-out accuracy testing: {0:4.2f}".format(prelim_test))

Initial hold-out accuracy testing: 0.58


The confusion matrix shows that the classification is very biased towards the majority class - even on the training data.

In [None]:
plot_confusion_matrix(kNN, X_train, y_train)  

This bias is even worse on the test data. 

In [None]:
plot_confusion_matrix(kNN, X_test, y_test)  

## Grid Search

In [None]:
param_grid = {'n_neighbors':[1,3,5,7,10], 
              'metric':['manhattan','euclidean', 'correlation'],
             'weights':['uniform','distance']}

In [None]:
knn_gs = GridSearchCV(kNN,param_grid,cv=10, scoring = 'balanced_accuracy',
                      verbose = 1, n_jobs = -1)
knn_gs = knn_gs.fit(X_train,y_train)

The parameters selected by the grid search.

In [None]:
knn_gs.best_params_

In [None]:
knn_best = KNeighborsClassifier(**knn_gs.best_params_)

In [None]:
scores = cross_val_score(knn_best, X_train, y_train, 
                         cv=kf, scoring='balanced_accuracy')
ms_train = scores.mean()
print("X-val accuracy training after tuning: {0:4.2f}".format(ms_train))

In [None]:
y_pred_gs = knn_gs.predict(X_test)
ms_test = balanced_accuracy_score(y_test,y_pred_gs)
print("Hold-out accuracy after tuning: {0:4.2f}".format(ms_test))

The performance on the training data is better.

In [None]:
plot_confusion_matrix(knn_gs, X_train, y_train)  

And better on the test data but still not great. 

In [None]:
plot_confusion_matrix(knn_gs, X_test, y_test)  

In [None]:
plot_df = pd.DataFrame()
plot_df['Train'] = [prelim_train,ms_train]
plot_df['Test'] = [prelim_test,ms_test]
plot_df.index = ['Before','After']

In [None]:
ax = plot_df.plot.bar()
ax.set_ylabel("Balanced Accuracy")
ax.set_title('Model Selection')
ax.set_xticklabels(plot_df.index, rotation=0)
