# Model validation

In [None]:
# import modules here
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.neighbors import KNeighborsClassifier


In this notebook, we will get aquainted with cross validaiton, and go over in practice the cross validation strategy described in the slides.

## Import data

In [None]:
data = pd.read_csv('data/flights08_clean.csv')
data.head()

In [None]:
X_train, X_test, y_train, y_test = train_test_split(data.drop('MajorDelay', axis=1), 
                                                    data['MajorDelay'], 
                                                    test_size=0.3, 
                                                    random_state=1337,
                                                    stratify=data['MajorDelay'])

## Sklearn's GridSearchCV

In fact, Sklearn makes everything very easy for you. [Read the docs](http://scikit-learn.org/stable/modules/grid_search.html#exhaustive-grid-search) and use GridSearchCV to fit a 5-fold cross validation for n_neighbors in np.arange(1, 10)

In [None]:
# Create the dictionary of given parameters
n_neighb = np.arange(1, 10)  
parameters = [{'n_neighbors': n_neighb}] 

#Pass the dicitionary and other parameters to GridSearchCV to create a GridSearchCV object
gridCV = ...
gridCV = GridSearchCV(KNeighborsClassifier(), parameters, cv=5, return_train_score=True)
gridCV.fit(X_train, y_train) 


Look up the attributes of the [`GridSearchCV`](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html#sklearn.model_selection.GridSearchCV) object, and print the cv_results and the best parameters...

In [None]:
bestNeighb = gridCV.best_params_['n_neighbors']
print("Best parameters: n_neighbours=", bestNeighb)
pd.DataFrame(gridCV.cv_results_)


Create a plot of n_neighbors vs mean_test_score of the 5-fold cross-validation

In [None]:
plt.plot(range(1, 10), gridCV.cv_results_['mean_test_score'])


Now we have chosen a value for k using cross-validation, fit the classifier on the full training data and report the test score.

In [None]:
knn = KNeighborsClassifier(n_neighbors=bestNeighb)
knn.fit(X_train, y_train)
knn.score(X_test, y_test)
