In [1]:
from sklearn.datasets import load_iris
iris = load_iris()

In [8]:
X = iris.data
y = iris.target

In [4]:
from sklearn.neighbors import KNeighborsClassifier
model = KNeighborsClassifier(n_neighbors=1) 

In [5]:
# train the model, and use it to predict labels for data we already know
model.fit(X, y) 
y_model = model.predict(X) 

In [6]:
#we compute the fraction of correctly labeled points:
from sklearn.metrics import accuracy_score
accuracy_score(y, y_model)

1.0

this approach contains a fundamental flaw: it trains and evaluates the model on the same data. Furthermore, the nearest neighbor model is an instance-based estimator that simply stores the training data, and predicts labels by comparing new data to these stored points; except in contrived cases, it will get 100% accuracy every time!


THE RIGHT WAY

In [11]:
#Holdout sets
from sklearn.model_selection import train_test_split
# split the data with 50% in each set
X1, X2, y1, y2 = train_test_split(X, y, random_state=0,
                                  train_size=0.5)
# fit the model on one set of data
model.fit(X1, y1)
# evaluate the model on the second set of data
y2_model = model.predict(X2) 
accuracy_score(y2, y2_model)



0.9066666666666666

One disadvantage of using a holdout set for model validation is that we have lost a portion
of our data to the model training. In the previous case, half the dataset does not contribute
to the training of the model! This is not optimal, and can cause problems—especially if the 
initial set of training data is small.

In [None]:
#############Model validation via cross-validation##############

One way to address this is to use cross-validation—that is, to do a sequence of fits
where each subset of the data is used both as a training set and as a validation set. 

In [12]:
y2_model = model.fit(X1, y1).predict(X2)
y1_model = model.fit(X2, y2).predict(X1)
accuracy_score(y1, y1_model), accuracy_score(y2, y2_model)

(0.96, 0.9066666666666666)

In [None]:
#################split the model into 5 subsets

In [14]:
from sklearn.model_selection import cross_val_score
cross_val_score(model, X, y, cv=5)

array([0.96666667, 0.96666667, 0.93333333, 0.93333333, 1.        ])

In [24]:
#LeaveOneOut
from sklearn.model_selection import LeaveOneOut
from sklearn.model_selection import cross_val_score
scores = cross_val_score(model, X, y, cv=LeaveOneOut(len(X)))
scores


TypeError: __init__() takes 1 positional argument but 2 were given

Because we have 150 samples, the leave-one-out cross-validation yields scores for 150 trials,
and the score indicates either successful (1.0) or unsuccessful (0.0) prediction. Taking the
mean of these gives an estimate of the error rate