# Lecture 10: Cross-validation
We have discussed a simple validation method to evaluate the performance of a machine learning (ML) model.
- Randomly split a data set into two parts: one is used as the *training set* to train a machine learning model. The other set-the *test set*- is used to calculate the test error (**generalization error**)
- The ultimate goal is to train a ML model that has the minumum generalization error.
- A common methodological mistake is "testing on the training set": involve test data samples in the training set.


## Agenda
1. One issue of the simple validation approach 
2. K-fold cross-validation
    1. model performance evaluation
    2. parameter searching for small datasets
3. Other evaluation metrics

### 1. One issue of the simple validation approach.

In [1]:
import numpy as np

from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix

ds = datasets.load_iris()
X = ds.data
y = ds.target

for i in range(10): # run the training and test of KNN model 10 times
    X_train, X_test, y_train, y_test = \
    train_test_split(X, y, test_size = 0.3) # generate different test sets

    k = 5
    # KNeighborsClassifier is a class --> go to the source code:
    # http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html#sklearn.neighbors.KNeighborsClassifier.score
    knn = KNeighborsClassifier(n_neighbors = k)
    knn.fit(X_train, y_train) # construct the BallTree or KDTree
    y_test_pred = knn.predict(X_test)
    mean_acc = knn.score(X_test, y_test) #acc = 1-errorRate
    print(i, 'mean test accuracy of {} nearest neighbor(s) is {} '.format(k, round(mean_acc, 4)))  

0 mean test accuracy of 5 nearest neighbor(s) is 0.9556 
1 mean test accuracy of 5 nearest neighbor(s) is 0.9778 
2 mean test accuracy of 5 nearest neighbor(s) is 0.9778 
3 mean test accuracy of 5 nearest neighbor(s) is 0.9778 
4 mean test accuracy of 5 nearest neighbor(s) is 1.0 
5 mean test accuracy of 5 nearest neighbor(s) is 1.0 
6 mean test accuracy of 5 nearest neighbor(s) is 0.9556 
7 mean test accuracy of 5 nearest neighbor(s) is 0.9778 
8 mean test accuracy of 5 nearest neighbor(s) is 1.0 
9 mean test accuracy of 5 nearest neighbor(s) is 0.9778 


For a small dataset, the split will reduce the number of samples used for learning the model, and the performance can depend on a particular random choice for the pair of training and test sets.

### 1.1. What is overfitting?
Overfitting is a problem that a model fits too exactly to a training dataset, and fails to has good performance on fitting new data.

### 1.2 Solve the overfitting problem by spliting dataset into training and test sets

In [1]:
# train a KNN model (including parameter searching) using use the training and test sets

import numpy as np

from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix

ds= datasets.load_iris()
X=ds.data
y =ds.target

X_train, X_test, y_train, y_test = \
train_test_split(X, y, test_size = 0.3, random_state = 2) # 0, 5


# KNeighborsClassifier is a class --> go to the source
# http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html#sklearn.neighbors.KNeighborsClassifier.score
for k in [1, 3, 5, 7, 9]:
    knn = KNeighborsClassifier(n_neighbors = k)
    knn.fit(X_train, y_train) #construct the BallTree or KDTree
    y_test_pred = knn.predict(X_test)
    cfm = confusion_matrix(y_test, y_test_pred)
    print('KNN confusion matrix for test set\n', cfm)

    mean_acc = knn.score(X_test, y_test)
    print('mean accuracy of {} nearest neighbor(s) is {} '.format(k, round(mean_acc, 4)))

# We obtained a KNN model with k = 1 that produce the highest acc on the test set

KNN confusion matrix for test set
 [[17  0  0]
 [ 0 15  0]
 [ 0  0 13]]
mean accuracy of 1 nearest neighbor(s) is 1.0 
KNN confusion matrix for test set
 [[17  0  0]
 [ 0 15  0]
 [ 0  0 13]]
mean accuracy of 3 nearest neighbor(s) is 1.0 
KNN confusion matrix for test set
 [[17  0  0]
 [ 0 15  0]
 [ 0  0 13]]
mean accuracy of 5 nearest neighbor(s) is 1.0 
KNN confusion matrix for test set
 [[17  0  0]
 [ 0 14  1]
 [ 0  0 13]]
mean accuracy of 7 nearest neighbor(s) is 0.9778 
KNN confusion matrix for test set
 [[17  0  0]
 [ 0 14  1]
 [ 0  0 13]]
mean accuracy of 9 nearest neighbor(s) is 0.9778 


### 2. k-fold cross-validation (CV)
K-fold CV is used for solving the above issue.

1. Randomly divide a dataset (n) into k disjoint subsets(folds) of equal size n/k 
2. Use k-1 of the folds as training set, and the remaining 1 fold as the test set
3. Train and test the model
4. Repeat steps 2 and 3 until all folds have been used for test
5. The final performance will be the average of the measures of the k-fold validation

![k-fold cv](kfcv.png)

In [6]:
# use the cross_val_score function for k-fold cross-validation
import numpy as np

from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

from sklearn.model_selection import cross_val_score

k =1
knn = KNeighborsClassifier(n_neighbors = k)
scores = cross_val_score(knn, X, y, cv = 5, scoring = 'accuracy')
print(scores)
print('Accuracy: %0.2f +/- %0.2f'%(scores.mean(), scores.std()))

[0.96666667 0.96666667 0.93333333 0.93333333 1.        ]
Accuracy: 0.96 +/- 0.02


In [2]:
# use the cross_val_score function for parameter searching
import numpy as np

from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

from sklearn.model_selection import cross_val_score

for k in [1,3,4, 5,6, 7,8, 9]:
    knn = KNeighborsClassifier(n_neighbors = k)#am empty model
    scores = cross_val_score(knn, X, y, cv = 5, scoring = 'accuracy')
    #print(scores)
    print('k: %d. Accuracy: %0.2f +/- %0.2f'%(k, scores.mean(), scores.std()))


k: 1. Accuracy: 0.96 +/- 0.02
k: 3. Accuracy: 0.97 +/- 0.02
k: 4. Accuracy: 0.97 +/- 0.01
k: 5. Accuracy: 0.97 +/- 0.02
k: 6. Accuracy: 0.98 +/- 0.02
k: 7. Accuracy: 0.98 +/- 0.02
k: 8. Accuracy: 0.97 +/- 0.03
k: 9. Accuracy: 0.97 +/- 0.02


### 3. Other evaluation metrics

- error rate
- accuracy
- precision(PR): defined as the number of true positives over the number of predicted positives (true positives plus the number of false positives)
- recall rate(RR): the number of true positives over the number of real positives (true positives plus the number of false negatives)

In [25]:
# Other metrics
# accuracy

# precision: tp (correctly classified samples)/(tp + fp) for each calss
# recall ratio: tp/(tp + fn: is total # of smples in one cate.): fn false nagtive: 0 is the true label, predicted labels are 1 or 2

import numpy as np

from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

from sklearn.model_selection import cross_val_score

for k in [1, 3, 5, 7, 9]:
    knn = KNeighborsClassifier(n_neighbors = k)
    scores = cross_val_score(knn, X, y, cv=5, scoring = 'precision_macro')
    #print(scores)
    print("k: %d. Precision: %0.2f (+/- %0.2f)" % (k, scores.mean(), scores.std()))

k: 1. Recall: 0.96 (+/- 0.02)
k: 3. Recall: 0.97 (+/- 0.02)
k: 5. Recall: 0.98 (+/- 0.02)
k: 7. Recall: 0.98 (+/- 0.01)
k: 9. Recall: 0.97 (+/- 0.02)
