### Cross-validation: evaluating estimator performance

### Reference
- http://scikit-learn.org/stable/modules/cross_validation.html

### Why should we use Cross-validation?
- merge `training set` and `validation set` instead of partitioning the available data into three sets, so we have much more data for training;
- evaluate the performance of our methods by only using training set, so we can avoid to 'leak' the knowledge about the test set into the model;
- select hyperparameters.

When evaluating different settings (“hyperparameters”) for estimators, such as the `C` setting that must be manually set for an SVM, there is still a risk of overfitting on the test set because the parameters can be tweaked until the estimator performs optimally. This way, knowledge about the test set can “leak” into the model and evaluation metrics no longer report on generalization performance. To solve this problem, yet another part of the dataset can be held out as a so-called “validation set”: training proceeds on the training set, after which evaluation is done on the validation set, and when the experiment seems to be successful, final evaluation can be done on the test set.

### Basic steps
A test set should still be held out for final evaluation, but the validation set is no longer needed when doing CV. In the basic approach, called k-fold CV, the training set is split into k smaller sets (other approaches are described below, but generally follow the same principles). The following procedure is followed for each of the k “folds”:
- A model is trained using `k - 1` of the folds as training data;
- the resulting model is validated on the remaining part of the data (i.e. it is used as a test set to compute a performance measure such as accuracy).

### Deal with raw data

In scikit-learn a random split into training and test sets can be quickly computed with the train_test_split helper function. Let’s load the iris data set to fit a linear support vector machine on it:

In [2]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import datasets
from sklearn import svm

iris = datasets.load_iris()
iris.data.shape, iris.target.shape


((150, 4), (150,))

We can now quickly sample a training set while holding out 40% of the data for testing (evaluating) our classifier:

In [3]:
X_train, X_test, y_train, y_test = train_test_split(
    iris.data, iris.target, test_size=0.4, random_state=0)

X_train.shape, y_train.shape

X_test.shape, y_test.shape


clf = svm.SVC(kernel='linear', C=1).fit(X_train, y_train)
clf.score(X_test, y_test)  # 在test上计算分类的准确性

0.96666666666666667

### 1. Computing cross-validated metrics

The simplest way to use cross-validation is to call the cross_val_score helper function on the estimator and the dataset.

The following example demonstrates how to estimate the accuracy of a linear kernel support vector machine on the iris dataset by splitting the data, fitting a model and computing the score 5 consecutive times (with different splits each time):

- 下面计算score的方式可以有多种选择：http://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter

In [16]:
from sklearn.model_selection import cross_val_score
clf = svm.SVC(kernel='linear', C=1)
# 基本用法：模型 + X + y + 交叉验证的倍数 + 模型评价的方式
scores = cross_val_score(clf, iris.data, iris.target, cv=5, scoring='accuracy')
scores  # 5重交叉验证，就会有5个评价结果（基于每次测试集计算得到），最终可以取均值

array([ 0.96666667,  1.        ,  0.96666667,  0.96666667,  1.        ])

The mean score and the 95% confidence interval of the score estimate are hence given by:
- MARK: calculate confidence interval

<img src="https://mathbitsnotebook.com/Algebra2/Statistics/normalEmp2.jpg">

In [9]:
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

Accuracy: 0.98 (+/- 0.03)


##### Data transformation with held out data

- 在进行预测时，对数据进行预处理的方式（例如标准化、缩放等）也应该与训练数据保持一致
- Dataset transformations，a big topic: http://scikit-learn.org/stable/data_transforms.html#data-transforms

Just as it is important to test a predictor on data held-out from training, preprocessing (such as standardization, feature selection, etc.) and similar data transformations similarly should be learnt from a training set and applied to held-out data for prediction:

In [10]:
from sklearn import preprocessing

X_train, X_test, y_train, y_test = train_test_split(
    iris.data, iris.target, test_size=0.4, random_state=0)
scaler = preprocessing.StandardScaler().fit(X_train)
X_train_transformed = scaler.transform(X_train)
clf = svm.SVC(C=1).fit(X_train_transformed, y_train)
X_test_transformed = scaler.transform(X_test)
clf.score(X_test_transformed, y_test)

0.93333333333333335

#### 1.1 The cross_validate function and multiple metric evaluation

The `cross_validate` function differs from `cross_val_score` in two ways -

- It allows specifying multiple metrics for evaluation.
- It returns a dict containing training scores, fit-times and score-times in addition to the test score.

The multiple metrics can be specified either as a list, tuple or set of predefined scorer names:

In [11]:
from sklearn.model_selection import cross_validate
from sklearn.metrics import recall_score

scoring = ['precision_macro', 'recall_macro']
clf = svm.SVC(kernel='linear', C=1, random_state=0)
scores = cross_validate(clf, iris.data, iris.target, scoring=scoring, cv=5, return_train_score=False)

print(sorted(scores.keys()))

print(scores['test_recall_macro'], scores['test_precision_macro'])

print(scores['fit_time'], scores['score_time'])

['fit_time', 'score_time', 'test_precision_macro', 'test_recall_macro']
[ 0.96666667  1.          0.96666667  0.96666667  1.        ] [ 0.96969697  1.          0.96969697  0.96969697  1.        ]
[ 0.00057602  0.00033641  0.0002985   0.00032592  0.00029635] [ 0.00103164  0.00073767  0.00072742  0.00075459  0.00073671]


Here is an example of `cross_validate` using a single metric:

In [14]:
scores = cross_validate(clf, iris.data, iris.target, scoring='precision_macro', return_train_score=True)
print(sorted(scores.keys()))

print(scores['test_score'], scores['train_score'])

['fit_time', 'score_time', 'test_score', 'train_score']
[ 1.          0.96491228  0.98039216] [ 0.98095238  1.          0.99047619]


#### 1.2. Obtaining predictions by cross-validation

The function `cross_val_predict` has a similar interface to `cross_val_score`, but returns, for each element in the input, the prediction that was obtained for that element when it was in the test set. Only cross-validation strategies that assign all elements to a test set exactly once can be used (otherwise, an exception is raised).

These prediction can then be used to evaluate the classifier:

- 使用已经训练好的模型预测？那么CV在这里有什么作用呢？？

In [12]:
from sklearn.model_selection import cross_val_predict
from sklearn import metrics

print(clf)
predicted = cross_val_predict(clf, iris.data, iris.target, cv=10)
metrics.accuracy_score(iris.target, predicted)

SVC(C=1, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='linear',
  max_iter=-1, probability=False, random_state=0, shrinking=True,
  tol=0.001, verbose=False)


0.97333333333333338

### 2. Cross-validation iterators for i.i.d. data

Assuming that some data is Independent and Identically Distributed (i.i.d.) is making the assumption that all samples stem from the same generative process and that the generative process is assumed to have no memory of past generated samples.

The following cross-validators can be used in such cases.

##### NOTE

While i.i.d. data is a common assumption in machine learning theory, it rarely holds in practice. If one knows that the samples have been generated using a time-dependent process, it’s safer to use a time-series aware cross-validation scheme Similarly if we know that the generative process has a group structure (samples from collected from different subjects, experiments, measurement devices) it safer to use group-wise cross-validation.

#### 2.1 K-fold

KFold divides all the samples in k groups of samples, called folds (if k = n, this is equivalent to the Leave One Out strategy), of equal sizes (if possible). The prediction function is learned using k - 1 folds, and the fold left out is used for test.

Example of 2-fold cross-validation on a dataset with 4 samples:

In [13]:
import numpy as np
from sklearn.model_selection import KFold

X = ["a", "b", "c", "d"]
kf = KFold(n_splits=2)
for train, test in kf.split(X):
    print("%s %s" % (train, test))

[2 3] [0 1]
[0 1] [2 3]


Each fold is constituted by two arrays: the first one is related to the training set, and the second one to the test set. Thus, one can create the training/test sets using numpy indexing:

In [14]:
X = np.array([[0., 0.], [1., 1.], [-1., -1.], [2., 2.]])
y = np.array([0, 1, 0, 1])
X_train, X_test, y_train, y_test = X[train], X[test], y[train], y[test]

#### 2.2 Repeated K-Fold

RepeatedKFold repeats K-Fold n times. It can be used when one requires to run KFold n times, producing different splits in each repetition.

Example of 2-fold K-Fold repeated 2 times:

In [15]:
import numpy as np
from sklearn.model_selection import RepeatedKFold
X = np.array([[1, 2], [3, 4], [1, 2], [3, 4]])
random_state = 12883823
rkf = RepeatedKFold(n_splits=2, n_repeats=2, random_state=random_state)
for train, test in rkf.split(X):
    print("%s %s" % (train, test))

[2 3] [0 1]
[0 1] [2 3]
[0 2] [1 3]
[1 3] [0 2]


#### 2.3 Leave One Out (LOO)

- LOO也是一种交叉验证

LeaveOneOut (or LOO) is a simple cross-validation. Each learning set is created by taking all the samples except one, the test set being the sample left out. Thus, for n samples, we have n different training sets and n different tests set. This cross-validation procedure does not waste much data as only one sample is removed from the training set:

In [16]:
from sklearn.model_selection import LeaveOneOut

X = [1, 2, 3, 4]
loo = LeaveOneOut()
for train, test in loo.split(X):
    print("%s %s" % (train, test))

[1 2 3] [0]
[0 2 3] [1]
[0 1 3] [2]
[0 1 2] [3]


In terms of accuracy, LOO often results in high variance as an estimator for the test error. Intuitively, since n - 1 of the n samples are used to build each model, models constructed from folds are virtually identical to each other and to the model built from the entire training set.

However, if the learning curve is steep for the training size in question, then 5- or 10- fold cross validation can overestimate the generalization error.

As a general rule, most authors, and empirical evidence, suggest that 5- or 10- fold cross validation should be preferred to LOO.

- Q: 如果5-或10-倍交叉验证会高估泛化误差，LOO方法不应该会更高么？？？

#### 2.4 Leave P Out (LPO)

`LeavePOut` is very similar to `LeaveOneOut` as it creates all the possible training/test sets by removing p samples from the complete set. For n samples, this produces ${n \choose p}$ train-test pairs. Unlike LeaveOneOut and KFold, the test sets will overlap for  p > 1.

Example of Leave-2-Out on a dataset with 5 samples:

In [17]:
from sklearn.model_selection import LeavePOut

X = np.ones(5)
lpo = LeavePOut(p=2)
for train, test in lpo.split(X):
    print("%s %s" % (train, test))

[2 3 4] [0 1]
[1 3 4] [0 2]
[1 2 4] [0 3]
[1 2 3] [0 4]
[0 3 4] [1 2]
[0 2 4] [1 3]
[0 2 3] [1 4]
[0 1 4] [2 3]
[0 1 3] [2 4]
[0 1 2] [3 4]


### 3. Cross-validation iterators for grouped data

#### Leave One Group Out