In [1]:
%matplotlib inline

## [Cross Validation](https://en.wikipedia.org/wiki/Cross-validation_(statistics))
- Learning & testing a model on the same data is a mistake. The model will simply repeat the labels it has already seen, but fail to predict anything new. (This is "overfitting").
- It is a common practice to reserve a subset of data (a "test set") to avoid this problem.

In [2]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import datasets
from sklearn import svm

X, y = datasets.load_iris(return_X_y=True)
X.shape, y.shape

((150, 4), (150,))

- now sample the training data while holding 40% for testing.

In [3]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    test_size=0.4, 
    random_state=0)

print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

clf = svm.SVC(kernel='linear', C=1).fit(X_train, y_train)
print(clf.score(X_test, y_test))

(90, 4) (90,) (60, 4) (60,)
0.9666666666666667


- There is still a risk of overfitting on test data, because params can be tweaked until the model works as planned - this means test data "knowledge" can leak into the model.
- Reserving yet another subset for *validation* solves this problem, but introduces another - we are reducing the #samples available for training.
- CV solves this problem. We still need a test data subset; the validation is done instead by splitting the training data into _k_ smaller sets. Each "fold" is trained using the remaining k-1 folds as training data.
- The performance returned by k-fold cross validation is the average of the values found by the loop.

![illustration](px/grid_search_cross_validation.png)

## [CV Math](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html#sklearn.model_selection.cross_val_score) using cross_val_score


In [4]:
from sklearn.model_selection import cross_val_score
clf = svm.SVC(kernel='linear', C=1)
scores = cross_val_score(clf, X, y, cv=5)
scores

array([0.96666667, 1.        , 0.96666667, 0.96666667, 1.        ])

In [5]:
# mean score & 95% confidence interval of the score estimate
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

Accuracy: 0.98 (+/- 0.03)


In [6]:
# changing the scoring method:
from sklearn import metrics
scores = cross_val_score(
    clf, X, y, cv=5, scoring='f1_macro')
scores

array([0.96658312, 1.        , 0.96658312, 0.96658312, 1.        ])

In [7]:
# changing the cross-validation iterator method:
from sklearn.model_selection import ShuffleSplit
n_samples = X.shape[0]
cv = ShuffleSplit(n_splits=5, test_size=0.3, random_state=0)
cross_val_score(clf, X, y, cv=cv)

array([0.97777778, 0.97777778, 1.        , 0.95555556, 1.        ])

In [8]:
# using an iterable which yields (train,test) splits as arrays
def custom_cv_2folds(X):
    n = X.shape[0]
    i = 1
    while i <= 2:
        idx = np.arange(n * (i - 1) / 2, n * i / 2, dtype=int)
        yield idx, idx
        i += 1

custom_cv = custom_cv_2folds(X)
cross_val_score(clf, X, y, cv=custom_cv)

array([1.        , 0.97333333])

## [cross_validate](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_validate.html#sklearn.model_selection.cross_validate) vs cross_val_score:
- cross_validate allows using multiple metrics
- cross_validate returns a dict with fit-times, score-times & test score.
- metrics can be spec'd with a list, tuple or set of scorer names.

In [9]:
from sklearn.model_selection import cross_validate
from sklearn.metrics import recall_score

scoring = ['precision_macro', 
           'recall_macro']

clf = svm.SVC(kernel='linear', 
              C=1, 
              random_state=0)

scores = cross_validate(clf, X, y, scoring=scoring)

sorted(scores.keys())
scores['test_recall_macro']

array([0.96666667, 1.        , 0.96666667, 0.96666667, 1.        ])

- or with a dict that maps a scorer name to predefined or custom function.

In [10]:
from sklearn.metrics import make_scorer

scoring = {'prec_macro': 'precision_macro',
           'rec_macro':  make_scorer(recall_score, 
                                     average='macro')}

scores = cross_validate(clf, X, y, scoring=scoring,
                        cv=5, return_train_score=True)

sorted(scores.keys())
scores['train_rec_macro']

array([0.975     , 0.975     , 0.99166667, 0.98333333, 0.98333333])

## [cross_val_predict](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_predict.html#sklearn.model_selection.cross_val_predict) vs cross_val_score
- cross_val_predict returns, for each input element, that element's prediction when it was in the test dataset.
- This is usable only in cross-validation strategies that assign all elements to a test set exactly once. Otherwise Scikit raises an exception.

## IID (independent, identically distributed) data splits:

#### [K-fold](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html#sklearn.model_selection.KFold)
- divided all samples into k groups of equal sizes.
- prediction function is learned using k-1 folds with one left for testing.

In [11]:
import numpy as np
from sklearn.model_selection import KFold

X = ["a", "b", "c", "d", "e", "f", "g", "h", "i", "j"]
kf = KFold(n_splits=4)
for train, test in kf.split(X):
    print("%s %s" % (train, test))


[3 4 5 6 7 8 9] [0 1 2]
[0 1 2 6 7 8 9] [3 4 5]
[0 1 2 3 4 5 8 9] [6 7]
[0 1 2 3 4 5 6 7] [8 9]


#### [Repeated K-fold](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RepeatedKFold.html#sklearn.model_selection.RepeatedKFold)
- repeats K-fold n times. Can be used when you need to run multiples of K-fold with different splits in each iteration.

In [12]:
import numpy as np
from sklearn.model_selection import RepeatedKFold

X = np.array([[1,2], [3,4], [5,6], [1,2], [3,4], [5,6]])
random_state = 12883823
rkf = RepeatedKFold(n_splits=2, 
                    n_repeats=2, 
                    random_state=random_state)

for train, test in rkf.split(X):
    print("%s %s" % (train, test))

[0 4 5] [1 2 3]
[1 2 3] [0 4 5]
[0 1 3] [2 4 5]
[2 4 5] [0 1 3]


#### [Leave one out](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.LeaveOneOut.html#sklearn.model_selection.LeaveOneOut) (LOO)
- Each training set is created by using all samples except one, which is used for testing. For n samples, we n training sets + n test sets.

In [13]:
from sklearn.model_selection import LeaveOneOut

X = [1, 2, 3, 4, 5, 6]
loo = LeaveOneOut()
for train, test in loo.split(X):
    print("%s %s" % (train, test))

[1 2 3 4 5] [0]
[0 2 3 4 5] [1]
[0 1 3 4 5] [2]
[0 1 2 4 5] [3]
[0 1 2 3 5] [4]
[0 1 2 3 4] [5]


#### [Leave P out](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.LeavePOut.html#sklearn.model_selection.LeavePOut) (LPO)
- creates all possible training & test subsets by removing P samples from the complete set. For n samples this generates ${n \choose p}$ train-test pairs.

In [14]:
from sklearn.model_selection import LeavePOut

X = np.ones(6)
lpo = LeavePOut(p=2)
for train, test in lpo.split(X):
    print("%s %s" % (train, test))

[2 3 4 5] [0 1]
[1 3 4 5] [0 2]
[1 2 4 5] [0 3]
[1 2 3 5] [0 4]
[1 2 3 4] [0 5]
[0 3 4 5] [1 2]
[0 2 4 5] [1 3]
[0 2 3 5] [1 4]
[0 2 3 4] [1 5]
[0 1 4 5] [2 3]
[0 1 3 5] [2 4]
[0 1 3 4] [2 5]
[0 1 2 5] [3 4]
[0 1 2 4] [3 5]
[0 1 2 3] [4 5]


#### Random permutations, aka [shuffle split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.ShuffleSplit.html#sklearn.model_selection.ShuffleSplit)
- Generates a user-defined number of independent train & test subsets.
- Randomness can be controlled via ```random_state```.

In [15]:
from sklearn.model_selection import ShuffleSplit
X = np.arange(20)
ss = ShuffleSplit(n_splits=5, test_size=0.25, random_state=0)
for train_index, test_index in ss.split(X):
    print("%s %s" % (train_index, test_index))

[17  6 13  4  2  5 14  9  7 16 11  3  0 15 12] [18  1 19  8 10]
[12 19 16 10  0  3  4 15  8 13  9  5 14  7  6] [11  1 18 17  2]
[ 2  8  6  3 17  4 10 16 18  9  1  0  7 14 19] [15 13 12  5 11]
[17  7 12 14 16 11 10  9 15  1 19  8  6  5  4] [18  0 13  2  3]
[18  8 17 15 16  6 13 11  4 10  9 12  3 14  0] [ 7  1  2 19  5]


## class label splits
- some classification problems suffer from large imbalances of class distributions. stratified sampling helps ensure that relative class frequencies are preserved.

#### [stratified K-fold](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html#sklearn.model_selection.StratifiedKFold)
- each fold contains approx the same percentage of samples of each class as the complete set.

In [16]:
# compare: stratified 3-fold CV to std k-fold CV.
# dataset has 50 samples from 2 unbalanced classes
# show #samples in each class

from sklearn.model_selection import StratifiedKFold, KFold
import numpy as np

X, y = np.ones((50, 1)), np.hstack(([0] * 45, [1] * 5))

skf = StratifiedKFold(n_splits=3)
kf = KFold(n_splits=3)

for train, test in skf.split(X, y):
    print('train -  {}   |   test -  {}'.format(
        np.bincount(y[train]), np.bincount(y[test])))

for train, test in kf.split(X, y):
    print('train -  {}   |   test -  {}'.format(
        np.bincount(y[train]), np.bincount(y[test])))

train -  [30  3]   |   test -  [15  2]
train -  [30  3]   |   test -  [15  2]
train -  [30  4]   |   test -  [15  1]
train -  [28  5]   |   test -  [17]
train -  [28  5]   |   test -  [17]
train -  [34]   |   test -  [11  5]


#### [stratified shuffle split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedShuffleSplit.html#sklearn.model_selection.StratifiedShuffleSplit)
- creates splits by preserving the same percentage for each class as in the complete set.

In [17]:
from sklearn.model_selection import StratifiedShuffleSplit
X = np.array([[1, 2], [3, 4], [1, 2], [3, 4], [1, 2], [3, 4]])
y = np.array([0, 0, 0, 1, 1, 1])

sss = StratifiedShuffleSplit(n_splits = 5, 
                             test_size = 0.5, 
                             random_state = 0)
sss.get_n_splits(X, y)
print(sss)

for train_index, test_index in sss.split(X, y):
    print("TRAIN:", train_index, "TEST:", test_index)
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

StratifiedShuffleSplit(n_splits=5, random_state=0, test_size=0.5,
            train_size=None)
TRAIN: [5 2 3] TEST: [4 1 0]
TRAIN: [5 1 4] TEST: [0 2 3]
TRAIN: [5 0 2] TEST: [4 3 1]
TRAIN: [4 1 0] TEST: [2 3 5]
TRAIN: [0 5 1] TEST: [3 4 2]


## grouped splits
- IID assumption breaks if the underlying distribution yields groups of dependent samples.
- These grouping are domain specific, for example multiple medical samples taken from each of multiple patients.

#### [Group K-Fold](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GroupKFold.html#sklearn.model_selection.GroupKFold)
- variation of k-fold; ensures the same group is not represented in both the test and training subsets.

In [18]:
from sklearn.model_selection import GroupKFold

X      = [0.1, 0.2, 2.2, 2.4, 2.3, 4.55, 5.8, 8.8, 9,   10]
y      = ["a", "b", "b", "b", "c", "c",  "c", "d", "d", "d"]
groups = [ 1,   1,   1,   2,   2,   2,    3,   3,   3,   3]
gkf = GroupKFold(n_splits=3)

for train, test in gkf.split(X, y, groups=groups):
    print("%s %s" % (train, test))

[0 1 2 3 4 5] [6 7 8 9]
[0 1 2 6 7 8 9] [3 4 5]
[3 4 5 6 7 8 9] [0 1 2]


#### [Leave one group out](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.LeaveOneGroupOut.html#sklearn.model_selection.LeaveOneGroupOut)
- reserves samples according to a 3rd-party array of integer groups. this allows you to encode arbitrary domain information.

In [19]:
from sklearn.model_selection import LeaveOneGroupOut

X      = [1, 5, 10, 50, 60, 70, 80]
y      = [0, 1, 1,  2,  2,  2,  2]
groups = [1, 1, 2,  2,  3,  3,  3]
logo = LeaveOneGroupOut()

for train, test in logo.split(X, y, groups=groups):
    print("%s %s" % (train, test))

[2 3 4 5 6] [0 1]
[0 1 4 5 6] [2 3]
[0 1 2 3] [4 5 6]


#### [Leave p groups out](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.LeavePGroupsOut.html#sklearn.model_selection.LeavePGroupsOut)
- removes samples related to P groups for each training & test subset.

In [20]:
from sklearn.model_selection import LeavePGroupsOut

X      = np.arange(6)
y      = [1, 1, 1, 2, 2, 2]
groups = [1, 1, 2, 2, 3, 3]
lpgo   = LeavePGroupsOut(n_groups = 2)

for train, test in lpgo.split(X, y, groups=groups):
    print("%s %s" % (train, test))

[4 5] [0 1 2 3]
[2 3] [0 1 4 5]
[0 1] [2 3 4 5]


#### [Group shuffle split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GroupShuffleSplit.html#sklearn.model_selection.GroupShuffleSplit)
- generates a sequence of random partitions, in which a subset of groups is reserved for each split.
- this strategy is useful when the behavior of **LeavePGroupsOut** is desired, but the #groups is too large for reasonable compute time.

In [21]:
from sklearn.model_selection import GroupShuffleSplit

X      = [0.1, 0.2, 2.2, 2.4, 2.3, 4.55, 5.8, 0.001]
y      = ["a", "b", "b", "b", "c", "c",  "c", "a"]
groups = [ 1,   1,   2,   2,   3,   3,    4,   4]
gss    = GroupShuffleSplit(n_splits  = 4, 
                           test_size = 0.5, 
                           random_state = 0)

for train, test in gss.split(X, y, groups=groups):
    print("%s %s" % (train, test))

[0 1 2 3] [4 5 6 7]
[2 3 6 7] [0 1 4 5]
[2 3 4 5] [0 1 6 7]
[4 5 6 7] [0 1 2 3]


## [predefined splits](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.PredefinedSplit.html#sklearn.model_selection.PredefinedSplit)
- in some datasets a split already exists.

In [22]:
import numpy as np
from sklearn.model_selection import PredefinedSplit

X         = np.array([[1, 2], [3, 4], [1, 2], [3, 4]])
y         = np.array([ 0,      0,      1,      1])
test_fold =          [ 0,      1,     -1,      1]
ps        = PredefinedSplit(test_fold)
ps.get_n_splits()

print(ps)

for train_index, test_index in ps.split():
    print("TRAIN:", train_index, "TEST:", test_index)
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]


PredefinedSplit(test_fold=array([ 0,  1, -1,  1]))
TRAIN: [1 2 3] TEST: [0]
TRAIN: [0 2] TEST: [1 3]


## [time series splits](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.TimeSeriesSplit.html#sklearn.model_selection.TimeSeriesSplit)
- variation of *k-fold* - the first _k_ folds are the training subset; _k+1_ fold is the test subset.
- Note that successive training sets are _supersets_ of those that came prior. It adds all surplus data to the first training partition.

In [23]:
# 3-split time series cross-validation, 6-sample dataset
from sklearn.model_selection import TimeSeriesSplit

X    = np.array([[1, 2], [3, 4], [1, 2], [3, 4], [1, 2], [3, 4]])
y    = np.array([1, 2, 3, 4, 5, 6])
tscv = TimeSeriesSplit(n_splits=3)
print(tscv)

for train, test in tscv.split(X):
    print("%s %s" % (train, test))

TimeSeriesSplit(max_train_size=None, n_splits=3)
[0 1 2] [3]
[0 1 2 3] [4]
[0 1 2 3 4] [5]


## Shuffling notes
- some splitters such as _kfold_ have an inbuilt option to shuffle data before splitting.
- this consumes less memory than direct shuffling.
- default: no shuffling occurs.
- ```random_state=None``` by default - shuffling will be different each time. set ```random_state``` to an integer to get repeated results.