# Cross-validation: evaluating estimator performance

From this guide: https://scikit-learn.org/stable/modules/cross_validation.html

## Overview

Learning the parameters of a prediction function and testing it on the *same data* is a methodological mistake: a model that would just repeat the labels of the samples that it has just seen would have a perfect score but would fail to predict anything useful on yet-unseen data. This situation is called **overfitting**. To avoid it, it is common practice when performing a (supervised) machine learning experiment to *hold out* part of the available data as a test set `X_test, y_test`.  

Here is a flowchart of typical cross validation workflow in model training. The best parameters can be determined by e.g. grid search techniques.

<img src="https://scikit-learn.org/stable/_images/grid_search_workflow.png" width=400 />

In [1]:
# Using train_test_split() to split data:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import datasets
from sklearn import svm

X, y = datasets.load_iris(return_X_y=True)

print(X.shape, y.shape)
print(X[0])
print(y[0])

(150, 4) (150,)
[5.1 3.5 1.4 0.2]
0


In [2]:
X_train, X_test, y_train, y_test = train_test_split(
    X, 
    y, 
    test_size=0.4, 
    random_state=0
)

print("X_train.shape:", X_train.shape, "y_train.shape:", y_train.shape)
print("X_test.shape:", X_test.shape, "y_test.shape:", y_test.shape)

X_train.shape: (90, 4) y_train.shape: (90,)
X_test.shape: (60, 4) y_test.shape: (60,)


In [3]:
clf = svm.SVC(kernel='linear', C=1).fit(X_train, y_train)
clf.score(X_test, y_test)

0.9666666666666667

When evaluating different settings (“hyperparameters”) for estimators, such as the `C` setting that must be manually set for an SVM, there is still a risk of **overfitting on the *test set*** because the parameters can be tweaked until the estimator performs optimally. This way, knowledge about the test set can “leak” into the model and evaluation metrics no longer report on generalization performance. 

To solve this problem, yet another part of the dataset can be held out as a so-called “validation set”: training proceeds on the training set, after which evaluation is done on the validation set, and when the experiment seems to be successful, final evaluation can be done on the test set.

However, by partitioning the available data into three sets, we drastically reduce the number of samples which can be used for learning the model, and the results can depend on a particular random choice for the pair of (train, validation) sets.

A solution to this problem is a procedure called **cross-validation** (CV for short). 

> A test set should still be held out for final evaluation, but the validation set is no longer needed when doing CV. 

In the basic approach, called *k*-fold CV, the training set is split into *k* smaller sets (other approaches are described below, but generally follow the same principles). The following procedure is followed for each of the *k* “folds”:

* A model is trained using $k-1$ of the folds as training data;
* the resulting model is validated on the remaining part of the data (i.e., it is used as a test set to compute a performance measure such as accuracy).

The performance measure reported by *k*-fold cross-validation **is then the average** of the values computed in the loop. This approach can be computationally expensive, but does not waste too much data (as is the case when fixing an arbitrary validation set), which is a major advantage in problems such as inverse inference where the number of samples is very small.

<img src="https://scikit-learn.org/stable/_images/grid_search_cross_validation.png" width=500 />

## [Computing cross-validated metrics](https://scikit-learn.org/stable/modules/cross_validation.html#computing-cross-validated-metrics)

The simplest way to use cross-validation is to call the `cross_val_score` helper function on the estimator and the dataset.

The following example demonstrates how to estimate the accuracy of a linear kernel support vector machine on the *iris dataset* by splitting the data, fitting a model and computing the score **5 consecutive times (with different splits each time)**:

In [4]:
from sklearn.model_selection import cross_val_score  # <-- This helper function.

clf = svm.SVC(kernel='linear', C=1, random_state=42)

scores = cross_val_score(clf, X, y, cv=5, verbose=True)
print(scores)

[0.96666667 1.         0.96666667 0.96666667 1.        ]


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    0.0s finished


The mean score and the standard deviation are hence given by:

In [5]:
print(f"{scores.mean():0.2f} accuracy with a standard deviation of {scores.std():0.2f}")

0.98 accuracy with a standard deviation of 0.02


* ⚠️ By default, the score computed at each CV iteration is the **score method of the estimator**. 

It is possible to change this by using the scoring parameter:

In [6]:
from sklearn import metrics
scores = cross_val_score(
    clf, 
    X, 
    y, 
    cv=5, 
    scoring='f1_macro'  # <-- This.
)

print(scores)

[0.96658312 1.         0.96658312 0.96658312 1.        ]


* 🔍 See [The scoring parameter: defining model evaluation rules](https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter) for details.

* 💡 When the `cv` argument is an integer, `cross_val_score` uses the `KFold` or `StratifiedKFold` strategies by default, the latter being used if the estimator derives from `ClassifierMixin`.

It is also possible to use other **cross validation strategies** by passing a `cross validation iterator` instead, for instance:

In [7]:
from sklearn.model_selection import ShuffleSplit

n_samples = X.shape[0]

cv = ShuffleSplit(n_splits=5, test_size=0.3, random_state=0)

scores = cross_val_score(
    clf, 
    X, 
    y, 
    cv=cv  # ⚠️ We are passing a `ShuffleSplit` cross validation iterator here (which is a CV strat.), not integer anymore!
)

print(scores)

[0.97777778 0.97777778 1.         0.95555556 1.        ]


Another option is to use an `iterable` yielding `(train, test)` splits as **arrays of indices**, for example:

In [8]:
def custom_cv_2folds(X):
    n = X.shape[0]
    i = 1
    while i <= 2:
        idx = np.arange(n * (i - 1) / 2, n * i / 2, dtype=int)
        yield idx, idx  # Yield two arrays of indices.
        i += 1

In [9]:
custom_cv = custom_cv_2folds(X)  # Initialise the iterable

In [10]:
cross_val_score(clf, X, y, cv=custom_cv)

array([1.        , 0.97333333])

---

### ⚠️ Warning on preprocessing / data transformation with held out data

Just as it is important to test a predictor on data **held-out** from training, preprocessing (such as standardization, feature selection, etc.) and similar data **transformations** similarly should be *learnt from a training set (only)* and applied to held-out data for prediction:

In [11]:
from sklearn import preprocessing
print("X.shape:", X.shape)
X_train, X_test, y_train, y_test = train_test_split(
    X, 
    y, 
    test_size=0.4, 
    random_state=0
)
print("X_train.shape:", X_train.shape)

X.shape: (150, 4)
X_train.shape: (90, 4)


In [12]:
scaler = preprocessing.StandardScaler().fit(X_train)  # Fit scaler on TRAINING set.
X_train_transformed = scaler.transform(X_train)

In [13]:
clf = svm.SVC(C=1).fit(X_train_transformed, y_train)

In [14]:
X_test_transformed = scaler.transform(X_test)

In [15]:
clf.score(X_test_transformed, y_test)

0.9333333333333333

A `Pipeline` makes it **easier** to compose estimators, *providing this behavior under cross-validation*:

In [16]:
from sklearn.pipeline import make_pipeline
clf = make_pipeline(
    preprocessing.StandardScaler(), 
    svm.SVC(C=1)
)
print(type(clf))
cross_val_score(clf, X, y, cv=cv)

<class 'sklearn.pipeline.Pipeline'>


array([0.97777778, 0.93333333, 0.95555556, 0.93333333, 0.97777778])

---

## [The `cross_validate` function and multiple metric evaluation](https://scikit-learn.org/stable/modules/cross_validation.html#the-cross-validate-function-and-multiple-metric-evaluation)

The `cross_validate` function differs from `cross_val_score` in two ways:
1. It allows specifying **multiple metrics** for evaluation.
2. It *returns a `dict`* containing **fit-times**, **score-times** *(and optionally training scores as well as fitted estimators)* in addition to the test score.

Returns:
* For **single metric** evaluation, where the scoring parameter is a *string*, *callable* or `None`, the keys will be - `['test_score', 'fit_time', 'score_time']`
* And for **multiple metric** evaluation, the return value is a `dict` with the following keys - `['test_<scorer1_name>', 'test_<scorer2_name>', 'test_<scorer...>', 'fit_time', 'score_time']`

`return_train_score` is set to `False` by default to save computation time. To evaluate the scores on the training set as well you need to set it to `True`. You may also retain the estimator fitted on each training set by setting `return_estimator=True`.

* 🤔 I believe it's talking about actual time (wall clock time) taken for fitting and scoring here, [see this](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_validate.html#sklearn.model_selection.cross_validate)...

The multiple metrics can be specified either as a *list, tuple or set of predefined scorer names*.

In [28]:
from sklearn.model_selection import cross_validate  # <-- This function.
from sklearn.metrics import recall_score

scoring = ['precision_macro', 'recall_macro', 'f1_micro']  # Choose *three* metrics here.

clf = svm.SVC(kernel='linear', C=1, random_state=0)  # Define estimator.

In [29]:
scores = cross_validate(clf, X, y, scoring=scoring)

# Show all the keys of returned `dict`:
sorted(scores.keys())

['fit_time',
 'score_time',
 'test_f1_micro',
 'test_precision_macro',
 'test_recall_macro']

In [30]:
for idx, (k, v) in enumerate(scores.items()):
    print(f"{k}:")
    print(f"> type: {type(v)}")
    print(v)
    print("=" * 80)
    if idx == 1:
        print("\n\n[Actual Scores]:\n")

fit_time:
> type: <class 'numpy.ndarray'>
[0.00093913 0.0009346  0.00081086 0.00087953 0.00082636]
score_time:
> type: <class 'numpy.ndarray'>
[0.00221133 0.0021584  0.00215769 0.00214863 0.00218296]


[Actual Scores]:

test_precision_macro:
> type: <class 'numpy.ndarray'>
[0.96969697 1.         0.96969697 0.96969697 1.        ]
test_recall_macro:
> type: <class 'numpy.ndarray'>
[0.96666667 1.         0.96666667 0.96666667 1.        ]
test_f1_micro:
> type: <class 'numpy.ndarray'>
[0.96666667 1.         0.96666667 0.96666667 1.        ]


In [31]:
# Similar example, but:
# using a dict mapping scorer name to a predefined or custom scoring function.

from sklearn.metrics import make_scorer  # Helper.

scoring = {
    'prec_macro': 'precision_macro',  # Existing scorer named by a string.
    'rec_macro': make_scorer(recall_score, average='macro')  # Custom scorer, here defined via a helper function `make_scorer`.
}

scores = cross_validate(
    clf, 
    X, 
    y, 
    scoring=scoring,
    cv=5, 
    return_train_score=True  # This time, also return train score...
)
print(sorted(scores.keys()))

['fit_time', 'score_time', 'test_prec_macro', 'test_rec_macro', 'train_prec_macro', 'train_rec_macro']


In [32]:
for idx, (k, v) in enumerate(scores.items()):
    print(f"{k}:")
    print(f"> type: {type(v)}")
    print(v)
    print("=" * 80)
    if idx == 1:
        print("\n\n[Actual Scores]:\n")

fit_time:
> type: <class 'numpy.ndarray'>
[0.00091481 0.0006578  0.00063038 0.0006721  0.00066805]
score_time:
> type: <class 'numpy.ndarray'>
[0.00119829 0.00108337 0.00120997 0.00126553 0.00119758]


[Actual Scores]:

test_prec_macro:
> type: <class 'numpy.ndarray'>
[0.96969697 1.         0.96969697 0.96969697 1.        ]
train_prec_macro:
> type: <class 'numpy.ndarray'>
[0.97674419 0.97674419 0.99186992 0.98412698 0.98333333]
test_rec_macro:
> type: <class 'numpy.ndarray'>
[0.96666667 1.         0.96666667 0.96666667 1.        ]
train_rec_macro:
> type: <class 'numpy.ndarray'>
[0.975      0.975      0.99166667 0.98333333 0.98333333]


In [33]:
# Example of `cross_validate` using only a single metric:
scores = cross_validate(clf, X, y, scoring='precision_macro', cv=5, return_estimator=True)
sorted(scores.keys())

['estimator', 'fit_time', 'score_time', 'test_score']

### [Obtaining predictions by cross-validation](https://scikit-learn.org/stable/modules/cross_validation.html#obtaining-predictions-by-cross-validation)

The function `cross_val_predict` has a similar interface to `cross_val_score`, but returns:

* **For each element in the input, the prediction that was obtained for that element when it was in the test set**. \[‼️\]

Only cross-validation strategies that assign all elements to a test set exactly once can be used (otherwise, an exception is raised).


⚠️ **Warning** Note on inappropriate usage of `cross_val_predict`:
The result of `cross_val_predict` may be different from those obtained using `cross_val_score` as **the elements are grouped in different ways**. The function `cross_val_score` takes an *average over cross-validation folds*, whereas `cross_val_predict` simply *returns the labels (or probabilities) from several distinct models undistinguished*. Thus, `cross_val_predict` is ***not an appropriate measure of generalisation error***.

The function cross_val_predict is appropriate for:
* Visualization of predictions obtained from different models.
* Model blending: When predictions of one supervised estimator are used to train another estimator in ensemble methods.

## [Cross validation iterators](https://scikit-learn.org/stable/modules/cross_validation.html#cross-validation-iterators)

The following sections list utilities to generate indices that can be used to generate dataset splits according to different cross validation strategies.

* [Cross-validation iterators for i.i.d. data](https://scikit-learn.org/stable/modules/cross_validation.html#cross-validation-iterators-for-i-i-d-data):
    1. `KFold`
    2. `RepeatedKFold`
    3. `LeaveOneOut`
    4. `LeavePOut`
    5. `ShuffleSplit`
* [Cross-validation iterators with stratification based on class labels](https://scikit-learn.org/stable/modules/cross_validation.html#cross-validation-iterators-with-stratification-based-on-class-labels):
    * Useful when large imbalance in the distribution of the target classes.
    6. `StratifiedKFold` (Note also`RepeatedStratifiedKFold`)
    7. `StratifiedShuffleSplit`
* [Cross-validation iterators for grouped data](https://scikit-learn.org/stable/modules/cross_validation.html#cross-validation-iterators-for-grouped-data):
    * The i.i.d. assumption is broken if the underlying generative process yield groups of dependent samples.
    * An example would be when there is medical data collected from multiple patients, with multiple samples taken from each patient. And such data is likely to be dependent on the individual group. In our example, the patient id for each sample will be its group identifier.
    * ‼️ **In this case we would like to know if a model trained on a particular set of groups generalizes well to the *unseen* groups.** To measure this, we need to ensure that *all the samples in the validation fold come from groups that are not represented at all in the paired training fold*.
    * The following cross-validation splitters can be used to do that. The grouping identifier for the samples is specified via the `groups` parameter.
    8. `GroupKFold`
    9. `LeaveOneGroupOut`
    10. `LeavePGroupsOut`
    11. `GroupShuffleSplit`
* [Predefined Fold-Splits / Validation-Sets](https://scikit-learn.org/stable/modules/cross_validation.html#predefined-fold-splits-validation-sets)
    12. `PredefinedSplit`
* [Cross validation of time series data](https://scikit-learn.org/stable/modules/cross_validation.html#cross-validation-of-time-series-data):
    13. `TimeSeriesSplit`

In [40]:
# 1. KFold 
import numpy as np
from sklearn.model_selection import KFold

X = ["a", "b", "c", "d"]
kf = KFold(n_splits=2)
for kth, (train, test) in enumerate(kf.split(X)):
    print(f"{kth}th split:")
    print(f"train idx: {train} test idx: {test}")
    print(f"train set: {[X[i_] for i_ in train]} test set: {[X[i_] for i_ in test]}")
    print()

0th split:
train idx: [2 3] test idx: [0 1]
train set: ['c', 'd'] test set: ['a', 'b']

1th split:
train idx: [0 1] test idx: [2 3]
train set: ['a', 'b'] test set: ['c', 'd']



In [46]:
# 2. RepeatedKFold
# Repeats K-Fold n times. It can be used when one requires to run KFold n times, producing different splits in each repetition.

import numpy as np
from sklearn.model_selection import RepeatedKFold

X = np.array([[1, 2], [3, 4], [1, 2], [3, 4]])
print("X:\n", X, "\n")
random_state = 12883823
rkf = RepeatedKFold(n_splits=2, n_repeats=2, random_state=random_state)

for kth, (train, test) in enumerate(rkf.split(X)):
    print(f"{kth}th split:")
    print(f"train idx: {train} test idx: {test}")
    print(f"train set: {[X[i_] for i_ in train]} test set: {[X[i_] for i_ in test]}")
    print()

print("> Note K=2 k-fold ran TWICE ==> Total of 4 runs.")

X:
 [[1 2]
 [3 4]
 [1 2]
 [3 4]] 

0th split:
train idx: [2 3] test idx: [0 1]
train set: [array([1, 2]), array([3, 4])] test set: [array([1, 2]), array([3, 4])]

1th split:
train idx: [0 1] test idx: [2 3]
train set: [array([1, 2]), array([3, 4])] test set: [array([1, 2]), array([3, 4])]

2th split:
train idx: [0 2] test idx: [1 3]
train set: [array([1, 2]), array([1, 2])] test set: [array([3, 4]), array([3, 4])]

3th split:
train idx: [1 3] test idx: [0 2]
train set: [array([3, 4]), array([3, 4])] test set: [array([1, 2]), array([1, 2])]

> Note K=2 k-fold ran TWICE ==> Total of 4 runs.


<br/>

Leave One Out (LOO):
* `LeaveOneOut` (or LOO) is a simple cross-validation. Each learning set is created by taking all the samples except one, the test set being **the (single) sample left out**. Thus, for $n$ samples, we have $n$ different training sets and $n$ different tests set. 
* This cross-validation procedure does not waste much data as only one sample is removed from the training set:

In [55]:
# 3. LeaveOneOut

from sklearn.model_selection import LeaveOneOut

X = [1, 2, 3, 4]
print("X:", X, "\n")

loo = LeaveOneOut()

for kth, (train, test) in enumerate(loo.split(X)):
    print(f"{kth}th split:")
    print(f"train idx: {train} test idx: {test}")
    print(f"train set: {[X[i_] for i_ in train]} test set: {[X[i_] for i_ in test]}")
    print()

X: [1, 2, 3, 4] 

0th split:
train idx: [1 2 3] test idx: [0]
train set: [2, 3, 4] test set: [1]

1th split:
train idx: [0 2 3] test idx: [1]
train set: [1, 3, 4] test set: [2]

2th split:
train idx: [0 1 3] test idx: [2]
train set: [1, 2, 4] test set: [3]

3th split:
train idx: [0 1 2] test idx: [3]
train set: [1, 2, 3] test set: [4]



Leave P Out (LPO):

* `LeavePOut` is very similar to `LeaveOneOut` as it creates all the possible training/test sets by removing $p$ samples from the complete set. 
* For $n$ samples, this produces ${n \choose p}$ train-test pairs. 
* **Unlike `LeaveOneOut` and `KFold`, the test sets will *overlap* for $p>1$.**

In [54]:
# 4. LeavePOut
# Leave-2-Out on a dataset with 4 samples: 

from sklearn.model_selection import LeavePOut

X = (1 + np.arange(4)) * 100
print("X:", X, "\n")

lpo = LeavePOut(p=2)

for kth, (train, test) in enumerate(lpo.split(X)):
    print(f"{kth}th split:")
    print(f"train idx: {train} test idx: {test}")
    print(f"train set: {[X[i_] for i_ in train]} test set: {[X[i_] for i_ in test]}")
    print()

X: [100 200 300 400] 

0th split:
train idx: [2 3] test idx: [0 1]
train set: [300, 400] test set: [100, 200]

1th split:
train idx: [1 3] test idx: [0 2]
train set: [200, 400] test set: [100, 300]

2th split:
train idx: [1 2] test idx: [0 3]
train set: [200, 300] test set: [100, 400]

3th split:
train idx: [0 3] test idx: [1 2]
train set: [100, 400] test set: [200, 300]

4th split:
train idx: [0 2] test idx: [1 3]
train set: [100, 300] test set: [200, 400]

5th split:
train idx: [0 1] test idx: [2 3]
train set: [100, 200] test set: [300, 400]



Random permutations cross-validation a.k.a. Shuffle & Split:

* The `ShuffleSplit` iterator will generate a user defined number of independent train / test dataset splits. 
* Samples are first shuffled and then split into a pair of train and test sets.
* It is possible to control the randomness for reproducibility of the results by explicitly seeding the `random_state` pseudo random number generator.


* `ShuffleSplit` is thus a good alternative to `KFold` cross validation that allows a finer control on the number of iterations and the proportion of samples on each side of the train / test split.

In [57]:
# 5. ShuffleSplit

from sklearn.model_selection import ShuffleSplit

X = np.arange(10)
print("X:", X, "\n")

ss = ShuffleSplit(n_splits=5, test_size=0.25, random_state=0)

for kth, (train, test) in enumerate(ss.split(X)):
    print(f"{kth}th split:")
    print(f"train idx: {train} test idx: {test}")
    print(f"train set: {[X[i_] for i_ in train]} test set: {[X[i_] for i_ in test]}")
    print()

X: [0 1 2 3 4 5 6 7 8 9] 

0th split:
train idx: [9 1 6 7 3 0 5] test idx: [2 8 4]
train set: [9, 1, 6, 7, 3, 0, 5] test set: [2, 8, 4]

1th split:
train idx: [2 9 8 0 6 7 4] test idx: [3 5 1]
train set: [2, 9, 8, 0, 6, 7, 4] test set: [3, 5, 1]

2th split:
train idx: [4 5 1 0 6 9 7] test idx: [2 3 8]
train set: [4, 5, 1, 0, 6, 9, 7] test set: [2, 3, 8]

3th split:
train idx: [2 7 5 8 0 3 4] test idx: [6 1 9]
train set: [2, 7, 5, 8, 0, 3, 4] test set: [6, 1, 9]

4th split:
train idx: [4 1 0 6 8 9 3] test idx: [5 2 7]
train set: [4, 1, 0, 6, 8, 9, 3] test set: [5, 2, 7]



Stratified k-fold:

* `StratifiedKFold` is a variation of k-fold which returns stratified folds: each set contains *approximately the same percentage of samples of each target class* as the complete set.

In [68]:
# 6. StratifiedKFold

from sklearn.model_selection import StratifiedKFold, KFold
import numpy as np

X, y = np.ones((50, 1)), np.hstack(([0] * 45, [1] * 5))
print(f"X [{X.shape}]:\n", X[-7:], "\n ...")
print(f"y [{y.shape}]:\n", y[-7:], "...")

print("\n---3 folds---")
print("Showing [<0 count> <1 count>] among y for each split:")

print("\nStratifiedKFold:")
skf = StratifiedKFold(n_splits=3)
for train, test in skf.split(X, y):
    print('train -  {}   |   test -  {}'.format(
        np.bincount(y[train]), np.bincount(y[test])))

print("\nKFold (NOT stratified, for comparison):")
kf = KFold(n_splits=3)
for train, test in kf.split(X, y):
    print('train -  {}   |   test -  {}'.format(
        np.bincount(y[train]), np.bincount(y[test])))

X [(50, 1)]:
 [[1.]
 [1.]
 [1.]
 [1.]
 [1.]
 [1.]
 [1.]] 
 ...
y [(50,)]:
 [0 0 1 1 1 1 1] ...

---3 folds---
Showing [<0 count> <1 count>] among y for each split:

StratifiedKFold:
train -  [30  3]   |   test -  [15  2]
train -  [30  3]   |   test -  [15  2]
train -  [30  4]   |   test -  [15  1]

KFold (NOT stratified, for comparison):
train -  [28  5]   |   test -  [17]
train -  [28  5]   |   test -  [17]
train -  [34]   |   test -  [11  5]


<br/>
Stratified Shuffle Split:

* `StratifiedShuffleSplit` is a variation of ShuffleSplit, which returns stratified splits, i.e which creates splits by preserving the same percentage for each target class as in the complete set.

Group k-fold:

* `GroupKFold` is a variation of k-fold which **ensures that the same group is *not* represented in both testing and training sets**. 
* For example if the data is obtained from different subjects with several samples per-subject and if the model is flexible enough to learn from highly person specific features it could fail to generalize to new subjects. 
* `GroupKFold` makes it possible to detect this kind of overfitting situations.

In [71]:
# 8. GroupKFold
# Imagine you have three subjects, each with an associated number from 1 to 3:

from sklearn.model_selection import GroupKFold

X = [0.1, 0.2, 2.2, 2.4, 2.3, 4.55, 5.8, 8.8, 9, 10]
y = ["a", "b", "b", "b", "c", "c", "c", "d", "d", "d"]
groups = [1, 1, 1, 2, 2, 2, 3, 3, 3, 3]

print("X:     ", X)
print("y:     ", y)
print("groups:", groups)
print()

gkf = GroupKFold(n_splits=3)

for kth, (train, test) in enumerate(
    gkf.split(
        X, 
        y, 
        groups=groups  # Note the specification of `groups` parameter in `.split()` call!
    )
):
    print(f"{kth}th split:")
    print(f"train idx: {train} test idx: {test}")
    print(f"train set: {[X[i_] for i_ in train]} test set: {[X[i_] for i_ in test]}")
    print(f"train grp: {[groups[i_] for i_ in train]} test set: {[groups[i_] for i_ in test]}")
    print()

# Each subject is in a different testing fold, and the same subject is never in both testing and training. 
# Notice that the folds do not have exactly the same size due to the imbalance in the data.

X:      [0.1, 0.2, 2.2, 2.4, 2.3, 4.55, 5.8, 8.8, 9, 10]
y:      ['a', 'b', 'b', 'b', 'c', 'c', 'c', 'd', 'd', 'd']
groups: [1, 1, 1, 2, 2, 2, 3, 3, 3, 3]

0th split:
train idx: [0 1 2 3 4 5] test idx: [6 7 8 9]
train set: [0.1, 0.2, 2.2, 2.4, 2.3, 4.55] test set: [5.8, 8.8, 9, 10]
train grp: [1, 1, 1, 2, 2, 2] test set: [3, 3, 3, 3]

1th split:
train idx: [0 1 2 6 7 8 9] test idx: [3 4 5]
train set: [0.1, 0.2, 2.2, 5.8, 8.8, 9, 10] test set: [2.4, 2.3, 4.55]
train grp: [1, 1, 1, 3, 3, 3, 3] test set: [2, 2, 2]

2th split:
train idx: [3 4 5 6 7 8 9] test idx: [0 1 2]
train set: [2.4, 2.3, 4.55, 5.8, 8.8, 9, 10] test set: [0.1, 0.2, 2.2]
train grp: [2, 2, 2, 3, 3, 3, 3] test set: [1, 1, 1]



Leave One Group Out:

* (Like before) `LeaveOneGroupOut` is a cross-validation scheme which holds out the samples according to a third-party provided array of integer groups. 
* (Like before) This group information can be used to encode arbitrary domain specific pre-defined cross-validation folds.
* Each training set is thus constituted by all the samples *except the ones related to a specific group*.

In [72]:
from sklearn.model_selection import LeaveOneGroupOut

X = [1, 5, 10, 50, 60, 70, 80]
y = [0, 1, 1, 2, 2, 2, 2]
groups = [1, 1, 2, 2, 3, 3, 3]
print("X:     ", X)
print("y:     ", y)
print("groups:", groups)
print()

logo = LeaveOneGroupOut()

for kth, (train, test) in enumerate(logo.split(X, y, groups=groups)):
    print(f"{kth}th split:")
    print(f"train idx: {train} test idx: {test}")
    print(f"train set: {[X[i_] for i_ in train]} test set: {[X[i_] for i_ in test]}")
    print(f"train grp: {[groups[i_] for i_ in train]} test set: {[groups[i_] for i_ in test]}")
    print()

X:      [1, 5, 10, 50, 60, 70, 80]
y:      [0, 1, 1, 2, 2, 2, 2]
groups: [1, 1, 2, 2, 3, 3, 3]

0th split:
train idx: [2 3 4 5 6] test idx: [0 1]
train set: [10, 50, 60, 70, 80] test set: [1, 5]
train grp: [2, 2, 3, 3, 3] test set: [1, 1]

1th split:
train idx: [0 1 4 5 6] test idx: [2 3]
train set: [1, 5, 60, 70, 80] test set: [10, 50]
train grp: [1, 1, 3, 3, 3] test set: [2, 2]

2th split:
train idx: [0 1 2 3] test idx: [4 5 6]
train set: [1, 5, 10, 50] test set: [60, 70, 80]
train grp: [1, 1, 2, 2] test set: [3, 3, 3]



Leave P Groups Out:
* `LeavePGroupsOut` is similar as `LeaveOneGroupOut`, but removes samples related to $P$ groups for each training/test set.

In [75]:
from sklearn.model_selection import LeavePGroupsOut

X = np.arange(6)
y = [1, 1, 1, 2, 2, 2]
groups = [1, 1, 2, 2, 3, 3]
print("X:     ", X)
print("y:     ", y)
print("groups:", groups)
print()

lpgo = LeavePGroupsOut(n_groups=2)

print("... leave 2 groups out ...\n")

for kth, (train, test) in enumerate(lpgo.split(X, y, groups=groups)):
    print(f"{kth}th split:")
    print(f"train idx: {train} test idx: {test}")
    print(f"train set: {[X[i_] for i_ in train]} test set: {[X[i_] for i_ in test]}")
    print(f"train grp: {[groups[i_] for i_ in train]} test set: {[groups[i_] for i_ in test]}")
    print()

X:      [0 1 2 3 4 5]
y:      [1, 1, 1, 2, 2, 2]
groups: [1, 1, 2, 2, 3, 3]

... leave 2 groups out ...

0th split:
train idx: [4 5] test idx: [0 1 2 3]
train set: [4, 5] test set: [0, 1, 2, 3]
train grp: [3, 3] test set: [1, 1, 2, 2]

1th split:
train idx: [2 3] test idx: [0 1 4 5]
train set: [2, 3] test set: [0, 1, 4, 5]
train grp: [2, 2] test set: [1, 1, 3, 3]

2th split:
train idx: [0 1] test idx: [2 3 4 5]
train set: [0, 1] test set: [2, 3, 4, 5]
train grp: [1, 1] test set: [2, 2, 3, 3]



Group Shuffle Split:
 
* The `GroupShuffleSplit` iterator behaves as a combination of `ShuffleSplit` and `LeavePGroupsOut`, 
* ...and generates a sequence of randomized partitions in which a *subset of groups* are held out for *each split*.


* This class is useful when the behavior of `LeavePGroupsOut` is desired, but the number of groups is large enough that generating all possible partitions with groups withheld would be prohibitively expensive. In such a scenario, `GroupShuffleSplit` provides a random sample (with replacement) of the train / test splits generated by `LeavePGroupsOut`.

In [76]:
from sklearn.model_selection import GroupShuffleSplit

X = [0.1, 0.2, 2.2, 2.4, 2.3, 4.55, 5.8, 0.001]
y = ["a", "b", "b", "b", "c", "c", "c", "a"]
groups = [1, 1, 2, 2, 3, 3, 4, 4]
print("X:     ", X)
print("y:     ", y)
print("groups:", groups)
print()

gss = GroupShuffleSplit(n_splits=4, test_size=0.5, random_state=0)

for kth, (train, test) in enumerate(gss.split(X, y, groups=groups)):
    print(f"{kth}th split:")
    print(f"train idx: {train} test idx: {test}")
    print(f"train set: {[X[i_] for i_ in train]} test set: {[X[i_] for i_ in test]}")
    print(f"train grp: {[groups[i_] for i_ in train]} test set: {[groups[i_] for i_ in test]}")
    print()

X:      [0.1, 0.2, 2.2, 2.4, 2.3, 4.55, 5.8, 0.001]
y:      ['a', 'b', 'b', 'b', 'c', 'c', 'c', 'a']
groups: [1, 1, 2, 2, 3, 3, 4, 4]

0th split:
train idx: [0 1 2 3] test idx: [4 5 6 7]
train set: [0.1, 0.2, 2.2, 2.4] test set: [2.3, 4.55, 5.8, 0.001]
train grp: [1, 1, 2, 2] test set: [3, 3, 4, 4]

1th split:
train idx: [2 3 6 7] test idx: [0 1 4 5]
train set: [2.2, 2.4, 5.8, 0.001] test set: [0.1, 0.2, 2.3, 4.55]
train grp: [2, 2, 4, 4] test set: [1, 1, 3, 3]

2th split:
train idx: [2 3 4 5] test idx: [0 1 6 7]
train set: [2.2, 2.4, 2.3, 4.55] test set: [0.1, 0.2, 5.8, 0.001]
train grp: [2, 2, 3, 3] test set: [1, 1, 4, 4]

3th split:
train idx: [4 5 6 7] test idx: [0 1 2 3]
train set: [2.3, 4.55, 5.8, 0.001] test set: [0.1, 0.2, 2.2, 2.4]
train grp: [3, 3, 4, 4] test set: [1, 1, 2, 2]



<br/>
Predefined Fold-Splits / Validation-Sets:

* For some datasets, a pre-defined split of the data into training- and validation fold or into several cross-validation folds already exists. 
* Using [`PredefinedSplit`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.PredefinedSplit.html#sklearn.model_selection.PredefinedSplit) it is possible to use these folds e.g. when searching for hyperparameters.


* For example, when using a validation set, set the `test_fold` to 0 for all samples that are part of the validation set, and to -1 for all other samples.

#### [Using cross-validation iterators to split train and test](https://scikit-learn.org/stable/modules/cross_validation.html#using-cross-validation-iterators-to-split-train-and-test):

* The above group cross-validation functions may also be useful for splitting a dataset into training and testing subsets. 
* Note that the convenience function `train_test_split` is *a wrapper around `ShuffleSplit`* and thus only allows for stratified splitting (using the class labels) and *cannot account for groups*.
* To perform the train and test split, use the indices for the train and test subsets yielded by the generator output by the `split()` method of the cross-validation splitter. 

For example:

In [79]:
import numpy as np
from sklearn.model_selection import GroupShuffleSplit

X = np.array([0.1, 0.2, 2.2, 2.4, 2.3, 4.55, 5.8, 0.001])
y = np.array(["a", "b", "b", "b", "c", "c", "c", "a"])
groups = np.array([1, 1, 2, 2, 3, 3, 4, 4])
print("X:     ", X)
print("y:     ", y)
print("groups:", groups)
print()

# Get indices from a .split() call of GroupShuffleSplit, use next() as it's a generator. 
train_indx, test_indx = next(
    GroupShuffleSplit(random_state=7).split(X, y, groups)
)

# Apply indices to data:
X_train, X_test, y_train, y_test = \
    X[train_indx], X[test_indx], y[train_indx], y[test_indx]
print("Train and test shapes:")
print(X_train.shape, X_test.shape)

print("Train and test unique groups:")
print(np.unique(groups[train_indx]), np.unique(groups[test_indx]))

X:      [1.00e-01 2.00e-01 2.20e+00 2.40e+00 2.30e+00 4.55e+00 5.80e+00 1.00e-03]
y:      ['a' 'b' 'b' 'b' 'c' 'c' 'c' 'a']
groups: [1 1 2 2 3 3 4 4]

Train and test shapes:
(6,) (2,)
Train and test unique groups:
[1 2 4] [3]


#### Cross validation of time series data

1. Time series data is characterised by the **correlation between observations that are near in time (autocorrelation)**. 
2. However, classical cross-validation techniques such as `KFold` and `ShuffleSplit` assume the samples are *independent and identically distributed*, and would result in unreasonable correlation between training and testing instances (yielding poor estimates of generalisation error) on time series data. 
3. Therefore, it is very important to evaluate our model **for time series data on the “future” observations least like those that are used to train the model**. 
4. To achieve this, one solution is provided by `TimeSeriesSplit`.

Time Series Split:

* `TimeSeriesSplit` is a variation of k-fold which **returns first $k$ folds as train set and the $(k+1)$th fold as test set**. 
* Note that unlike standard cross-validation methods, **successive training sets are *supersets* of those that come before them**. 
* Also, it adds all surplus data to the first training partition, which is always used to train the model.
* This class can be used to cross-validate time series data samples that are observed at fixed time intervals.

In [82]:
# 13. TimeSeriesSplit
# Example of 3-split time series cross-validation on a dataset with 6 samples:

from sklearn.model_selection import TimeSeriesSplit

X = np.array([[1, 2], [3, 4], [5, 6], [7, 8], [9, 10], [11, 12]])
y = np.array([1, 2, 3, 4, 5, 6])
print("X:\n", X)
print("y:\n", y)
print()

tscv = TimeSeriesSplit(n_splits=3)
print(tscv)

for kth, (train, test) in enumerate(tscv.split(X)):
    print(f"{kth}th split:")
    print(f"train idx: {train} test idx: {test}")
    print(f"train set: {[X[i_] for i_ in train]} test set: {[X[i_] for i_ in test]}")
    print()

X:
 [[ 1  2]
 [ 3  4]
 [ 5  6]
 [ 7  8]
 [ 9 10]
 [11 12]]
y:
 [1 2 3 4 5 6]

TimeSeriesSplit(gap=0, max_train_size=None, n_splits=3, test_size=None)
0th split:
train idx: [0 1 2] test idx: [3]
train set: [array([1, 2]), array([3, 4]), array([5, 6])] test set: [array([7, 8])]

1th split:
train idx: [0 1 2 3] test idx: [4]
train set: [array([1, 2]), array([3, 4]), array([5, 6]), array([7, 8])] test set: [array([ 9, 10])]

2th split:
train idx: [0 1 2 3 4] test idx: [5]
train set: [array([1, 2]), array([3, 4]), array([5, 6]), array([7, 8]), array([ 9, 10])] test set: [array([11, 12])]



### [A note on shuffling](https://scikit-learn.org/stable/modules/cross_validation.html#a-note-on-shuffling)

* If the data ordering is not arbitrary (e.g. samples with the same class label are contiguous), shuffling it first may be essential to get a meaningful cross- validation result.
* However, the opposite may be true if the samples are *not* independently and identically distributed. 
    * For example, if samples correspond to news articles, and are ordered by their time of publication, then shuffling the data will likely lead to a model that is overfit and an inflated validation score: it will be tested on samples that are artificially similar (close in time) to training samples.


* Some cross validation iterators, such as `KFold`, have an *inbuilt option to shuffle the data indices before splitting them*. 

**` ` 💡 Note that: ` `**
* This consumes less memory than shuffling the data directly.
* *By default no shuffling occurs*, including for the (stratified) K fold cross- validation performed by specifying `cv=some_integer` to `cross_val_score`, grid search, etc. (Keep in mind that train_test_split still returns a random split.)
* The `random_state` parameter defaults to `None`, meaning that the shuffling will be different every time `KFold(..., shuffle=True)` is iterated. 
    * However, `GridSearchCV` will use the same shuffling for each set of parameters validated by a single call to its fit method.
    * To get identical results for each split, set `random_state` to an integer.

### [Cross validation and model selection](https://scikit-learn.org/stable/modules/cross_validation.html#cross-validation-and-model-selection)

Cross validation iterators can also be used to directly perform model selection using **Grid Search** for the optimal hyperparameters of the model. 

This is the topic of the next section of the User Guide: [Tuning the hyper-parameters of an estimator](https://scikit-learn.org/stable/modules/grid_search.html#grid-search).

### [Permutation test score](https://scikit-learn.org/stable/modules/cross_validation.html#permutation-test-score)

[`permutation_test_score`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.permutation_test_score.html#sklearn.model_selection.permutation_test_score) offers another way to evaluate the performance of classifiers.

It provides a permutation-based p-value, which represents how likely an observed performance of the classifier would be obtained by chance. 

The null hypothesis in this test is that the classifier fails to leverage any statistical dependency between the features and the labels to make correct predictions on left out data. 

* `permutation_test_score` generates a null distribution by calculating `n_permutations` different permutations of the data. 
* In each permutation the labels are randomly shuffled, thereby removing any dependency between the features and the labels. 
* The p-value output is the fraction of permutations for which the average cross-validation score obtained by the model is better than the cross-validation score obtained by the model using the original data. 

**For reliable results `n_permutations` should typically be larger than 100 and `cv` between 3-10 folds.**

For more details, see title link.