#### The following examples are taken from the official scikit learn website documentations

In [1]:
import numpy as np;
from sklearn.model_selection import train_test_split;
from sklearn import datasets;
from sklearn import svm;

In [2]:
X, y = datasets.load_iris(return_X_y=True);
print(f"X.shape = {X.shape}, y.shape = {y.shape}");

X.shape = (150, 4), y.shape = (150,)


In [3]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=0);
print(f"X_train.shape = {X_train.shape}, y_train.shape = {y_train.shape}");
print(f"X_test.shape = {X_test.shape}, y_test.shape = {y_test.shape}");

X_train.shape = (90, 4), y_train.shape = (90,)
X_test.shape = (60, 4), y_test.shape = (60,)


In [4]:
clf = svm.SVC(kernel='linear', C=1).fit(X_train, y_train);
print(f"score = {clf.score(X_test, y_test)}");

score = 0.9666666666666667


In the k-fold CV, the training set is split into k smaller sets. The following procedure is followed for each of the k "folds":

* A model is trained using `k - 1` of the folds as training data;
* The resulting model is validated on the remaining part of the data, it's used as a test set to compute a performance measure such as accuracy.

The performance measure resported by k-fold cross-validation is then the average of the values computed in the loop.

In [5]:
# use cross_val_score
from sklearn.model_selection import cross_val_score;
clf = svm.SVC(kernel='linear', C=1, random_state=42);
scores = cross_val_score(clf, X, y, cv=5);
print(f"scores: {scores}");
print(f"{scores.mean()} accuracy with a standard deviation of {scores.std()}");

scores: [0.96666667 1.         0.96666667 0.96666667 1.        ]
0.9800000000000001 accuracy with a standard deviation of 0.016329931618554516


In [6]:
# The score computed at each CV iteration is the score method of the estimator.
# It is possible to change this by using the scoring parameter:
from sklearn import metrics
scores = cross_val_score(clf, X, y, cv=5, scoring='f1_macro');
print(f"scores: {scores}");
print(f"{scores.mean()} accuracy with a standard deviation of {scores.std()}");

scores: [0.96658312 1.         0.96658312 0.96658312 1.        ]
0.9799498746867169 accuracy with a standard deviation of 0.016370858765468226


In [7]:
# to use other cross validation strategies, pass a cross validation iterator
from sklearn.model_selection import ShuffleSplit;
#n_samples = X.shape[0];
cv = ShuffleSplit(n_splits=5, test_size=0.3, random_state=0);
print(f"score: {cross_val_score(clf, X, y, cv=cv)}");

score: [0.97777778 0.97777778 1.         0.95555556 1.        ]


In [8]:
# another option is to use an iterable yielding (train, test)
def custom_cv_2folds(X: np.ndarray):
    n = X.shape[0];
    i = 1;
    while i <= 2:
        idx = np.arange(n * (i - 1) / 2, n * i / 2, dtype=int);
        yield idx, idx;
        i += 1;

custom_cv = custom_cv_2folds(X);
print(f"scores: {cross_val_score(clf, X, y, cv=custom_cv)}");

scores: [1.         0.97333333]


The `cross_validate` function differes from `cross_val_score` in two ways:

* It allows specifying mulitple metrics for evaluation.
* It returns a dict containing fit-times, score-times (optional: training scores, fitted estimators, train-test split) also the test score.


In [9]:
# multiple metrics
from sklearn.model_selection import cross_validate;
from sklearn.metrics import recall_score;
scoring = ['precision_macro', 'recall_macro'];
clf = svm.SVC(kernel='linear', C=1, random_state=0);
scores = cross_validate(clf, X, y, scoring=scoring);
sort = sorted(scores.keys());
print(f"calculated scores: {sort}");

calculated scores: ['fit_time', 'score_time', 'test_precision_macro', 'test_recall_macro']


In [10]:
for key in scores.keys():
    print(f"{key}: {scores[key]}");

fit_time: [0.0025897  0.00168729 0.00174332 0.0015254  0.00142384]
score_time: [0.00719547 0.00505853 0.00430751 0.00493288 0.00613737]
test_precision_macro: [0.96969697 1.         0.96969697 0.96969697 1.        ]
test_recall_macro: [0.96666667 1.         0.96666667 0.96666667 1.        ]


In [11]:
from sklearn.metrics import make_scorer;
scoring = {'prec_macro': 'precision_macro',
           'rec_macro': make_scorer(recall_score, average='macro')};
scores = cross_validate(clf, X, y, scoring=scoring, cv=5, return_train_score=True, return_indices=True);
for key in scores.keys():
    print(f"{key}: {scores[key]}");

fit_time: [0.00212765 0.001472   0.00154805 0.00139213 0.00135851]
score_time: [0.00445461 0.0047586  0.00423384 0.00459552 0.00437593]
indices: {'train': (array([ 10,  11,  12,  13,  14,  15,  16,  17,  18,  19,  20,  21,  22,
        23,  24,  25,  26,  27,  28,  29,  30,  31,  32,  33,  34,  35,
        36,  37,  38,  39,  40,  41,  42,  43,  44,  45,  46,  47,  48,
        49,  60,  61,  62,  63,  64,  65,  66,  67,  68,  69,  70,  71,
        72,  73,  74,  75,  76,  77,  78,  79,  80,  81,  82,  83,  84,
        85,  86,  87,  88,  89,  90,  91,  92,  93,  94,  95,  96,  97,
        98,  99, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120,
       121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133,
       134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146,
       147, 148, 149]), array([  0,   1,   2,   3,   4,   5,   6,   7,   8,   9,  20,  21,  22,
        23,  24,  25,  26,  27,  28,  29,  30,  31,  32,  33,  34,  35,
        36,  37,  38,  39,  

**Obtaining perdictions by cross-validation**

The function `cross_val_predict` has a similar interface to `cross_val_score`, but returns, for each element in the input, the prediction that was obtained for that element when it was in the test set. Only cross-validation strategies that assign all elements to a test set exactly one can be used.

`cross_val_predict` is not an appropriate measure of generalization error. But it's appropriate for:

* Visualization of predictions obtained from different models.
* Model blending: When predictions of one supervised estimator are used to train another estimator in ensemble methods.

In [16]:
# KFold
# Example of 2-fold cross-validation on a dataset with 4 samples
from sklearn.model_selection import KFold;

X = ["a", "b", "c", "d"];
kf = KFold(n_splits=2);
for train, test in kf.split(X):
    print("%s %s" % (train, test));

# Each fold is constituted by two arrays: the first one is related to the training set,
# and the second one to the test set. Thus, one can create the training/test set using numpy indexing:
X = np.array([[0, 0], [1, 1], [-1, -1], [2, 2]]);
y = np.array([0, 1, 0, 1]);
X_train, X_test, y_train, t_test = X[train], X[test], y[train], y[test];
print(f"X_train: \n{X_train}");
print(f"X_test: \n{X_test}");

[2 3] [0 1]
[0 1] [2 3]
X_train: 
[[0 0]
 [1 1]]
X_test: 
[[-1 -1]
 [ 2  2]]


In [18]:
# Repeated K-Fold
# It repeats KFold n times, producing different splits in each repetition
from sklearn.model_selection import RepeatedKFold;

X = np.array([[1, 2], [3, 2], [1, 2], [3, 4]]);
random_state = 1288;
rkf = RepeatedKFold(n_splits=2, n_repeats=2, random_state=random_state);
for train, test in rkf.split(X):
    print("%s %s" % (train, test));

[0 1] [2 3]
[2 3] [0 1]
[1 3] [0 2]
[0 2] [1 3]


In [20]:
# Leave One Out (LOO)
# is a simple cross-validation. Each learning set is created by taking all the samples except one, 
# the test set being the sample left out. Thus, for n samples, we have n different training sets
# and n different test sets. This cross-validation procedure does not waste much data as only one
# sample is removed from the training set:

from sklearn.model_selection import LeaveOneOut;

X = [1, 2, 3, 4];
loo = LeaveOneOut();
for train, test in loo.split(X):
    print("%s %s" % (train, test));

[1 2 3] [0]
[0 2 3] [1]
[0 1 3] [2]
[0 1 2] [3]


In [21]:
# Leave P Out (LPO)
# It created all the possible training/test sets by removing `p` samples from the complete set.
# For `n` samples, this produced (n p) train-test pairs. Unlike LeaveOneOut and KFold, the test
# sets will overlap for p > 1.
from sklearn.model_selection import LeavePOut;

X = np.ones(4);
lpo = LeavePOut(p=2);
for train, test in lpo.split(X):
    print("%s %s" % (train, test));

[2 3] [0 1]
[1 3] [0 2]
[1 2] [0 3]
[0 3] [1 2]
[0 2] [1 3]
[0 1] [2 3]


**Stratification**

In case of rare classes, cross-validation splitting can generate train or validation folds without any occurance of a particular class. This typically leads to undefined classification metrics (e.g. `ROC AUC`), exceptions raised when attempting to call `fit` or missing columns in the output of the `predict_proba` or `decision_function` methods of multiclass classifiers trained on different folds.

To mitigate such problems, splitters such as `StratidiedKFold` and `StratifiedShuffleSplit` implement stratidied sampling to ensure that relative class frequencies are approximately preserved in each fold.

In [27]:
# StratifiedKFold is a variation of K-fold which returns stratified folds:
# each set contains approximately the same percentage of samples of each terget class
# as the complete set.
from sklearn.model_selection import StratifiedKFold;

X, y = np.ones((50, 1)), np.hstack(([0] * 45, [1] * 5));
skf = StratifiedKFold(n_splits=3);
print("StratifiedKFold");
for train, test in skf.split(X, y):
    print('train - {} | test - {}'.format(np.bincount(y[train]), np.bincount(y[test])));

print("\nKFold");
kf = KFold(n_splits=3);
for train, test in kf.split(X, y):
    print('train - {} | test - {}'.format(np.bincount(y[train]), np.bincount(y[test])));

StratifiedKFold
train - [30  3] | test - [15  2]
train - [30  3] | test - [15  2]
train - [30  4] | test - [15  1]

KFold
train - [28  5] | test - [17]
train - [28  5] | test - [17]
train - [34] | test - [11  5]


---------------------------------------------------------------

#### Tuning the hyper-parameters of an estimator

Two generic approaches to parameter search are provided in scikit-learn: for given values, `GridSearchCV` exhaustively considers all parameters combinations, while `RandomizedSearchCV` can sample a given number of candidates from a parameter space with a specified distribution. Both these tools have successive halving conterparts `HalvingGridSearchCV` and `HalvingRandomSearchCV`, which can be much faster at finding a good parameter combination.


**Exhaustive Grid Search**
It exhaustively generates candidates from a grid of parameter values specified with the 
`param_grid` paramter. 

```py
para_grid = [
    {'C': [1, 10, 100, 1000], 'kernel': ['linear']},
    {'C': [1, 10, 100, 1000], 'gamma': [0.001, 0.0001], 'kernel': ['rbf']},
]
```

The GridSearchCV instance implements the usual estimator API: when "fitting" it on a dataset
all the possible combinations of parameter values are evaluated and the best combinationis retained.

**Randomized Parameter Optimization**
It implements a randomized search over parameters, where each setting is sampled from a distribution
over possible parameter values. This has two main benefits over an exhaustive search:
* A budget can be chosen independent of the number of parameters and possible values.
* Adding parameters that do not influence the performance does not decrease efficiency.
Specifying how parameters should be samples is done using a dictionary, very similar to specifying parameters for `GridSearchCV`. Additionally, a computation budget, being the number of samples candidates or sampling iterations, specified using the `n_iter` parameter. For each parameter, either a distribution over possible values or alist of discrete choices (which will be sampled uniformaly) can be specified:

```py
{'C': scipy.stats.expon(scale=100), 'gamma': scipy.stats.expon(scale=.1),
    'kernel': ['rbf'], 'class_weight': ['balanced', None]}
```


**Successive Halving (SH)**
It's like a tournament among candidate parameter combinations. SH is an iterative selection process where all candidates (the parameter combinations) are evaluated with a small amount of resources at the first iteration. Only some of these candidates are selected for the next iteration, which will be allocated more resources. For parameter tuning, the resource is typically the number of training samples, but it can also be an arbitrary numeric parameter such as `n_estimators` in a random forest.