# Implementing Cross-Validation in Python

![train_test](/Users/marioht/Dropbox/EDHEC/2022/COURSES/S2/Prediction_&_Sequential_Investment_Strategies/Presentations/2nd_FINAL/train_test.png)

![overfitting](/Users/marioht/Dropbox/EDHEC/2022/COURSES/S2/Prediction_&_Sequential_Investment_Strategies/Presentations/2nd_FINAL/train_validation_holdout.png)

Mock dataset

In [1]:
data = list(range(1, 11))
data

[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

Scikit-learn's CV functionality can be imported from sklearn.model_selection.
For a single split of your data into a training and a test set, use `train_test_split`, where the shuffle parameter, by default, ensures the randomized selection of observations. You
can ensure replicability by seeding the random number generator by setting `random_state`. There is also a `stratify` parameter, which ensures for a classification problem that the train and test sets will contain approximately the same proportion of each class. The result looks as follows:

In [2]:
from sklearn.model_selection import (train_test_split, 
                                     KFold, 
                                     LeaveOneOut,
                                     LeavePOut, 
                                     ShuffleSplit,
                                     TimeSeriesSplit)

In [3]:
train_test_split(data, train_size = 0.8)

[[8, 4, 6, 7, 2, 3, 9, 1], [10, 5]]

In this case, we train a model using all data except row numbers 8 and 9, which will be used to generate predictions and measure the errors given on the known labels. This method is useful for quick evaluation but is sensitive to the split, and the standard error of the performance measure estimate will be higher.

## KFold iterator

The `KFold` iterator produces several disjunct splits and assigns each of these splits once to the validation set, as shown in the following code:

In [4]:
kf = KFold(n_splits = 5)

In [5]:
kf

KFold(n_splits=5, random_state=None, shuffle=False)

In [6]:
for train, validate in kf.split(data):
    print(train, validate)

[2 3 4 5 6 7 8 9] [0 1]
[0 1 4 5 6 7 8 9] [2 3]
[0 1 2 3 6 7 8 9] [4 5]
[0 1 2 3 4 5 8 9] [6 7]
[0 1 2 3 4 5 6 7] [8 9]


In addition to the number of splits, most CV objects take a `shuffle` argument that ensures randomization. To render results reproducible, set the random_state as follows:

In [7]:
kf = KFold(n_splits = 5, shuffle = True, random_state = 42)
for train, validate in kf.split(data):
    print(train, validate)

[0 2 3 4 5 6 7 9] [1 8]
[1 2 3 4 6 7 8 9] [0 5]
[0 1 3 4 5 6 8 9] [2 7]
[0 1 2 3 5 6 7 8] [4 9]
[0 1 2 4 5 7 8 9] [3 6]


## Leave-one-out CV

The original CV implementation used a __leave-one-out method__ that used each observation once as the validation set, as shown in the following code:

In [8]:
loo = LeaveOneOut()
for train, validate in loo.split(data):
    print(train, validate)

[1 2 3 4 5 6 7 8 9] [0]
[0 2 3 4 5 6 7 8 9] [1]
[0 1 3 4 5 6 7 8 9] [2]
[0 1 2 4 5 6 7 8 9] [3]
[0 1 2 3 5 6 7 8 9] [4]
[0 1 2 3 4 6 7 8 9] [5]
[0 1 2 3 4 5 7 8 9] [6]
[0 1 2 3 4 5 6 8 9] [7]
[0 1 2 3 4 5 6 7 9] [8]
[0 1 2 3 4 5 6 7 8] [9]


This maximizes the number of models that are trained, which increases computational costs. While the validation sets do not overlap, the overlap of training sets is maximized, driving up the correlation of models and their prediction errors. As a result, the variance of the prediction error is higher for a model with a larger number of folds.

## Leave-P-Out CV

A similar version to leave-one-out CV is leave-P-out CV, which generates all possible combinations of p data rows, as shown in the following code:

In [9]:
lpo = LeavePOut(p = 2)
for train, validate in lpo.split(data):
    print(train, validate)


[2 3 4 5 6 7 8 9] [0 1]
[1 3 4 5 6 7 8 9] [0 2]
[1 2 4 5 6 7 8 9] [0 3]
[1 2 3 5 6 7 8 9] [0 4]
[1 2 3 4 6 7 8 9] [0 5]
[1 2 3 4 5 7 8 9] [0 6]
[1 2 3 4 5 6 8 9] [0 7]
[1 2 3 4 5 6 7 9] [0 8]
[1 2 3 4 5 6 7 8] [0 9]
[0 3 4 5 6 7 8 9] [1 2]
[0 2 4 5 6 7 8 9] [1 3]
[0 2 3 5 6 7 8 9] [1 4]
[0 2 3 4 6 7 8 9] [1 5]
[0 2 3 4 5 7 8 9] [1 6]
[0 2 3 4 5 6 8 9] [1 7]
[0 2 3 4 5 6 7 9] [1 8]
[0 2 3 4 5 6 7 8] [1 9]
[0 1 4 5 6 7 8 9] [2 3]
[0 1 3 5 6 7 8 9] [2 4]
[0 1 3 4 6 7 8 9] [2 5]
[0 1 3 4 5 7 8 9] [2 6]
[0 1 3 4 5 6 8 9] [2 7]
[0 1 3 4 5 6 7 9] [2 8]
[0 1 3 4 5 6 7 8] [2 9]
[0 1 2 5 6 7 8 9] [3 4]
[0 1 2 4 6 7 8 9] [3 5]
[0 1 2 4 5 7 8 9] [3 6]
[0 1 2 4 5 6 8 9] [3 7]
[0 1 2 4 5 6 7 9] [3 8]
[0 1 2 4 5 6 7 8] [3 9]
[0 1 2 3 6 7 8 9] [4 5]
[0 1 2 3 5 7 8 9] [4 6]
[0 1 2 3 5 6 8 9] [4 7]
[0 1 2 3 5 6 7 9] [4 8]
[0 1 2 3 5 6 7 8] [4 9]
[0 1 2 3 4 7 8 9] [5 6]
[0 1 2 3 4 6 8 9] [5 7]
[0 1 2 3 4 6 7 9] [5 8]
[0 1 2 3 4 6 7 8] [5 9]
[0 1 2 3 4 5 8 9] [6 7]
[0 1 2 3 4 5 7 9] [6 8]
[0 1 2 3 4 5 7 8

## ShuffleSplit

The `ShuffleSplit` class creates independent splits with potentially overlapping validation sets, as shown in the following code:

In [10]:
ss = ShuffleSplit(n_splits = 3, test_size = 2, random_state = 42)
for train, validate in ss.split(data):
    print(train, validate)

[5 0 7 2 9 4 3 6] [8 1]
[8 5 3 4 7 9 6 2] [0 1]
[0 6 8 5 3 7 1 4] [9 2]


# Challenges with cross-validation in finance

A key assumption for the cross-validation methods discussed so far is the IID distribution of the samples available for training.
For financial data, this is often not the case. On the contrary, financial data is neither independently nor identically distributed because of serial correlation and time-varying standard deviation, also known as heteroskedasticity. `TimeSeriesSplit` in the `sklearn.model_selection` module aims to address the linear order of time-series data.

## Time Series Cross Validation with scikit-learn

The time-series nature of the data implies that cross-validation produces a situation where data from the future will be used to predict data from the past. This is unrealistic at best and data snooping at worst, to the extent that future data reflects past events.
To address time dependency, the `TimeSeriesSplit` object implements a walk-forward test with an expanding training set, where subsequent training sets are supersets of past training sets, as shown in the following code:

In [11]:
tscv = TimeSeriesSplit(n_splits = 5)
for train, validate in tscv.split(data):
    print(train, validate)

[0 1 2 3 4] [5]
[0 1 2 3 4 5] [6]
[0 1 2 3 4 5 6] [7]
[0 1 2 3 4 5 6 7] [8]
[0 1 2 3 4 5 6 7 8] [9]


You can use the `max_train_size` parameter to implement walk-forward cross-validation, where the size of the training set remains constant over time.

In [12]:
tscv = TimeSeriesSplit(n_splits = 5, max_train_size = 4)
for train, validate in tscv.split(data):
    print(train, validate)

[1 2 3 4] [5]
[2 3 4 5] [6]
[3 4 5 6] [7]
[4 5 6 7] [8]
[5 6 7 8] [9]


In [13]:
tscv = TimeSeriesSplit(n_splits = 3, max_train_size = 4, test_size = 2)
for train, validate in tscv.split(data):
    print(train, validate)

[0 1 2 3] [4 5]
[2 3 4 5] [6 7]
[4 5 6 7] [8 9]


`End of File`