# **Cross-validation in Finance**


One of the purpose of ML is to learn the general structure of the data, so that we can produce predictionson future, unseen features.

It is well known that when an ML algorithm is test on the same dataset as was used for training, not suprisingly, we achieve spectular results.

**Cross-Validation** splits observations drawn from an IID process into two sets: the training set and the testing set.This is done as to prevent leakage from one set into the other, since that would defeat the purpose of testing on unseen data.

**K-fold cross-validation** is the most popular method.This is the algorithm:


1.   The dataset is partitioned into k subsets
2.   For each subset:

      a. The ML algorithm is trained on all subsets excluding i.

      b. The fitted ML algorithm is tested on i.


In finance, CV is used into 2 settings: model development (ex: parameter tuning) and backtesting. However, **K-fold fails is in finance** because observation cannot be assumed to be drawn from iid process. Moreover, CV leads to multiple testing and seclection bias.

Let's focus one the first reason!

Leakage takes place when the training set contains informations that also appears in the testing set. 
Consider a serially correlated feature X that is associated with labels Y that are formed on overlapping data:

* Because of the seria correlation, $X_{t} \approx X_{t+1}$
* Because labels are derived from overlapping datapoints, $Y_{t} \approx Y_{t+1}$

The problem is that leakage in the presence of irrelevantfeatures will lead to false discoveries. 






In [3]:
import pandas as pd
import numpy as np

## Purged K-fold cross-validation:

One way to reduce leakage is to purge from the training set all the observations whose labels overlapped in time with those labels included in the testing set (purging). Another is to eliminate from the training set observations that immediately follow an observation in the testing set (embargo).



In [1]:
## This function is credited to M. Lopez de Prado

def getTrainTimes(t1, testTimes):

  '''
  Given testTimes, find the times of training observations
  t1.index: Time when the ovbservation started.
  t1.value: Time when the observation ended.
  testTimes: Times of testing obervations.
  '''

  trn=t1.copy(deep=True)
  for i,j in testTimes.iteritem():
    df0 = trn[(i <= trn.index)&(trn.index <= j)].index # train starts within test
    df1 = trn[(i <= trn)&(trn<j)].index # train ends within test
    df2 = trn[(trn.index <= i)&(j <= trn)].index # train envelops test
    trn = trn.drop(df0.union(df1).union(df2))
  
  return trn

Note that a larger number of testing splits is not a good idea since it leads  to a greater number of overlapping observations in the training set.

In many cases, purging suffices to prevent leakage. But if it doesn't, we can impose an embargo on training observations after every test set.

![purging and Embargo by de Prado](https://www.quantresearch.org/Purge.png)



In [4]:
def getEmbargoTimes(times, pctEmbargo):
  # Get embargo time for each bar

  step = int(times.shape[0]*pctEmbargo)

  if step == 0:
    mbrg = pd.Series(times, index=times)
  else:
    mbrg = pd.Series(times[step:], index=times[:-step])
    mbrg = mbrg.append(pd.Series(times[-1], index = times[-step:]))
  
  return mbrg

## **Remarks on Sklearns cross-validation:**

When working with Open-source libraries, you should always verify and adjust it to your needs.

## **Example:**