# Purged KFold as a method

The below was an issue that was reported in mlfinlab, which aroused my curiosity.

Hence to test the relationship of PurgedKFold with different parameters.

[https://github.com/hudson-and-thames/mlfinlab/issues/295#](https://github.com/hudson-and-thames/mlfinlab/issues/295#)

At the same time, avoid numpy/ pandas searchsorted if possible. I realised searchsorted either returns unsorted or duplicated dates/ complex values.

[A quick look at the current searchsorted issues on github.](https://github.com/numpy/numpy/issues?q=is%3Aissue+is%3Aopen+searchsorted)

In [1]:
import datetime
import pandas as pd
import research as rs

Num of CPU core:  4
Machine info:  Windows-10-10.0.18362-SP0
Python 3.7.4 (default, Aug  9 2019, 18:34:13) [MSC v.1915 64 bit (AMD64)]
Numpy 1.18.5
Pandas 1.0.4


  import pandas.util.testing as tm


<Figure size 1500x800 with 1 Axes>

In [2]:
start = datetime.datetime.strptime("01-01-2020", "%d-%m-%Y")
start_dates = [start + datetime.timedelta(days=x) for x in range(10)] #sample =10 only
end_dates = [start + datetime.timedelta(days=2) for start in start_dates]
df = pd.Series(index=start_dates, data=end_dates)

# splitter
splitter = rs.PurgedKFold(events=df, n_splits=2, pct_embargo = 0.0) #split = 2 only with no embargo

# print off folds
for k, (train_ind, test_ind) in enumerate(splitter.split(df)):
    print("------ %d-th fold ------" % (k))
    print(">> Training events:")
    print(df.iloc[train_ind])
    print(">> Test events:")
    print(df.iloc[test_ind])
    
#there are leakage below on 1st fold last row over to test set first row.

------ 0-th fold ------
>> Training events:
2020-01-08   2020-01-10
2020-01-09   2020-01-11
2020-01-10   2020-01-12
dtype: datetime64[ns]
>> Test events:
2020-01-01   2020-01-03
2020-01-02   2020-01-04
2020-01-03   2020-01-05
2020-01-04   2020-01-06
2020-01-05   2020-01-07
dtype: datetime64[ns]
------ 1-th fold ------
>> Training events:
2020-01-01   2020-01-03
2020-01-02   2020-01-04
2020-01-03   2020-01-05
2020-01-04   2020-01-06
2020-01-05   2020-01-07
dtype: datetime64[ns]
>> Test events:
2020-01-06   2020-01-08
2020-01-07   2020-01-09
2020-01-08   2020-01-10
2020-01-09   2020-01-11
2020-01-10   2020-01-12
dtype: datetime64[ns]


In [3]:
start = datetime.datetime.strptime("01-01-2020", "%d-%m-%Y")
start_dates = [start + datetime.timedelta(days=x) for x in range(10)] #sample =10 only
end_dates = [start + datetime.timedelta(days=2) for start in start_dates]
df = pd.Series(index=start_dates, data=end_dates)

# splitter
splitter = rs.PurgedKFold(events=df, n_splits=2, pct_embargo = 0.1) #split = 2 only with no embargo

# print off folds
for k, (train_ind, test_ind) in enumerate(splitter.split(df)):
    print("------ %d-th fold ------" % (k))
    print(">> Training events:")
    print(df.iloc[train_ind])
    print(">> Test events:")
    print(df.iloc[test_ind])
    
#there are NO more leakage below on 1st fold last row over to test set first row.

------ 0-th fold ------
>> Training events:
2020-01-08   2020-01-10
2020-01-09   2020-01-11
dtype: datetime64[ns]
>> Test events:
2020-01-01   2020-01-03
2020-01-02   2020-01-04
2020-01-03   2020-01-05
2020-01-04   2020-01-06
2020-01-05   2020-01-07
dtype: datetime64[ns]
------ 1-th fold ------
>> Training events:
2020-01-01   2020-01-03
2020-01-02   2020-01-04
2020-01-03   2020-01-05
2020-01-04   2020-01-06
dtype: datetime64[ns]
>> Test events:
2020-01-06   2020-01-08
2020-01-07   2020-01-09
2020-01-08   2020-01-10
2020-01-09   2020-01-11
2020-01-10   2020-01-12
dtype: datetime64[ns]


**Note**

Notice how I changed from pct_embargo = 0.0 to 0.1.

Initially there was leakage/ cross-contamination between training and test sets.

After embargo was setted to 0.1.

The algo starts to set "barrier" between training and test sets.

Conceptually, this is what the book taught us. As long as your algo reflects this concept, your algo should be fine.

At the same time, please shuffle your test set after splitting into test and train sets. This will further reduce possible selection bias. [AFML 7.1](https://github.com/boyboi86/AFML/blob/master/AFML%207.1.ipynb)

In [4]:
start = datetime.datetime.strptime("01-01-2020", "%d-%m-%Y")
start_dates = [start + datetime.timedelta(days=x) for x in range(100)] #sample =100 only
end_dates = [start + datetime.timedelta(days=2) for start in start_dates]
df = pd.Series(index=start_dates, data=end_dates)

# splitter
splitter = rs.PurgedKFold(events=df, n_splits=2, pct_embargo = 0.01) #split = 2 only

# print off folds
for k, (train_ind, test_ind) in enumerate(splitter.split(df)):
    print("------ %d-th fold ------" % (k))
    print(">> Training events:")
    print(df.iloc[train_ind])
    print(">> Test events:")
    print(df.iloc[test_ind])

------ 0-th fold ------
>> Training events:
2020-02-22   2020-02-24
2020-02-23   2020-02-25
2020-02-24   2020-02-26
2020-02-25   2020-02-27
2020-02-26   2020-02-28
2020-02-27   2020-02-29
2020-02-28   2020-03-01
2020-02-29   2020-03-02
2020-03-01   2020-03-03
2020-03-02   2020-03-04
2020-03-03   2020-03-05
2020-03-04   2020-03-06
2020-03-05   2020-03-07
2020-03-06   2020-03-08
2020-03-07   2020-03-09
2020-03-08   2020-03-10
2020-03-09   2020-03-11
2020-03-10   2020-03-12
2020-03-11   2020-03-13
2020-03-12   2020-03-14
2020-03-13   2020-03-15
2020-03-14   2020-03-16
2020-03-15   2020-03-17
2020-03-16   2020-03-18
2020-03-17   2020-03-19
2020-03-18   2020-03-20
2020-03-19   2020-03-21
2020-03-20   2020-03-22
2020-03-21   2020-03-23
2020-03-22   2020-03-24
2020-03-23   2020-03-25
2020-03-24   2020-03-26
2020-03-25   2020-03-27
2020-03-26   2020-03-28
2020-03-27   2020-03-29
2020-03-28   2020-03-30
2020-03-29   2020-03-31
2020-03-30   2020-04-01
2020-03-31   2020-04-02
2020-04-01   2020-04

In [5]:
start = datetime.datetime.strptime("01-01-2020", "%d-%m-%Y")
start_dates = [start + datetime.timedelta(days=x) for x in range(80)] #sample = 80 only
end_dates = [start + datetime.timedelta(days=2) for start in start_dates]
df = pd.Series(index=start_dates, data=end_dates)

# splitter
splitter = rs.PurgedKFold(events=df, n_splits=5, pct_embargo = 0.2) #split = 5 only

# print off folds
for k, (train_ind, test_ind) in enumerate(splitter.split(df)):
    print("------ %d-th fold ------" % (k))
    print(">> Training events:")
    print(df.iloc[train_ind])
    print(">> Test events:")
    print(df.iloc[test_ind])
    
# No leakage

------ 0-th fold ------
>> Training events:
2020-01-19   2020-01-21
2020-01-20   2020-01-22
2020-01-21   2020-01-23
2020-01-22   2020-01-24
2020-01-23   2020-01-25
2020-01-24   2020-01-26
2020-01-25   2020-01-27
2020-01-26   2020-01-28
2020-01-27   2020-01-29
2020-01-28   2020-01-30
2020-01-29   2020-01-31
2020-01-30   2020-02-01
2020-01-31   2020-02-02
2020-02-01   2020-02-03
2020-02-02   2020-02-04
2020-02-03   2020-02-05
2020-02-04   2020-02-06
2020-02-05   2020-02-07
2020-02-06   2020-02-08
2020-02-07   2020-02-09
2020-02-08   2020-02-10
2020-02-09   2020-02-11
2020-02-10   2020-02-12
2020-02-11   2020-02-13
2020-02-12   2020-02-14
2020-02-13   2020-02-15
2020-02-14   2020-02-16
2020-02-15   2020-02-17
2020-02-16   2020-02-18
2020-02-17   2020-02-19
2020-02-18   2020-02-20
2020-02-19   2020-02-21
2020-02-20   2020-02-22
2020-02-21   2020-02-23
2020-02-22   2020-02-24
2020-02-23   2020-02-25
2020-02-24   2020-02-26
2020-02-25   2020-02-27
2020-02-26   2020-02-28
2020-02-27   2020-02

In [6]:
start = datetime.datetime.strptime("01-01-2020", "%d-%m-%Y")
start_dates = [start + datetime.timedelta(days=x) for x in range(100)] #sample =100 only
end_dates = [start + datetime.timedelta(days=2) for start in start_dates]
df = pd.Series(index=start_dates, data=end_dates)

# splitter
splitter = rs.PurgedKFold(events=df, n_splits=5, pct_embargo = 0.2) #split = 5 only

# print off folds
for k, (train_ind, test_ind) in enumerate(splitter.split(df)):
    print("------ %d-th fold ------" % (k))
    print(">> Training events:")
    print(df.iloc[train_ind])
    print(">> Test events:")
    print(df.iloc[test_ind])
    
#No leakage

------ 0-th fold ------
>> Training events:
2020-01-23   2020-01-25
2020-01-24   2020-01-26
2020-01-25   2020-01-27
2020-01-26   2020-01-28
2020-01-27   2020-01-29
2020-01-28   2020-01-30
2020-01-29   2020-01-31
2020-01-30   2020-02-01
2020-01-31   2020-02-02
2020-02-01   2020-02-03
2020-02-02   2020-02-04
2020-02-03   2020-02-05
2020-02-04   2020-02-06
2020-02-05   2020-02-07
2020-02-06   2020-02-08
2020-02-07   2020-02-09
2020-02-08   2020-02-10
2020-02-09   2020-02-11
2020-02-10   2020-02-12
2020-02-11   2020-02-13
2020-02-12   2020-02-14
2020-02-13   2020-02-15
2020-02-14   2020-02-16
2020-02-15   2020-02-17
2020-02-16   2020-02-18
2020-02-17   2020-02-19
2020-02-18   2020-02-20
2020-02-19   2020-02-21
2020-02-20   2020-02-22
2020-02-21   2020-02-23
2020-02-22   2020-02-24
2020-02-23   2020-02-25
2020-02-24   2020-02-26
2020-02-25   2020-02-27
2020-02-26   2020-02-28
2020-02-27   2020-02-29
2020-02-28   2020-03-01
2020-02-29   2020-03-02
2020-03-01   2020-03-03
2020-03-02   2020-03

In [7]:
start = datetime.datetime.strptime("01-01-2020", "%d-%m-%Y")
start_dates = [start + datetime.timedelta(days=x) for x in range(80)] #sample = 80 only
end_dates = [start + datetime.timedelta(days=2) for start in start_dates]
df = pd.Series(index=start_dates, data=end_dates)

# splitter
splitter = rs.PurgedKFold(events=df, n_splits=10, pct_embargo = 0.2) #split = 10 only

# print off folds
for k, (train_ind, test_ind) in enumerate(splitter.split(df)):
    print("------ %d-th fold ------" % (k))
    print(">> Training events:")
    print(df.iloc[train_ind])
    print(">> Test events:")
    print(df.iloc[test_ind])
    
# Leakage! 5th fold

------ 0-th fold ------
>> Training events:
2020-01-11   2020-01-13
2020-01-12   2020-01-14
2020-01-13   2020-01-15
2020-01-14   2020-01-16
2020-01-15   2020-01-17
2020-01-16   2020-01-18
2020-01-17   2020-01-19
2020-01-18   2020-01-20
2020-01-19   2020-01-21
2020-01-20   2020-01-22
2020-01-21   2020-01-23
2020-01-22   2020-01-24
2020-01-23   2020-01-25
2020-01-24   2020-01-26
2020-01-25   2020-01-27
2020-01-26   2020-01-28
2020-01-27   2020-01-29
2020-01-28   2020-01-30
2020-01-29   2020-01-31
2020-01-30   2020-02-01
2020-01-31   2020-02-02
2020-02-01   2020-02-03
2020-02-02   2020-02-04
2020-02-03   2020-02-05
2020-02-04   2020-02-06
2020-02-05   2020-02-07
2020-02-06   2020-02-08
2020-02-07   2020-02-09
2020-02-08   2020-02-10
2020-02-09   2020-02-11
2020-02-10   2020-02-12
2020-02-11   2020-02-13
2020-02-12   2020-02-14
2020-02-13   2020-02-15
2020-02-14   2020-02-16
2020-02-15   2020-02-17
2020-02-16   2020-02-18
2020-02-17   2020-02-19
2020-02-18   2020-02-20
2020-02-19   2020-02

------ 9-th fold ------
>> Training events:
2020-01-01   2020-01-03
2020-01-02   2020-01-04
2020-01-03   2020-01-05
2020-01-04   2020-01-06
2020-01-05   2020-01-07
2020-01-06   2020-01-08
2020-01-07   2020-01-09
2020-01-08   2020-01-10
2020-01-09   2020-01-11
2020-01-10   2020-01-12
2020-01-11   2020-01-13
2020-01-12   2020-01-14
2020-01-13   2020-01-15
2020-01-14   2020-01-16
2020-01-15   2020-01-17
2020-01-16   2020-01-18
2020-01-17   2020-01-19
2020-01-18   2020-01-20
2020-01-19   2020-01-21
2020-01-20   2020-01-22
2020-01-21   2020-01-23
2020-01-22   2020-01-24
2020-01-23   2020-01-25
2020-01-24   2020-01-26
2020-01-25   2020-01-27
2020-01-26   2020-01-28
2020-01-27   2020-01-29
2020-01-28   2020-01-30
2020-01-29   2020-01-31
2020-01-30   2020-02-01
2020-01-31   2020-02-02
2020-02-01   2020-02-03
2020-02-02   2020-02-04
2020-02-03   2020-02-05
2020-02-04   2020-02-06
2020-02-05   2020-02-07
2020-02-06   2020-02-08
2020-02-07   2020-02-09
2020-02-08   2020-02-10
2020-02-09   2020-02

In [8]:
start = datetime.datetime.strptime("01-01-2020", "%d-%m-%Y")
start_dates = [start + datetime.timedelta(days=x) for x in range(100)] #sample =100 only
end_dates = [start + datetime.timedelta(days=2) for start in start_dates]
df = pd.Series(index=start_dates, data=end_dates)

# splitter
splitter = rs.PurgedKFold(events=df, n_splits=10, pct_embargo = 0.2) #split = 10 only

# print off folds
for k, (train_ind, test_ind) in enumerate(splitter.split(df)):
    print("------ %d-th fold ------" % (k))
    print(">> Training events:")
    print(df.iloc[train_ind])
    print(">> Test events:")
    print(df.iloc[test_ind])
    
#Seems like no leakage

------ 0-th fold ------
>> Training events:
2020-01-13   2020-01-15
2020-01-14   2020-01-16
2020-01-15   2020-01-17
2020-01-16   2020-01-18
2020-01-17   2020-01-19
                ...    
2020-03-16   2020-03-18
2020-03-17   2020-03-19
2020-03-18   2020-03-20
2020-03-19   2020-03-21
2020-03-20   2020-03-22
Length: 68, dtype: datetime64[ns]
>> Test events:
2020-01-01   2020-01-03
2020-01-02   2020-01-04
2020-01-03   2020-01-05
2020-01-04   2020-01-06
2020-01-05   2020-01-07
2020-01-06   2020-01-08
2020-01-07   2020-01-09
2020-01-08   2020-01-10
2020-01-09   2020-01-11
2020-01-10   2020-01-12
dtype: datetime64[ns]
------ 1-th fold ------
>> Training events:
2020-01-01   2020-01-03
2020-01-02   2020-01-04
2020-01-03   2020-01-05
2020-01-04   2020-01-06
2020-01-05   2020-01-07
                ...    
2020-03-16   2020-03-18
2020-03-17   2020-03-19
2020-03-18   2020-03-20
2020-03-19   2020-03-21
2020-03-20   2020-03-22
Length: 68, dtype: datetime64[ns]
>> Test events:
2020-01-11   2020-01-1

### Conclusion

When using PurgedKFold, in order to use either 5 or 10 splits. Embargo percentage has to be around 20%, to be safe.

As for the reason for 5/ 10 splits:

* [stack overflow: choice-of-k-in-k-fold-cross-validation](https://stats.stackexchange.com/questions/27730/choice-of-k-in-k-fold-cross-validation)
* [researchgate.net](https://www.researchgate.net/post/What_does_ten_times_ten-fold_cross_validation_of_data_set_mean_and_its_importance)

But when number of split is 2 only, pct_embargo required was only 0.01 for no data leakage.

In short, when sample size is large ( >=100 ) with both 5 and 10 splits. A good embargo estimate to start from is 20%, before reducing  this percentage.

As such, maybe another function may be required to check for large samples so that we do not "over-embargo" the training set.

This PurgedKFold algo is commonly used with cv_score, since we are dealing with non-IID series. The implementation of this particular algo is very important.

I strongly encourage every quant to built this algo in your own fashion (additional features if desire).