Q1: Why is shuffling a dataset before conducting k-fold CV generally a bad idea in
finance? What is the purpose of shuffling? Why does shuffling defeat the purpose
of k-fold CV in financial datasets?


A: Shuffling a dataset before conducting cross validation is a bad idea in finance because the data is not independent from itself, i.e. there is correlation between datasets. Therefore, if the dataset is shuffled, and some data in the train set is taken from near data in the validation set, there will be data leakage. The usual purpose of shuffling is to ensure both the train and validation sets have a good sample of the data set. However in financial datasets shuffling actually goes too far, and provides too much information about the distribution in the validation set to the train set 

In [5]:
import numpy as np
import pandas as pd
import sys
import os
from fml_lib import getIndMatrix, getAvgUniqueness, PurgedKFold, cvScore
from sklearn.ensemble import RandomForestClassifier

Loading in the label data from exercise 3

In [6]:
events = pd.read_csv("events.csv",index_col=['Unnamed: 0'])
events = events.drop(['trgt'], axis=1)
events = events.dropna()
events.index = pd.to_datetime(events.index, format='%Y-%m-%d %H:%M:%S.%f')
events['t1'] = pd.to_datetime(events['t1'], format='%Y-%m-%d %H:%M:%S.%f')
bins = pd.read_csv("bins.csv",index_col=['Unnamed: 0'])
bins.index = pd.to_datetime(bins.index, format='%Y-%m-%d %H:%M:%S.%f')
weights = pd.read_csv("out.csv", index_col=['Unnamed: 0'])
weights.index = pd.to_datetime(weights.index, format='%Y-%m-%d %H:%M:%S.%f')



weights = weights.replace([np.inf], 0)
bars = pd.read_csv("july_2023_dollar_bars.csv")
bars['Timestamp'] = pd.to_datetime(bars['Timestamp'], format = '%Y%m%d %H:%M:%S:%f')
bars.index = bars['Timestamp']
close = bars['close']
#close.index = close['Timestamp']
#close = close.drop(['Timestamp'])
close = close.loc[bins.index]


Computing average uniqueness for the random forest

In [7]:

ind_matrix = getIndMatrix(bars.index, events)
#print(ind_matrix[(ind_matrix.select_dtypes(include=['number']) != 0).any(1)])
#avgU = getAvgUniqueness(ind_matrix)

Random Forest Initialization for no shuffling case

In [8]:
clf0=RandomForestClassifier(n_estimators=1000,class_weight='balanced_subsample',criterion='entropy')
x = close
print(close.dtypes)
y = bins['bin']
print(y.dtypes)
#print(events.isnull().values.any())

t1 = pd.Series(events.t1.values, index = events.index)
#print(t1.dtypes)
first_KFold = cvScore(clf0, X=x, y=y,sample_weight = weights,t1 = t1,cv = 10,shuffle=False, pctEmbargo=0.01 )

float64
float64
init started


  maxT1Idx=self.t1.index.searchsorted(self.t1[test_indices].max())
  maxT1Idx=self.t1.index.searchsorted(self.t1[test_indices].max())
  maxT1Idx=self.t1.index.searchsorted(self.t1[test_indices].max())
  maxT1Idx=self.t1.index.searchsorted(self.t1[test_indices].max())
  maxT1Idx=self.t1.index.searchsorted(self.t1[test_indices].max())
  maxT1Idx=self.t1.index.searchsorted(self.t1[test_indices].max())
  maxT1Idx=self.t1.index.searchsorted(self.t1[test_indices].max())
  maxT1Idx=self.t1.index.searchsorted(self.t1[test_indices].max())
  maxT1Idx=self.t1.index.searchsorted(self.t1[test_indices].max())
  maxT1Idx=self.t1.index.searchsorted(self.t1[test_indices].max())


In [9]:
print(first_KFold)

[-2.15704649 -1.59315106 -1.68475095 -4.13137252 -1.7780026  -1.50757641
 -1.57965077 -1.4650449  -1.67974621 -1.61783655]


In [10]:
second_KFold = cvScore(clf0, X=x, y=y,sample_weight = weights,t1 = t1,cv = 10,shuffle=True, pctEmbargo=0.01 )

init started


  maxT1Idx=self.t1.index.searchsorted(self.t1[test_indices].max())
  maxT1Idx=self.t1.index.searchsorted(self.t1[test_indices].max())
  maxT1Idx=self.t1.index.searchsorted(self.t1[test_indices].max())
  maxT1Idx=self.t1.index.searchsorted(self.t1[test_indices].max())
  maxT1Idx=self.t1.index.searchsorted(self.t1[test_indices].max())
  maxT1Idx=self.t1.index.searchsorted(self.t1[test_indices].max())
  maxT1Idx=self.t1.index.searchsorted(self.t1[test_indices].max())
  maxT1Idx=self.t1.index.searchsorted(self.t1[test_indices].max())
  maxT1Idx=self.t1.index.searchsorted(self.t1[test_indices].max())
  maxT1Idx=self.t1.index.searchsorted(self.t1[test_indices].max())


In [11]:
print(second_KFold)

[-2.28601826 -1.58333518 -1.7686587  -3.25738851 -1.66854633 -1.47492554
 -1.46808772 -1.35358164 -1.55935982 -1.56132081]


Q: Why are both results so different?

A: The results for the CV with shuffling are overall better (i.e. lower log loss) because the shuffling has leaked information about the validation set into the train set


Q: How does shuffling leak information?

A: See above