### Cross-Validation in Finance

This notebook will cover exercise answer.

* Exercise 7.1
* Exercise 7.2
* Exercise 7.3
* Exercise 7.4
* Exercise 7.5

As we go along, there will be some explanations.

Most of the functions below can be found under Tool/cross_validate

Contact: boyboi86@gmail.com

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import research as rs

%matplotlib inline

Num of CPU core:  4
Machine info:  Windows-10-10.0.18362-SP0
Python 3.7.4 (default, Aug  9 2019, 18:34:13) [MSC v.1915 64 bit (AMD64)]
Numpy 1.17.3
Pandas 1.0.3


  import pandas.util.testing as tm


<Figure size 1500x800 with 1 Axes>

### Kfold CV

As demostrated in [notebook 6.1](https://github.com/boyboi86/AFML/blob/master/AFML%206.1.ipynb), when datasets are shuffled it is guaranteed to be overfit. Leakage issue will be much more prevailing than OOB method (subsampling), because the entire ensemble data will be used to evaluate.

However, shuffling is not without merit. If datasets were truly IID, shuffling will add the necessary randomness to the training set. As such, outcome will be more reliable.

The only problem is financial series are not considered IID. When financial datasets are shuffled before partition. Some test data will be used to evaluate, in fact when split is high there is the higher the proportion since only 1 of the set is reserved.

During evaluation/fitting if split was 5, 80% of the shuffled data will be choosen to be train set. Assuming the shuffled test set (remaining 20%) is all part of the n_sample choosen to evaluate. As a result, CV score will be super inflated.

However if dataset were not shuffled, the above will not be a problem. Because evaluation will be only based on n_sample in contingent. Unshuffled training set can only be evaluate against, n_samples (identical).

**Conclusion**

KFold can only be reliable as a method, when not non-IID samples are not shuffled.

In [2]:
# exercise 7.2

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import StratifiedKFold, KFold

dollar = pd.read_csv('./research/Sample_data/dollar_bars.txt', 
                 sep=',', 
                 header=0, 
                 parse_dates = True, 
                 index_col=['date_time'])

# For most part of the func we only use 'close'

close = dollar['close'].to_frame()

ffd_series = close.apply(np.log).cumsum()
ffd_series = rs.fracDiff_FFD(ffd_series, 
                    d = 1.99999889, 
                    thres=1e-5
                   ).dropna()

cs_event = rs.cs_filter(data = ffd_series, limit=(ffd_series.std() * 0.2))

df_mtx = pd.DataFrame(index = cs_event).assign(close = close,
                                                ffd_series = ffd_series).drop_duplicates().dropna()
df_mtx

Unnamed: 0,close,ffd_series
2015-01-04 23:20:12.567,2040.75,-0.003825
2015-01-07 10:55:23.194,2008.00,0.004166
2015-01-07 15:10:50.900,2014.50,-0.002360
2015-01-08 01:48:57.964,2037.25,0.006857
2015-01-08 05:47:32.006,2032.75,-0.003119
...,...,...
2016-11-28 01:31:48.252,2205.75,-0.099969
2016-12-04 23:32:49.403,2183.50,-0.102717
2016-12-05 02:06:52.025,2189.00,-0.096321
2016-12-08 12:46:05.346,2233.00,-0.101778


In [3]:
df_mtx['volatility'] = rs.vol(df_mtx.close, span0 = 50) #one of our features, since we do not have a side

df_mtx.dropna(inplace = True)

vb = rs.vert_barrier(data = df_mtx.close, events = cs_event, period = 'days', freq = 5)

# triple barrier events based on filter while data is also based on filtered index
tb = rs.tri_barrier(data = df_mtx.close, 
                    events = cs_event, 
                    trgt = df_mtx['volatility'], 
                    min_req = 0.0002, 
                    num_threads = 3, 
                    ptSl= [2,2], #2x barriers
                    t1 = vb, 
                    side = None)

mlabel = rs.meta_label(data = df_mtx.close, 
                       events = tb, 
                       drop = 0.05) # because we do not have side, we need to drop rare labels




[                                             t1                      sl  \
2015-01-07 15:10:50.900 2015-01-12 16:02:08.112                     NaT   
2015-01-08 01:48:57.964 2015-01-13 09:38:58.103                     NaT   
2015-01-08 05:47:32.006 2015-01-13 09:38:58.103                     NaT   
2015-01-09 14:48:46.704 2015-01-14 19:14:20.771 2015-01-14 04:31:40.468   
2015-01-12 14:36:34.243 2015-01-19 09:36:49.301 2015-01-14 19:14:20.771   
...                                         ...                     ...   
2015-09-01 19:46:17.742 2015-09-07 01:34:00.944                     NaT   
2015-09-01 22:05:09.069 2015-09-07 01:34:00.944                     NaT   
2015-09-02 00:20:04.277 2015-09-07 01:34:00.944                     NaT   
2015-09-02 02:09:17.333 2015-09-07 04:40:25.376                     NaT   
2015-09-02 08:08:50.931 2015-09-08 01:27:51.915                     NaT   

                                             pt  
2015-01-07 15:10:50.900 2015-01-08 01:48:57.964 

2020-05-31 21:42:02.557476 33.33% _pt_sl_t1 done after 0.11 minutes. Remaining 0.21 minutes.2020-05-31 21:42:02.588716 66.67% _pt_sl_t1 done after 0.11 minutes. Remaining 0.05 minutes.2020-05-31 21:42:02.666865 100.0% _pt_sl_t1 done after 0.11 minutes. Remaining 0.0 minutes.


In [4]:
mlabel['bin'].value_counts() #834

 1.0    489
-1.0    345
Name: bin, dtype: int64

In [5]:
X = df_mtx.reindex(mlabel.index)
Z = tb.reindex(mlabel.index)
y = mlabel['bin']

idx_Mat0 = rs.mp_idx_matrix(data = X.close, events = Z)

avgU = rs.av_unique(idx_Mat0).mean() #get ave uniqueness
print("Ave Uniqueness of Observations", avgU)

Ave Uniqueness of Observations 0.1201464235230114


In [6]:
#based on book recommendation
rf_clf = RandomForestClassifier(n_estimators = 1000,
                                criterion = "entropy",
                                max_samples=avgU, #note averge unique used
                                bootstrap=True, 
                                n_jobs=1,
                                random_state=42,
                                class_weight="balanced_subsample",
                                oob_score=False) #use only one either OOB or CV

cv_gen = KFold(n_splits=10, 
               #random_state=42, 
               shuffle=False)

score = rs.cv_score(classifier = rf_clf,
                     X = X,
                     y = y,
                     events = None,
                     pct_embargo = .0,
                     cv_gen = cv_gen,
                     sample_weight = None,
                     scoring = "neg_log_loss")

print('rf_clf Mean CV score: {0:.6f}\nCV Variance: {1:.6f}'.format(score.mean(), score.var()))

rf_clf Mean CV score: -0.626236
CV Variance: 0.011432


In [7]:
cv_gen0 = KFold(n_splits=10, 
                random_state=42, 
                shuffle=True) # shuffle is on!

score = rs.cv_score(classifier = rf_clf,
                     X = X,
                     y = y,
                     events = None,
                     pct_embargo = .0,
                     cv_gen = cv_gen0,
                     sample_weight = None,
                     scoring = "neg_log_loss")

print('rf_clf Mean CV score: {0:.6f}\nCV Variance: {1:.6f}'.format(score.mean(), score.var()))

rf_clf Mean CV score: -0.485730
CV Variance: 0.002437


### To shuffle or not.. That is the question

Please note we are not using OOB score. All the scoring done previously was on OOB.

**Based on KFold CV**
    
Shuffled:
* Mean score: -0.485087
* Variance: 0.002315

Not Shuffled:
* Mean score: -0.625331
* CV Variance: 0.011186
    
**Based on Stratified KFold CV**
    
Shuffled:
* Mean score: -0.485636
* Variance: 0.001371

Not Shuffled:
* Mean score: -0.711593
* CV Variance: 0.014060

Based on the above, we can conclude that shuffle will give higher but more inflated score. But in both cases, shuffled datasets has lower variance.

There is a presence of information leakage, kindly refer to [notebook 6.1](https://github.com/boyboi86/AFML/blob/master/AFML%206.1.ipynb).

**Note**

If your mean score is positive, please go back and review your own code. We are only using negative log loss.

[sklearn.Metrics: log_loss](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.log_loss.html)

[issue 9144](https://github.com/scikit-learn/scikit-learn/issues/9144)

The issue stated /9144 about cross_val_score has not fixed yet (As of this writing).

[negative_log_loss](https://stackoverflow.com/questions/43081251/sklearn-metrics-log-loss-is-positive-vs-scoring-neg-log-loss-is-negative)

The utility value is correct just add negative sign will do.

In [8]:
#used in built PurgeKFold

cv_gen = rs.PurgedKFold(n_splits = 10,
                        events = Z, #included reindex tb
                        pct_embargo = 0.01) #1% as a protection

score = rs.cv_score(classifier = rf_clf,
                     X = X,
                     y = y,
                     events = None,
                     pct_embargo = None, # 1% of embargo
                     cv_gen = cv_gen,
                     sample_weight = None,
                     scoring = "neg_log_loss",
                     shuffle_after_split = False)

print('rf_clf Mean CV score: {0:.6f}\nCV Variance: {1:.6f}'.format(score.mean(), score.var()))

rf_clf Mean CV score: -0.651812
CV Variance: 0.005689


In [9]:
#shuffle after splitting using PurgedKFold

score = rs.cv_score(classifier = rf_clf,
                     X = X,
                     y = y,
                     events = None,
                     pct_embargo = None, # 1% of embargo
                     cv_gen = cv_gen,
                     sample_weight = None,
                     scoring = "neg_log_loss",
                     shuffle_after_split = True) #added new features, not in the book

print('rf_clf Mean CV score: {0:.6f}\nCV Variance: {1:.6f}'.format(score.mean(), score.var()))

rf_clf Mean CV score: -0.653224
CV Variance: 0.006174


### Additional Layer: Embargo

Both Kfold method did not use shuffle.

**Embargo:**

* Mean CV score: -0.651812
* CV Variance: 0.005689
    
**Without Embargo:**

* Mean score: -0.625331
* CV Variance: 0.011186
    
With embargo acting as an additional barrier to prevent leakage, the possiblity of leakage is further reduced.

Since there will be an extra layer of seperation between training and testing sets.

This is reflected in a lower mean score and variance (More relistic).

### Conclusion

Model development is an iterative process from paper to reality, repeated testing and evetually built upon.

No matter which method which you used to seperate train, test and validate samples (train_test_split, PurgedKFold, StratifiedKFold). 

Ultimately, as long as the 3 key samples do not cross-contaminate (leakage). Any method should be fine.

For IID samples, having shuffled data will definitely provide that additional layer of randomness. 

For Non-IID however, consider splitting first then shuffle for trained data could be an alternative (if validation and test samples are not comtaminated). 

One of the data sample can be shuffled to add that layer of randomness. Preferably training set, for additional security consider using embargo.

The below was tried with training set shuffled after split. No cross-contamination across test data/ validation 

**Embargo:**

* Mean CV score: -0.653224
* CV Variance: 0.006174

Overall CV variance is slightly higher (Increased randomness), but mean CV score seems to be slightly lower (More realistic/ less bias).

So from the above there is a variance vs bias trade-off.

Given financial series are considered non-IID, a small change in sequential order of model development procedure should be used. To contain each datasets.

With an additional layer of randomness, it may deter selection bias. (If you think about it, selection bias can only exist if you get to select. If one of the 3 samples are random, selection bias seems unlikely as random configurations are beyond control)

**Note**

This shuffle after split is not from the book, but something I hypothesized based on the question posted by Dr Marco.

I believe if the datasets are not shuffled ML models will still be overfitted (with all precautions in place, but still marginally overfit). 

Therefore as long as the datasets do not suffer from leakage, some level of randomness should always be introduced.

Update: In the later chapter, the author started to shuffle test set after split. So my guess is the correct answer should be to shuffle the test set to avoid selection bias.