### Ensemble Methods

This notebook will cover exercise answer.

* Exercise 6.1
* Exercise 6.2
* Exercise 6.3
* Exercise 6.4
* Exercise 6.5

As we go along, there will be some explanations.

ML models are generally smart enough to deduce key features and perform forecast, however training such models to produce effective and reliable outcome is the key.

Most of the functions below can be found under research/Ensemble

Contact: boyboi86@gmail.com

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import research as rs

%matplotlib inline

Num of CPU core:  4
Machine info:  Windows-10-10.0.18362-SP0
Python 3.7.4 (default, Aug  9 2019, 18:34:13) [MSC v.1915 64 bit (AMD64)]
Numpy 1.17.3
Pandas 1.0.3


  import pandas.util.testing as tm


<Figure size 1500x800 with 1 Axes>

In [2]:
from scipy.special import comb

def cls_accuracy(N = 100, p = 1./3, k = 3.):
    #N, p, k = 100, 1./3, 3.
    p_ = 0

    for i in np.arange(0, int(N/k)+1):
        p_ += comb(N,i)*p**i*(1-p)**(N-i)
    if p > 1-p_ : 
        print("individual learners are considered poor classifier")
    else: 
        print("individual learners are not considered poor classifier")
    print(p, 1-p_)

cls_accuracy(N = 10, p = 1./3, k = 3.) # cls = 3, N estimate = 10
cls_accuracy(N = 100, p = 1./3, k = 3.) # cls = 3, N estimate = 100
cls_accuracy(N = 1001, p = 1./3, k = 3.) # cls = 3, N estimate = 1001

individual learners are not considered poor classifier
0.3333333333333333 0.4407356602143977
individual learners are not considered poor classifier
0.3333333333333333 0.4811966952738904
individual learners are not considered poor classifier
0.3333333333333333 0.5029710233411802


In [3]:
#optional comparison with 2 class only

cls_accuracy(N = 100, p = 1./2, k = 2.) # cls = 2, N estimate = 100
cls_accuracy(N = 1001, p = 1./2, k = 2.) # cls = 2, N estimate = 1001

individual learners are considered poor classifier
0.5 0.46020538130641064
individual learners are not considered poor classifier
0.5 0.5000000000001502


### Accuracy improvement

Based on the above using only 2 classes, if a classifer were to be poor.

There is always a much higher chance bagging ensemble will not help to improve bias, unless a sufficiently large estimates were used (N = 1001).

On the other hand if the classifer is considered good, a small amount of estimators can already provide good outcome (class k = 3).

In short, identified relevant features does improve overall accuracy of classifers. All of it would be before employing any ML ensemble method.

### Variance reduction

Bagging is sampling with replacement, each subset of the sample can be used mulitple times. This may introduce more randomness, will slightly higher bias (If it was already bias, it will probability be worst).

Pasting is sampling without replacement, each subset of the sample can be used once at most (Requires large dataset to work and more computering power).

In the case of financial application, samples drawn with replacement are more likely to be virtually correlated (almost 1.0), bagging will not reduce variance. (Bagging reduce variance is always under assumption that observations are IID, this is not true in financial applications)

> "In chapter 4 we studied why financial observations cannot be assumed to be IID..
> 
> ..and Bagging will not reduce variance regardless of number of N."
>
> AFML chapter 6, page 97, section 6.3.3

As a result, OOB score will always be inflated. As tested in previous notebook. [juypter notebook](https://github.com/boyboi86/AFML/blob/master/AFML%204.1.ipynb)

**Note**

If you realised when we tried to run a binomial expansion on accuracy for classes vs num of estimators. The most effective way to reduce bias is actually before running any ML ensemble methods.

As for variance, pasting ensemble does provide a better solution however would be considered expensive. The other way which was introduced by Dr Marco Lopez would be sequential bootstrap ensemble method. [juypter notebook 5.4](http://localhost:8888/notebooks/AFML%205.4.ipynb#)

### Bagging as a method

As long as samples are considered redundant (Non-IID) or virtually correlated (almost 1.0), bagging will be ineffective and still prone to overfit.

In the case where samples are lowly unique, observations are most likely virtually identical to each other. As a result, bagging will still be ineffective.

Since bagging based on the above lowly unique samples will lead to overfitting problem discussed, Out-of-bag (OOB) score will naturally be inflated hence unreliable.

In [4]:
# exercise 6.3

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier

#from sklearn.datasets import make_classification # create dataset
from sklearn.model_selection import train_test_split

dollar = pd.read_csv('./research/Sample_data/dollar_bars.txt', 
                 sep=',', 
                 header=0, 
                 parse_dates = True, 
                 index_col=['date_time'])

# For most part of the func we only use 'close'

close = dollar['close'].to_frame()

ffd_series = close.apply(np.log).cumsum()
ffd_series = rs.fracDiff_FFD(ffd_series, 
                    d = 1.99999889, 
                    thres=1e-5
                   ).dropna()

cs_event = rs.cs_filter(data = ffd_series, limit=(ffd_series.std() * 0.2))

df_mtx = pd.DataFrame(index = cs_event).assign(close = close,
                                                ffd_series = ffd_series).drop_duplicates().dropna()
df_mtx

Unnamed: 0,close,ffd_series
2015-01-04 23:20:12.567,2040.75,-0.003825
2015-01-07 10:55:23.194,2008.00,0.004166
2015-01-07 15:10:50.900,2014.50,-0.002360
2015-01-08 01:48:57.964,2037.25,0.006857
2015-01-08 05:47:32.006,2032.75,-0.003119
...,...,...
2016-11-28 01:31:48.252,2205.75,-0.099969
2016-12-04 23:32:49.403,2183.50,-0.102717
2016-12-05 02:06:52.025,2189.00,-0.096321
2016-12-08 12:46:05.346,2233.00,-0.101778


In [5]:
df_mtx['volatility'] = rs.vol(df_mtx.close, span0 = 50) #one of our features, since we do not have a side

df_mtx.dropna(inplace = True)

vb = rs.vert_barrier(data = df_mtx.close, events = cs_event, period = 'days', freq = 5)

# triple barrier events based on filter while data is also based on filtered index
tb = rs.tri_barrier(data = df_mtx.close, 
                    events = cs_event, 
                    trgt = df_mtx['volatility'], 
                    min_req = 0.0002, 
                    num_threads = 3, 
                    ptSl= [2,2], #2x barriers
                    t1 = vb, 
                    side = None)

mlabel = rs.meta_label(data = df_mtx.close, 
                       events = tb, 
                       drop = 0.05) # because we do not have side, we need to drop rare labels




[                                             t1                      sl  \
2015-01-07 15:10:50.900 2015-01-12 16:02:08.112                     NaT   
2015-01-08 01:48:57.964 2015-01-13 09:38:58.103                     NaT   
2015-01-08 05:47:32.006 2015-01-13 09:38:58.103                     NaT   
2015-01-09 14:48:46.704 2015-01-14 19:14:20.771 2015-01-14 04:31:40.468   
2015-01-12 14:36:34.243 2015-01-19 09:36:49.301 2015-01-14 19:14:20.771   
...                                         ...                     ...   
2015-09-01 19:46:17.742 2015-09-07 01:34:00.944                     NaT   
2015-09-01 22:05:09.069 2015-09-07 01:34:00.944                     NaT   
2015-09-02 00:20:04.277 2015-09-07 01:34:00.944                     NaT   
2015-09-02 02:09:17.333 2015-09-07 04:40:25.376                     NaT   
2015-09-02 08:08:50.931 2015-09-08 01:27:51.915                     NaT   

                                             pt  
2015-01-07 15:10:50.900 2015-01-08 01:48:57.964 

2020-05-31 12:45:26.856440 33.33% _pt_sl_t1 done after 0.11 minutes. Remaining 0.21 minutes.2020-05-31 12:45:26.864467 66.67% _pt_sl_t1 done after 0.11 minutes. Remaining 0.05 minutes.

[                                             t1                      sl  \
2015-01-07 15:10:50.900 2015-01-12 16:02:08.112                     NaT   
2015-01-08 01:48:57.964 2015-01-13 09:38:58.103                     NaT   
2015-01-08 05:47:32.006 2015-01-13 09:38:58.103                     NaT   
2015-01-09 14:48:46.704 2015-01-14 19:14:20.771 2015-01-14 04:31:40.468   
2015-01-12 14:36:34.243 2015-01-19 09:36:49.301 2015-01-14 19:14:20.771   
...                                         ...                     ...   
2015-09-01 19:46:17.742 2015-09-07 01:34:00.944                     NaT   
2015-09-01 22:05:09.069 2015-09-07 01:34:00.944                     NaT   
2015-09-02 00:20:04.277 2015-09-07 01:34:00.944                     NaT   
2015-09-02 02:09:17.333 2015-09-07 04:40:25.376                     NaT   
2015-09-02 08:08:50.931 2015-09-08 01:27:51.915                     NaT   

                                             pt  
2015-01-07 15:10:50.900 2015-01-08 01:48:57.964 

2020-05-31 12:45:27.126982 100.0% _pt_sl_t1 done after 0.11 minutes. Remaining 0.0 minutes.


In [6]:
mlabel['bin'].value_counts() #834

 1.0    489
-1.0    345
Name: bin, dtype: int64

In [7]:
X = df_mtx.reindex(mlabel.index)
Z = tb.reindex(mlabel.index)
y = mlabel['bin']

idx_Mat0 = rs.mp_idx_matrix(data = X.close, events = Z)

avgU = rs.av_unique(idx_Mat0).mean() #get ave uniqueness
print("Ave Uniqueness of Observations", avgU)

Ave Uniqueness of Observations 0.1201464235230114


In [8]:
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, shuffle=True, stratify=None)

# Benchmarks
rf_clf0 = RandomForestClassifier(n_estimators = 1000, 
                                bootstrap=True, 
                                n_jobs=1,
                                random_state=42,
                                oob_score=True)

base_estimate0 = DecisionTreeClassifier()


bag_clf0 = BaggingClassifier(base_estimator = base_estimate0,
                                n_estimators = 1000,
                                bootstrap=True, 
                                n_jobs=1, 
                                random_state=42,
                                oob_score=True)


rf_clf0.fit(X_train, y_train)
bag_clf0.fit(X_train, y_train)

print('Default rf Out-of-bag score: {}\n'.format(rf_clf0.oob_score_))
print('Default dt Out-of-bag score: {}\n'.format(bag_clf0.oob_score_))

Default rf Out-of-bag score: 0.8027444253859348

Default dt Out-of-bag score: 0.7873070325900514



In [9]:
#based on book recommendation
rf_clf = RandomForestClassifier(n_estimators = 1000,
                                criterion = "entropy",
                                max_samples=avgU, #note averge unique used
                                bootstrap=True, 
                                n_jobs=1,
                                random_state=42,
                                class_weight="balanced_subsample",
                                oob_score=True)

clf = DecisionTreeClassifier(criterion = "entropy", 
                             max_features="auto", 
                             class_weight="balanced")

bag_clf = BaggingClassifier(base_estimator = clf,
                            n_estimators = 1000, 
                            max_samples=avgU, #note averge unique used
                            bootstrap=True, 
                            n_jobs=1,
                            random_state=42,
                            oob_score=True)

rf_clf.fit(X_train, y_train)
bag_clf.fit(X_train, y_train)

print('rf_clf rf Out-of-bag score: {}\n'.format(rf_clf.oob_score_))
print('bag_clf Out-of-bag score: {}\n'.format(bag_clf.oob_score_))

rf_clf rf Out-of-bag score: 0.7701543739279588

bag_clf Out-of-bag score: 0.758147512864494



### Random Forest Classifer vs Decision Tree Classifer (bagging)

The OOB scores for both types of classifers regardless of bagging or not, proves to have a lower OOB score when compared.

* criterion = "entropy" for the information gain
* max_sample = avg uniqueness of observations
* class_weight (depending on type of classifer)
* max_features should be "auto" (Decision Tree Ensemble) and 1.0 (Random Forest Ensemble)
* n_estimate has to be large enough (Refer to accuracy improvement at the top)

The above parameters will have an impact on your overall OOB score. Less inflated.

In [10]:
#based on book recommendation
rf_clf1 = RandomForestClassifier(n_estimators = 1000,
                                criterion = "entropy",
                                bootstrap=True, 
                                n_jobs=1,
                                random_state=42,
                                class_weight="balanced_subsample",
                                oob_score=True)

clf1 = DecisionTreeClassifier(criterion = "entropy",
                             splitter="random", #added random as splitter, which was in rf but not in dt
                             max_features=None, 
                             class_weight="balanced")

bag_clf1 = BaggingClassifier(base_estimator = clf1,
                            n_estimators = 850, 
                            max_samples=avgU, #note averge unique used
                            bootstrap=True, 
                            n_jobs=1,
                            random_state=42,
                            oob_score=True)

rf_clf1.fit(X_train, y_train)
bag_clf1.fit(X_train, y_train)

print('rf_clf rf Out-of-bag score: {}\n'.format(rf_clf1.oob_score_))
print('bag_clf Out-of-bag score: {}\n'.format(bag_clf1.oob_score_))

rf_clf rf Out-of-bag score: 0.79073756432247

bag_clf Out-of-bag score: 0.79073756432247



### Random Forest vs Decision Tree Ensemble

**Key difference**

The max_sample for Decision Tree was set to average uniqueness of observations (only using a fraction of dataset) while Random Forest was using default (entire X dataset).

**The changes made to Decision tree**

* splitter = "random"
* max_features = None (which will affect split)

**The changes made to Bagging Classifer**

* n_estimate = 850

After the changes, OOB score for both Decision Tree Ensemble and Random Forest became identical: 0.79073756432247

**Note**

Initutively this is my guess.

The overall uniqueness of the samples used will affect the number of estimators required. Inverse relationship.

If random forest classifer was employed instead of decision tree classifer. We can have randomness incorporated into our algo and reduced number of samples required (reduce variance without overfitting). 

More importantly, a reduced number of estimators required (less expensive).

Hence, if possible we should modify Random Forest Classifer to fit it with sequential bootstrap and use it with bagging for optimal results.

In [11]:
# if N is too small
cls_accuracy(N = 1, p = 1./3, k = 3.)
cls_accuracy(N = 1, p = 1./5, k = 5.)
cls_accuracy(N = 1, p = 1./10, k = 10.)

individual learners are considered poor classifier
0.3333333333333333 0.33333333333333326
individual learners are considered poor classifier
0.2 0.19999999999999996
individual learners are considered poor classifier
0.1 0.09999999999999998


In [12]:
# if N is too large
cls_accuracy(N = 1200, p = 1./3, k = 3.)
cls_accuracy(N = 1200, p = 1./5, k = 5.)
cls_accuracy(N = 1200, p = 1./10, k = 10.)

individual learners are considered poor classifier
0.3333333333333333 -inf
individual learners are not considered poor classifier
0.2 0.48273446631083894
individual learners are not considered poor classifier
0.1 0.475713305287882


### Number of Trees vs Number of Features

With reference to the above.

When we were using binomial expansion on N (number of trees) against k (number of classes).

The more classes means less trees required. Likewise, if there is more relevant features less tree will be required.

However, these features are under the assumption of relevant or what we would considered elements which will provide information gains.

Using the same binomial formula, under the same condition where every features are equal weighted with binary labels only. To derive below results.

**If Number of Trees are too small**

The number of trees will never be too small for classes available. (See above for mathematical proof)

But if N <= 1, the outcome will be a poor classifier even with relevant features available.

**If Number of Trees are too large**

However, the number of trees can be too large for features available. (Notice as N = 1200, k = 3 will yield -infinite probabilities.)

#### Conclusion

In order to attain high accuracy, the proportion of relevant classes must be "inline" with the number of trees generated to yield optimal results.

However, the number of estimators does seem to be a debate within ML community:

(Computational power vs Estimators)[https://www.researchgate.net/publication/230766603_How_Many_Trees_in_a_Random_Forest]

In view of Condorcet's jury theorem, it seems that regardless of k class or n_features. Ultimately, in order to make a correct prediction, probability is more important. (P > 0.5)

**Note**

In AFML page 101, section 6.7. There is a short discussion on Support Vector Machine (SVM) scalability.

This may not be important now, since we are focus on mean-revision strategy. 

However, for trend strategy. This might be useful.

[SVM Trend Strategy](https://www.cs.princeton.edu/sites/default/files/uploads/saahil_madge.pdf)

In [13]:
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import accuracy_score

accuracy_array = np.zeros(5)

skf = StratifiedKFold(n_splits=5, 
                      shuffle=True, #shuffle = True
                      random_state = 42)

i = 0
for train_index, test_index in skf.split(X, y):
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]

    rf_clf0.fit(X_train, y_train.values.ravel()) #use the original rf cls
    y_pred_rf = rf_clf0.predict_proba(X_test)[:, 1] #True positive only
    y_pred = rf_clf0.predict(X_test)
    accuracy_array[i] = accuracy_score(y_test, y_pred)
    i += 1

print("Mean Strat KFold accuracy with shuffle: {:.8f}".format(np.mean(accuracy_array)))

Mean Strat KFold accuracy with shuffle: 0.80450905


### OOB score vs Stratified KFold accuracy

**Based on initial random forest input: OOB score: 0.8027444253859348**

OOB accuracy is based on shuffled trained data (Train_test_split defaults) against ensemble data that was sampled (Which are part of instead of full ensemble, occasionally test data might be randomly picked) when fitting.

Only subsampled of forest, which may introduce more randomness.

Moreover, if observations were to be redundant (Non-IID). Repeated sampling on such data will inflate OOB score.

**Based on Shuffled Stratified KFold: Mean KFold accuracy: 0.80450905**

Stratified KFold will use the entire ensemble (full forest) to evaluate trained data, hence accuracy should be better. Stratification will balance the weight throughout the dataset based on class which makes it fair. 

With shuffle however, dataset will not be able to preserve it's order dependency.

Observations were shuffled then splitted. It will still end up with the same problem with OOB methods, in fact even worst since they will use the entire ensemble data to evaluate. As such, their accuracy will also be more inflated.

**Note**

Without shuffle Kfold score is not inflated, kindly refer to [Notebook 4.1](http://localhost:8888/notebooks/AFML%204.1.ipynb)