* By: Jacques Joubert
* Email: jacques@quantsportal.com
* Reference: Advances in Financial Machine Learning, Marcos Lopez De Prado, pg 101


# Chapter 6 Ensemble Methods

## Question 1:
**Why is bagging based on random sampling with replacement? Would bagging still reduce a forecast’s variance if sampling were without replacement?**

Sampling without replacement is called Pasting and with replacement is Bagging. Pasting is designed for very large data sets that can afford to make use of sampling without replacement. Both have the same purpose in mind - to create a diverse set of models. These diverse models are then used in an ensemble method which has a similar bias but a lower variance than a single predictor trained on the original training set.

Bagging ends up with a slightly higher bias than pasting which results in the predictors being less correlated, thus the ensemble's variance is reduced. Overall bagging is preferred as it often leads to better models.

## Question 2:
**Suppose that your training set is based on highly overlapping labels (i.e., with low uniqueness, as defined in Chapter 4).**

**(a) Does this make bagging prone to overfitting, or just ineffective? Why?**

“Redundant observations have two detrimental effects on bagging. First Samples drawn with replacement are more likely to be virtually identical, even if they do not share the same observations. This makes p_bar ≈ 1, and bagging will not reduce variance, regardless of the number of estimators N.”

The advantage of using Bagging lays in its ability to reduce forecast variance and thus prevents overfitting. The variance of the bagged prediction is a function of the number of bagged estimators, the average variance of a single estimator’s prediction, and the average correlation among their forecasts.

Models that are trained on the same type of data are likely to make the same type of errors. When there are many overlapping samples (low uniqueness) then it results in models with poor diversity (high correlation).

Bagging is only effective to the extent that the average correlation among forecasters is less than 1. One of the goals of sequential bootstrapping (Chapter 4) is to produce samples as independent as possible, thereby reducing the avg correlation, which should lower the variance of bagging classifiers.

<div align="center">
  <img src="https://raw.githubusercontent.com/hudson-and-thames/research/ch6_questions/Chapter6/images/ch6_bagged_prediction.png" width="500"><br>
</div>


The plot above shows that models that are highly correlated (overbar p), fail to reduce the variance. It is better to have a few diverse models than a large number of correlated ones.

Thus training models on data that has a low average uniqueness, makes the bagging ensemble more ineffective. It also leads to the out-of-bag accuracy being grossly over-inflated.

**(b) Is out-of-bag accuracy generally reliable in financial applications? Why?**

“The second detrimental effect from observation redundancy is that out-of-bag accuracy will be inflated. This happens because random sampling with replacement places in the training set samples that are very similar to those out-of-bag. In such a case, a proper stratified k-fold cross-validation without shuffling before partitioning will show a much lower testing-set accuracy than the one estimated out-of-bag. For this reason, it is advisable to set StratifiedKFold(n_splits=k, shuffle=False) when using that sklearn class, cross-validate the bagging classifier, and ignore the out-of-bag accuracy results. A low number k is preferred to a high one, as excessive partitioning would again place in the testing set samples too similar to those used in the training set.”


## Question 3:

**Build an ensemble of estimators, where the base estimator is a decision tree.**

In [54]:
import numpy as np
import pandas as pd

from sklearn.datasets import load_iris 
from sklearn.ensemble import RandomForestClassifier, BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score 

In [55]:
# Load the data set
iris = load_iris()
X_data = iris['data']
y_data = iris['target']

# Split data
X_train, X_test, y_train, y_test = train_test_split(X_data, y_data, test_size=0.3, random_state=42, shuffle=True, stratify=None)

In [71]:
# Create model
bag_clf = BaggingClassifier(DecisionTreeClassifier(splitter="best", max_leaf_nodes=None),
                            n_estimators=500, max_samples=100, bootstrap=True, n_jobs=-1, oob_score=True)

# Fit and score
bag_clf.fit(X_train, y_train)
y_pred = bag_clf.predict(X_test)
y_pred = bag_clf.predict(X_test)

print('Out-of-bag score on training: {}'.format(bag_clf.oob_score_))
print('Test Score: {}'.format(accuracy_score(y_test, y_pred)))

**(a) How is this ensemble different from an RF?**

A random forest also makes use of a bagging method and thus it has the same hyperparameters as a decsion tree and a bagging classifier. However the key difference is that it introduces extra randomness when building trees as it splits the data on the best feature from a random subset of features. This results in greater tree diversity, thus lowering the variance.

**(b) Using sklearn, produce a bagging classifier that behaves like an RF. What parameters did you have to set up, and how?**

First we build a random forest classifier and then we fit a bagging classifier which has its parameters adjusted to replicate the random forrest. In order to do this, we need to set the splitter hyperparameter in the decision trees to random.

In [81]:
# Random Forest
rnd_clf = RandomForestClassifier(n_estimators=500, max_leaf_nodes=16, n_jobs=-1, oob_score=True)
rnd_clf.fit(X_train, y_train) 
y_pred_rf = rnd_clf.predict(X_test)

# Bagging -> RF
bag_clf = BaggingClassifier(DecisionTreeClassifier(splitter="random", max_leaf_nodes=16),
                            n_estimators=500, max_samples=1.0, bootstrap=True, n_jobs=-1, oob_score=True)
bag_clf.fit(X_train, y_train)
y_pred_bagging = bag_clf.predict(X_test)

In [82]:
print('Random Forest Classifier')
print('Out-of-bag score on training: {}'.format(rnd_clf.oob_score_))
print('Test Score: {}'.format(accuracy_score(y_test, y_pred_rf)))
print('')

print('Bagging Classifier')
print('Out-of-bag score on training: {}'.format(bag_clf.oob_score_))
print('Test Score: {}'.format(accuracy_score(y_test, y_pred_bagging)))

Random Forest Classifier
Out-of-bag score on training: 0.9238095238095239
Test Score: 1.0

Bagging Classifier
Out-of-bag score on training: 0.9333333333333333
Test Score: 1.0
