In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.dates as mdates

from scipy.stats import norm
import statsmodels.api as sm
from statsmodels.tsa.stattools import acf
from statsmodels.tsa.stattools import kpss, adfuller
from statsmodels.stats.stattools import jarque_bera
from statsmodels.tsa.stattools import coint

from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report, accuracy_score

from numpy.random import normal

In [28]:
from AFML_module.dataset_utilities import (get_instrument_attributes, 
                                           form_dollar_bars, 
                                           form_time_bars,
                                           form_vol_bars,
                                           reduce_to_active_symbols, 
                                           apply_roll_factors)

In [2]:
# Credit risk data from Kaggle. # Ground truth is whether individual has defaulted
# https://www.kaggle.com/datasets/adilshamim8/credit-risk-benchmark-dataset

from AFML_module.config import RAW_DATA_DIR

data = pd.read_csv(RAW_DATA_DIR/"Credit Risk Benchmark Dataset.csv")


[32m2025-09-01 23:53:17.854[0m | [1mINFO    [0m | [36mAFML_module.config[0m:[36m<module>[0m:[36m11[0m - [1mPROJ_ROOT path is: /Users/noahbittermann/PycharmProjects/advances-in-financial-machine-learning[0m


In [3]:
data.head(5)

Unnamed: 0,rev_util,age,late_30_59,debt_ratio,monthly_inc,open_credit,late_90,real_estate,late_60_89,dependents,dlq_2yrs
0,0.006999,38.0,0.0,0.30215,5440.0,4.0,0.0,1.0,0.0,3.0,0
1,0.704592,63.0,0.0,0.471441,8000.0,9.0,0.0,1.0,0.0,0.0,0
2,0.063113,57.0,0.0,0.068586,5000.0,17.0,0.0,0.0,0.0,0.0,0
3,0.368397,68.0,0.0,0.296273,6250.0,16.0,0.0,2.0,0.0,0.0,0
4,1.0,34.0,1.0,0.0,3500.0,0.0,0.0,0.0,0.0,1.0,0


# Exercises

### 6.1

Why is bagging based on random sampling with replacement? Would bagging
still reduce a forecast’s variance if sampling were without replacement?

The goal of bagging is to reduce variance by averaging together forecasts fromm models trained on bootstrapped datasets.  However, it is not effective if the forecasts from the different models are highly correlated.

Suppose we sample $m$ datapoints from an original sample of $n$. If $n=m$, then all our bootstrapped dateasets are identical, and the model forecasts will have a correlation of 1.  Moreover, the bootstrapped datasets will remain very similar even as we decrease $m$ (For example, if I have 10 samples to resample from, drawing 7 samples without replacement will yield training datasets that are practically the same, so the forecasts will be highly correlated).  If we continue to decrease $m$ to lessen this effect, then we being to sacrifice accuracy, since each model is only trained on a very small fraction of the original dataset. Therefore, sampling should be done with replacement.

### 6.2

Suppose that your training set is based on highly overlap labels (i.e., with low
uniqueness, as defined in Chapter 4).

(a) Does this make bagging prone to overfitting, or just ineffective? Why?

(b) Is out-of-bag accuracy generally reliable in financial applications? Why?

(a) As shown in the text, the bagged estimator cannot increase the variance _compared to a single estimator_.  Therefore, it will not lead to overfitting.  However it can be ineffective if the correlation of forecasts is near 1, which is the case if the estimators are trained on very similar resampled datasets.  This occurs if there is a high degree of observational redundancy in the original dataset.  

One should also note that the variance of a single estimator will increase if the samples are not IID.

(b) OOB accuracy is not reliable in finance, because samples are not typically IID.  Because we cannot resample independently, the data in the bag will be very similar to that out of the bag. Specifically, the same information is partially determining both the in and out of bag data, which constitues data leakage. 

### 6.3

Build an ensemble of estimators, where the base estimator is a decision tree.

(a) How is this ensemble different from an RF?

(b) Using sklearn, produce a bagging classifier that behaves like an RF. What
parameters did you have to set up, and how?

In [19]:
# build an ensemble of estimators where the base is a decision tree
X = data[data.columns[:-1]]
y = data["dlq_2yrs"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.7, random_state=0)

clf_tree = DecisionTreeClassifier(criterion="entropy", max_depth=10)
clf_bagged = BaggingClassifier(estimator=clf_tree, n_estimators=100, max_samples=1.0)

clf_bagged.fit(X_train, y_train)

print("In Sample Results")
print(classification_report(y_train, clf_bagged.predict(X_train)))

print("Out of Sample Results")
print(classification_report(y_test, clf_bagged.predict(X_test)))

In Sample Results
              precision    recall  f1-score   support

           0       0.85      0.91      0.87      2536
           1       0.90      0.83      0.86      2478

    accuracy                           0.87      5014
   macro avg       0.87      0.87      0.87      5014
weighted avg       0.87      0.87      0.87      5014

Out of Sample Results
              precision    recall  f1-score   support

           0       0.74      0.80      0.77      5821
           1       0.79      0.73      0.76      5879

    accuracy                           0.76     11700
   macro avg       0.77      0.76      0.76     11700
weighted avg       0.77      0.76      0.76     11700



This bagged classifier is not a random forest. To build a random forest model, we must omit a subset of features at each node. We would change the max_features parameter to control this, as below

In [20]:
clf_tree = DecisionTreeClassifier(criterion="entropy", max_depth=10)
clf_boost = BaggingClassifier(estimator=clf_tree, n_estimators=100, max_samples=1.0, max_features=0.7)

clf_boost.fit(X_train, y_train)

print("In Sample Results")
print(classification_report(y_train, clf_boost.predict(X_train)))

print("Out of Sample Results")
print(classification_report(y_test, clf_boost.predict(X_test)))

In Sample Results
              precision    recall  f1-score   support

           0       0.85      0.92      0.89      2536
           1       0.91      0.83      0.87      2478

    accuracy                           0.88      5014
   macro avg       0.88      0.88      0.88      5014
weighted avg       0.88      0.88      0.88      5014

Out of Sample Results
              precision    recall  f1-score   support

           0       0.75      0.81      0.78      5821
           1       0.80      0.73      0.76      5879

    accuracy                           0.77     11700
   macro avg       0.77      0.77      0.77     11700
weighted avg       0.77      0.77      0.77     11700



### 6.4

Consider the relation between an RF, the number of trees it is composed of, and
the number of features utilized:

(a) Could you envision a relation between the minimum number of trees needed
in an RF and the number of features utilized?

(b) Could the number of trees be too small for the number of features used?

(c) Could the number of trees be too high for the number of observations available?

(a), (b) If there are too few trees, then it is possible certain features will never be considered for splitting. For simplicity, we focus on trees that only have one splitting each. If there are $N$ total features and we consider $M$ in a split, then the probability of a feature being considered by the split is $$ P(\text{Considered by Tree}) =  \frac{M}{N}.$$
If there are T trees, then the probability of never considering say feature $i$ is 
$$ P(\text{Feature i Never Considered}) =  \Big(1 - \frac{M}{N}\Big)^T.$$
Now we want the probability that all features are considered at least once.  This is complicated by the fact that feature $i$ being considered is not independent of feature $j$ being considered. However, if $M \sim< N$, then this effect is reduced; considering feature $i$ doesn't affect whether we consider feature $j$ all that much, because we still have many chances to pick $j$. In this approximation, we have       

$$ P(\text{All Features Considered at Least Once}) \approx  \Big(1- \Big(1 - \frac{M}{N}\Big)^T\Big)^N \approx 1 - N\Big(1 - \frac{M}{N}\Big)^T > p.$$

Solving for $T$, we find $$ T > \frac{\log(\frac{1 - p}{n})}{\log(1 - \frac{m}{n})}$$ 

For e.g. $N = 20$, $M = 10$, $p=0.99$, we find $T >\sim 10$. The minimum value of $T$ will also be higher if we account for the fact that considering features is not independent.

(c) There is no harm in overfitting due to increasing the number of trees relative to the number of observations. However, there will be diminishing returns to doing so, especially if the correlation of predictions are high  

### 6.5

How is out-of-bag accuracy different from stratified k-fold (with shuffling) crossvalidation accuracy?

Actually, they're both similar in that they artifically inflate OOS accuracy of the model. In the case of OOB accuracy, if the samples are not IID, then the bootstrapped training sets will have information overlap with the OOB test set.  This constitutes data leakage. 

The same thing happens for k fold cross validation __with__ shuffling.  If the sampling is not IID, then labels that are the result of overlapping information will be spread across the training sets and the validation sets, again leading to data leakage. 