# Lecture 6 - Classification & Regression II (Tree Ensembles)

The goals of this notebook is to go trough the most popular ensemble methods for Decision Trees. Compared to the previous notebook that look at individual Decision Trees, visualizing the results (i.e., the trees) is no longer meaningfully possible. Hence we focus on the results (f1 scores) for different hyperparameter settings.

Let's get started...


## Setting up the notebook

Specify how plots get rendered

In [None]:
%matplotlib notebook

Make all required imports.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier
from sklearn.model_selection import train_test_split

from sklearn.metrics import f1_score, mean_squared_error

## Prepare Training & Test

We already have done these steps many time. so there's no need for any details. As Decision Trees do not require normalized data, there's also not much to do in terms of data preprocessing.

### Load Data

In [None]:
df = pd.read_csv('data/cardio_train.csv', sep=';')

# Drop "artificial" feature id
df.drop(columns=['id'], inplace=True)

# Show the first 5 columns
df.head()

### Generate Training and Test Data

In [None]:
# Convert data to numpy arrays
X = df[['age', 'gender', 'height', 'weight', 'ap_hi', 'ap_lo', 'cholesterol', 'gluc', 'smoke', 'alco', 'active']].to_numpy()
y = df[['cardio']].to_numpy().squeeze()

# Split dataset in to training and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

print("Size of training set: {}".format(len(X_train)))
print("Size of test: {}".format(len(X_test)))

## Basic Decision Tree Classifier

For comparison, we train an individual Decision tree for different values of `max_depth`; same as in the previous notebook just with different data.

In [None]:
%%time

max_depth = 20

# Keep track of depth and f1 scores for plotting
ds, f1s = [], []

# Loop over all values for max_depth
for d in range(1, max_depth+1):
    ds.append(d)
    # Train Decision Tree classifier for current value of max_depth
    clf = DecisionTreeClassifier(max_depth=d, criterion='gini', random_state=10).fit(X_train, y_train)
    # Predict class labels for test set
    y_pred = clf.predict(X_test)
    # Calculate f1 score between predictions and ground truth
    f1 = f1_score(y_test, y_pred)
    f1s.append(f1)
    
print('A maximum depth of {} yields the best f1 score of {:.3f}'.format(ds[np.argmax(f1s)], np.max(f1s), ))        
    
# Plot the results (max_depth vs. f1.score)
plt.figure()
plt.plot(ds, f1s)
plt.show()

## Bagging Classifier

We introduced Bagging as simple ways to train multiple models on different datasets, where each dataset is a random sample (with replacement) of the original dataset of the same size. scikit-learn's [BaggingClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingClassifier.html) implements this idea. As Bagging is a General concept and not limited to Decision Trees, `BaggingClassifier` gets as input a `base_estimator` which is a Decision Tree in our case.

Note that we now have 2 parameters: 
 * `max_depth` of Decision Tree base estimator
 * `n_estimators` as the number of models

(well, there are more parameters but we just focus on these 2 here)

Since you have not 2 parameters to tune, we can implement this as nested loop to go over all combinations

In [None]:
%%time

max_depth = 20

ds, ns, f1s = [], [], []

# Loop over all values for max_depth
for d in range(1, max_depth+1):
    for n in [10, 25, 50, 100]:
        ds.append(d)
        ns.append(n)
        
        # Train Decision Tree classifier for current value of max_depth
        base_estimator = DecisionTreeClassifier(max_depth=d, random_state=10)

        clf = BaggingClassifier(base_estimator=base_estimator, n_estimators=n, max_features=1.0).fit(X_train, y_train)
        # Predict class labels for test set
        y_pred = clf.predict(X_test)
        # Calculate f1 score between predictions and ground truth
        f1 = f1_score(y_test, y_pred)
        f1s.append(f1)


We can visualize the result, i.e., the f1 scores for each parameter combination using a 3d plot.

In [None]:
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.set_xlabel(r'max_depth', fontsize=16)
ax.set_ylabel(r'n_estimators', fontsize=16)
ax.set_zlabel('f1 score', fontsize=16)

surf = ax.plot_trisurf(ds, ns, f1s, cmap=plt.cm.coolwarm, linewidth=0, antialiased=False)
plt.tight_layout()
plt.show()

We can also extract the best f1 score and the parameter combination that resulted in the score.

In [None]:
f1_max = np.max(f1s)

print('The hights f1 score across all runs: {:.3f}'.format(f1_max))

In [None]:
best_runs = np.where(f1s == f1_max)[0]

print('The following runs resulted in the hightest f1 score of {:.3f}'.format(f1_max))
for run in best_runs:
    print('* max_depth = {}, n_estimators = {}'.format(ds[run], ns[run]))

## Random Forest Classifier

A [RandomForestClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) goes further then Bagging in such a way that each dataset is not only sample with respect to the data items but also with respect to the features. The code for the evaluation -- considering that we again only look at the 2 parameters `max_depth` and `n_estimators` -- is essentially the same.

In [None]:
%%time

max_depth = 20

ds, ns, f1s = [], [], []

# Loop over all values for max_depth
for d in range(1, max_depth+1):
    for n in [10, 25, 50, 100]:
        ds.append(d)
        ns.append(n)
        # Train Decision Tree classifier for current value of max_depth
        clf = RandomForestClassifier(max_depth=d, criterion='gini', n_estimators=n).fit(X_train, y_train)
        # Predict class labels for test set
        y_pred = clf.predict(X_test)
        # Calculate f1 score between predictions and ground truth
        f1 = f1_score(y_test, y_pred)
        f1s.append(f1)


And we can plot it scores for different parameter combinations again...

In [None]:
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.set_xlabel(r'max_depth', fontsize=16)
ax.set_ylabel(r'n_estimators', fontsize=16)
ax.set_zlabel('f1 score', fontsize=16)

surf = ax.plot_trisurf(ds, ns, f1s, cmap=plt.cm.coolwarm, linewidth=0, antialiased=False)
plt.tight_layout()
plt.show()

...as well as extracting the best score and respective parameter values.

In [None]:
f1_max = np.max(f1s)

print('The hights f1 score across all runs: {:.3f}'.format(f1_max))

In [None]:
best_runs = np.where(f1s == f1_max)[0]

print('The following runs resulted in the hightest f1 score of {:.3f}'.format(f1_max))
for run in best_runs:
    print('* max_depth = {}, n_estimators = {}'.format(ds[run], ns[run]))

## AdaBoost Classifier

Similar to Bagging, AdaBoost is a general concept and not limited to Decision Trees. The basic idea of AdaBoost is to train series of classifiers, where the next classifiers tries to correct the errors of the previous one. For this, scikit-learn provides its [AdaBoostClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier). Similar to `BaggingClassifier`, we have to specify to use a Decision Tree as `base_estimator`.

The code for trying different parameter combinations should look familiar by now.

In [None]:
%%time

max_depth = 10

ds, ns, f1s = [], [], []

# Loop over all values for max_depth
for d in range(1, max_depth+1):
    for n in [10, 25, 50, 100]:
        ds.append(d)
        ns.append(n)
        # Train Decision Tree classifier for current value of max_depth
        base_estimator = DecisionTreeClassifier(max_depth=d, random_state=100)

        clf = AdaBoostClassifier(base_estimator=base_estimator, n_estimators=n).fit(X_train, y_train)
        # Predict class labels for test set
        y_pred = clf.predict(X_test)
        # Calculate f1 score between predictions and ground truth
        f1 = f1_score(y_test, y_pred)
        f1s.append(f1)

In [None]:
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.set_xlabel(r'max_depth', fontsize=16)
ax.set_ylabel(r'n_estimators', fontsize=16)
ax.set_zlabel('f1 score', fontsize=16)

surf = ax.plot_trisurf(ds, ns, f1s, cmap=plt.cm.coolwarm, linewidth=0, antialiased=False)
plt.tight_layout()
plt.show()

In [None]:
f1_max = np.max(f1s)

print('The hights f1 score across all runs: {:.3f}'.format(f1_max))

In [None]:
best_runs = np.where(f1s == f1_max)[0]

print('The following runs resulted in the hightest f1 score of {:.3f}'.format(f1_max))
for run in best_runs:
    print('* max_depth = {}, n_estimators = {}'.format(ds[run], ns[run]))

## Gradient Boosting Classifier

Lastly, we can look at Gradient Boosting using the [GradientBoostingClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html), again here experimenting with different values for `max_depth` and `n_estimators`.

In [None]:
%%time

max_depth = 10

ds, ns, f1s = [], [], []

# Loop over all values for max_depth
for d in range(1, max_depth+1):
    for n in [10, 25, 50, 100]:
        ds.append(d)
        ns.append(n)    
        # Train Decision Tree classifier for current value of max_depth
        clf = GradientBoostingClassifier(max_depth=d, n_estimators=n).fit(X_train, y_train)
        # Predict class labels for test set
        y_pred = clf.predict(X_test)
        # Calculate f1 score between predictions and ground truth
        f1 = f1_score(y_test, y_pred)
        f1s.append(f1)


In [None]:
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.set_xlabel(r'max_depth', fontsize=16)
ax.set_ylabel(r'n_estimators', fontsize=16)
ax.set_zlabel('f1 score', fontsize=16)

surf = ax.plot_trisurf(ds, ns, f1s, cmap=plt.cm.coolwarm, linewidth=0, antialiased=False)
plt.tight_layout()
plt.show()

In [None]:
f1_max = np.max(f1s)

print('The hights f1 score across all runs: {:.3f}'.format(f1_max))

In [None]:
best_runs = np.where(f1s == f1_max)[0]

print('The following runs resulted in the hightest f1 score of {:.4f}'.format(f1_max))
for run in best_runs:
    print('* max_depth = {}, n_estimators = {}'.format(ds[run], ns[run]))

## Summary

On this dataset and with the parameter values given, all classifiers show very comparable results. So (single) Decision Trees are not bad per se :). Of course, tree ensemble method offer additional hyperparameter setting worth tuning. Also, this dataset is not overly large or as a lot of features. All those can help a Decision Tree to keep up with the ensemble methods. 

We can certainly see that ensemble methods take much more time to evaluate. Firstly, each training takes more time as multiple models are built, and secondly, ensemble methods offer more hyperparameters that potentially can affect the results.