# Decision Tree and Random Forest Models

based on based on [DecisionTrees.ipynb from LMU block course](https://github.com/fuenfundachtzig/LMU_DA_ML_Basic/blob/main/notebooks/DecisionTrees.ipynb)

### Decision Trees

A further important model category. The basic principle is easy to understand:  
 Hierarchical series of  **if/else questions** 

*Example:* Game where you need to distinguish four kinds of animals:  
* *Bear, Dolphin, Penguin, Hawk*

Goal is to use as few questions as possible.

One possible solution:

![](figures/DT_animals.png)

#### Simple example 
Illustrate DT with half-moon data, a simple dataset with half-moon shaped data distributions:

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import make_moons

In [None]:
X, y = make_moons(n_samples=100, noise=0.25, random_state=3)

In [None]:
plt.scatter(*X[y==0].T, color="red")
plt.scatter(*X[y==1].T, color="blue")

**Try previous models first**

In [None]:
from sklearn.naive_bayes import GaussianNB
model = GaussianNB()
model.fit(X, y).score(X, y)

In [None]:
from sklearn.neighbors import KNeighborsClassifier
model = KNeighborsClassifier(n_neighbors=5)
model.fit(X, y).score(X, y)

**Now the decision Tree**

In [None]:
tree = DecisionTreeClassifier(max_depth=3).fit(X, y)
tree.score(X, y)

In [None]:
def visualize_classifier(predict, xmin, xmax, ymin, ymax, **kwargs):
    xx, yy = np.meshgrid(
        np.linspace(xmin, xmax, 100),
        np.linspace(ymin, ymax, 100),
    )
    X = np.stack([xx, yy], axis=-1).reshape(-1, 2)
    zz = predict(X).reshape(xx.shape)
    plt.pcolormesh(xx, yy, zz, **kwargs)

Visualization with different depths8

In [None]:
max_depth = 1
model = DecisionTreeClassifier(max_depth=max_depth).fit(X, y)
print(f"{max_depth=}, {model.score(X, y)=}")
visualize_classifier(model.predict, -1.5, 2.5, -1.5, 2, cmap="RdBu")
plt.scatter(*X[y==0].T, color="red")
plt.scatter(*X[y==1].T, color="blue")

For high depth, clearly goes into over-training

## Decision tree example with real data

A frequently used data set for ML is a data set for *breast cancer diagnosis*

In [None]:
from sklearn.datasets import load_breast_cancer

cancer = load_breast_cancer()

print (cancer.feature_names)
print (cancer.DESCR)

In [None]:
# apply decision-tree
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    cancer.data, cancer.target, random_state=42
)
tree = DecisionTreeClassifier(random_state=0)
tree.fit(X_train, y_train)
print("Accuracy on training set: {:.3f}".format(tree.score(X_train, y_train)))
print("Accuracy on test set: {:.3f}".format(tree.score(X_test, y_test)))

Without limiting the depth, the DT will be evolved until perfect accuracy.

But not really useful &rarr; Over-training

Better approach:

In [None]:
tree = DecisionTreeClassifier(max_depth=3, random_state=42)
tree.fit(X_train, y_train)

print("Accuracy on training set: {:.3f}".format(tree.score(X_train, y_train)))
print("Accuracy on test set: {:.3f}".format(tree.score(X_test, y_test)))

Note that the performance on the test set has improved by introducing a maximum depth of the trees. (The fact that we do no longer get perfect classfication on the training sample is not relevant.)

### Random Forests

Decisions trees are potentially very powerful models but they are very also sensitive to overtraining (overfitting); therefore they are normally not directly used in practice. 

However, one can mitigate or solve this problem by using an ensemble of decision trees and not just a single DT.  
The main trick is randomization:
* train many DTs but
    * each DT sees different parts of the data
    * or different set of features

This approach is called **Random Forest**:  
Many randomized trees contribute and the final decision is made by some sort of majority voting.

Test with half moon data:

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_moons

X, y = make_moons(n_samples=100, noise=0.25, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, stratify=y
)

In [None]:
forest = RandomForestClassifier(n_estimators=100, random_state=0)
forest.fit(X_train, y_train)

In [None]:
visualize_classifier(lambda X: forest.predict_proba(X)[:, 1], -1.5, 2.5, -1.5, 2.5, cmap="RdBu")
plt.scatter(*X[y==0].T, color="red")
plt.scatter(*X[y==1].T, color="blue")

#### Random Forest for Cancer Data

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    cancer.data, cancer.target, random_state=0
)
forest = RandomForestClassifier(n_estimators=100, random_state=0)
forest.fit(X_train, y_train)

print("Accuracy on training set: {:.3f}".format(forest.score(X_train, y_train)))
print("Accuracy on test set: {:.3f}".format(forest.score(X_test, y_test)))

Out-of-box already better accuracy on validation set

## Feature importance

A very useful additional result of DT classification is the *feature importance*.
This gives for each feature a rating between 0 and 1 how important it is for the classification:
* 0 means no effect, not useful
* 1 means perfect separation

In [None]:
forest.feature_importances_

In [None]:
pd.Series(forest.feature_importances_, index=cancer.feature_names).sort_values().plot(kind="barh")

In [None]:
from sklearn.ensemble import GradientBoostingClassifier

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    cancer.data, cancer.target, random_state=0
)
bdt = GradientBoostingClassifier(n_estimators=1000, max_depth=3, learning_rate=0.01, verbose=True, subsample=0.5, max_features="sqrt")
bdt.fit(X_train, y_train)

bdt.score(X_train, y_train), bdt.score(X_test, y_test)

In [None]:
pd.Series(bdt.feature_importances_, index=cancer.feature_names).sort_values().plot(kind="barh")

## Further reading
There is a nice interactive tool that helps to understand how decision trees work:

[![Screenshot](figures/screenshot_BDT_playground.png)](https://arogozhnikov.github.io/2016/07/05/gradient_boosting_playground.html)

This also allows to use rotated decision trees, originally proposed in [2006](https://ieeexplore.ieee.org/document/1677518). You can read more about this e.g. [here](https://jmlr.csail.mit.edu/papers/volume17/blaser16a/blaser16a.pdf).