# Random Forests and Gradient boosting

AI Black Belt - Yellow (June 2019).

---

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd 

In [None]:
from sklearn.datasets import fetch_california_housing
data = fetch_california_housing()

In [None]:
print(data.DESCR)

In [None]:
X, y = data.data, data.target

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y)

## Decision trees

In [None]:
from sklearn.tree import DecisionTreeRegressor
reg = DecisionTreeRegressor()
reg.fit(X_train, y_train)

In [None]:
print(reg.score(X_train, y_train))
print(reg.score(X_test, y_test))

<div class="alert alert-success">
    <b>EXERCISE</b>:

Why is the training score so good?
</div>

Decision trees are one of the few models to be directly interpretable:

In [None]:
from sklearn.tree import plot_tree
fig, ax = plt.subplots(figsize=(15, 10))
plot_tree(reg, max_depth=5, feature_names=data.feature_names, ax=ax, fontsize=8)
plt.show()

By default, a decision tree is grown until all leaves are pure. This typically leads to overfitting and to trees that are difficult to interpret.

Instead, the structure can be regularized through hyper-parameters:

In [None]:
# max_depth
reg = DecisionTreeRegressor(max_depth=3)
reg.fit(X_train, y_train)

fig, ax = plt.subplots(figsize=(15, 10))
plot_tree(reg, max_depth=5, feature_names=data.feature_names, ax=ax, fontsize=8)
plt.show()

In [None]:
# min_samples_split
reg = DecisionTreeRegressor(min_samples_split=1000)
reg.fit(X_train, y_train)

fig, ax = plt.subplots(figsize=(15, 15))
plot_tree(reg, max_depth=10, feature_names=data.feature_names, ax=ax, fontsize=8)
plt.show()

In [None]:
# max_leaf_nodes
reg = DecisionTreeRegressor(max_leaf_nodes=20)
reg.fit(X_train, y_train)

fig, ax = plt.subplots(figsize=(15, 15))
plot_tree(reg, feature_names=data.feature_names, ax=ax, fontsize=8)
plt.show()

In [None]:
imp = pd.DataFrame({"importances": reg.feature_importances_}, index=data.feature_names)
imp = imp.sort_values(by=["importances"], ascending=False)
imp

<div class="alert alert-success">
    <b>EXERCISE</b>:

Grid search for the best value of <code>max_depth</code>.
</div>

## Random forests

In [None]:
from sklearn.ensemble import RandomForestRegressor
forest = RandomForestRegressor(n_estimators=50)
forest.fit(X_train, y_train)
forest.score(X_test, y_test)

In [None]:
imp = pd.DataFrame({"importances": forest.feature_importances_}, index=data.feature_names)
imp = imp.sort_values(by=["importances"], ascending=False)
imp

The two most parameters are `n_estimators` and `max_features`.

In [None]:
# n_estimators: the more the better!
forest = RandomForestRegressor(n_estimators=250)
forest.fit(X_train, y_train)
forest.score(X_test, y_test)

In [None]:
# max_features
forest = RandomForestRegressor(n_estimators=50, max_features=1)
forest.fit(X_train, y_train)
forest.score(X_test, y_test)

<div class="alert alert-success">
    <b>EXERCISE</b>:

Evaluate the train and test error as a function of the number of trees in the forest.
</div>

<div class="alert alert-success">
    <b>EXERCISE</b>:

Grid search for the best value of <code>max_features</code>. Have feature importances changed?
</div>

<div class="alert alert-success">
    <b>EXERCISE</b> (optional):

Evaluate the performance of <code>ExtraTreesRegressor</code>.
</div>

## Gradient boosting

In [None]:
from sklearn.ensemble import GradientBoostingRegressor
boosting = GradientBoostingRegressor()
boosting.fit(X_train, y_train)
boosting.score(X_test, y_test)

<div class="alert alert-success">
    <b>EXERCISE</b>:

Grid search simultaneously for the best value of <code>n_estimators</code> and the best value of <code>learning_rate</code>.
</div>