# In Depth - Decision Trees and Forests

In [1]:
%matplotlib notebook
import numpy as np
import matplotlib.pyplot as plt

Here we'll explore a class of algorithms based on decision trees.
Decision trees at their root are extremely intuitive.  They
encode a series of "if" and "else" choices, similar to how a person might make a decision.
However, which questions to ask, and how to proceed for each answer is entirely learned from the data.

For example, if you wanted to create a guide to identifying an animal found in nature, you
might ask the following series of questions:

- Is the animal bigger or smaller than a meter long?
    + *bigger*: does the animal have horns?
        - *yes*: are the horns longer than ten centimeters?
        - *no*: is the animal wearing a collar
    + *smaller*: does the animal have two or four legs?
        - *two*: does the animal have wings?
        - *four*: does the animal have a bushy tail?

and so on.  This binary splitting of questions is the essence of a decision tree.

One of the main benefit of tree-based models is that they require little preprocessing of the data.
They can work with variables of different types (continuous and discrete) and are invariant to scaling of the features.

Another benefit is that tree-based models are what is called "nonparametric", which means they don't have a fix set of parameters to learn. Instead, a tree model can become more and more flexible, if given more data.
In other words, the number of free parameters grows with the number of samples and is not fixed, as for example in linear models.


## Decision Tree Regression

A decision tree is a simple binary classification tree that is
similar to nearest neighbor classification.  It can be used as follows:

In [2]:
from figures import make_dataset
x, y = make_dataset()
X = x.reshape(-1, 1)

plt.figure()
plt.xlabel('Feature X')
plt.ylabel('Target y')
plt.scatter(X, y);

<IPython.core.display.Javascript object>

In [3]:
from sklearn.tree import DecisionTreeRegressor

reg = DecisionTreeRegressor(max_depth=5)
reg.fit(X, y)

X_fit = np.linspace(-3, 3, 1000).reshape((-1, 1))
y_fit_1 = reg.predict(X_fit)

plt.figure()
plt.plot(X_fit.ravel(), y_fit_1, color='blue', label="prediction")
plt.plot(X.ravel(), y, '.k', label="training data")
plt.legend(loc="best");

<IPython.core.display.Javascript object>

A single decision tree allows us to estimate the signal in a non-parametric way,
but clearly has some issues.  In some regions, the model shows high bias and
under-fits the data.
(seen in the long flat lines which don't follow the contours of the data),
while in other regions the model shows high variance and over-fits the data
(reflected in the narrow spikes which are influenced by noise in single points).

Decision Tree Classification
==================
Decision tree classification work very similarly, by assigning all points within a leaf the majority class in that leaf:


In [4]:
from sklearn.datasets import make_blobs
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from figures import plot_2d_separator


X, y = make_blobs(centers=[[0, 0], [1, 1]], random_state=61526, n_samples=100)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

clf = DecisionTreeClassifier(max_depth=5)
clf.fit(X_train, y_train)

plt.figure()
plot_2d_separator(clf, X, fill=True)
plt.scatter(X_train[:, 0], X_train[:, 1], c=np.array(['b', 'r'])[y_train], s=60, alpha=.7, edgecolor='k')
plt.scatter(X_test[:, 0], X_test[:, 1], c=np.array(['b', 'r'])[y_test], s=60, edgecolor='k');

<IPython.core.display.Javascript object>

There are many parameter that control the complexity of a tree, but the one that might be easiest to understand is the maximum depth. This limits how finely the tree can partition the input space, or how many "if-else" questions can be asked before deciding which class a sample lies in.

This parameter is important to tune for trees and tree-based models. The interactive plot below shows how underfit and overfit looks like for this model. Having a ``max_depth`` of 1 is clearly an underfit model, while a depth of 7 or 8 clearly overfits. The maximum depth a tree can be grown at for this dataset is 8, at which point each leave only contains samples from a single class. This is known as all leaves being "pure."

In the interactive plot below, the regions are assigned blue and red colors to indicate the predicted class for that region. The shade of the color indicates the predicted probability for that class (darker = higher probability), while yellow regions indicate an equal predicted probability for either class.

In [2]:
import matplotlib.pyplot as plt
% matplotlib notebook
from figures.plot_interactive_tree import plot_tree
plot_tree(2)
fig = plt.gcf()
fig.set_size_inches(30, 4)

<IPython.core.display.Javascript object>

1.0
1.0622688172


In [5]:
ax.figure.draw?

In [3]:
from matplotlib.text import Annotation

ax = fig.gca()
anns = [ann for ann in fig.gca().get_children() if isinstance(ann, Annotation)]


inv = ax.transData.inverted()
import numpy as np
# get max bbox width
width = max([np.diff(inv.transform(ann.get_bbox_patch().get_window_extent())[:, 0]) for ann in anns])

In [15]:
ann.update_bbox_position_size(ann._renderer)

In [12]:
fig.canvas.get_renderer()

<matplotlib.backends.backend_agg.RendererAgg at 0x7fe37c483400>

In [5]:
ann._get_layout(ann._renderer)

(Bbox([[-56.8125, -23.199999999999996], [56.8125, 23.199999999999996]]),
 [('X[0] <= 0.996', array([ 101.25,   14.  ]), -50.625, 12.199999999999999),
  ('samples = 50', array([ 95.75,  14.  ]), -47.875, -4.0),
  ('value = [25, 25]',
   array([ 113.625,   14.   ]),
   -56.8125,
   -20.199999999999996)],
 3.0)

In [14]:
bla = ann.get_bbox_patch().get_window_extent()
bla

Bbox([[2109.222853535353, 323.24444444444447], [2233.958964646464, 380.7555555555556]])

In [22]:
bla.width

124.73611111111131

In [17]:
inv.transform(ann.get_bbox_patch().get_window_extent())

array([[ 1.3229552 ,  1.86724387],
       [ 1.6770448 , -1.86724387]])

In [13]:
width

array([ 0.35408961])

In [4]:
ann = anns[0]

In [23]:
bla = ann.get_bbox_patch()

In [12]:
inv.get_matrix()

array([[  2.83870968e-03,   0.00000000e+00,  -4.66451613e+00],
       [  0.00000000e+00,  -6.49350649e-02,   2.28571429e+01],
       [  0.00000000e+00,   0.00000000e+00,   1.00000000e+00]])

In [26]:
bla.get_width()

113.625

In [27]:
bla.get_window_extent().width

124.73611111111131

In [15]:
ann.get_bbox_patch().get_width() * inv.get_matrix()[0, 0]

0.32254838709677419

In [64]:
from matplotlib.text import Annotation
anns = [ann for ann in fig.gca().get_children() if isinstance(ann, Annotation)]
size = anns[0].get_fontsize() / width 
for ann in anns:
    ann.set_fontsize(size)
fig.canvas.draw()


In [31]:
intersect_annotations(anns)

True

In [9]:
a = anns[0]

In [18]:
bla = a.get_window_extent()

In [None]:
from matplotlib.transforms import 

In [26]:
bla.corners()

array([[ 707.11363636,  521.4       ],
       [ 707.11363636,  534.6       ],
       [ 740.61363636,  521.4       ],
       [ 740.61363636,  534.6       ]])

In [11]:
a.get_bbox_patch().get_verts()

array([[ 705.4469697 ,  519.73333333],
       [ 742.28030303,  519.73333333],
       [ 742.28030303,  536.26666667],
       [ 705.4469697 ,  536.26666667],
       [ 705.4469697 ,  519.73333333]])

In [2]:
ann

NameError: name 'ann' is not defined

In [None]:
ann

In [1]:
def intersect(patch1, patch2):
    return any([patch1.contains_point(p) for p in patch2.get_verts()])      

def intersect_annotations(anns):
    from itertools import combinations                                           
    intersect_any = []
    for a, b in combinations(anns, 2):                                        
        if a is b:
            continue
        a_patch = a.get_bbox_patch()                                        
        b_patch = b.get_bbox_patch()
        intersect_any.append(intersect(a_patch, b_patch))                   
    return any(intersect_any)   


In [1]:
from figures import plot_tree_interactive
%matplotlib notebook
plot_tree_interactive()

Decision trees are fast to train, easy to understand, and often lead to interpretable models. However, single trees often tend to overfit the training data. Playing with the slider above you might notice that the model starts to overfit even before it has a good separation between the classes.

Therefore, in practice it is more common to combine multiple trees to produce models that generalize better. The most common methods for combining trees are random forests and gradient boosted trees.


## Random Forests

Random forests are simply many trees, built on different random subsets (drawn with replacement) of the data, and using different random subsets (drawn without replacement) of the features for each split.
This makes the trees different from each other, and makes them overfit to different aspects. Then, their predictions are averaged, leading to a smoother estimate that overfits less.


In [None]:
from figures import plot_forest_interactive
plot_forest_interactive()

## Selecting the Optimal Estimator via Cross-Validation

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.datasets import load_digits
from sklearn.ensemble import RandomForestClassifier

digits = load_digits()
X, y = digits.data, digits.target

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

rf = RandomForestClassifier(n_estimators=200)
parameters = {'max_features':['sqrt', 'log2', 10],
              'max_depth':[5, 7, 9]}

clf_grid = GridSearchCV(rf, parameters, n_jobs=-1)
clf_grid.fit(X_train, y_train)

In [None]:
clf_grid.score(X_train, y_train)

In [None]:
clf_grid.score(X_test, y_test)

## Another option: Gradient Boosting

Another Ensemble method that can be useful is *Boosting*: here, rather than
looking at 200 (say) parallel estimators, We construct a chain of 200 estimators
which iteratively refine the results of the previous estimator.
The idea is that by sequentially applying very fast, simple models, we can get a
total model error which is better than any of the individual pieces.

In [None]:
from sklearn.ensemble import GradientBoostingRegressor
clf = GradientBoostingRegressor(n_estimators=100, max_depth=5, learning_rate=.2)
clf.fit(X_train, y_train)

print(clf.score(X_train, y_train))
print(clf.score(X_test, y_test))

<div class="alert alert-success">
    <b>EXERCISE: Cross-validating Gradient Boosting</b>:
     <ul>
      <li>
      Use a grid search to optimize the `learning_rate` and `max_depth` for a Gradient Boosted
Decision tree on the digits data set.
      </li>
    </ul>
</div>

In [None]:
from sklearn.datasets import load_digits
from sklearn.ensemble import GradientBoostingClassifier

digits = load_digits()
X_digits, y_digits = digits.data, digits.target

# split the dataset, apply grid-search

In [None]:
#%load solutions/19_gbc_grid.py