# Decision Trees and Random Forrests

Random forests are an example of an ensemble method, meaning that it relies on aggregating the results of an ensemble of simpler estimators (Decision Trees). The somewhat surprising result with such ensemble methods is that the sum can be greater than the parts: that is, a majority vote among a number of estimators can end up being better than any of the individual estimators doing the voting! We will see examples of this in the following sections. 

We begin with the standard imports:

In [None]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()

## Creating a Desicion Tree

Consider the following two-dimensional data, which has one of four class labels:



In [None]:
from sklearn.datasets import make_blobs

X, y = make_blobs(n_samples=300, centers=4,
                  random_state=0, cluster_std=1.0)
plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='rainbow');

A simple decision tree built on this data will iteratively split the data along one or the other axis according to some quantitative criterion, and at each level assign the label of the new region according to a majority vote of points within it.

This process of fitting a decision tree to our data can be done in Scikit-Learn with the DecisionTreeClassifier estimator:

In [None]:
from sklearn.tree import DecisionTreeClassifier
tree = DecisionTreeClassifier().fit(X, y)

In [None]:
tree.feature_importances_

Let's write a quick utility function to help us visualize the output of the classifier:

In [None]:
def visualize_classifier(model, X, y, ax=None, cmap='rainbow'):
    ax = ax or plt.gca()
    
    # Plot the training points
    ax.scatter(X[:, 0], X[:, 1], c=y, s=30, cmap=cmap,
               clim=(y.min(), y.max()), zorder=3)
    ax.axis('tight')
    ax.axis('off')
    xlim = ax.get_xlim()
    ylim = ax.get_ylim()
    
    # fit the estimator
    model.fit(X, y)
    xx, yy = np.meshgrid(np.linspace(*xlim, num=200),
                         np.linspace(*ylim, num=200))
    Z = model.predict(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape)

    # Create a color plot with the results
    n_classes = len(np.unique(y))
    contours = ax.contourf(xx, yy, Z, alpha=0.3,
                           levels=np.arange(n_classes + 1) - 0.5,
                           cmap=cmap, clim=(y.min(), y.max()),
                           zorder=1)

    ax.set(xlim=xlim, ylim=ylim)


Now we can examine what the decision tree classification looks like:

In [None]:
visualize_classifier(DecisionTreeClassifier(), X, y)

Notice that as the depth increases, we tend to get very strangely shaped classification regions; for example, at a depth of five, there is a tall and skinny purple region between the yellow and blue regions. 

It's clear that this is less a result of the true, intrinsic data distribution, and more a result of the particular sampling or noise properties of the data. That is, this decision tree, even at only five levels deep, is clearly **over-fitting** our data.

## Ensembles of Estimators: Random Forests

In Scikit-Learn, an optimized ensemble of randomized decision trees is implemented in the RandomForestClassifier estimator, which takes care of all the randomization automatically. All you need to do is select a number of estimators, and it will very quickly (in parallel, if desired) fit the ensemble of trees:

In [None]:
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(n_estimators=100, random_state=0)
visualize_classifier(model, X, y);

We see that by averaging over 100 randomly perturbed models, we end up with an overall model that is much closer to our intuition about how the parameter space should be split.

## Example: Random Forest for Classifying Digits

Previously we worked with the hand-written digits data. Let's use that again here to see how the random forest classifier can be used in this context.

In [None]:
from sklearn.datasets import fetch_openml

# X contains the data and y contains the labels
X, y = fetch_openml('mnist_784', version=1, return_X_y=True)

To remind us what we're looking at, we'll visualize the first few data points:

In [None]:
# set up the figure
fig = plt.figure(figsize=(6, 6))  # figure size in inches
fig.subplots_adjust(left=0, right=1, bottom=0, top=1, hspace=0.05, wspace=0.05)

# plot the digits: each image is 28x28 pixels
for i in range(64):
    ax = fig.add_subplot(8, 8, i + 1, xticks=[], yticks=[])
    ax.imshow(X[i].reshape(28, 28), cmap=plt.cm.binary, interpolation='nearest')
    
    # label the image with the target value
    ax.text(0, 7, str(y[i]))

We can quickly classify the digits using a random forest as follows:



In [None]:
from sklearn.model_selection import train_test_split

Xtrain, Xtest, ytrain, ytest = train_test_split(X, y,
                                                random_state=0)
model = RandomForestClassifier(n_estimators=10)
model.fit(Xtrain, ytrain)
ypred = model.predict(Xtest)

In [None]:
test_im = Xtest[0].reshape((28,28))

plt.imshow(test_im, cmap='gray')
plt.show()
print('true label: ', ytest[0])

In [None]:
from sklearn.metrics import confusion_matrix
mat = confusion_matrix(ytest, ypred)
sns.heatmap(mat.T, square=True, annot=True, fmt='d', cbar=False)
plt.xlabel('true label')
plt.ylabel('predicted label');

In [None]:
from sklearn.metrics import accuracy_score
print("Accuracy = {} %".format(accuracy_score(ytest, ypred)*100))

In [None]:
def train(num_trees):
    model = RandomForestClassifier(n_estimators=num_trees)
    model.fit(Xtrain, ytrain)
    ypred = model.predict(Xtest)
    print("Accuracy for {} trees = {} %".format(num_trees, accuracy_score(ytest, ypred)*100))

In [None]:
number_of_trees = [1, 5, 50, 200]
for i in number_of_trees:
    train(i)