# Trees and Forests

We will study some tree-based models using a different dataset, the wisconsin breast cancer dataset describe below:

In [None]:
import numpy as np
import matplotlib.pyplot as plt
% matplotlib inline
np.set_printoptions(precision=3)
import pandas as pd
from sklearn.model_selection import train_test_split

In [None]:
from sklearn.datasets import load_breast_cancer
cancer = load_breast_cancer()

In [None]:
print(cancer.DESCR)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    cancer.data, cancer.target, stratify=cancer.target, random_state=0)

## Tree visualization
Let's start by building a very small tree (``max_depth=2``) and visualizing it.
The model fitting shouldn't be anything new:

In [None]:
from sklearn.tree import DecisionTreeClassifier
tree = DecisionTreeClassifier(max_depth=2)
tree.fit(X_train, y_train)

Scikit-learn has a way to export trees to ``dot`` graphs using the ``export_graphviz`` function: 

In [None]:
from sklearn.tree import export_graphviz
tree_dot = export_graphviz(tree, out_file=None, feature_names=cancer.feature_names, filled=True)
print(tree_dot)

If you have graphviz installed, you can plot this using the following code. However, that's often a bit cumbersome, and I give an alternativ below.

In [None]:
import graphviz
graphviz.Source(tree_dot)

I included a small function ``plot_tree`` in this lesson that can plot the tree without graphviz. It will hopefully be included in scikit-learn soon.

In [None]:
# import from local file, not in sklearn yet
from tree_plotting import plot_tree

plot_tree(tree, feature_names=cancer.feature_names, filled=True)

### Task 1

Create a plot of the full tree, that is without limiting depth.
Then, create visualizations of trees with varying ``max_depth`` and ``max_leaf_nodes``. How are these trees different? Which do you think will generalize best?

# Parameter Tuning
### Task 2
Tune the ``max_leaf_nodes`` parameter using ``GridSearchCV``:

In [None]:
from sklearn.model_selection import GridSearchCV
# param_grid = ...
grid = GridSearchCV(DecisionTreeClassifier(random_state=0), param_grid=param_grid,
                    cv=10., return_training_score=True)
grid.fit(X_train, y_train)
# inspect best parameters, compute test-set accuracy
#....

We can plot the tree that was fitted with the best parameters on the full training data by accessing ``grid.best_estimator_``:

In [None]:
plot_tree(grid.best_estimator_)

It's easy to visualize the mean training set and  validation accuracy as we did in the last lab:

In [None]:
scores = pd.DataFrame(grid.cv_results_)
scores.plot(x='param_max_depth', y=['mean_train_score', 'mean_test_score'], ax=plt.gca())
plt.legend(loc=(1, 0))

### Task 3
Plot the feature importances of the ``best_estimator_`` using a bar chart.

In [None]:
# solution here ...

# Random Forests
While we could in theory visualize all the trees in a forest, they are random by design, and usually there are too many to look at.
So we'll skip the visualization, and go directly to parameter tuning

### Task 4
Tune the ``max_depth`` parameter of the ``RandomForestClassifier``. Make sure to set ``n_estimators`` to a large enough number (such as 100).

Plot the feature importances of the best random forest side-by-side with the feature importances of the best decision tree.

Finally, compare the precision recall curve of the best random forest with the best tree.