# Working with Decision Trees
*Curtis Miller*

A **decision tree** is a classification algorithm where a series of true/false questions about the data are answered to predict the value of a target variable. This usually is visualized by a tree that one traces to make predictions. A nice feature of this algorithm is that it's a heuristic a human can easily interpret and use. However, decision trees are especially prone to overfitting.

Decision tree classifiers can be implemented using the **scikit-learn** class `DecisionTreeClassifier`. The algorithm tries to train a decision tree that quickly makes accurate decisions on training data.

The hyperparameter I want to draw particular attention to is the maximum depth a decision tree may have. Trees with high depth may be more prone to overfitting with trees with low depth, while trees with low depth may underfit.

We will see how well decision trees predict who survived the *Titanic* disaster. I load in the *Titanic* dataset below.

In [None]:
import pandas as pd
from pandas import DataFrame
from sklearn.model_selection import train_test_split, cross_validate
from sklearn.metrics import classification_report

In [None]:
titanic = pd.read_csv("titanic.csv")
titanic.head()

In [None]:
titanic_train, titanic_test = train_test_split(titanic)
titanic_train.head()

## Fitting a Decision Tree

We will fit a decision tree without specifying a maximum depth. We will also visualize the tree. (I grabbed the code for visualizing the tree from a [blog post by "Russel"](https://medium.com/@rnbrown/creating-and-visualizing-decision-trees-with-python-f8e8fa394176).)

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.externals.six import StringIO  
from IPython.display import Image  
from sklearn.tree import export_graphviz
import pydotplus

In [None]:
tree1 = DecisionTreeClassifier()

tree1 = tree1.fit(X=titanic_train.replace({'Sex': {'male': 0, 'female': 1}}    # Replace strings with numbers
                                         ).drop(["Survived", "Name"], axis=1),
                  y=titanic_train.Survived)

# Example prediction
tree1.predict([[2, 0, 26, 0, 0, 30]])    # A male in second class age 26 with no spouse or child aboard who paid $30 fare

In [None]:
pred1 = tree1.predict(titanic_train.replace({'Sex': {'male': 0, 'female': 1}}
                                           ).drop(["Survived", "Name"], axis=1))
print(classification_report(titanic_train.Survived, pred1))

We can see that on the training data the algorithm is highly accurate, but there's a good chance the model overfit the data.

We can visualize the resulting tree like so. (You will need to install [Graphviz](http://www.graphviz.org/), an open source software package for visualizing graphs, including decision trees.)

In [None]:
# From here: https://medium.com/@rnbrown/creating-and-visualizing-decision-trees-with-python-f8e8fa394176
dot_data = StringIO()

export_graphviz(tree1,    # Function for exporting a visualization of the tree
                out_file=dot_data,
                # Data controlling the display of the graph
                filled=True, rounded=True,
                special_characters=True,
                feature_names=["Pclass", "Sex", "Age", "Siblings/Spouses Aboard", "Parents/Children Aboard",
                               "Fare"],    # Use the name of the features
                proportion=True)    # Show proportions for labels

# Display graph in Jupyter notebook
graph1 = pydotplus.graph_from_dot_data(dot_data.getvalue())  
Image(graph1.create_png())

A graph this complex is probably overfitting. In fact, let's peek to see how this would do on the test data.

In [None]:
pred2 = tree1.predict(titanic_test.replace({'Sex': {'male': 0, 'female': 1}}
                                          ).drop(["Survived", "Name"], axis=1))
print(classification_report(titanic_test.Survived, pred2))

Performance dropped significantly. This is likely no better than the "predict most frequent label" algorithm.

## Restricting Tree Depth

We can control overfitting by restricting the depth of the tree. For example, let's see a tree that does not go deeper than three levels.

In [None]:
tree2 = DecisionTreeClassifier(max_depth=3)

tree2 = tree2.fit(X=titanic_train.replace({'Sex': {'male': 0, 'female': 1}}    # Replace strings with numbers
                                         ).drop(["Survived", "Name"], axis=1),
                  y=titanic_train.Survived)

dot_data = StringIO()

export_graphviz(tree2,    # Function for exporting a visualization of the tree
                out_file=dot_data,
                # Data controlling the display of the graph
                filled=True, rounded=True,
                special_characters=True,
                feature_names=["Pclass", "Sex", "Age", "Siblings/Spouses Aboard", "Parents/Children Aboard",
                               "Fare"],
                proportion=True)

# Display graph in Jupyter notebook
graph2 = pydotplus.graph_from_dot_data(dot_data.getvalue())  
Image(graph2.create_png())

Now let's use cross-validation to decide the appropriate tree depth.

In [None]:
m_candidate = [2, 3, 4, 5, 6, 7, 8, 9, 10]    # Candidate depths
res = dict()

for m in m_candidate:
    pred3 = DecisionTreeClassifier(max_depth=m)
    res[m] = cross_validate(pred3,
                            X=titanic_train.replace({'Sex': {'male': 0, 'female': 1}}    # Replace strings with numbers
                                         ).drop(["Survived", "Name"], axis=1),
                            y=titanic_train.Survived,
                            cv=10,
                            return_train_score=False,
                            scoring='accuracy')

resdf = DataFrame({(i, j): res[i][j]
                             for i in res.keys()
                             for j in res[i].keys()}).T

resdf.loc[(slice(None), 'test_score'), :]

In [None]:
resdf.loc[(slice(None), 'test_score'), :].mean(axis=1)

Maximum predictive accuracy occurs when the maximum depth is 5. Let's see the final result.

In [None]:
tree4 = DecisionTreeClassifier(max_depth=4)

tree4 = tree4.fit(X=titanic_train.replace({'Sex': {'male': 0, 'female': 1}}    # Replace strings with numbers
                                         ).drop(["Survived", "Name"], axis=1),
                  y=titanic_train.Survived)

dot_data = StringIO()

export_graphviz(tree4,    # Function for exporting a visualization of the tree
                out_file=dot_data,
                # Data controlling the display of the graph
                filled=True, rounded=True,
                special_characters=True,
                feature_names=["Pclass", "Sex", "Age", "Siblings/Spouses Aboard", "Parents/Children Aboard",
                               "Fare"],
                proportion=True)

# Display graph in Jupyter notebook
graph3 = pydotplus.graph_from_dot_data(dot_data.getvalue())  
Image(graph3.create_png())

In [None]:
survived_test_predict = tree4.predict(X=titanic_test.replace(
    {'Sex': {'male': 0, 'female': 1}}
).drop(["Survived", "Name"], axis=1))

In [None]:
print(classification_report(titanic_test.Survived, survived_test_predict))

The metrics of the decision tree look good. These are better than when we allowed the tree to have any depth.