# <font color='#eb3483'> Decision Trees </font>

One of the most popular algorithms for classification and regression are decision trees. One of the advantages of decision trees is their interpretability (you can see a tree and know exactly what it's doing).

In [None]:
from IPython.display import Image
import pandas as pd
import numpy as np
import warnings
warnings.simplefilter("ignore")

import seaborn as sns
sns.set(rc={"figure.figsize":(6,6)})

## <font color='#eb3483'> Load the data </font>

We are going to use the [Titanic dataset](https://www.kaggle.com/c/titanic/data), and load it up from seaborn.

In [None]:
passengers = sns.load_dataset("titanic")
passengers.head()

## <font color='#eb3483'> Preprocessing </font>

In [None]:
#remove the alive column

In [None]:
#check for missing values

We can drop the "deck" column, since 688 out of 891 instances are null, and we don't really have a way to fill in the missing values (unlike the "age" column, where we can use the average). We can also drop the "embarked" column since it has the same data as "embark_town". We can also drop the 2 rows with missing "embark_town" values.

In [None]:
# drop deck and embarked
dropped_columns = ["deck", "embarked"]
passengers = passengers.drop(dropped_columns, axis = 1)

# fill in null age values with the average
passengers["age"] = passengers["age"].fillna(passengers["age"].mean())

# drop rows with null values
passengers = passengers.dropna()

passengers.head()

In [None]:
#check the missing values again to check

Decision trees can handle both categorical and numerical data. However, scikit-learn does not provide support for decision trees to handle categorical data! This is a long-standing issue... see https://stackoverflow.com/a/56857255 for more details on this.

Because of this, using a OneHotEncoder is the only way to handle categorical data in decision trees in scikit-learn.

In [None]:
passengers = pd.get_dummies(passengers)

passengers.head()

## <font color='#eb3483'> Classification with Decision Trees </font>

In scikit-learn the tree based algorithms are in the `sklearn.tree` submodule.

Scikit-learn tree implementation [uses an optimized version of CART](http://scikit-learn.org/stable/modules/tree.html#tree-algorithms-id3-c4-5-c5-0-and-cart) *(Classification and Regression Trees)*, that allows us to use the decission trees for both classification and regression.

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn import metrics

tree = DecisionTreeClassifier()

In [None]:
features = passengers.drop(columns=["survived"]).columns
X = passengers[features]
y = passengers["survived"]
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2)

tree.fit(X_train, y_train)

In [None]:
#generate predictions based on your model and check the first 10 outputs of the array

In [None]:
#evaluate your model

One very good advantage that decision trees have is that they can be visualized, and we can explain why they take a certain decision  (we say trees have a high **explainability**). Scikit-learn trees can be visualized with a variety of tools; today we'll use `Grpahviz`, a visualisation library in Python. You can install it using `conda install python-graphviz`.

In [None]:
import graphviz
import matplotlib.pyplot as plt
import pydotplus
import matplotlib.image as mpimg
import io
from sklearn.tree import export_graphviz

def draw_tree(tree):
    dot_data = export_graphviz(tree,
                               out_file=None,
                               feature_names=features,
                               class_names=["survived", "died"],
                               filled=True,
                               rounded=True,
                               special_characters=True,
                               proportion = True)

    graph = graphviz.Source(dot_data)
    graph.format = "png"
    graph.render("tree",view=True)
    plt.figure(figsize=(50,30))
    img = mpimg.imread("tree.png")
    imgplot = plt.imshow(img)

    plt.show()

draw_tree(tree)


Another good functionality of decision trees is that they give us an indication of how important is each variable. A fitted tree can show us the variable importance with the attribute `feature_importances_`. The feature importances are calculated based on the information gained by each variable (that is, which variables split the classes better).

In [None]:
tree.feature_importances_

In [None]:
dict(zip(
    features,
    tree.feature_importances_
))

For example, we see that on this tree the most important features in predicting if a passenger will survive are "fare", "age" and "gender".

## <font color='#eb3483'> Max Depth </font>

In [None]:
DecisionTreeClassifier?

These are the most important hyperparameters for scikit-learn `DecisionTreeClassifier`:

Here is a great article that goes over how to understand and potentially use these for tuning your model ...
https://towardsdatascience.com/how-to-tune-a-decision-tree-f03721801680


* **criterion** : The partition criterion to use, we can use either `gini`, or `entropy`

* **max_depth** (int>1) : The max depth the tree can achieve. We define as depth as the number of nodes an observation goes through (how many *questions* are asked).

* **max_features** (int or float(percentage)):  The maximum number of potential partitions evaluated when we split a node.

* **max_leaf_nodes** (int or None): Max number of leaves in the tree.

* **min_impurity_decrease** (float) : The minimum information gain required in a node to split it (if no feature provides that minimum, the node becomes a leaf).

* **class_weight** : For imbalanced classes, we can use `class_weight`, which is a dictionary with the shape `{class: weight}`, so sklearn takes the class weights into consideration. We can also use `class_weight=balanced` so sklearn creates the weights automatically based on class distribution.

Decision trees are prone to overfitting, there are some hyperparameters that help us control this:

* **min_samples_leaf** (int or float(percentage)) : Minimum number of observations on a node to consider the node a leaf. Default value is 1, this means that by default sklearn will create leaves with one observation (and this memorize the dataset).

* **min_samples_split** (int or float(percentage)) : Minimum number of observations on a node to generate a partition. By default is 2, this means sklearn will split all nodes with 2 or more observations by default.

For example, we can create a simpler tree by setting the maximum depth

In [None]:
simple_tree = DecisionTreeClassifier(max_depth=4)

In [None]:
simple_tree.fit(X_train, y_train)

In [None]:
draw_tree(simple_tree)

In [None]:
simple_predictions = simple_tree.predict(X_test)
print(metrics.confusion_matrix(y_test, simple_predictions))

In [None]:
print(metrics.accuracy_score(y_test, simple_predictions))

In [None]:
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, simple_predictions))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, simple_predictions))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, simple_predictions)))

In [None]:
cross_val_score(simple_tree, X, y,
                scoring="roc_auc",
                cv=3).mean()

So we see this simple tree performs much better than the initial tree (that was overfitting), and it is also very simple to explain!

But how do we know what the optimal depth is? Well, this is a balance of practicality and "hyperparameter tuning". Let's test a number of depths.

In [None]:
# let's test a range of depths from 2:10 using a for loop

depths = np.arange(2,10) # define the depths
results = [] # create an empty data frame for our results

for depth in depths:
    best_depth_tree = DecisionTreeClassifier(max_depth = depth) # creating an instance of a decision tree
    results.append(cross_val_score(best_depth_tree, X, y, scoring="roc_auc", # getting the cv accuracy metric for the tree at each depth
                cv=3).mean())


In [None]:
test = pd.DataFrame({'depths':depths, 'mean_roc_auc':results})
test.sort_values("mean_roc_auc", ascending=False)