# Decision Trees 

A decision tree is like going to a doctor who asks a series of questions to determine the cause of your symptoms.

**Let's Play 20 Questions** 

**What are some other real life examples(like our doctor) where our society uses a decision tree mentality to come to an answer?** 



## Entropy and Information Gain 
The goal is to have our ultimate classes be fully "ordered" (for a binary dependent variable, we'd have the 1's in one group and the 0's in the other). So one way to assess the value of a split is to measure how disordered our groups are, and there is a notion of entropy that measures precisely this.

The entropy of the whole dataset is given by:

$\large E = -\Sigma^n_i p_i\log_2(p_i)$,

where $p_i$ is the probability of belonging to the $i$th group, where $n$ is the number of groups (i.e. target values).

Entropy will always be between 0 and 1. The closer to 1, the more disordered your group.

Let's use the math library's log() function to look at this:
[Powerpoint presentation](https://docs.google.com/presentation/d/1kXs3Mi9a3w87J6tzs2sWyxW8kq2eaRQTBgUPKvuf8x8/edit#slide=id.p9)


In [5]:
#if we start with a 50/50 distribution 
from math import log
import numpy as np 
entropy = -0.5 * log(0.5, 2) - 0.5 * log(0.5, 2)
entropy

1.0

In [None]:
#tree will detect a feature that can split into 30/70
entropy = -0.3 * log(0.3, 2) - 0.7 * log(0.7, 2)
entropy

For a given split, the information gain is simply the entropy of the parent group less the entropy of the split.

For a given parent, then, we maximize our model's performance by minimizing the split's entropy.

What we'd like to do then is:

to look at the entropies of all possible splits, and
to choose the split with the lowest entropy.
In practice there are far too many splits for it to be practical for a person to calculate all these different entropies ...

... but we can make computers do these calculations for us!

**Gini Impurity**

An alternative metric to entropy comes from the work of Corrado Gini. The Gini Impurity is defined as:

$\large G = 1 - \Sigma_ip_i^2$, or, equivalently, $\large G = \Sigma_ip_i(1-p_i)$.

where, again, $p_i$ is the probability of belonging to the $i$th group.

Gini Impurity will always be between 0 and 0.5. The closer to 0.5, the more disordered your group.

In [None]:
#gini with 30/70 split 
1 - (0.7**2 + 0.3**2)

In [None]:
import pandas as pd
import matplotlib.pyplot as plt 
%matplotlib inline 
from sklearn.tree import DecisionTreeClassifier 
from sklearn.model_selection import train_test_split 
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

#df = pd.read_csv('/Users/amberyandow/Downloads/diabetes.csv')

In [None]:
df.head()

In [None]:
df.info()

In [None]:
import seaborn as sns
sns.countplot(df['Outcome'],label="Count")

In [None]:
#create numpy arrays for predictors and target variables 
X = df.drop('Outcome',axis=1).values
y = df['Outcome'].values

In [None]:
# Split dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

In [None]:
# Create Decision Tree classifer object
clf = DecisionTreeClassifier()

# Train Decision Tree Classifer
clf = clf.fit(X_train,y_train)

#Predict the response for test dataset
y_pred = clf.predict(X_test)

In [None]:
print('-'*40)
print('Accuracy Score:')
print(accuracy_score(y_test, y_pred))

print('-'*40)
print('Confusion Matrix:')
print(confusion_matrix(y_test, y_pred))

print('-'*40)
print('Classification Matrix:')
print(classification_report(y_test, y_pred))

In [None]:
#brew install graphviz
#pip install -U pydotplus

#For installing graphviz on windows use the url below 
#https://bobswift.atlassian.net/wiki/spaces/GVIZ/pages/20971549/How+to+install+Graphviz+software

### Visualizing the Tree 

**You will likely need to run the cell above before you can use the following code. Export_graphviz converts our classifier into a dot file and pydotplus converts the dot file in a png which can then be displayed into the notebook.** 

In [None]:
from sklearn.tree import export_graphviz
from sklearn.externals.six import StringIO  
from IPython.display import Image  
import pydotplus

col_names = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness',
             'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age']
dot_data = StringIO()
export_graphviz(clf, out_file=dot_data,  
                filled=True, rounded=True,
                special_characters=True,feature_names = col_names,class_names=['0','1'])
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())  
graph.write_png('diabetes.png')
Image(graph.create_png())

**This unpruned tree is not very interpretable. Let's prune these trees and see if we get a better output.** 

## Hyperparameter Tuning 

![](https://miro.medium.com/max/1136/1*3MDxpY_pIMs0yb4dc55KpQ.jpeg)

[Great Medium Article that thoroughly goes over each hyperparameter](https://towardsdatascience.com/how-to-tune-a-decision-tree-f03721801680)

In [None]:
# Create Decision Tree classifer object
clf = DecisionTreeClassifier(criterion="entropy", max_depth=3)

# Train Decision Tree Classifer
clf = clf.fit(X_train,y_train)

#Predict the response for test dataset
y_pred = clf.predict(X_test)


In [None]:
print('-'*40)
print('Accuracy Score:')
print(accuracy_score(y_test, y_pred))

print('-'*40)
print('Confusion Matrix:')
print(confusion_matrix(y_test, y_pred))

print('-'*40)
print('Classification Matrix:')
print(classification_report(y_test, y_pred))

In [None]:
dot_data = StringIO()
export_graphviz(clf, out_file=dot_data,  
                filled=True, rounded=True,
                special_characters=True,feature_names = col_names,class_names=['0','1'])
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())  
graph.write_png('diabetes.png')
Image(graph.create_png())

## Feature Selection 

In [None]:
top_feats = clf.feature_importances_
top_feats


In [None]:
import numpy as np
# creating list of column names
feat_names=list(col_names)

# Sort feature importances in descending order
indices = np.argsort(top_feats)[::-1]

# Rearrange feature names so they match the sorted feature importances
names = [feat_names[i] for i in indices]

# Create plot
plt.figure()

# Create plot title
plt.title("Feature Importance")

# Add bars
plt.bar(range(X_train.shape[1]), top_feats[indices])

# Add feature names as x-axis labels
plt.xticks(range(X_train.shape[1]), names, rotation=90)

# Show plot
plt.show()

## Pros & Cons 
[Scikit-Learn](https://scikit-learn.org/stable/modules/tree.html#tree)

**Pros**
- East to interpret 
- Easily capture non-linear patterns 
- No need to normalize columns 
- feature engineering/importance 
- Non-parametric/no assumptions 

**Cons**
- Sensitive to noisy data(tends to overfit), can be reduced with tuning 
- Sensitive to variance, can be reduced by bagging/boosting 
- Biased when you have imbalanced data(can be fixed with smote) 