# Decision trees: code

In [142]:
import matplotlib.pyplot as plt
import numpy as np
from sklearn.cross_validation import train_test_split
from sklearn.datasets import load_breast_cancer, load_iris
from sklearn.tree import DecisionTreeClassifier, export_graphviz

%matplotlib inline

## Build a decision tree and visualize the result EXAMPLE

**The iris dataset**  
Measures of petal and sepal width and height (cm) for three different iris flower species.  

_Can we build a decision tree to model the petal and sepal dimensions associated with each flower species?_

![http://5047-presscdn.pagely.netdna-cdn.com/wp-content/uploads/2015/04/iris_petal_sepal.png](./images/iris_petal_sepal.png)

In [None]:
# Load the iris dataset (object known as a "bunch")
iris = load_iris()

In [89]:
# Review the keys available in the file
print "\nKeys:"
print iris.keys()

# Review the target names, and the target values
print "\nTarget names:"
print iris['target_names']

print "\nTarget values:"
print iris.target

# Review the feature names, and feature values
print "\nFeature values (first 5 rows):"
print iris.data[:5]


Keys:
['target_names', 'data', 'target', 'DESCR', 'feature_names']

Target names:
['setosa' 'versicolor' 'virginica']

Target values:
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]

Feature values (first 5 rows):
[[ 5.1  3.5  1.4  0.2]
 [ 4.9  3.   1.4  0.2]
 [ 4.7  3.2  1.3  0.2]
 [ 4.6  3.1  1.5  0.2]
 [ 5.   3.6  1.4  0.2]]


In [199]:
# Instantiate a tree object
tree = DecisionTreeClassifier()

# Fit the model
tree.fit(iris.data, iris.target)

# Check fit accuracy, the long way
print tree.predict(iris.data) == iris.target

print "\nAccuracy:", tree.score(iris.data, iris.target)

[ True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True]

Accuracy: 1.0


In [42]:
export_graphviz(tree, 
                out_file="./images/iris_tree.dot",
                class_names = iris["target_names"],
                feature_names = iris["feature_names"],
                impurity = False,
                filled = True)

**Anticipating some problems with rendering dot files...**  
Example dot graph:  
```python 
digraph graphname {
     a -> b -> c;
     b -> d;
 }
```
![dot file](https://upload.wikimedia.org/wikipedia/commons/e/ec/DotLanguageDirected.svg)

In [207]:
# Quick fix: shell command to convert dot to png
!dot -Tpng ./images/iris_tree.dot -o ./images/iris_tree.png

![iris_tree](./images/iris_tree.png)

## Predicting with decision trees EXERCISE

**The breast cancer dataset**  
Measures of tumor cytoarchitecture for two different diagnoses: benign or malignant.  

_Can we build a decision tree to model to predict the diagnostic outcome of breast cancer screenings based on the cellular dimensions of biopsy tissue?_

In [269]:
# Load the brest cancer dataset
cancer = load_breast_cancer()

In [None]:
# Review the keys available in the file
print "\nKeys:"
print #CODE

# Review the target names, and the target values
print "\nTarget names:"
print #CODE

print "\nTarget values:"
print #CODE

# Review the feature names, and feature values
print "\nFeature values (first 5 rows):"
print #CODE

In [270]:
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(cancer.data, cancer.target, 
                                                    train_size = 0.5, 
                                                    random_state = 7)

In [271]:
# Instantiate a tree object
tree = DecisionTreeClassifier(random_state = 42)

In [272]:
# Fit the data to the training set
tree.fit(X_train, y_train)

# Evaluate performance
print "Training accuracy:", tree.score(X_train, y_train)
print "Testing accuracy:", tree.score(X_test, y_test)

Training accuracy: 1.0
Testing accuracy: 0.908771929825


Our training accuracy is perfect, but our prediction accuracy is lower than reported others who have worked on
this dataset. Collectivly, these features suggest that our model is overfitting.

Read the scikit-learn [documentation](www.google.com) on decision trees. What parameters can you see that might 
help us parameterize our decision tree with pre-pruning and avoid overfitting?  

What effect does pre-prune have on predictive accuracy?  

In [277]:
# Instantiate a tree object the includes an argument to pre-prune our model
tree = DecisionTreeClassifier(random_state = 42) #CODE

In [278]:
# Fit the data to the training set
tree.fit(X_train, y_train)

# Evaluate performance
print "Training accuracy:", tree.score(X_train, y_train)
print "Testing accuracy:", tree.score(X_test, y_test)

Training accuracy: 1.0
Testing accuracy: 0.908771929825


**Note**: your model should be able to achieve an accuracy score of >= 0.937

In [None]:
# BONUS: visualize the final decision tree