# Lecture 5. Decision trees
Mikhail Belyaev and Maxim Panov

## Example: Iris Data Set

#### Task: 
consider the flowers characterized by
* sepal length
* sepal width
* petal length
* petal width

and identifie the species based on those measurements alone.

![image](./figures/petal_sepal.jpg)




## Iris setosa
![image](./figures/iris_setosa.jpg)

## Iris virginica
![image](./figures/iris_virginica.jpg)

## Iris versicolor
![image](./figures/iris_versicolor.jpg)

### Decision tree example
![image](./figures/iris_1.png)

### Decision tree example

![image](./figures/iris_2.png)

### How to find informative features?

* We need some notion of information.

### Consider:
* $P$ - total number of positive objects (class 1)
* $N$ - total number of negative objects (class 0)
* $p$ - number of correctly classified objects of class 1
* $n$ - number of incorrectly classified objects of class 0 (classified as class 1)

## Different information criteria

![image](./figures/criteria.png)

## Decision tree algorithm
Do iteratively:
1. Find most informative combination of
  * node of the tree
  * feature
  * split value
2. Do split if
  * *max_depth* is not reached
  * there is more than *min_samples_split* objects in the node
  * there is more than *min_samples_leaf* objects in a leaf after the split

## Decision tree advantages

* Simple to understand and interpret.
* Able to handle both numerical and categorical data. 
* Possible to validate a model using statistical tests. 
* Robust. 
* Performs well with large datasets.

In [None]:
import seaborn as sns
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
%matplotlib inline
from pylab import rcParams
rcParams['figure.figsize'] = (15, 9)

### Loading & Statistics

In [None]:
iris_df = sns.load_dataset("iris")
iris_df.tail()

### Numpy way

In [None]:
labels = np.unique(iris_df['species'])

for label in labels:
    idx = iris_df['species'] == label

    print(label)
    iris_sub_df = iris_df[idx]
    print(iris_sub_df.describe())
    print()

###  Pandas way

In [None]:
iris_grouped = iris_df.groupby(by='species')
iris_grouped.describe()

### Plots 

In [None]:
#TODO: apply pairplot to the iris_df
sns.pairplot(iris_df)

In [None]:
#TODO: modify parameters to use different colors for different classes
sns.pairplot(iris_df, hue='species')

## Sklearn example (pair-wise classification) 

In [None]:
# slightly simplified sklearn example
# see http://scikit-learn.org/stable/auto_examples/tree/plot_iris.html

from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier

# Parameters
n_classes = 3 
plot_colors = "bry"
plot_step = 0.005

# Load data
iris = load_iris()

for pairidx, pair in enumerate([[0, 1], [0, 2], [0, 3],
                                [1, 2], [1, 3], [2, 3]]):
    # We only take the two corresponding features
    X = iris.data[:, pair]
    y = iris.target

    # Train
    clf = DecisionTreeClassifier(max_depth=3).fit(X, y)

    # Plot the decision boundary
    plt.subplot(2, 3, pairidx + 1)

    x_min, x_max = X[:, 0].min() - 0.2, X[:, 0].max() + 0.2
    y_min, y_max = X[:, 1].min() - 0.2, X[:, 1].max() + 0.2
    xx, yy = np.meshgrid(np.arange(x_min, x_max, plot_step),
                         np.arange(y_min, y_max, plot_step))

    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    cs = plt.contourf(xx, yy, Z, cmap=plt.cm.Paired)

    plt.xlabel(iris.feature_names[pair[0]])
    plt.ylabel(iris.feature_names[pair[1]])
    plt.axis("tight")

    # Plot the training points
    for i, color in zip(range(n_classes), plot_colors):
        idx = np.where(y == i)
        plt.scatter(X[idx, 0], X[idx, 1], c=color, label=iris.target_names[i],
                    cmap=plt.cm.Paired)

    plt.axis("tight")

plt.suptitle("Decision surface of a decision tree using paired features")
plt.legend()
plt.show()

## Tree visualization 

In [None]:
# visualisation dependencies:
#  - graphviz (via anaconda)
#  - pydotplus (via pip)
#  - probably, you also have to upgrade pyparsing

In [None]:
!conda install graphviz -y
!pip install pydotplus
!conda update pyparsing -y

In [None]:
from sklearn.externals.six import StringIO  
import pydotplus as pydot
from IPython.display import Image  
from sklearn.tree import export_graphviz

def show_tree(clf):
    dot_data = StringIO()  
    export_graphviz(clf, out_file=dot_data,  
                    feature_names=iris.feature_names,  
                    class_names=iris.target_names,  
                    filled=True, rounded=True,  
                    special_characters=True)  
    graph = pydot.graph_from_dot_data(dot_data.getvalue())  
    return Image(graph.create_png())

In [None]:
clf = DecisionTreeClassifier(max_depth=5)
clf = clf.fit(iris.data, iris.target)
show_tree(clf)

## Check bias-variance tradeoff 

In [None]:
from sklearn.cross_validation import cross_val_score
from sklearn.datasets import make_classification

X, y = make_classification(n_samples=1000, n_informative=5)

In [None]:
acc = [cross_val_score(DecisionTreeClassifier(max_depth=max_depth), X, y, cv=5).mean()
       for max_depth in range(1, 15)]
plt.plot(range(1, len(acc) + 1), acc)

# Hands On: Kaggle's 'Forest Cover Type Prediction' competition

Read in the data as pandas dataframes. Data was downloaded as csv files from the [Kaggle competition Data page](http://www.kaggle.com/c/forest-cover-type-prediction/data).

In [None]:
train = pd.read_csv('forest_train.csv')

### TODO
 - remove an uninformative feature
 - extract X, y
 - compare DecisionTreeClassifier with kNeighborsClassifier using cross_val_score