# Decision Tree

CART (Classification  and Regression Tree)

## What is Decision Tree?

<img src="img\tree.png" width=50% height=50%>

***
* Supervised Learning

* Works for both classification and regression

* Foundation of Random Forests

* Attractive because of interpretability



***

Decision Tree works by:

* Split based on set impurity criteria
* Stopping criteria


***

Some **advantages** of decision trees are:
* Simple to understand and to interpret. Trees can be visualised.
* Requires little data preparation. 
* Able to handle both numerical and categorical data.
* Possible to validate a model using statistical tests. 
* Performs well even if its assumptions are somewhat violated by the true model from which the data were generated.

The **disadvantages** of decision trees include:
* Overfitting. Mechanisms such as pruning (not currently supported), setting the minimum number of samples required at a leaf node or setting the maximum depth of the tree are necessary to avoid this problem.
* Decision trees can be unstable. Mitigant: Use decision trees within an ensemble.
* Cannot guarantee to return the globally optimal decision tree. Mitigant: Training multiple trees in an ensemble learner
* Decision tree learners create biased trees if some classes dominate. Recommendation: Balance the dataset prior to fitting


# Classification

## Training a Decision Tree with Scikit-Learn Library

In [None]:
from sklearn import tree

In [None]:
# data section
X = [[0, 0], [1, 2]]
y = [0, 1]

'''

(zero,zero) --> zero
(one,two) --> one

'''

In [None]:
# instantiate the decision tree classifier
clf = tree.DecisionTreeClassifier()

In [None]:
# fitting the model
clf = clf.fit(X, y)

In [None]:
# ask the model to predict some test question (data)
clf.predict([[2., 2.]])
# (two,two) --> ?

In [None]:
# (two,two) --> one

In [None]:
# predict the question with probability of truthiness
clf.predict_proba([[2. , 2.]])

In [None]:
# it says there is 0% probability that model fits to the first calss (y[0])
# and 100% of probability that model fits in the second class (y[1])

In [None]:
clf.predict([[0.4, 1.2]])

In [None]:
clf.predict_proba([[0.4, 1.2]])

In [None]:
clf.predict_proba([[0, 0.2]])

`DecisionTreeClassifier` is capable of both binary (where the labels are [-1, 1]) classification and multiclass (where the labels are [0, …, K-1]) classification.

## Applying to Iris Dataset

In [None]:
from sklearn.datasets import load_iris
from sklearn import tree
iris = load_iris()

In [None]:
# printing the first five queries of Iris dataset
iris.data[0:5]

In [None]:
# and their feature names
iris.feature_names

In [None]:
# feature selection begins...
# we select the last two features of Iris dataset (petal length and petal width)
X = iris.data[:, 2:]

In [None]:
# and the target which obviously is the type of Iris flower
y = iris.target

In [None]:
y

In [None]:
clf = tree.DecisionTreeClassifier(random_state=42)

In [None]:
clf = clf.fit(X, y)

### Export_graphviz

In [None]:
from sklearn.tree import export_graphviz

In [None]:
export_graphviz(clf,
                out_file="tree.dot",
                feature_names=iris.feature_names[2:],
                class_names=iris.target_names,
                rounded=True,
                filled=True)

Run the following line on your command prompt

`$ dot -Tpng tree.dot -o tree.png`

<img src="img\tree.png" width=60% height=60%>

## Visualise the Decision Boundary

In [None]:
import numpy as np
import seaborn as sns
sns.set_style('whitegrid')
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
df = sns.load_dataset('iris')
df.head()

In [None]:
col = ['petal_length', 'petal_width']
X = df.loc[:, col]

In [None]:
species_to_num = {'setosa': 0,
                  'versicolor': 1,
                  'virginica': 2}
df['tmp'] = df['species'].map(species_to_num)
y = df['tmp']

In [None]:
clf = tree.DecisionTreeClassifier()
clf = clf.fit(X, y)

In [None]:
Xv = X.values.reshape(-1,1)
h = 0.02
x_min, x_max = Xv.min(), Xv.max() + 1
y_min, y_max = y.min(), y.max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                     np.arange(y_min, y_max, h))

In [None]:
z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
z = z.reshape(xx.shape)
fig = plt.figure(figsize=(16,10))
ax = plt.contourf(xx, yy, z, cmap = 'afmhot', alpha=0.3);
plt.scatter(X.values[:, 0], X.values[:, 1], c=y, s=80, 
            alpha=0.9, edgecolors='g');