# Decision Trees

Train, regularize, vizualize and make predictions with Decision Trees on regression task. 

> Decision Trees require little data preparation, they don't require feature scaling or centering at all.

Scikit-Learn uses the CART algorithm which produces only binary trees. Other algorithms such as ID3 can produce Decision Trees with nodes that have more than two children.

> Decision Trees are intuitive, their decisions are easy to interpret. Such models are often called white box models. In contrast, Random Forests or neural network are generally considered black box models.

Decision Trees provide a simple classification rules that can be applied manually.

However, they do have a few limitations. First, Decision Trees love orthogonal decision boundaries, all splits are perpendicular to an axis, which makes them sensitive to training set rotation. One way to limit this problem is to use PCA which often results in a better orientation of the training data.

More generally, the main issue is that they are very sensitive to small variation in the training data. Actually, since the training algorithm used by Scikit-Learn is stochastic (randomly selects the set of features to evaluate at each node), you may get very different models even on the same training data, unless you set the `random_state` hyperparameter.

Random Forests can limit this instability by averaging predictions over many trees.

In [1]:
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier

iris = load_iris()
X = iris.data[:, 2:] # petal length and width
y = iris.target

tree_clf = DecisionTreeClassifier(max_depth=2)
tree_clf.fit(X, y)

Vizualize the trained Decision Tree using `export_graphviz()` method to output a graph definition file

In [2]:
from sklearn.tree import export_graphviz

export_graphviz(tree_clf, 
                out_file="iris_tree.dot",
                feature_names=iris.feature_names[2:],
                class_names=iris.target_names,
                rounded=True,
                filled=True)

Use Graphviz package to convert from .dot file to .png:

In [3]:
! dot -Tpng iris_tree.dot -o iris_tree.png

A node's gini attribute measures its impurity
* if gini=0, the node is pure, all training instances belong to the same class

To compute the gini score $G_i$ of the $i^{th}$ note:
$$
G_i = 1 - \sum_{k=1}^n p_{i,k}^2
$$

* $p_{i,k}$ is the ratio of class $k$ instances among the training instances in the $i^{th}$ node

# Estimating Class Probabilities

A Decision Tree can estimate the probability that an instance belongs to a particular class k.

First it traverses the tree to find the leaf node for this instance, and then it returns the ratio fo training instances of class k in this node.

The estimated probabilities would be identical anywhere else in the bottom-right leaf node, if the petals were 6cm long and 1.5cm wide (or 5cm long and 1.5cm wide), even though it seems obvious that it would most likely be an Iris virginica in this case.

In [4]:
tree_clf.predict_proba([[5,1.5]])

tree_clf.predict([[5,1.5]])

array([1])

# CART Training Algorithm
Scikit-Learn uses the Classification and Regression Tree (CART) algorithm to train Decision Trees (also called "growing" trees). The algorithm works by first splitting the training set into two subsets using a single feature k and threshold $t_k$ (eg petal length <= 2.45cm). How does it choose $k$ and $t_k$? It searches for the pair $(k, t_k)$ that produces the purest subsets, weighted by their size.

$$
J(k, t_k) = { m_{left} \over m } G_{left} + { m_{right} \over m } G_{right}
$$

where:
* $G_{left/right}$ measures the impurity of the left/right subset,
* $m_{left/right}$ is the number of instances in the left/right subset.

Once the CART algorithm has successfully split the training set in two, it splits the subsets using the same logic, then the sub-subsets, and so on, recursively. It stops recursing once it reaches the maximum depth, defined by the `max_depth` hyperparameters, or if it cannot find a split that will reduce impurity.

The CART algorithm is a greedy algorithm, it greedily searches for an optimum split at the top level, then repeats the process at each subsequent level. It does not check wether or not the split will lead to the lowest possible impurity several levels down. A greedy algorithm often produces a solution that's reasonably good but not guaranted to be optimal.

Finding the optimal tree is known to be an $NP-Complete$ problem: it requires O(exp(m)) time, making the problem intractable even for small training sets.

## Computational Complexity

Making prediction requires traversing the Decision Tree from the root to a leaf. Decision Trees generally are approximately balanced, so traversing the DT requires going through roughly O(log(m)) nodes. Since each node only requires checking the value of one feature, the overall prediction complexity is 0(log(m)), independent of the number of features.

The training algorithm compares all features (or less if max_features is set) on all samples at each node. Comparing all features on all samples at each node results in a training complexity of O(n m log(m)). For small training sets, less than a few thousand instances, Scikit-Learn can speed up training by presorting the data, (set presort=True), but doing that slows down considerably for larger training set.

## Gini Impurity or Entropy?

In Machine Learning entropy is frequently used as an impurity measure: a set's entropy is zero when it contains instances of only one class.

$$
H_i = - \sum_{k=1}^n p_{i,k} \log_2 (p_{i,k})
$$

with $p_{i,k} \ne 0$, for the $i^{th}$ node

Giny impurity is slightly faster to compute, so it is a good default. Generaly they lead to similar trees, but when they differ, Gini impurity tends to isolate the most frequent class in its own branch of the tree, while entropy tends to produce slightly more balanced trees.


# Regularization Hyperparameters

Decision Trees make very few assumptions about the training data, as opposed to linear models, which assume that the data is linear for example. If left unconstrained the tree structure will adapt itself to the training data, fitting it very closelt, most likely overfitting it.

Such a model is called a **nonparametric model** not because it does not have any parameters, but because the number of paraemters is not determined prior to training, so the model structure is free to stick closely to the data.

In contrast **parametric model** such as linear model, has a predetermined number of parameters, so its degree of freedom is limiter, reducing the risk of overfitting, but increasing the risk of underfitting.

To avoid overfitting the training data, you need to restrict the Decision Tree's freedom during training. The regularization hyperparameters depend on the algorithm usedn but generally you can at least restrict the maximum depth of the Decision Tree. In Scikit-learn, this is controlled by the `max_depth` hyperparameter, the default is `None` which means unlimited. Reducing `max_depth` will regularize the model and thus reduce the risk of overfitting.

The `DecisionTreeClassifier` class has a few other parameters that similarly restrict the shape of the Decision Tree:
* `min_saples_split`: the minimum number of samples a node must have before it can be split
* `min_sample_leaf`: the minimum number of samples a leaf node must have
* `min_weight_fraction_leaf`: fraction of the total number of weighted instances
* `max_leaf_nodes`: the maximum number of leaf nodes
* `max_features`: the maximum number of features that are evaluated for splitting at each node

Increasing `min_*` hyperparameters or reducing `max_*` hyperparameters will regularize the model.

> Other algorithms work by first training the Decision Tree without restrictions, then pruning (deleting) unnecessary nodes. A node whose children are all leaf nodes is considered unnecessary if the purity improvement it provides is not statistically significance.
> 
> Standard statistical tests, such as $\chi^2$ test, are used to estimate the probability that the improvement is purely the result of chance, which is called the null hypothesis. If this probability, called p-value, is higher than a given threshold, typically 5%, controlled by a hyperparameter, then the node is considered unnecessary and its children are deleted. The pruning continus until all unnecessary nodes have been pruned.

# Estimating qualitative variable

Decision Trees are also capable of performing regression tasks.

The CART algorithm works mostly the same way as earlier, except that instead of trying to split the training set in a way that minimizes impurity, it now tries to split the training set in a way that minimizes the MSE.

$$
J(k, t_k) = { m_{left} \over m } MSE_{left} + { m_{right} \over m } MSE_{right}
$$

where:
* $MSE_{node} = \sum_{i \in node} (\hat{y}_{node} -y^{(i)})^2$
* $ \hat{y}_{node} = { 1 \over m_{node} }  \sum_{i \in node} y^{(i)}$

Like for classification, Decision Trees are prone to overfitting when dealing with regression tasks.

In [None]:
from sklearn.tree import DecisionTreeRegressor

tree_reg = DecisionTreeRegressor(max_depth=2)
tree_reg.fit(X, y)