# Decision Trees

* It can perform both classification and regression tasks, and multiouput tasks.

* It is the fundamental components of Random Forests.

## Training and Visualizing a Decision Tree



In [2]:
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier

iris = load_iris()
X = iris.data[:, 2:] #petal length and width
y = iris.target

tree_clf = DecisionTreeClassifier(max_depth=2)
tree_clf.fit(X, y)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=2,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')

In [6]:
import numpy as np

from sklearn.tree import export_graphviz

export_graphviz(tree_clf, out_file = "iris_tree.dot",
               feature_names = iris.feature_names[2:],
               class_names=iris.target_names,
               rounded=True,
               filled=True)


## Making Predicitons

* One of the many qualities of Decision Trees is that they require very little data preparation. In particular, they don't require feature scaling or centering at all.

* A node's value attribute tells us how many training instances of each class this node applies to.

* A node's gini attribute measures its impurity: a node is "pure"(gini=0) if all training instances it applies to belong to the same calss.

* Sklearn uses the CART algorithm, which produces only binary trees: nonleaf nodes always have two children.

* Decision Trees provide nice and simple classification rules that can even be applied manually if need be.

## Estimating Class Probabilities

* A Decision Tree can also estimate the probability that an instance belongs to a particular class k.


In [7]:
tree_clf.predict_proba([[5, 1.5]])

array([[ 0.        ,  0.90740741,  0.09259259]])

In [8]:
tree_clf.predict([[5, 1.5]])

array([1])

## The CART Training Algorithm

* CART: Classification And Regression Tree

* the CART algorithm is a greedy algorithm. A greedy algorithm often produces a reasonably good solution, but it is not guaranteed to be the optimal solution.

* Finding the optimal tree is known to be an NP-Complete problem.

## Computaional Complexity

* the prediction complexity is just O(log_2(m)), independent of the number of features.

* the training complexity is O(n×mlog(m)).

## Gini Impurity or Entropy?

* by setting hte criterion hyperparameter to "entropy".

* entropy is zero when all messages are identical.

* In Machine Learning, a set's entropy is zero when it contains instances of only one class.

* Gini and entropy does not make a big difference. Gini impurity is slightly faster to compute.

### Regularization Hyperparameters

* Decision Trees make very few assumptions about the training data. 

* If left unconstrained, the tree structure will adapt itself to the training data, fitting it very closely, and most likely overfitting it.

* Such a model is ofen called a nonparametric model.

* To avoid overfitting the training data, we need to restrict the Decision Tree's freedom during training.

** max_depth, min_samples_split, min_samples_leaf, min_weight_fraction_leaf, max_leaf_nodes, max_features)**

** Increasing min hyperparameters or reducing max hyperparameters will regularize the model **


### Regression


In [9]:
np.random.seed(42)
m = 200
X = np.random.rand(m, 1)
y = 4 * (X - 0.5) ** 2
y = y + np.random.randn(m, 1) / 10

In [10]:
from sklearn.tree import DecisionTreeRegressor

tree_reg = DecisionTreeRegressor(max_depth=2)
tree_reg.fit(X, y)

DecisionTreeRegressor(criterion='mse', max_depth=2, max_features=None,
           max_leaf_nodes=None, min_impurity_split=1e-07,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, presort=False, random_state=None,
           splitter='best')

* The CART algorithm works mostly the same way as earlier, except that instead of trying to split the training set in a way that minimizes impurity, it now tries to split the training set in a way that minimizes the MSE.

### Instability

** Limitation of Decision Trees:**

1. Decision Trees love orthogonal decision boundaries, which makes them sensitive to training set rotation. One way to limit this problem is to use PCA.

2. They are very sensitive to small variations in the training data.

3. we can use random forests to solve all these problems.
