# Decision Trees
See also Random Forest.

Used for classification or regession.  
Used on categorical or numerical data.  
Trees can be binary or not.  

Tree building is inductive reasoning. We infer general rules from specific data.  
TDIDT stands for 
Top-Down Induction of Decision Trees.
This is a general name for all DT algorithms.  

A DT is a predictive model.  
Prediction with a tree is deductive reasoning.  

Practical DT algorithms are greedy.
It is computational intractable to examine all possible trees.
Instead, we iteratively add the next best rule.

A DT chooses a rectilinear decision boundary.  
DT is susceptible to feature interaction effects, esp with many irrelevant features.  
A DT may be unusable if unseen data presents missing values.  

DT is related to rules-based learners, 
but those don't necessarily apply rules in any order. 

## Hunt's algorithm
Recursively split the remaining data based on the most discriminative feature.
This is the splitting criteria used by all the other algorithms listed here,
including Quinlan.

Hunt's algorithm is optimal 
if all feature combinations are present in the training data.
But this is unlikely.

## Quinlan's algorithms: 
Quinlan invented a serices of algorithms, each an improvement on the previous.

#### ID3
Iterative Dichotomizer. 
At each node, review the so-far unused features. 
Choose feature with max information gain or min entropy.
Choose one or more binary rules for two or more child nodes.
Stop when out of examples, out of features, or node is pure.
All the children of a pure node have the same class.

This is a greedy, non-optimal, recursive algorithm.
It tends to overfit. 
Can be improved (with longer run) by backtracking.
Better suited for categorical features & classification than continuous regression.

#### FOIL
First Order Inductive Learner algorithm. 
This is a hill climbing algorithm.
It adds one rule at a time.
Uses separate-and-conquer, as opposed to divide-and-conquer.

#### C4.5
This became world's most popular machine learning tool in 1980s.
It was implemented as Weka J48.

Improvments relative to ID3:
C4 is better at selecting thresholds on continuous features.
C4 gnores missing data during entropy calculation.
In C4, the features can be weighted.
C4 prunes the final tree to remove unhelpful branches.

#### See5
This was commercial software.
C5 requires less CPU and less RAM. 
C5 makes smaller trees.
C5 uses boosting (training sequentially on the residual cases).
C5 uses feature selection (winnowing).
C5 weights the different classification-error classes.

## Brieman's algorithms
#### CART
CART = Classification And Regression Tree.
CART was the acronym used by Brieman, but now it is a generic term.

#### Boosted trees
CART uses multiple trees in series (boosting) for incremental refinement.
This was used by AdaBoot.
Implemented by [XGBoost](https://xgboost.readthedocs.io/en/stable/tutorials/model.html).
Implemented by sklearn as GradientBoostingClassifier.

It is computationally preferable to train many small trees in series rather than one big tree. 
Each tree is trained to classifiy what the previous tree misclassified.
We say tree_2 classifies the residuals of tree_1.

Unclear to me: how boosted trees are used for prediction,
when you don't know whether the first tree's prediction was wrong.
Documentation says we combine the scores generated by the various models,
possibly using a weighted average, and possibly using trained weights.

#### Random Forests
RF reduces variance and avoids overitting.
RF trains and predicts with multiple trees in parallel.
Each tree is built on a random sample of the instances,
on a random sample of the features, or both.
Thus, each tree is imperfect, and the collection avoids overfitting.

RF uses bagging = boosted aggregation.
Each tree is built by sampling WITH replacement, called bootstrap.
This ensures that each tree is independent of the others.

## Impurity metrics
Each algorithm decides whether to split a node based on its impurity.  
There are several impurity metrics.  
Let prob(x) = proportion of examples in this node belonging to class x.  
Sum over all classes.  
For splits into more than two child nodes, impurity of parent is sum of impurity of children, each weighted by their portion of the examples.

#### Entropy
Entropy = sum [ -1 * prob(x) * lg (prob(x) ) ] , max=1.0, min=0.0
#### Gini
Gini = 1 - sum [ square (prob(x)) ] , max = 0.5 for binary, min=0.0
#### Classification error
Simple concept: a leaf "predicts" the majority class, so the rest is error.
There are many ways to say this:
* Classification error of node = 1 - (portion in largest class)  
* Classification error of node = (portion outside largest class), max = 0.5 for binary  
* Classification error of node = (total predicted wrong) / (training instances)

#### Information gain
You can measure the information gain of each node in a tree.  
Information gain = reduction of uncertainty = Entropy before split - Entropy after split.

#### FOIL information gain
FOIL adds one rule at a time until the information gain is too small.  
Say rule 0 finds $p_0$ positive and $n_0$ negative samples and $t_0=p_0+n_0$.  
Say rule 1 finds $p_1$ positive and $n_1$ negative samples and $t_1=p_1+n_1$.  
Determine whether the extra rule is worth it by:     
Gain = $p_1 [ lg(\frac{p_1}{t_1}) - lg(\frac{p_0}{t_0}) ]$

#### Collective impurity
The collective impurity of a node is the weighted sum of its children,
where fractional impurity is is weighted by the portion of training instances covered.  

In [3]:
# This node contains 5 of class1, 1 of class0.
# This node will predict the majority class, class1.
from math import log2
class0=1
class1=5
total=class0+class1
prob0=class0/total
prob1=class1/total
entropy = - ( prob0*log2(prob0) + prob1*log2(prob1) )
gini = 1 - (prob0**2 + prob1**2)
error = 1 - max(prob0,prob1)
print ('entropy:',entropy,'\ngini:',gini,'\nerror:',error)

entropy: 0.6500224216483541 
gini: 0.2777777777777777 
error: 0.16666666666666663


## Stopping
Decision tree tends to overfit.  
Use a stopping criteria to stop the splitting.  
Or use tree pruning after the fact.  

Stopping options.  
Instead of splitting till nodes are pure, split till impurity is below threshold.  
Or, stop when splitting doesn't reduce impurity by a threshold amount.  
Or, stop when no remaining feature introduces a statistical (t-test or chi-square) change in the class distribution.    

Pruning example.   
A node has a 20:10 mix of 2 classes, error = 10/30.  
The node has 4 leaf nodes with total error = 9/30.  
The node was split because the split redued error.  
Pessimistic error = add a penalty of e.g. 0.5 per node.   
Then parent error = 10.5/30 but leaf error = 11/30, and this split should be pruned.  

