# Decision Trees

## Impurity metrics
Let prob(x) = proportion of examples in this node belonging to class x.
Sum over all classes.

Entropy = sum [ -1 * prob(x) * lg (prob(x) ) ] , max=1.0

Gini = 1 - sum [ square (prob(x)) ] , max = 0.5 for binary

Classification error = 1 - max [ scores for wrong classes ], max = 0.5 for binary

Information gain = reduction of uncertainty = Entropy before split - Entropy after.

For splits into more than two child nodes, impurity of parent is sum of impurity of children, each weighted by their portion of the examples.

## Hunt's algorithm
This is the splitting criteria used by all the other algorithms listed here.
It is optimal if all feature combinations are present in the training data (unlikely).

## Quinlan: ID3, C4.5, See5
### ID3
Iterative Dichotomizer. 
At each node, review the so-far unused features. 
Choose feature with max information gain or min entropy.
Choose one or more binary rules for two or more child nodes.
Stop when out of examples, out of features, or node is pure (leaf). 

Greedy, non-optimal, recursive.
Tends to overfit. 
Can be improved (with longer run) by backtracking.
Better suited for categorical features & classification than continuous regression.

### C4.5
Became world's most popular machine learning tool in 1980s.
Implemented as Weka J48.

Improvments relative to ID3:
Better at selecting thresholds on continuous features.
Ignores missing data during entropy calculation.
Features can be weighted.
Prune the final tree to remove unhelpful branches.

Improved again in See5 (commercial):
Requires less CPU and less RAM. 
Makes smaller trees.
Uses boosting.
Uses feature selection (winnowing).
Weights the different classification-error classes.

## Brieman: CART, RF
CART is a generic term for classification and regression tree.

Boosted Trees: incremental boost as in AdaBoot. 
Build tree_2 to classify the cases misclassified by tree_1 (residuals).
[Is tree_2 deployed before or after tree_1???]

RF is one type of bagging (boosted aggregation) 
i.e. each tree built by sampling with replacement.

# Random Forest

The random forest algorithm can be used for classification or regression.
RF is robust to missing values, numeric & categorical features, 
and numeric features at different scales. 
A trained RF is explainable; 
it can rank the features by their importance to the decision making process.

A decision tree, DT, is an ok classifier. 
Here is how to build a DT. 
Always operate on the most impure node i.e. 
one that still receives 50% cancer & 50% normal instances.
For that node, select a feature & threshold that is optimal for splitting. 
For example, age & 50 would be a good choice 
if age>50 would put most of the cancer cases on the right, 
and age<=50 would put most of the normal cases on the left. 

But a single DT tends to overfit.
That means high variance and poor generalization.

One solution is random forest,
which is an ensemble of many trees.
RF is a strong classifier built from weak trees.
Each tree incorporates randomness so each is slightly different (and wrong).
The ensemble layer aggregates the tree decisions.
The ensemble layer can ask each tree for its best guess (called winner-take-all), 
then apply majority rule; this is what the orginal RF paper suggested.
Or it can ask each tree to assign a probability to each class, and sum those up;
this is what the sklearn RandomForestClassifier does.

There are several ways to make random trees.
You can build each tree from a random subset of the training data.
Or you can build each tree from a random subset of the features.
Or you can do both. This is what the sklearn RandomForestClassifier does.

When using a random subset of instances (called the bag), 
you can estimate the tree's accuracy on the remaining instances 
(called out-of-bag, or OOB)
to arrive at an OOB score per tree.
Then, the ensemble layer can weight each tree's vote based on its OOB.
The sklearn RandomForestClassifier has options to do this.

See sklearn [Ensemble Methods](https://scikit-learn.org/stable/modules/ensemble.html)
section 1.11.2.


