Random Forests (Ensemble Method)

1) **Bagging (Bootstrap Aggregating)** - alternative method (post-pruning) to reducing variance in decision trees by growing as many large trees and then averaging away the variance
* **Bootstrapping** (For Decision Trees) - random sampling with replacement of dataset to create multiple decision tree models
    * **Ensemble Method** - combine the predictions of several base estimators built with a given learning algorithm in order to improve generalizability / robustness over a single estimator (e.g. several base decision trees to get random forest)
* **Central Limit Theorem** (For Decision Trees) - averaging large number of decision trees that each have high variance to reduce overall variance
* Making Predictions - create $B$ bootstrapped training datasets
    * For Regression: Average predictions of $B$ decision trees
        * equation: $\hat{f}_{bag}(x)=\frac{1}{B}\sum_{b=1}^B \hat{f}^{*b}(x)$
    * For Classification: Majority vote among $B$ decision trees
* Error Estimation From Remaining Un-Bootstrap Samples
    * Since bootstrapped, each decision tree only uses about $\frac{2}{3}$ of observations
    * Remaining $\frac{1}{3}$ can be used to estimate **Out of Bag (OOB) Error** (which, can be treated as test error)
* Bagging Steps:
    1. Draw a random **bootstrap** sample of size $n$ (randomly choose $n$ samples from the training set with replacement)
    2. Grow a decision tree from the bootstrap sample. At each node:
        * Split the node using the feature that provides the best split according to the objective function, for instance, by maximizing the information gain
    3. Repeat the steps 1 & 2 *k* times
    4. Aggregate the prediction by each tree to assign the class label by **majority vote** (or average them for regression)
* Problem arising with only Bagging method:
    * all bagged decision trees can be the same

2) Random Forest Method (using Bagging)
* Uses bagging method to generate multiple decision tree models and averages out the variance through CLT
* **Random selection of $m$ predictors** - at **each** split, choose a random selection of $m$ predictors (subset of predictors)
    * classification tuning: $(m = \sqrt{p})$
        * example: For $p=100$ predictors in dataset, randomly get $10$ features at **each** split point
    * regression tuning: $(m = \frac{p}{3})$
    * what does choosing a random subset of predictors at each split do? It **decorrelates** the decision trees, thereby leading to improvements in performance over plain vanilla bagging
* How to Handle Categorical Data:
    * String values need to be converted into numeric
    * If possible, convert to a continuous variable (e.g. S, M, L to size in actual weight or height)
    * Sklearn doesn't support splitting on multiple features (it will use $\leq$ for all variables)
    * Don't drop one of the categories
* Handling Missing Values:
    * Typical implementation will use the median value
    * Can first use the median values and then use **proximities** to calculate a more accurate missing value later
        * **Proximities** - can be used to see how similar 2 data points are
        * After fitting the random forest, for each tree in random forest count the number of times that 2 data points are in the same leaf node
        * Normalize at the end
* Random Forest Steps:
    1. Draw a random **bootstrap** sample of size $n$ (randomly choose $n$ samples from the training set with replacement)
    2. Grow a decision tree from the bootstrap sample. At each node:
        1. **Randomly select $d$ features without replacement**
        2. Split the node using the feature that provides the best split according to the objective function, for instance, by maximizing the information gain
    3. Repeat the steps 1 & 2 *k* times
    4. Aggregate the prediction by each tree to assign the class label by **majority vote** (or average them for regression)
* How many decision trees to use? (n_estimators in sklearn)
    * Variance decreases with more trees (with diminishing returns)
    * Run time scales linearly with more trees
    * More is still better, but wait until the end to run 50,000 trees

3) **Feature Importance** - evaluate the importance of features
![feature_importance](feature_importance.png)
* Calculating Feature Importance in Regression vs. Classification RF Models:
    * In Bagging/Random Forest Regression trees - record total amount **RSS decreases** due to splits over given predictor, then average over all $B$ decision trees $\rightarrow$ larger value indicates "importance"
    * In Bagging/Random Forest Classification trees - record total amount **Gini Index decreases** due to splits over given predictor, then average over all $B$ decision trees $\rightarrow$ larger value indicates "importance"
* Alternative Way to Calculate Feature Importance
    * To evaluate importance of $j$th feature:
        1. When $b$th tree is bootstrapped, **OOB** samples passed down through decision tree $\rightarrow$ record accuracy
        2. Values of $j$th feature randomly permuted in **OOB** samples $\rightarrow$ compute new lower accuracy
    * Average **decrease** in accuracy over all decision trees
* Comparison of Feature Importances By Split Measurement Metric
![gini_rand_feat_impt](gini_rand_feat_impt.png)
* Feature Importance in Sklearn
    * Basically, the *higher in the decision tree the feature is*, the more important it is in determining the result of a data point
    * The *expected fraction of data points that reach a node* is used as an estimate of that feature's importance for that tree
    * Finally, average those values across all decision trees to get the feature's importance
* Feature Importance Overall:
    * Feature importances are almost always put forth as normalized values. What is important is that we can compare features to other features
    * Authors of RF state that you should be interested in **rank** only and not magnitude
    * Typically, the more features you have in random forest, the less important any individual feature will be
    * Highly correlated features tend to split importance
    * Some highly correlated but not super important features will look important
* Bias-Variance Trade off For Random Forest
    * **Bias** - by creating full decision trees, we get relatively **low** bias
        * expectation of average of $B$ decision trees same as expectation of any one of the decision trees
    * **Variance** - average of $B$ identically distributed random variables with pairwise correlation, $\rho$, has variance
        * equation: $\rho\sigma^2+\frac{1-\rho}{B}\sigma^2$ - decorrelation of the decision trees by randomly selecting of $m$ features at each split/node
        * greater the $B$ (number of decision trees), the more the variance will reduce (although after some point there is diminishing return / computationally expensive)
* Recommended Tuning For Random Forest:
    * Classification: 
        * minimum node size = 1
        * max_features: $m=\sqrt{p}$
    * Regression: 
        * minimum node size = 5
        * max_features: $m=\frac{p}{3}$ or full features
    * min_sample_leaf - start with None and try others
    * n_jobs - choosing -1 will make it run on maximum # of processors
    * k-Fold Cross Validation to get optimal hyperparameters
* Random Forest Pros:
    * (+) For an out of the box model, it has very good accuracy
    * (+) Trees can be trained in parallel to make computations faster
    * (+) OOB estimates allow for an estimate of generalization error without needing CV
    * (+) Can handle thousands of features and be used for feature reduction
* Random Forest Intuition:
    * Cannot extrapolate well for regression trees
    * Just because we have interactions, doesn't mean you'll never want interaction variables
    * Just because we have OOB error, doesn't mean you shouldn't do CV
    * Start with a small number of trees at first, then increase
    * Pickling makes a giant file (~GB)