# Random Forest

A random forest is a model ensemble of decision trees. Meaning it is not fundamentally different that a decision tree. It just has more bells and whistles. They were created in order to help alleviate some of the issues lone decision trees tend to deal with like high variance or overfitting.

Now as a model ensemble, random forests employ the bagging technique, and more specifically random patching. Meaning that multiple decisions trees are trained a random subset of features per the node splitting process. This mean every decision tree will:

1. Randomly subset of the training data.
2. Randomly select a subset of the features.
3. Random select a subset of the features per split.

As a consequence, each tree is trained on a different set of features generating more diversity in the process, hence overcomming the overfitting problem. In addition, this also forcing the trees to be uncorrelated. Correlated trees would otherwise lead to the same votes almost redundently. In the end, a random forest would average out the predictions of all the trees, or use any other statistical model like frequency in the case of classifiers.

Since each classifier can be trained independently, it's also pretty easy to see how you can train each each classifier in parallel using GPU or CPU Cores.

You can approximate a random forest using this sklearn's `BaggingClassifier`.

```python

from sklearn.ensemble import RandomForestClassifier

bag_clf = BaggingClassifier(DecisionTreeClassifier(splitter="random", max_leaf_nodes=16), 
                            n_estimators=500, max_samples=1.0, bootstrap=True, n_jobs=-1)
```

## Example

### Click-Through Prediction with a Random Forest

In [3]:
import pandas as pd
import numpy as np
from os.path import join
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.model_selection import train_test_split


# read in data
DATA_DIR = join('..', '..', '..', 'data', 'click-rate-prediction')
click_df = pd.read_csv(join(DATA_DIR, 'train.csv'), nrows=150000)
click_df.drop(['id', 'hour', 'device_id', 'device_ip'], axis=1, inplace=True)

# encode
click_df = click_df.apply(LabelEncoder().fit_transform)

# split X and y into np matricies explicitly
col_names = list(click_df)
X_names, y_names = list(filter(lambda name: name != 'click', col_names)), ['click']
X, y = np.array(click_df[X_names]), np.array(click_df[y_names])

# one hot encoding for categorical distance constaint
X_train = OneHotEncoder(categories='auto').fit_transform(X)

# split X and y into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

In [2]:
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier


# grid search for parameter optimization
# n_estimators = number of trees in the forest used for majority voting
parameters = {'max_depth': np.arange(5, 30, 3),
              'n_estimators': np.arange(100, 120, 5),
              'min_samples_split': np.arange(20, 60, 15)}

# another parameter worth considering: max_features: which specifies the number of
# random features to consider upon splitting
# verbose for progressbar (the quantity specifies the detail)
rforest = RandomForestClassifier(criterion='entropy')
gsearch = GridSearchCV(rforest, parameters, n_jobs=4, cv=3, scoring='roc_auc', verbose=10)
gsearch.fit(X_train, y_train)
print(gsearch.best_params_)

rforest_best = gsearch.best_estimator_
rforest_prob_pred = rforest_best.predict_proba(X_test)[:, 1]
rforest_auc = roc_auc_score(y_test, rforest_prob_pred)
print(f'The ROC AUC on testing set is using optimized rforest classifier is {rforest_auc}')

Fitting 3 folds for each of 108 candidates, totalling 324 fits


[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done   5 tasks      | elapsed:   11.5s
[Parallel(n_jobs=4)]: Done  10 tasks      | elapsed:   16.7s
[Parallel(n_jobs=4)]: Done  17 tasks      | elapsed:   26.3s
[Parallel(n_jobs=4)]: Done  24 tasks      | elapsed:   33.0s
[Parallel(n_jobs=4)]: Done  33 tasks      | elapsed:   46.2s
[Parallel(n_jobs=4)]: Done  42 tasks      | elapsed:   58.9s
[Parallel(n_jobs=4)]: Done  43 tasks      | elapsed:   58.9s
[Parallel(n_jobs=4)]: Done  44 tasks      | elapsed:   58.9s
[Parallel(n_jobs=4)]: Done  45 tasks      | elapsed:   58.9s
[Parallel(n_jobs=4)]: Done  46 tasks      | elapsed:   58.9s
[Parallel(n_jobs=4)]: Done  47 tasks      | elapsed:   58.9s
[Parallel(n_jobs=4)]: Done  48 tasks      | elapsed:   58.9s
[Parallel(n_jobs=4)]: Done  49 tasks      | elapsed:   58.9s
ERROR:root:Internal Python error in the inspect module.
Below is the traceback from this internal error.



Traceback (most recent call last):
  File "/Users/danielm/anaconda3/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 3267, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-2-eb35c09fb8f9>", line 16, in <module>
    gsearch.fit(X_train, y_train)
  File "/Users/danielm/anaconda3/lib/python3.6/site-packages/sklearn/model_selection/_search.py", line 722, in fit
    self._run_search(evaluate_candidates)
  File "/Users/danielm/anaconda3/lib/python3.6/site-packages/sklearn/model_selection/_search.py", line 1191, in _run_search
    evaluate_candidates(ParameterGrid(self.param_grid))
  File "/Users/danielm/anaconda3/lib/python3.6/site-packages/sklearn/model_selection/_search.py", line 711, in evaluate_candidates
    cv.split(X, y, groups)))
  File "/Users/danielm/anaconda3/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py", line 930, in __call__
    self.retrieve()
  File "/Users/danielm/anaconda3/lib/python3.6/site-packa

KeyboardInterrupt: 