> To live! like a tree alone and free,
> and like a forest in solidarity...
> Nazim Hikmet

# Random Forests



__Objectives__

- Introduction of 'bagging' procedure.

- Identifying the need for bootstrapping for random forests

- Comparing Random forests and bagging methods

- Evaluating a model by random forest model

## Bootstrapping


<img src= "img/bootstrap1.png" style="height:250px">


# Bagging (Boostrapping + Aggregating)


Let's us one more time recall that if $Z_{1}, \cdots, Z_{n}$ are independent observations with variance $\sigma^{2}$ then the variance of the mean $\bar{Z}$ is given by $\frac{\sigma^{2}}{n}$. 

__How is this relevant now?__



We will use this idea calculate $$ \hat{f}^{1}(x), \cdots, \hat{f}^{B}(x)$$ where each $\hat{f}^{i}$ represents a decision tree fitted to the bootstrapped data.

Then we will make a prediction by: 

$$ \hat{f}_{\text{avg}}(x) = \frac{1}{B}\sum_{b=1}^{B} \hat{f}^{b}(x)$$

Note that this is for regression and for the classification we can get majority vote.

_side note: [sklearn averages over probabilities not majority vote](https://scikit-learn.org/stable/modules/ensemble.html#forest)_


## Sklearn for Random Forests

In [1]:
import pandas as pd
import numpy as np

In [2]:
# you can download the data from -- https://www.kaggle.com/ishaanv/ISLR-Auto#Heart.csv

# or http://faculty.marshall.usc.edu/gareth-james/ISL/data.html
heart = pd.read_csv('data/Heart.csv', index_col=0)
heart.head()
print(heart.shape)

(303, 14)


In [3]:
# drop nulls
heart.dropna(axis=0, how='any', inplace=True)
y = heart.AHD
X = heart.drop(columns='AHD')

In [4]:
heart.head()

Unnamed: 0,Age,Sex,ChestPain,RestBP,Chol,Fbs,RestECG,MaxHR,ExAng,Oldpeak,Slope,Ca,Thal,AHD
1,63,1,typical,145,233,1,2,150,0,2.3,3,0.0,fixed,No
2,67,1,asymptomatic,160,286,0,2,108,1,1.5,2,3.0,normal,Yes
3,67,1,asymptomatic,120,229,0,2,129,1,2.6,2,2.0,reversable,Yes
4,37,1,nonanginal,130,250,0,0,187,0,3.5,3,0.0,normal,No
5,41,0,nontypical,130,204,0,2,172,0,1.4,1,0.0,normal,No


In [5]:
heart.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 297 entries, 1 to 302
Data columns (total 14 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Age        297 non-null    int64  
 1   Sex        297 non-null    int64  
 2   ChestPain  297 non-null    object 
 3   RestBP     297 non-null    int64  
 4   Chol       297 non-null    int64  
 5   Fbs        297 non-null    int64  
 6   RestECG    297 non-null    int64  
 7   MaxHR      297 non-null    int64  
 8   ExAng      297 non-null    int64  
 9   Oldpeak    297 non-null    float64
 10  Slope      297 non-null    int64  
 11  Ca         297 non-null    float64
 12  Thal       297 non-null    object 
 13  AHD        297 non-null    object 
dtypes: float64(2), int64(9), object(3)
memory usage: 34.8+ KB


In [10]:
display(X.Thal.value_counts(normalize=True))
display(X.ChestPain.value_counts(normalize=True))

normal        0.552189
reversable    0.387205
fixed         0.060606
Name: Thal, dtype: float64

asymptomatic    0.478114
nonanginal      0.279461
nontypical      0.164983
typical         0.077441
Name: ChestPain, dtype: float64

In [11]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=1, stratify=y)

In [12]:
from sklearn.preprocessing import OneHotEncoder

In [13]:
# shortcut for splitting catvar from numvar:
categorical_variables = list(X_train.select_dtypes(include=['object']).columns)
numerical_variables = list(X_train.select_dtypes(include=['int64', 'float64']).columns)

In [14]:
categorical_variables

['ChestPain', 'Thal']

In [15]:
numerical_variables

['Age',
 'Sex',
 'RestBP',
 'Chol',
 'Fbs',
 'RestECG',
 'MaxHR',
 'ExAng',
 'Oldpeak',
 'Slope',
 'Ca']

In [16]:
categorical_variables.append(numerical_variables.pop(5))

In [17]:
categorical_variables.append(numerical_variables.pop(-2))

In [18]:
categorical_variables

['ChestPain', 'Thal', 'RestECG', 'Slope']

In [19]:
ohe = OneHotEncoder(drop='first')
X_categ = ohe.fit_transform(X_train[categorical_variables]).toarray()
X_num = X_train[numerical_variables].values
Xtrain = np.concatenate((X_categ, X_num), axis=-1,)
Xtrain.shape

(237, 18)

In [20]:
# check out feature names for cat and num:
ohe.get_feature_names()

array(['x0_nonanginal', 'x0_nontypical', 'x0_typical', 'x1_normal',
       'x1_reversable', 'x2_1', 'x2_2', 'x3_2', 'x3_3'], dtype=object)

In [21]:
numerical_variables

['Age', 'Sex', 'RestBP', 'Chol', 'Fbs', 'MaxHR', 'ExAng', 'Oldpeak', 'Ca']

In [22]:
# now we should transform the test data
# to be able to use it for the prediction

X_test_categ = ohe.transform(X_test[categorical_variables]).toarray()
X_test_num = X_test[numerical_variables].values
Xtest = np.concatenate((X_test_categ, X_test_num), axis=-1,)
Xtest.shape

(60, 18)

In [28]:
Xtest.shape

(60, 18)

In [29]:
from sklearn.ensemble import RandomForestClassifier

In [30]:
clf = RandomForestClassifier(n_estimators=100,
                             criterion='gini',
                             max_features='auto',
                             oob_score=True)

In [37]:
clf.fit(Xtrain, y_train)
print(clf.score(Xtrain, y_train))
print(clf.score(Xtest, y_test))

1.0
0.7833333333333333


__Your Turn__

- Use 5 fold cross_validation to fit random forest classifier we created above.
- Don't forget to return training scores and trained estimators.

In [38]:
from sklearn.model_selection import cross_validate

In [43]:
cv = cross_validate(clf, Xtrain, y_train, return_estimator=True, return_train_score=True)

__Your Turn__

- What is the type of validator above?

- Check test vs train(validation) scores.

- Print "mean +/- std" for both train and test scores

- Also print oob_scores and compare them with cross_validation scores

In [44]:
type(cv)

dict

In [45]:
cv

{'fit_time': array([0.19141197, 0.17072487, 0.17394614, 0.17212605, 0.183038  ]),
 'score_time': array([0.01241112, 0.00932598, 0.00928402, 0.00918388, 0.00967503]),
 'estimator': (RandomForestClassifier(oob_score=True),
  RandomForestClassifier(oob_score=True),
  RandomForestClassifier(oob_score=True),
  RandomForestClassifier(oob_score=True),
  RandomForestClassifier(oob_score=True)),
 'test_score': array([0.8125    , 0.8125    , 0.78723404, 0.76595745, 0.80851064]),
 'train_score': array([1., 1., 1., 1., 1.])}

In [46]:
clf.fit(Xtrain, y_train)
clf.oob_score_

0.7932489451476793

In [47]:
# How to interpret: if one sample has really bad test_score, it may mean there's some really bad data in test.
# I may want to go back and change my model. 
cv['test_score'].mean()

0.797340425531915

__Your Turn__

- Note that we have over-fitting problem. 

- Let's try to reduce over-fitting

In [49]:
y_train.value_counts()

No     128
Yes    109
Name: AHD, dtype: int64

In [51]:
clf = RandomForestClassifier(max_depth=10,
                             max_features='log2',
                             min_samples_split=4,
                             oob_score=True)

clf.fit(Xtrain, y_train)
print(clf.score(Xtrain, y_train))
print(clf.oob_score_)

0.9873417721518988
0.7848101265822784


### Do it with Pipelines!

In [61]:
# There is an "easier" way to do this
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler

from sklearn.ensemble import RandomForestClassifier

In [62]:
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

preprocessor = ColumnTransformer(
        transformers=[
        ('num', numeric_transformer, numerical_variables),
        ('cat', categorical_transformer, categorical_variables)],)

rf = Pipeline(steps=[
    ('ct', preprocessor),
    ('clf', clf)])

In [63]:
pipe_validator = cross_validate(rf, Xtrain, y_train, return_estimator=True, return_train_score=True)

Traceback (most recent call last):
  File "/opt/anaconda3/lib/python3.7/site-packages/sklearn/utils/__init__.py", line 425, in _get_column_indices
    all_columns = X.columns
AttributeError: 'numpy.ndarray' object has no attribute 'columns'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/anaconda3/lib/python3.7/site-packages/sklearn/model_selection/_validation.py", line 531, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/opt/anaconda3/lib/python3.7/site-packages/sklearn/pipeline.py", line 330, in fit
    Xt = self._fit(X, y, **fit_params_steps)
  File "/opt/anaconda3/lib/python3.7/site-packages/sklearn/pipeline.py", line 296, in _fit
    **fit_params_steps[name])
  File "/opt/anaconda3/lib/python3.7/site-packages/joblib/memory.py", line 355, in __call__
    return self.func(*args, **kwargs)
  File "/opt/anaconda3/lib/python3.7/site-packages/sklearn/pipeline.py", line 740, in _fit_transfo

In [64]:
# train scores
print(cv['train_score'])
# validation scores
print(cv['test_score'])

# let's pick one of the estimator for further investigation

est = cv['estimator'][0]

[1. 1. 1. 1. 1.]
[0.8125     0.8125     0.78723404 0.76595745 0.80851064]


In [66]:
est = cv['estimator'][0]

In [67]:
est

RandomForestClassifier(oob_score=True)

In [68]:
est['classifier'].oob_score_

TypeError: list indices must be integers or slices, not str

## Feature Importance

In [52]:
feature_importances = clf.feature_importances_

In [53]:
feature_importances

array([0.03527144, 0.0121948 , 0.02397573, 0.12191526, 0.13940688,
       0.00076297, 0.01469168, 0.02046935, 0.00401169, 0.08384736,
       0.03783794, 0.0755455 , 0.06916895, 0.0089195 , 0.10191607,
       0.05198318, 0.10174049, 0.0963412 ])

In [58]:
ohe.get_feature_names()

array(['x0_nonanginal', 'x0_nontypical', 'x0_typical', 'x1_normal',
       'x1_reversable', 'x2_1', 'x2_2', 'x3_2', 'x3_3'], dtype=object)

In [56]:
# be careful with the order of columns
columns = ohe.get_feature_names().tolist() + numerical_variables

In [57]:
importances = pd.DataFrame(data=feature_importances,
                           index=columns, columns=['feature_importances'])

importances.sort_values(by='feature_importances', ascending=False)

Unnamed: 0,feature_importances
x1_reversable,0.139407
x1_normal,0.121915
MaxHR,0.101916
Oldpeak,0.10174
Ca,0.096341
Age,0.083847
RestBP,0.075545
Chol,0.069169
ExAng,0.051983
Sex,0.037838


In [60]:
importances.feature_importances.sum()

1.0000000000000002

In [59]:
# Interpret of feature importance: 
# How much information gain do I get from a split on x1_reversable? 
# each number shows the % of model that's explained by a given variable.

### Extra Material 

- [Sklearn averages probabilities in RF implementation](https://scikit-learn.org/stable/modules/ensemble.html#forest)

- [On the variance](https://newonlinecourses.science.psu.edu/stat414/node/167/)

- [Is RF immune to overfitting?](https://en.wikipedia.org/wiki/Talk%3ARandom_forest)

- [Tricky stuff with respect to feature importance](http://rnowling.github.io/machine/learning/2015/08/10/random-forest-bias.html)

- [An interesting implementation of feature importance](https://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances_faces.html#sphx-glr-auto-examples-ensemble-plot-forest-importances-faces-py)

- [Different Ensemble Methods in sklearn](https://scikit-learn.org/stable/modules/ensemble.html#forest)

- [ISLR - section 8.2](http://faculty.marshall.usc.edu/gareth-james/ISL/ISLR%20Seventh%20Printing.pdf)

- [Another library for RF: H2o](http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/drf.html)