# Random Forest

**What is a Random Forest?**

- Random Forest is a type **Ensemble** Machine Learning algorithm called Bootstrap Aggregation or bagging.

**How does it work?**

- **Bootstrapping** is a statistical method for estimating a quantity from a data sample, e.g. mean. You take lots of samples of your data, calculate the mean, then average all of your mean values to give you a better estimation of the true mean value. In bagging, the same approach is used for estimating entire statistical models, such as decision trees. Multiple samples of your training data are taken and models are constructed for each sample set. When you need to make a prediction for new data, each model makes a prediction and the **predictions are averaged** to give a better estimate of the true output value.

- Random forest is a tweak on this approach where decision trees are created so that rather than selecting optimal split points, suboptimal splits are made by introducing randomness.  The models created for each sample of the data are therefore more different than they otherwise would be, but still accurate in their unique and different ways. Combining their predictions results in a better estimate of the true underlying output value.

- In the example below, "Will I exercise", "I" am a single observation. Each person is an observation. The model takes each observation through the forest and votes on the most frequent class for that observation to get a final prediction. 


**Pros**

1. Reduction in over-fitting 

2. More accurate than decision trees in most cases

3. Naturally performs feature selection

**Cons**

1. Slow real time prediction

2. Difficult to implement

3. Complex algorithm so difficult to explain

#### Slide Show: https://docs.google.com/presentation/d/1tXxpl8XUzg-nBnlgoE2y-0qR2BMcZJ4o0rCRWxUECUw/edit?usp=sharing

In [None]:
# ignore warnings
import warnings
warnings.filterwarnings("ignore")

import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix


import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns


from pydataset import data

In [None]:
# read Iris data from pydatset
df = data('iris')

# convert column names to lowercase, replace '.' in column names with '_'
df.columns = [col.lower().replace('.', '_') for col in df]

df.head()

## Train Validate Test

Now we'll do our train/validate/test split:

- We'll do exploration and train our model on the `train` data

- We tune our model on `validate`, since it will be out-of-sample until we use it.

- And keep the `test` nice and safe and separate, for our final out-of-sample dataset, to see how well our tuned model performs on new data.

In [None]:
from sklearn.model_selection import train_test_split

def train_validate_test_split(df, target, seed=123):
    '''
    This function takes in a dataframe, the name of the target variable
    (for stratification purposes), and an integer for a setting a seed
    and splits the data into train, validate and test. 
    Test is 20% of the original dataset, validate is .30*.80= 24% of the 
    original dataset, and train is .70*.80= 56% of the original dataset. 
    The function returns, in this order, train, validate and test dataframes. 
    '''
    train_validate, test = train_test_split(df, test_size=0.2, 
                                            random_state=seed, 
                                            stratify=df[target])
    train, validate = train_test_split(train_validate, test_size=0.3, 
                                       random_state=seed,
                                       stratify=train_validate[target])
    return train, validate, test

In [None]:
# split into train, validate, test
train, validate, test = train_validate_test_split(df, target='species', seed=123)

# create X & y version of train, where y is a series with just the target variable and X are all the features. 

X_train = train.drop(columns=['species'])
y_train = train.species

X_validate = validate.drop(columns=['species'])
y_validate = validate.species

X_test = test.drop(columns=['species'])
y_test = test.species

## Train Model

**Create the object**

Create the Random Forest object with desired hyper-parameters. 



In [None]:
rf = RandomForestClassifier(max_depth=3, 
                            random_state=123)

In [None]:
rf

**Fit the model**

Fit the random forest algorithm to the training data. 

In [None]:
rf.fit(X_train, y_train)

**Feature Importance**

Evaluate importance, or weight, of each feature. 

In [None]:
print(rf.feature_importances_)

- The higher number the feature importance == more important the feature.
- The importance of a feature is computed as the (normalized) total reduction of the criterion brought by that feature.

**Make Predictions**

Classify each flower by its estimated species. 

In [None]:
y_pred = rf.predict(X_train)
y_pred

**Estimate Probability**

Estimate the probability of each species, using the training data. 

In [None]:
y_pred_proba = rf.predict_proba(X_train)

## Evaluate Model

**Compute the Accuracy**

In [None]:
print('Accuracy of random forest classifier on training set: {:.2f}'
     .format(rf.score(X_train, y_train)))

**Create a confusion matrix**

In [None]:
print(confusion_matrix(y_train, y_pred))

**Create a classificaiton report**


**Precision:** $\frac{TP}{(TP + FP)}$

**Recall:** $\frac{TP}{(TP + FN)}$

**F1-Score:** A measure of accuracy. The harmonic mean of precision & recall. The harmonic mean is the reciprocal of the arithmetic mean of the reciprocals.  

F1 $\in [0, 1]$

F1-score = harmonic mean = $\frac{2}{\frac{1}{precision} + \frac{1}{recall}}$

**Support:** number of occurrences of each class. 

In [None]:
print(classification_report(y_train, y_pred))

## Validate Model

**Evaluate on Out-of-Sample data**

Compute the accuracy of the model when run on the validate dataset. 

In [None]:
print('Accuracy of random forest classifier on test set: {:.2f}'
     .format(rf.score(X_validate, y_validate)))

## Exercises

Continue working in your `model` file with titanic data to do the following: 

1. Fit the Random Forest classifier to your training sample and transform (i.e. make predictions on the training sample) setting the random_state accordingly and setting min_samples_leaf = 1 and max_depth = 10.

1. Evaluate your results using the model score, confusion matrix, and classification report.

1. Print and clearly label the following:  Accuracy, true positive rate, false positive rate, true negative rate, false negative rate, precision, recall, f1-score, and support.

1. Run through steps increasing your min_samples_leaf and decreasing your max_depth. 

1. What are the differences in the evaluation metrics?  Which performs better on your in-sample data?  Why?

After making a few models, which one has the best performance (or closest metrics) on both train and validate?