# Random Forest Classification
An ensemble learning (using multiple models) model that uses multiple decision trees.

### Random Forest Algorithm:
- Step 1: Pick at random K data points from the training set
- Step 2: Build a Decision Tree for only these K data points (it's like a subset tree)
- Step 3: Choose the number of trees (NTrees) you want to build
    - Repeat Steps 1 and 2, so just keep building and building these trees until having NTrees
- Step 4: Finally, use all decision trees to predict a data point's y-value
    - Make each NTree predict the y-value for the data point
        - The final predicted y-value will be the category that wins the majority vote

### Small Data Set?
What if the data set was small and we wanted to use a lot of trees? How would the Random Forest handle that?

Answer: Some of the trees would be repeated and used redundantly in the algorithm.
- Even if the algorithm uses reproduced trees, using the trees' predictions could still make a strong final prediction

### Random Forest Real-Life Application
When Microsoft was developing their XBox Kinect system, they actually use the Random Forest classification algorithm to classify the body parts of each person.

It used sensors to detect where a person is moving and where they're located in space, and these sensors helped determine the body part of each person and its position.

In [1]:
# import libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

In [2]:
# import the data set
ads_df = pd.read_csv("datasets/social_network_ads.csv")

ads_df.head()

Unnamed: 0,User ID,Gender,Age,EstimatedSalary,Purchased
0,15624510,Male,19,19000,0
1,15810944,Male,35,20000,0
2,15668575,Female,26,43000,0
3,15603246,Female,27,57000,0
4,15804002,Male,19,76000,0


In [3]:
# x is the Age and Estimated Salary columns
x = ads_df.iloc[:, [2, 3]].values

# y is the Purchased column
y = ads_df.iloc[:, 4].values

In [4]:
# split the data set into training and testing data sets
from sklearn.model_selection import train_test_split 
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25, random_state=0)

### No Need to Feature Scale!
Decision Trees don't use the "Euclidean Distance" (distance formula) method like many other classification methods, so feature scaling is actually not needed for decision trees.

We're not planning on plotting the Random Forest classifier, so that's another reason to not feature scale.

# Random Forest Classifier

In [12]:
# import the random forest classifier class
from sklearn.ensemble import RandomForestClassifier

In [13]:
"""
create a random forest classifier, then fit to the training set
- n_estimators is the number of decision trees to use
- set the criterion to determine a split as the informational entropy
- set the random_state (seed) to 0
"""
classifier = RandomForestClassifier(n_estimators=10, criterion="entropy", random_state=0)
classifier.fit(x_train, y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='entropy',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=None,
            oob_score=False, random_state=0, verbose=0, warm_start=False)

In [14]:
# predict the training set results
y_pred = classifier.predict(x_test)

y_pred

array([0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1,
       0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
       1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1,
       0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1,
       1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1])

# Confusion Matrix

In [15]:
# import the confusion matrix function
from sklearn.metrics import confusion_matrix

In [18]:
# create a confusion matrix that compares the y_test (actual) to the y_pred (prediction)
cm = confusion_matrix(y_test, y_pred)

"""
Read the Confusion Matrix diagonally:
63 + 29 = 92 correct predictions
5 + 3 = 8 incorrect predictions
"""
cm

array([[63,  5],
       [ 3, 29]])