# Random Forest
### What is a random forest?
A random forest is an improved version of decision trees. A decision tree is a single DAG which takes in inputs and produces outputs. A random forest is a collection of decision trees (a forest). Each decision tree is generated at the same time, with slightly different (random) input parameters (different numbers of nodes, different graph shape, different starting values). Random forest is a "bagging" algorithm, which means every decision tree is given the same input, they all produce a differnet output, and then their outputs are aggregated to produce the model's decision. For categorical models, the majority decision wins. For quantitative models, the outputs are averaged.

![A visual representation of a random forest](assets/random_forest.png)

### Why is random forest useful?
ML models are inhrently flawed. They get things wrong. One way they get things wrong is "overfitting". When a model overfits, it gets very good at predicting the data in its training set, but does so via quirks in the training data rather than the looking at the actual data. For example, if a model is training apples and oranges, and all apples in the training set have sticker labels on them while all the oranges in the training set don't, the model might overfit to detect the sticker label. On the training set, it would detect nearly 100% correctly, but given a test set with apples that don't have stickers or oranges that do have stickers, it would get all of the predictions wrong.

### How to use random forest?
Random forest is used in a very similar way to decision trees. It will be trained in the same way, and inference works the same way; it is a black box which takes in input and gives an ouput. The different is in initialization. There are more settings to control with random forest: the number of trees, the type of randomization, etc.

# Random Forest Exercise

In [None]:
# First, import libraries
# Remember, if any of these fail to import, they need to be
# installed in the terminal via 'pip install PACKAGENAME'
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import GridSearchCV
from sklearn.tree import plot_tree
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt

# XXX Import the load_digits dataset
from XXX import XXX

In [None]:
# Next, load the dataset
#XXX load the digits dataset with load_digits()
digits = XXX

x = digits.data
y = digits.target

# Pandas is a library that makes it easy to display data in
# a readable format. To use it, create a DataFrame. The DataFrame
# can be fed data (numbers) and columns (labels).
data = pd.DataFrame(data=digits.data, columns=digits.feature_names)
display(data)
display(x)
display(y)

In [None]:
# Next, split the data into train and test data
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=0)

print(np.size(x_train))
print(np.size(x_test))
print(np.size(y_train))
print(np.size(y_test))

In [None]:
# Next, clean up the data. StandardScalar removes the mean and scales to unit variance.
scaler = StandardScaler()
x_train = scaler.fit_transform(x_train)
x_test = scaler.transform(x_test)

In [None]:
# Next, initialize the model.
# There are many parameters which you could pass to Random ForestClassifier.
# You should try messing around with different parameters. Here are some options:
# random_state determines the random initalization in each of the trees.
# n_jobs determines how many processes run in parallel. Default is 1. -1 means use all processors.
# max_depth determines how many layers of logic an individual decision tree can have
# n_estimators determines how many decision trees are in the forst
# oob_score means the model is validated on a unique set of datapoints, rather than sometimes reusing data in the same training / testing run.
# oob stands for out of bag, and is a little complicated. You can read about it at https://towardsdatascience.com/what-is-out-of-bag-oob-score-in-random-forest-a7fa23d710.
# You can read about the other options at https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

model = RandomForestClassifier(random_state=2,
                               n_jobs=-1,
                               min_samples_leaf=100,
                               max_depth=5,
                               n_estimators=50,)
model = RandomForestClassifier()


In [None]:
# Next, train the model.

# Jupyter Notebook has "magic" statements which do microfunctions.
# if you start a line with %time, it will measure the time to execute
# that line. You can do the same thing for a whole cell by putting
# %%time at the start of the cell.
%time model = model.fit(x_train, y_train)

In [None]:
# Test the Model

model.predict(x_test)
print(f"The model has an accuracy of {model.score(x_test, y_test)*100:.3f}%")

In [None]:
# This accuracy score is really bad. Lets try to make it beter.
# Hyperparametarization is the process of figuring out better
# ML model parameters by running a lot of models and seeing which
# performs best.

hyperparam_model = RandomForestClassifier(n_jobs=-1)
params = {
    'random_state':[2,40,152,9836],
    'min_samples_leaf':[20,50,100],
    'max_depth':[3,5,10,20],
    'n_estimators':[50,100,500],
}

# The grid search will try every permutation of the parameters and find the best.
# This took 1 minute on my 8 core machine. Yours may take longer.
# This will throw an error that says "The least populated class has only 1 members".
# That is because n_jobs is not being parametrized.
grid_search = GridSearchCV(estimator=hyperparam_model,
                           param_grid=params,
                           n_jobs=-1,
                           verbose=5,
                           cv=4,
                           scoring="accuracy")

%time grid_search.fit(x_train, y_train)
print(f"The best model accuracy is {grid_search.best_score_}.")
print(f"The best model is {grid_search.best_estimator_}")