# Lesson 36 - Random Forests

## Prepare Environment

In [0]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, expr

from pyspark.ml.feature import VectorAssembler, OneHotEncoder, StringIndexer
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml.classification import RandomForestClassifier, DecisionTreeClassifier
from pyspark.mllib.evaluation import MulticlassMetrics
from pyspark.ml import Pipeline
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder


spark = SparkSession.builder.getOrCreate()

## Random Forests

An **ensemble**  model is one that generates its predictions by combining the predictions from several simpler models. It is often the case that the ensemble model will have better predictive performance than any of the individual models from which it is built.

A **random forest**  is an ensemble of many decision trees. To ensure that the trees are different from one another, each tree is trained on a different subset of the training set. Assume that our training set contains n observations. When building a tree model to be used in a random forest, a sample of size n is drawn from the training set, with replacement. Such a sample is referred to as a **bootstrap**  sample. We train each tree in the random forest on its own bootstrapped sample. Since each sample will likely be different from every other sample, this will encourage differences in the trees making up the random forest. 

When generating predictions with a random forest, the observations are provided to each of the individual trees, which will generate their own predictions. The final classifications generated by the forest are then obtained by allowing the trees to vote on the correct classification. 

Each of these trees in the forest will likely overfit to the training data, but in its own idiosyncratic way. When the predictions of the individuals are combined, these idiosyncrasies will likely get outvoted, and the forest will likely generate better predictions than the individual trees.

## Load and Explore Data

To denomstrate the construction and application of random forests models in PySpark, we will use the [Diamonds dataset](https://ggplot2.tidyverse.org/reference/diamonds.html).

In [0]:
diamonds = (
    spark.read
    .option('delimiter', '\t')
    .option('header', True)
    .schema(
        'carat DOUBLE, cut STRING, color STRING, clarity STRING, depth DOUBLE, '
        'table DOUBLE, price INTEGER, x DOUBLE, y DOUBLE, z DOUBLE'
    )
    .csv('/FileStore/tables/diamonds.txt')
)

diamonds.printSchema()

In [0]:
diamonds.show(10)

In [0]:
N = diamonds.count()
print(N)

In [0]:
diamonds = diamonds.select('*', expr('LOG(carat) AS ln_carat'), expr('LOG(price) AS ln_price'))
diamonds.show()

### Distribution of Label Values

To serve as a baseline against which we can compare our model, we will check the distribution of the label values.

In [0]:
(
    diamonds
    .select('cut')
    .groupby('cut')
    .agg(
        expr('COUNT(*) as count'), 
        expr(f'ROUND(COUNT(*)/{N},4) as prop')
    )
    .show()
)

### Numerical and Categorical Features

We need to create lists specifying the names of our numerical features and our categorical features.

In [0]:
num_features = ['ln_carat', 'ln_price', 'x', 'y', 'z', 'depth', 'table']
cat_features = ['color', 'clarity']

### Preprocessing Pipeline

We will now create stages to be used in a pre-processing pipeline. Since random forests are constructed from decision tress, which do not require one-hot encoding of categorical variables, we will simply need to perform an integer encoding of these variables.

In [0]:
ix_features = [c + '_ix' for c in cat_features]

label_indexer = StringIndexer(inputCol='cut', outputCol='label')

feature_indexer = StringIndexer(inputCols=cat_features, outputCols=ix_features)

assembler = VectorAssembler(inputCols=num_features + ix_features, outputCol='features')

In [0]:
preprocessor = Pipeline(stages=[label_indexer, feature_indexer, assembler]).fit(diamonds)
train = preprocessor.transform(diamonds)
train.persist()
train.select(['features']).show(10, truncate=False)

### Evaluator

We will create an accuracy evaluator for use in scoring our models.

In [0]:
accuracy_eval = MulticlassClassificationEvaluator(
    predictionCol='prediction', labelCol='label', metricName='accuracy')

## Grid Seach for Random Forest

As with decision trees, we will typically want to use grid search to tune the hyperparameters `maxDepth` and `minInstancesPerNode`. You could also tune the `numTrees` hyperparameter, but this is not always practical. Since random forests consist of many tree models, they can be expensive to train and significantly more expensive to perform grid search with cross validation. As a result, unless we have a significant amount of computing resources available to us, we will need to pick a single value of `numTrees` to use. Generally speaking, higher values tend to produce better performance, but are much more time-consuming to train.

In [0]:
rforest = RandomForestClassifier(featuresCol='features', labelCol='label', numTrees=20, seed=1)

param_grid = (ParamGridBuilder()
              .addGrid(rforest.maxDepth, [14, 16, 18, 20, 22] )
              .addGrid(rforest.minInstancesPerNode, [2, 4, 8, 16])
              ).build()

cv = CrossValidator(estimator=rforest, estimatorParamMaps=param_grid, numFolds=5, 
                    evaluator=accuracy_eval, seed=1, parallelism=6)

cv_model = cv.fit(train)

opt_model = cv_model.bestModel
opt_maxDepth = opt_model.getMaxDepth()
opt_minInstancesPerNode = opt_model.getMinInstancesPerNode()

print('Max CV Score:  ', round(max(cv_model.avgMetrics),4))
print('Optimal Depth:  ', opt_maxDepth)
print('Optimal MinInst:', opt_minInstancesPerNode)

In [0]:
model_params = cv_model.getEstimatorParamMaps()

dt_cv_summary_list = []
for param_set, acc in zip(model_params, cv_model.avgMetrics):
    new_set = list(param_set.values()) + [acc]
    dt_cv_summary_list.append(new_set)

cv_summary = pd.DataFrame(dt_cv_summary_list, columns=['maxDepth', 'minInst', 'acc'])

for en in cv_summary.minInst.unique():
    sel = cv_summary.minInst == en
    plt.plot( cv_summary.maxDepth[sel]  , cv_summary.acc[sel], label=en)
    plt.scatter(cv_summary.maxDepth[sel], cv_summary.acc[sel])  
plt.legend()
plt.show()