##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Random Forests

In this lab, we are going to use the same dataset as in the last lab, but we are going to use a Random Forest instead of a single decision tree. 

We will also use a parameter grid and cross-validation to perform hyperparameter tuning, as well as export our final model.

The code below is taken from the last lab to set up our data transformations.

[Random Forest Scala Docs](https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.regression.RandomForestRegressor)

[Random Forest Python Docs](https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.regression.RandomForestRegressor)

[Spark ML Guide](https://spark.apache.org/docs/latest/ml-classification-regression.html#random-forest-regression)

In [2]:
%run "../includes/setup_env"

##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Reading the data

Let's load and cache the data once more.

In [5]:
df = spark.read.parquet("dbfs:/FileStore/tables/preprocessed").cache()
display(df)

In [6]:
# from pyspark.sql.types import DateType
from pandas import datetime
from pyspark.sql.functions import col, hour

# we sample every nth row of the data using the `hour` function
df_train = df.filter((col('datetime') < datetime(2015, 10, 1))) # & (hour(col('datetime')) % 3 == 0))
df_test = df.filter(col('datetime') > datetime(2015, 10, 15))

In [7]:
display(df_train)

In [8]:
df_train = df_train.drop("y_1","y_2","y_3","datetime", "machineID")
df_train = df_train.withColumnRenamed("y_0", "error")
df_train.cache()

df_test = df_test.drop("y_1","y_2","y_3","datetime", "machineID")
df_test = df_test.withColumnRenamed("y_0", "error")
df_test.cache()

In [9]:
display(df_train.describe())

##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Random Forests

Random forests and ensembles of decision trees are more powerful than a single decision tree alone. 

Let's take a look at all the hyperparameters we could change in [RandomForestClassifier](https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.regression.RandomForestClassifier).

In [11]:
from pyspark.ml.classification import RandomForestClassifier

rf = (RandomForestClassifier()
      .setLabelCol("error")
      .setFeaturesCol("norm_features")
      .setSeed(27))

print(rf.explainParams())

### Hands-on lab
Try changing the values of `numTrees` and `maxDepth` to any values you like

HINT: Take a look at the docs

In [13]:
# maximize this cell (click the + button on the right) to see the solution:
  
rf.numTrees = 100
rf.maxDepth = 10

### End of lab

##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Pipeline

Now that we have all of the feature transformations and estimators set up, let's put all of the stages together in the pipline.

In [16]:
from pyspark.ml import Pipeline

pipeline = Pipeline(stages = [rf])

pipeline.getStages()

If you want to look at what parameter each stage in the pipeline takes.

In [18]:
pipeline.getStages()[0].extractParamMap()

##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) ParamGrid

There are a lot of hyperparamaters we could tune, and it would take a long time to manually configure.

Instead of a manual (ad-hoc) approach, let's use Spark's [ParamGridBuilder](https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.tuning.ParamGridBuilder) to find the optimal hyperparameters in a more systematic approach.

In this example notebook, we keep these trees shallow and use a relatively small number of trees. Let's define a grid of hyperparameters to test:
  - maxDepth: max depth of each decision tree in the RF ensemble (Use the values `2, 5, 10`)
  - numTrees: number of trees in each RF ensemble (Use the values `10, 50`)

`addGrid()` accepts the name of the parameter (e.g. `rf.maxDepth`), and a list of the possible values (e.g. `[2, 5, 10]`).

In [20]:
# maximize this cell (click the + button on the right) to see the solution:

from pyspark.ml.tuning import ParamGridBuilder

paramGrid = (ParamGridBuilder()\
            .addGrid(rf.maxDepth, [2, 5, 10]) \
            .addGrid(rf.numTrees, [10, 50]) \
            .build())

##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Cross Validation

We are also going to use 3-fold cross validation to identify the optimal maxDepth and numTrees combination.

![crossValidation](https://files.training.databricks.com/images/301/CrossValidation.png)

With 3-fold cross-validation, we train on 2/3 of the data, and evaluate with the remaining (held-out) 1/3. We repeat this process 3 times, so each fold gets the chance to act as the validation set. We then average the results of the three rounds.

We pass in the `estimator` (pipeline), `evaluator`, and `estimatorParamMaps` to [CrossValidator](https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.tuning.CrossValidator) so that it knows:
- Which model to use
- How to evaluate the model
- What hyperparamters to set for the model

We can also set the number of folds we want to split our data into (3), as well as setting a seed so we all have the same split in the data.

In [23]:
from pyspark.ml.evaluation import  BinaryClassificationEvaluator

from pyspark.ml.tuning import CrossValidator

evaluator = (BinaryClassificationEvaluator()
             .setLabelCol("error")
             .setRawPredictionCol("rawPrediction")
             .setMetric('accuracy'))

cv = (CrossValidator()
      .setEstimator(pipeline)
      .setEvaluator(evaluator)
      .setEstimatorParamMaps(paramGrid)
      .setNumFolds(3)
      .setSeed(27))

### Hands-on lab

Depending on your Runtime version, you may have received an error message when running the previous cell. Can you use the the pyspark documentation to fix the code above to make it run in your Runtime version?

Start with:
> help(BinaryClassificationEvaluator)

### End of lab

In [25]:
cvModel = cv.fit(df_train)

Let's take a look at the model with the best hyperparameter configuration

In [27]:
list(zip(cvModel.getEstimatorParamMaps(), cvModel.avgMetrics))

##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Save Model

Let's save our model by writing it out. 

**NOTE:** We cannot save a pipeline model with a cross-validation step in Python. Instead, we have to save the best pipeline model itself.

Also, there is no `overwrite` method. Our only alternative is to recursively delete the existing directory if we want to remove it.

In [29]:
path = "/tmp/random_forest_pipeline"

modelPath = userhome + path
dbutils.fs.rm(modelPath, recurse=True)

cvModel.bestModel.save(modelPath)

Let's load the saved model back in.

In [31]:
from pyspark.ml.pipeline import PipelineModel

savedPipelineModel = PipelineModel.load(modelPath)

Let's apply the trained model to the test data.

In [33]:
df_pred = savedPipelineModel.transform(df_test)
display(df_pred.select("error", "prediction"))

##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Evaluate

Let's see how well we did on the test set.

In [35]:
def printEval(df, labelCol = "error", rawPredictionCol = "prediction"):
  evaluator = BinaryClassificationEvaluator()
  evaluator.setLabelCol(labelCol)
  evaluator.setRawPredictionCol(rawPredictionCol)

  auroc = evaluator.setMetricName("areaUnderROC").evaluate(df)
  print("AUROC: {}".format(auroc))

In [36]:
printEval(df_pred)

#### Improving our model

You are not done yet!  There are several ways we could further improve our model:
* **Expert knowledge** 
* **Better tuning** 
* **Feature engineering**

As an exercise: Replace the Random Forest code with a Gradient Boosted tree, and vary the number of trees and depth of the trees. What do you find?

*Good luck!*