Databricks Runtime for Machine Learning includes,
1) [Hyperopt](https://github.com/hyperopt/hyperopt): A library for ML hyperparameter tuning in Python
2) [Apache Spark MLlib](https://spark.apache.org/docs/latest/ml-guide.html): A library of distributed algorithms for training ML models (also called as "Spark ML")

**In this notebook we will learn to use them together:** We have distributed ML workloads in Python for which we want to tune hyperparameters

This notebook includes two sections:
* **Part 1: Run distributed training using MLlib:** In this section we will do the MLlib model training without hyperparameter tuning
* **Part 2: Use Hyperopt to tune hyperparameters in the distributed training workflow:** Here we will wrap the MLlib code with Hyperopt for tuning

# Part 1: Run distributed training using MLlib

## Load data
- **MNIST handwritten digit recognition dataset:** A classic dataset of handwritten digits that is commonly used for training and benchmarking ML algorithms
- It consists of 60,000 training images and 10,000 test images, each of which is a 28x28 pixel grayscale image of a handwritten digit
- The digits in the dataset are labeled from 0 to 9, and the task is to classify a given image as one of these 10 classes
 - It is stored in the popular LibSVM dataset format, we will load MNIST dataset using MLlib's LibSVM dataset reader utility

In [0]:
full_training_data = spark.read.format("libsvm").load("/databricks-datasets/mnist-digits/data-001/mnist-digits-train.txt")
test_data = spark.read.format("libsvm").load("/databricks-datasets/mnist-digits/data-001/mnist-digits-test.txt")

# Cache data for multiple uses
full_training_data.cache()
test_data.cache()

print(f"There are {full_training_data.count()} training images and {test_data.count()} test images.")

In [0]:
# Randomly split full_training data for tuning
training_data, validation_data = full_training_data.randomSplit([0.8, 0.2], seed=42)

In [0]:
display(training_data)

In [0]:
display(validation_data)

In [0]:
display(test_data)

## Create a function to train a model

We will define a function to train a decision tree. Wrapping the training code in a function is important for passing the function to Hyperopt for tuning later.

**Details:** The tree algorithm needs to know that the labels are categories 0-9, rather than continuous values. This example uses the `StringIndexer` class to do this.  A `Pipeline` ties this feature preprocessing together with the tree algorithm.  ML Pipelines are tools Spark provides for piecing together Machine Learning algorithms into workflows.

In [0]:
import mlflow

from pyspark.ml import Pipeline
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml.feature import StringIndexer

In [0]:
# MLflow autologging for `pyspark.ml` requires MLflow version 1.17.0 or above.
# This try-except logic allows the notebook to run with older versions of MLflow.
try:
  import mlflow.pyspark.ml
  mlflow.pyspark.ml.autolog()
except:
  print(f"Your version of MLflow ({mlflow.__version__}) does not support pyspark.ml for autologging. To use autologging, upgrade your MLflow client version or use Databricks Runtime for ML 8.3 or above.")

In [0]:
def train_tree(minInstancesPerNode, maxBins):
  '''
  This train() function:
   - takes hyperparameters as inputs (for tuning later)
   - returns the F1 score on the validation dataset

  Wrapping code as a function makes it easier to reuse the code later with Hyperopt.
  '''
  # Use MLflow to track training.
  # Specify "nested=True" since this single model will be logged as a child run of Hyperopt's run.
  with mlflow.start_run(nested=True):
    
    # StringIndexer: Read input column "label" (digits) and annotate them as categorical values.
    indexer = StringIndexer(inputCol="label", outputCol="indexedLabel")
    
    # DecisionTreeClassifier: Learn to predict column "indexedLabel" using the "features" column.
    dtc = DecisionTreeClassifier(labelCol="indexedLabel",
                                 minInstancesPerNode=minInstancesPerNode,
                                 maxBins=maxBins)
    
    # Chain indexer and dtc together into a single ML Pipeline.
    pipeline = Pipeline(stages=[indexer, dtc])
    model = pipeline.fit(training_data)

    # Define an evaluation metric and evaluate the model on the validation dataset.
    evaluator = MulticlassClassificationEvaluator(labelCol="indexedLabel", metricName="f1")
    predictions = model.transform(validation_data)
    validation_metric = evaluator.evaluate(predictions)
    mlflow.log_metric("val_f1_score", validation_metric)

  return model, validation_metric

## Train a decision tree classifier

In [0]:
initial_model, val_metric = train_tree(minInstancesPerNode=200, maxBins=2)
print(f"The trained decision tree achieved an F1 score of {val_metric} on the validation data")

# Part 2: Use Hyperopt to tune hyperparameters

In this section, you create the Hyperopt workflow. 
* Define a function to minimize
* Define a search space over hyperparameters
* Specify the search algorithm and use `fmin()` to tune the model

For more information about the Hyperopt APIs, see the [Hyperopt documentation](http://hyperopt.github.io/hyperopt/).

## Define a function to minimize

* Input: hyperparameters
* Internally: Reuse the training function defined above.
* Output: loss

In [0]:
from hyperopt import fmin, tpe, hp, Trials, STATUS_OK

def train_with_hyperopt(params):
  """
  An example train method that calls into MLlib.
  This method is passed to hyperopt.fmin().
  
  :param params: hyperparameters as a dict. Its structure is consistent with how search space is defined. See below.
  :return: dict with fields 'loss' (scalar loss) and 'status' (success/failure status of run)
  """
  # For integer parameters, make sure to convert them to int type if Hyperopt is searching over a continuous range of values.
  minInstancesPerNode = int(params['minInstancesPerNode'])
  maxBins = int(params['maxBins'])

  model, f1_score = train_tree(minInstancesPerNode, maxBins)
  
  # Hyperopt expects you to return a loss (for which lower is better), so take the negative of the f1_score (for which higher is better).
  loss = - f1_score
  return {'loss': loss, 'status': STATUS_OK}

## Define the search space over hyperparameters

This example tunes two hyperparameters: `minInstancesPerNode` and `maxBins`. See the [Hyperopt documentation](https://github.com/hyperopt/hyperopt/wiki/FMin#21-parameter-expressions) for details on defining a search space and parameter expressions.

In [0]:
import numpy as np
space = {
  'minInstancesPerNode': hp.uniform('minInstancesPerNode', 10, 200),
  'maxBins': hp.uniform('maxBins', 2, 32),
}

## Select the search algorithm

- You must also specify which search algorithm to use. The two main choices are:
  - `hyperopt.tpe.suggest`: Tree of Parzen Estimators, a Bayesian approach which iteratively and adaptively selects new hyperparameter settings to explore based on previous results
  - `hyperopt.rand.suggest`: Random search, a non-adaptive approach that randomly samples the search space

In [0]:
algo = tpe.suggest

## Run the tuning algorithm with Hyperopt fmin()

**Important:**  
When using Hyperopt with MLlib and other distributed training algorithms, do not pass a `trials` argument to `fmin()`. When you do not include the `trials` argument, Hyperopt uses the default `Trials` class, which runs on the cluster driver. Hyperopt needs to evaluate each trial on the driver node so that each trial can initiate distributed training jobs.  

Do not use the `SparkTrials` class with MLlib. `SparkTrials` is designed to distribute trials for algorithms that are not themselves distributed. MLlib uses distributed computing already and is not compatible with `SparkTrials`.

In [0]:
with mlflow.start_run():
  best_params = fmin(
    fn=train_with_hyperopt,
    space=space,
    algo=algo,
    max_evals=8
  )

In [0]:
# Best hyperparametrs
print(best_params)

## Retrain the model on training dataset

In [0]:
best_minInstancesPerNode = int(best_params['minInstancesPerNode'])
best_maxBins = int(best_params['maxBins'])

final_model, val_f1_score = train_tree(best_minInstancesPerNode, best_maxBins)

In [0]:
print(f"The retrained decision tree achieved an F1 score of {val_f1_score} on the validation data")

## Use test dataset to compare evaluation metrics for the initial and "best" model

In [0]:
evaluator = MulticlassClassificationEvaluator(labelCol="indexedLabel", metricName="f1")

initial_model_test_metric = evaluator.evaluate(initial_model.transform(test_data))
final_model_test_metric = evaluator.evaluate(final_model.transform(test_data))

print(f"On the test data, the initial (untuned) model achieved F1 score {initial_model_test_metric}, and the final (tuned) model achieved {final_model_test_metric}.")