**<center><h1>Introduction</h1></center>**

Hyperparameters are parameters defined before model training that can influence the model's performance. There are different hyperparameters available to fine-tune depending on the algorithm used to train a model, which can be done through a process called **hyperparameter tuning**.

In this module, you'll learn how to use Azure Databricks with MLflow to do hyperparameter tuning and model selection.


**<h2>Learning Objectives</h2>**

After completing this module, you’ll be able to:

- Understand hyperparameter tuning and its role in machine learning.
- Learn how to use the two open-source tools - automated MLflow and Hyperopt - to automate the process of model selection and hyperparameter tuning.

<hr>

**<center><h1>Understand hyperparameter tuning</h1></center>**

Building machine learning solutions involves testing many different models. Let's explore two concepts that can help with finding the optimal model:

- Hyperparameter tuning
- Cross-validation

**<h2>Hyperparameter tuning</h2>**

A **hyperparameter** is a parameter used in a machine learning algorithm that is set before the learning process begins. In other words, a machine learning algorithm can't learn hyperparameters from the data itself. Hyperparameters are tested and validated by training multiple models. Common hyperparameters include the number of iterations and the complexity of the model. **Hyperparameter** tuning is the process of choosing the hyperparameter that has the best result on our loss function, or the way we penalize an algorithm for being wrong.

**<h2>Cross-validation</h2>**

When you train and evaluate a model on the same data, it can lead to **overfitting**. Overfitting is where the model performs well on data it has already seen but fails to predict anything useful on data it has not already seen. To avoid overfitting, you can use the train/test split where the dataset is divided between a training set used to train the model and a test set to evaluate the model's performance on unseen data.

If you train many different models with different hyperparameters and then evaluate their performance on the test set, you would still risk overfitting because you may choose the hyperparameter that just so happens to perform the best on the data you have in your dataset. To solve overfitting when using hyperparameters, you can use k subsets of your training set to train the model, a process called **k-fold cross-validation**. A model is then trained on k-1 folds of the training data and the last fold is used to evaluate its performance.

<img src="images/01-01-01-cross-validation.png" />

Within Azure Databricks, there are two approaches to tune hyperparameters, which will be discussed in the next units:

- Automated MLflow tracking.
- Hyperparameter tuning with Hyperopt.




<hr>

**<center><h1>Automated MLflow for model tuning</h1></center>**

To choose the best model trained during hyperparameter tuning, you want to compare all models by evaluating their metrics. One common and simple approach to track model training in Azure Databricks is by using the open-source platform **MLflow**.

**<h2>Use automated MLflow</h2>**

As you train multiple models with hyperparameter tuning, you want to avoid the need to make explicit API calls to log all necessary information about the different models to MLflow. To make tracking hyperparameter tuning easier, the [Databricks Runtime for Machine Learning](https://docs.databricks.com/runtime/mlruntime.html) also supports automated MLflow Tracking. When you use automated MLflow for model tuning, the hyperparameter values and evaluation metrics are automatically logged in MLflow and a hierarchy will be created for the different runs that represent the distinct models you train.

To use automated MLflow tracking, you have to do the following:

- Use a Python notebook to host your code.
- Attach the notebook to a cluster with Databricks Runtime or Databricks Runtime for Machine Learning.
- Set up the hyperparameter tuning with ```CrossValidator``` or ```TrainValidationSplit```.


MLflow will automatically create a main or parent run that contains the information for the method you chose: ```CrossValidator``` or ```TrainValidationSplit```. MLflow will also create child runs that are nested under the main or parent run. Each child run will represent a trained model and you can see which hyperparameter values were used and the resulting evaluation metrics.


**<h2>Run tuning code</h2>**

When you want to run code that will train multiple models with different hyperparameter settings, you can go through the following steps:

- List the available hyperparameters for a specific algorithm.
- Set up the search space and sampling method.
- Run the code with automated MLflow, using ```CrossValidator``` or ```TrainValidationSplit```.

**<h3>List the available hyperparameters</h3>**

You can explore the hyperparameters of a specific machine learning algorithm by using the ```.explainParams()``` method on a model. For example, if we want to train a linear regression model lr, we can use the following command to view the available hyperparameters:

```
print(lr.explainParams())
```

The ```.explainParams()``` method will return a list of hyperparameters you can choose from, including the name of the hyperparameter, a description, and the default value. Three of the hyperparamaters available for the linear regression model are:

- maxIter: max number of iterations (>= 0). (default: 100)
- fitIntercept: whether to fit an intercept term. (default: True)
- standardization: whether to standardize the training features before fitting the model. (default: True)

**<h2>Set up the search space and sampling method</h2>**

After you select the hyperparameters, you can use ```ParamGridBuilder()``` to specify the **search space**. The search space is the range of values of the hyperparameters you want to try out. You can then specify how you want to choose values from that search space to train individual models with which is known as the **sampling method**. The most straight-forward **sampling method** is known as** grid sampling**. The grid sampling method tries all possible combinations of values for the hyperparameters listed.

By default, the individual models will be trained in serial. It is possible to train models with different hyperparamater values in parallel. You can find more information on setting up the parameter grid in the documentation here.

<mark>**Note:** Since grid search works through exhaustively building a model for each combination of hyperparameters, it quickly becomes a lot of different unique combinations. As each model training can consume a lot of compute power, be careful with the configuration you set up.</mark>

If we continue the example with the linear regression model ```lr```, the following code shows how to set up a grid search to try out all possible combinations of parameters:

```
from pyspark.ml.tuning import ParamGridBuilder

paramGrid = (ParamGridBuilder()
  .addGrid(lr.maxIter, [1, 10, 100])
  .addGrid(lr.fitIntercept, [True, False])
  .addGrid(lr.standardization, [True, False])
  .build()
)
```

**<h2>Run code and invoke automated MLflow</h2>**

To test how the model performs and to generate evaluation metrics, you can use a test dataset. If you want to train multiple models on the same training dataset and the same test dataset, you can use the ```TrainValidationSplit``` method to run your code, build the models, and log them automatically with MLflow.

In case you want to take extra measures to prevent overfitting, you can use the ```CrossValidator``` method to train the models with different training datasets for each model and different test datasets to calculate the evaluation metrics.

To build the models for the linear regression model ```lr``` used in the examples above, you can create a ```RegressionEvaluator()``` to evaluate the grid search experiments, which will help decide which model is best. The settings for the hyperparameter tuning experiment can be set by using the ```CrossValidator()``` method as is done in the example below.

```
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.tuning import CrossValidator

evaluator = RegressionEvaluator(
  labelCol = "medv", 
  predictionCol = "prediction"
)

cv = CrossValidator(
  estimator = pipeline,             # Estimator (individual model or pipeline)
  estimatorParamMaps = paramGrid,   # Grid of parameters to try (grid search)
  evaluator=evaluator,              # Evaluator
  numFolds = 3,                     # Set k to 3
  seed = 42                         # Seed to sure our results are the same if ran again
)

cvModel = cv.fit(trainDF)
```
Once all models have been trained, you can get the best model with the following code:

```
bestModel = cvModel.bestModel
```
Alternatively, you can look at all models you trained through the UI of MLflow. Just remember that there will be a parent run for the complete experiment and child runs for each individual model that has been trained.



<hr>

**<center><h1>Hyperparameter tuning with Hyperopt</h1></center>**

Another open-source tool that allows you to automate the process of hyperparameter tuning and model selection is **Hyperopt**. Hyperopt is simple to use, but using it efficiently requires care. The main advantage to using Hyperopt is that it is flexible and it can optimize any Python model with hyperparameters.

**<h2>Use Hyperopt</h2>**

Hyperopt is already installed if you create a compute with the Databricks Runtime ML. To use it when training a Python model, you should follow these basic steps:

1. Define an objective function to minimize.
2. Define the hyperparameter search space.
3. Specify the search algorithm.
4. Run the Hyperopt function fmin().

**<h2>Define an objective function to minimize</h2>**

The objective function represents what the main purpose is of training multiple models through hyperparameter tuning. Often, the objective is to minimize training or validation loss.

When defining a function, you can make use of any evaluation metric that can be calculated with the algorithm you selected. For example, if we use a [support vector machine classifier from the scikit-learn library](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html), you can vary the value for the regularization parameter c. The objective is to have the model with the highest accuracy. Since Hyperopt wants a function that it needs to minimize, you can define the objective function as the negative accuracy so that a lower score actually means a higher accuracy.

In the following example, the regularization parameter c is defined as the input, a support vector machine classifier model is trained, the accuracy is calculated, and the objective function is defined as the negative accuracy, which is the value Hyperopt will try to minimize.

```
from sklearn.model_selection import cross_val_score
from sklearn.svm import SVC

def objective(C):
    clf = SVC(C)
    
    accuracy = cross_val_score(clf, X, y).mean()
    
    return {'loss': -accuracy, 'status': STATUS_OK}
```

**<h2>Define the hyperparameter search space</h2>**


When tuning hyperparameters, you need to define a search space. If you want to make use of Hyperopt's Bayesian approach to sampling, there is a set of expressions you can use to define the search space that is compatible with Hyperopt's approach to sampling.

Some examples of the expressions used to define the search space are:

- ```hp.choice(label, options):``` Returns one of the options you listed.
- ```hp.randint(label, upper):``` Returns a random integer in the range ```[0, upper]```.
- ```hp.uniform(label, low, high):``` Returns a value uniformly between low and high.
- ```hp.normal(label, mu, sigma):``` Returns a real value that's normally distributed with mean mu and standard deviation ```sigma.```


**<h2>Select the search algorithm</h2>**

There are two main choices in how Hyperopt will sample over the search space:

- ```hyperopt.tpe.suggest:``` Tree of Parzen Estimators (TPE), a Bayesian approach, which iteratively and adaptively selects new hyperparameter settings to explore based on past results.
- ```hyperopt.rand.suggest:``` Random search, a non-adaptive approach that samples over the search space.

**<h2>Run the Hyperopt function fmin()</h2>**

Finally, to execute a Hyperopt run, you can use the function fmin(). The fmin() function takes the following arguments:

- ```fn:``` The objective function.
- ```space:``` The search space.
- ```algo:``` The search algorithm you want Hyperopt to use.
- ```max_evals:``` The maximum number of models to train.
- ```max_queue_len:``` The number of hyperparameter settings generated ahead of time. This can save time when using the TPE algorithm.
- ```trials:``` A ```SparkTrials``` or ```Trials ```object. ```SparkTrials``` is used for single-machine algorithms such as scikit-learn. ```Trials``` is used for distributed training algorithms such as MLlib methods or Horovod. When using ```SparkTrials``` or Horovod, automated MLflow tracking is enabled and hyperparameters and evaluation metrics are automatically logged in MLflow.






<hr>

**<center><h1>Exercise</h1></center>**

Now, it's your chance to use Azure Databricks to tune hyperparameters.

In this exercise, you will:

- Explore automated MLflow hyperparameter tuning.
- Explore Hyperopt for hyperparameter tuning.


**<h2>Instructions</h2>**

Follow these instructions to complete the exercise:

1. Open the exercise instructions at https://aka.ms/mslearn-dp090.
2. Complete the **Hyperparameter tuning with Azure Databricks** exercise.



<hr>