# Automated MLflow Hyperparameter Tuning

In this lab, you will learn to tune hyperparameters in Azure Databricks. This lab will cover the following exercise:
- Exercise 1: Using Automated MLflow for hyperparameter tuning.

To upload the necessary data, please follow the instructions in the lab guide.


## Attach notebook to your cluster
Before executing any cells in the notebook, you need to attach it to your cluster. Make sure that the cluster is running.

In the notebook's toolbar, select the drop down arrow next to Detached, and then select your cluster under Attach to.

Make sure you run each cells in order.

## Exercise 1: Using Automated MLflow for hyperparameter tuning
In this exercise, you will perform hyperparameter tuning by using the automated MLflow libary. 

### Load the data
In this exercise, you will be using a dataset of real estate sales transactions to predict the price-per-unit of a property based on its features. The price-per-unit in this data is based on a unit measurement of 3.3 square meters

The data consists of the following variables:
- **transaction_date** - the transaction date (for example, 2013.250=2013 March, 2013.500=2013 June, etc.)
- **house_age** - the house age (in years)
- **transit_distance** - the distance to the nearest light rail station (in meters)
- **local_convenience_stores** - the number of convenience stores within walking distance
- **latitude** - the geographic coordinate, latitude
- **longitude** - the geographic coordinate, longitude
- **price_per_unit** - house price of unit area (3.3 square meters) 


Run the following cell to load the table into a Spark dataframe and review the dataframe.

In [None]:
TO-DO


### Train a linear regression model
Start by performing a train/test split on the housing dataset and building a pipeline for linear regression.

In the cell below, a dataframe `housingDF` is created from the table you created before. The dataframe is then randomnly split into a training set that contains 80% of the data, and a test set that contains 20% of the original dataset. All columns except for the last one are then marked as features so that a Linear Regression model can be trained on the data. 

In [None]:
TO-DO


Take a look at the model parameters using the `.explainParams()` method.

In [None]:
TO-DO


`ParamGridBuilder()` allows us to string together all of the different possible hyperparameters we would like to test.  In this case, we can test the maximum number of iterations, whether we want to use an intercept with the y axis, and whether we want to standardize our features.

In [None]:
TO-DO


Now `paramGrid` contains all of the combinations we will test in the next step.  Take a look at what it contains.

In [None]:
TO-DO


### Cross-Validation

There are a number of different ways of conducting cross-validation, allowing us to trade off between computational expense and model performance.  An exhaustive approach to cross-validation would include every possible split of the training set.  More commonly, _k_-fold cross-validation is used where the training dataset is divided into _k_ smaller sets, or folds.  A model is then trained on _k_-1 folds of the training data and the last fold is used to evaluate its performance.

Create a `RegressionEvaluator()` to evaluate our grid search experiments and a `CrossValidator()` to build our models.

In [None]:
TO-DO


Fit the `CrossValidator()`

<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> This will train a large number of models.  If your cluster size is too small, it could take a while.

In [None]:
TO-DO


Take a look at the scores from the different experiments.

In [None]:
TO-DO


You can then access the best model using the `.bestModel` attribute. 

In [None]:
TO-DO


To see the predictions of the best model on the test dataset, execute the code below:

In [None]:
TO-DO
