# BCI AVM Hypertuning
### Training the Machine Learning Models on Tabular Data: 

This notebook covers the following steps:
- Import data from AWS
- Visualize the data using Seaborn and matplotlib
- Run a parallel hyperparameter sweep to train machine learning models on the dataset
- Explore the results of the hyperparameter sweep with MLflow
- Register the best performing model in MLflow

## Requirements
This notebook requires Databricks Runtime for Machine Learning or a similar spark/pyspark enabled environment setup locally (or elsewhere).  
If you are using Databricks Runtime 7.3 LTS ML or below, you must update the CloudPickle library using the commands in the following cell.

In [0]:
# These commands are only required if you are using a cluster running DBR 7.3 LTS ML or below. 
import cloudpickle
assert cloudpickle.__version__ >= "1.4.0", "Update the cloudpickle library using `%pip install --upgrade cloudpickle`"

## Importing Data
  
In this section, you download a dataset from the web (AWS S3 Bucket).

1. Ensure that you have installed `bciavm` using `pip install bciavm` in your local machine or on Databricks.

In [0]:
import io
from bciavm.core.config import your_bucket
from bciavm.utils.bci_utils import ReadParquetFile, get_postcodeOutcode_from_postcode, get_postcodeArea_from_outcode, drop_outliers, preprocess_data
import pandas as pd
import bciavm
from sklearn.model_selection import ShuffleSplit, StratifiedShuffleSplit

In [0]:
dfPricesEpc = pd.DataFrame()
dfPrices = pd.DataFrame()

yearArray = ['2020', '2019']
for year in yearArray:
    singlePriceEpcFile = pd.DataFrame(ReadParquetFile(your_bucket, 'epc_price_data/byDate/2021-02-04/parquet/' + year))
    dfPricesEpc = dfPricesEpc.append(singlePriceEpcFile)

dfPricesEpc['POSTCODE_OUTCODE'] = dfPricesEpc['Postcode'].apply(get_postcodeOutcode_from_postcode)
dfPricesEpc['POSTCODE_AREA'] = dfPricesEpc['POSTCODE_OUTCODE'].apply(get_postcodeArea_from_outcode)
dfPricesEpc.groupby('TypeOfMatching_m').count()['Postcode']

## Preprocessing Data
Prior to training a model, check for missing values and split the data into training and validation sets.

In [0]:
X_train, X_test, y_train, y_test = bciavm.preprocess_data(dfPricesEpc)

##Hypertuning the AVM pipeline

The following code uses the `xgboost` and `scikit-learn` libraries to train a valuation model. It runs a parallel hyperparameter sweep to train multiple
models in parallel, using Hyperopt and SparkTrials. The code tracks the performance of each parameter configuration with MLflow.

In [0]:
import sys
from hyperopt import fmin, tpe, hp, SparkTrials, Trials, STATUS_OK
from hyperopt.pyll import scope
from math import exp
import mlflow

version_info = sys.version_info
PYTHON_VERSION = "{major}.{minor}.{micro}".format(major=version_info.major,
                                              minor=version_info.minor,
                                              micro=version_info.micro)


conda_env = {'channels': ['defaults','conda-forge'],
            'dependencies': [
                'python={}'.format(PYTHON_VERSION),
                'pip',
                  {'pip': ['bciavm==1.21.2',
                          ],
                  },
            ],
            'name': 'mlflow-env'
}

In [0]:
import gc
from bciavm.pipelines import RegressionPipeline
import numpy as np


search_space = {
  #Transformer tuning
  'numeric_impute_strategy': hp.choice('numeric_impute_strategy', ["mean", "median", "most_frequent"]),
  'top_n': hp.choice('top_n', [1, 5, 10, 20, 30]),
    
  #K nearest neighbor tuning
  'n_neighbors': scope.int(hp.quniform('n_neighbors', 2, 30, 1)),
  'leaf_size': hp.choice('leaf_size', [5, 10, 20, 30]),
  'p': hp.choice('p', [1, 2, 3]),
    
  #MultiLayer Perceptron Regressor tuning
  'activation': 'relu',
  'solver': 'adam',
  'batch_size': scope.int(hp.quniform('batch_size', 200, 1000, 10)),
  'alpha': hp.loguniform('alpha', -5, -1),
  'learning_rate_init': hp.loguniform('learning_rate_init', -3, 0),
  'max_iter': scope.int(hp.quniform('max_iter', 100, 500, 10)),
  'beta_1': hp.choice('beta_1', [0.1, 0.2, 0.4, 0.6, 0.8, 0.9]),
  'epsilon': hp.choice('epsilon', [1e-08, 1e-07, 1e-06, 1e-05, 1e-04, 1e-09, 1e-03]),
    
  #XGBoost Regressor tuning
  'n_estimators': scope.int(hp.quniform('n_estimators', 100, 1000, 10)),
  'max_depth': scope.int(hp.quniform('max_depth', 4, 100, 1)),
  'learning_rate': hp.loguniform('learning_rate', -3, 0),
  'reg_alpha': hp.loguniform('reg_alpha', -5, -1),
  'reg_lambda': hp.loguniform('reg_lambda', -6, -1),
  'min_child_weight': hp.loguniform('min_child_weight', -1, 3),
  'metric':'mae',
  'objective': 'reg:squarederror',
  'seed': 123, # Set a seed for deterministic training
}


def train_model(params):
  
    # With MLflow autologging, hyperparameters and the trained model are automatically logged to MLflow.
    mlflow.sklearn.autolog()

    with mlflow.start_run(nested=True):

        parameters = {
              'Imputer': {'categorical_impute_strategy': 'most_frequent',
                'numeric_impute_strategy': params['numeric_impute_strategy'],
                'categorical_fill_value': None,
                'numeric_fill_value': None},
               'One Hot Encoder': {'top_n': params['top_n'],
                'features_to_encode': ['agg_cat'],
                'categories': None,
                'drop': None,
                'handle_unknown': 'ignore',
                'handle_missing': 'error'},
               'MultiLayer Perceptron Regressor': {'activation': 'relu',
                'solver': 'adam',
                'alpha': params['alpha'],
                'batch_size': params['batch_size'],
                'learning_rate': 'constant',
                'learning_rate_init': params['learning_rate_init'],
                'max_iter': params['max_iter'],
                'early_stopping': True,
                'beta_1': params['beta_1'],
                'beta_2': 0.999,
                'epsilon': params['epsilon'],
                'n_iter_no_change': 10},
               'K Nearest Neighbors Regressor': {'n_neighbors': params['n_neighbors'],
                'weights': 'distance',
                'algorithm': 'auto',
                'leaf_size': params['leaf_size'],
                'p': params['p'],
                'n_jobs': 4},
               'XGBoost Regressor': {'learning_rate': params['learning_rate'],
                'max_depth': params['max_depth'],
                'min_child_weight': params['min_child_weight'],
                'reg_alpha': params['reg_alpha'],
                'reg_lambda': params['reg_lambda'],
                'n_estimators': params['n_estimators']},
        }

        class AVMPipeline(RegressionPipeline):
            custom_name = 'AVM Pipeline'
            component_graph = {
                "Preprocess Transformer": ["Preprocess Transformer"],
                'Imputer': ['Imputer', "Preprocess Transformer"],
                'One Hot Encoder': ['One Hot Encoder', "Imputer"],
                'MultiLayer Perceptron Regressor': ['MultiLayer Perceptron Regressor',  
                                                    'One Hot Encoder'],
                'K Nearest Neighbors Regressor': ['K Nearest Neighbors Regressor', 
                                                  'MultiLayer Perceptron Regressor', 
                                                  'One Hot Encoder'],
                'XGBoost Regressor': ["XGBoost Regressor", 
                                    'K Nearest Neighbors Regressor', 
                                    'One Hot Encoder']
            }
            
        for component in parameters:
          for param in parameters[component]:
            _component_param = component + '_' + param
            mlflow.log_param(_component_param, parameters[component][param])

        avm_pipeline = AVMPipeline(parameters=parameters)
        avm_pipeline.fit(X_train, y_train)

        # Compute and return trial error
        scores = avm_pipeline.score(X_test, 
                                         y_test, 
                                         objectives=['MAPE',
                                                   'MdAPE',
                                                   'ExpVariance',
                                                   'MaxError',
                                                   'MedianAE',
                                                   'MSE',
                                                   'MAE',
                                                   'R2',
                                                   'Root Mean Squared Error'])
        MdAPE = scores['MdAPE']

        #Examine the learned feature importances output by the model as a sanity-check.
        fi = pd.DataFrame({'feature':avm_pipeline.get_component("XGBoost Regressor").input_feature_names,'importance':avm_pipeline.get_component("XGBoost Regressor").feature_importance}).sort_values(by='importance', ascending=False)

        #Log the feature importances output as an artifact
        artifact_path = '/dbfs/FileStore/tables/XGBoost_importance.csv'
        fi.to_csv(artifact_path,index=False)
        mlflow.log_artifact(artifact_path)

        #Log the scoring metrics
        mlflow.log_metric('MAPE', scores['MAPE'])
        mlflow.log_metric('MdAPE', scores['MdAPE'])
        mlflow.log_metric('ExpVariance', scores['ExpVariance'])
        mlflow.log_metric('MaxError', scores['MaxError'])
        mlflow.log_metric('MedianAE', scores['MedianAE'])
        mlflow.log_metric('MSE', scores['MSE'])
        mlflow.log_metric('MAE', scores['MAE'])
        mlflow.log_metric('R2', scores['R2'])
        mlflow.log_metric('Root Mean Squared Error', scores['Root Mean Squared Error'])

        #Log an input example for future reference
        input_example = X_train.dropna().sample(1)

        #Log the mlflow model, along with the conda environment and input example
        mlflow.sklearn.log_model(
                             avm_pipeline,
                             "avm", 
                             conda_env=conda_env,
                             input_example=input_example
                            )

        # fmin minimizes the MdAPE (median absolute percentage error)
        return {'status': STATUS_OK, 'loss': scores['MdAPE']}

# Greater parallelism will lead to speedups, but a less optimal hyperparameter sweep. 
# A reasonable value for parallelism is the square root of max_evals.
spark_trials = SparkTrials(parallelism=10)

# Run fmin within an MLflow run context so that each hyperparameter configuration is logged as a child run of a parent
with mlflow.start_run(run_name='pipeline_tuning'):
    best_params = fmin(
        fn=train_model, 
        space=search_space, 
        algo=tpe.suggest, 
        max_evals=100,
        trials=spark_trials,
        rstate=np.random.RandomState(123)
        )

#### Use MLflow to view the results
Open up the Experiment Runs sidebar to see the MLflow runs. Click on Date next to the down arrow to display a menu, and select 'MdAPE' to display the runs sorted by the MdAPE metric. The lowest MdAPE value is ~10%. 

MLflow tracks the parameters and performance metrics of each run. Click the External Link icon <img src="https://docs.databricks.com/_static/images/external-link.png"/> at the top of the Experiment Runs sidebar to navigate to the MLflow Runs Table.

Now investigate how the hyperparameter choice correlates with MdAPE. Click the "+" icon to expand the parent run, then select all runs except the parent, and click "Compare". Select the Parallel Coordinates Plot.

The Parallel Coordinates Plot is useful in understanding the impact of parameters on a metric. You can drag the pink slider bar at the upper right corner of the plot to highlight a subset of MdAPE values and the corresponding parameter values. The plot below highlights the highest MdAPE values:

<img src="https://docs.databricks.com/_static/images/mlflow/end-to-end-example/parallel-coordinates-plot.png"/>

Notice that all of the top performing runs have a low value for reg_lambda and learning_rate. 

You could run another hyperparameter sweep to explore even lower values for these parameters. For simplicity, that step is not included here.

You used MLflow to log the model produced by each hyperparameter configuration. The following code finds the best performing run and saves the model to the model registry.

In [0]:
best_run = mlflow.search_runs(order_by=['metrics.MdAPE DESC']).iloc[0]
print(f'MdAPE of Best Run: {best_run["metrics.MdAPE"]}')

#### Updating the production wine_quality model in the MLflow Model Registry

Earlier, we saved the model to the Model Registry under "avm". Now that we have a created a more accurate model, update avm.

In [0]:
model_name = 'avm'
new_model_version = mlflow.register_model(f"runs:/{best_run.run_id}/model", model_name)

Click **Models** in the left sidebar to see that the wine_quality model now has two versions. 

The following code promotes the new version to production.

In [0]:
# Archive the old model version
client.transition_model_version_stage(
  name=model_name,
  version=model_version.version,
  stage="Archived"
)

# Promote the new model version to Production
client.transition_model_version_stage(
  name=model_name,
  version=new_model_version.version,
  stage="Production"
)

Clients that call load_model now receive the new model.

In [0]:
# This code is the same as the last block of "Building a Baseline Model". No change is required for clients to get the new model!
model = mlflow.pyfunc.load_model(f"models:/{model_name}/production")
print(f'MdAPE: {roc_auc_score(y_test, model.predict(X_test))}')