# MLflow Training Demo

Creates a model for predicting the quality of wine using [xgboost.XGBRegressor](https://xgboost.readthedocs.io/en/stable/python/python_api.html).  We perform a naive search of the hyperparameter space in order to determine the optimal values.

The dataset contains the chemical properties of many different wines made from the Vinho Verde grape, a native grape of Portugal.
The data we will be primarily using will be the white wine variety of Vinho Verde, however we also can set up an additional MLFlow experiment with red wine data.

**Input Variables**:
1. fixed acidity (tartaric acid - g / dm^3
2. volatile acidity (acetic acid - g / dm^3)
3. citric acid (g / dm^3)
4. residual sugar (g / dm^3)
5. chlorides (sodium chloride - g / dm^3
6. free sulfur dioxide (mg / dm^3)
7. total sulfur dioxide (mg / dm^3)
8. density (g / cm^3)
9. pH
10. sulphates (potassium sulphate - g / dm3)
11. alcohol (% by volume)

**Output variable (based on sensory data)**:

12. quality (score between 0 and 10)

The results of the model training runs are tracked in an MLflow experiment. The best performing model is then registered in the model registry and set to the `Production` stage for usage.

> This is notebook is based on `train.py` from the MLflow example [xgboost_sklearn](https://github.com/mlflow/mlflow/tree/master/examples/xgboost/xgboost_sklearn).

Attribution
* The data set used in this example is from http://archive.ics.uci.edu/ml/datasets/Wine+Quality.
* P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis.
* Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009.


## Tracking Setup

Create our experiment to track all our model training runs in.

* This experiment is used across runs of the notebook and will not be recreated if it already exists.
* The name of the experiment is defined as an anaconda project variable located within `anaconda-project.yml`.
    * The variable name is `MLFLOW_EXPERIMENT_NAME`, and the default value is `demo_sklearn_elasticnet_wine`.

In [None]:
from src.environment import init
import warnings
import numpy as np

warnings.filterwarnings("ignore")
np.random.seed(42)

experiment_id, client = init()

# Training

In [None]:
"""
Model Training Function
"""

from src.data import DataSet
from pydantic.main import BaseModel
from mlflow_adsp import create_unique_name
import os

import xgboost as xgb

import mlflow.xgboost


class HyperParameters(BaseModel):
    n_estimators: int
    max_depth: int
    reg_lambda: float
    gamma: float
    early_stopping_rounds: int


def train(ds: DataSet, parameters: HyperParameters) -> str:
    # Start the MLflow run to track the model training.
    with mlflow.start_run(run_name=create_unique_name(name=os.environ["MLFLOW_EXPERIMENT_NAME"])) as run:
        # Enable MLflow logging
        mlflow.xgboost.autolog()

        # https://xgboost.readthedocs.io/en/stable/python/python_api.html
        regressor = xgb.XGBRegressor(
            n_estimators=parameters.n_estimators,
            max_depth=parameters.max_depth,
            reg_lambda=parameters.reg_lambda,
            gamma=parameters.gamma,
            early_stopping_rounds=parameters.early_stopping_rounds,
        )
        regressor.fit(X=ds.X_train, y=ds.y_train, eval_set=[(ds.X_test, ds.y_test)], verbose=False)

        # Return the run_id for training run comparisons.
        return run.info.run_id

# Train a single model

In [None]:
from src.data import prepare_data
from mlflow.entities import Run

DATA_SET_FILENAME: str = "datasets/housing.csv"

data_set: DataSet = prepare_data(csv_url=DATA_SET_FILENAME)
parameters = HyperParameters(n_estimators=18, max_depth=10, reg_lambda=1, gamma=0, early_stopping_rounds=10)

run_id: str = train(ds=data_set, parameters=parameters)
stand_alone_run: Run = client.search_runs([experiment_id], f"attributes.run_id = '{run_id}'")[0]

print(f"Run ID: {run_id}")
print(stand_alone_run.data.metrics)

# Perform a naive search of the hyperparameter space

We will naively review model performance at specific internals across the solution space.  There are many optimization functions, which can be leveraged based on business needs.

In [None]:
from typing import Optional
from mlflow import MlflowClient


def get_best_run(client: MlflowClient, experiment_id, runs: list[str]) -> tuple[Optional[Run], dict]:
    _inf = np.finfo(np.float64).max

    best_metrics: dict = {
        "validation_0-rmse": _inf,
    }
    best_run: Optional[Run] = None

    for run_id in runs:
        # find the best run, log its metrics as the final metrics of this run.
        run: Run = client.search_runs([experiment_id], f"attributes.run_id = '{run_id}'")[0]
        if (
            "validation_0-rmse" in run.data.metrics
            and run.data.metrics["validation_0-rmse"] < best_metrics["validation_0-rmse"]
        ):
            best_metrics = run.data.metrics
            best_run = run

    return best_run, best_metrics

In [None]:
from tqdm import trange

runs: list[str] = []

for i in trange(3, 9):
    n_estimators: int = i * 2 + 1
    for j in range(3, 9):
        max_depth: int = j + 3
        data_set: DataSet = prepare_data(csv_url=DATA_SET_FILENAME)
        parameters = HyperParameters(
            n_estimators=n_estimators,
            max_depth=max_depth,
            reg_lambda=1,
            gamma=0,
            early_stopping_rounds=10,
        )
        run_id: str = train(ds=data_set, parameters=parameters)
        runs.append(run_id)

# Find and register the best model

## Review the runs for the best performing model and add it to the model registry

In [None]:
from mlflow.entities.model_registry import ModelVersion

(best_run, metrics) = get_best_run(client=client, experiment_id=experiment_id, runs=runs)

print(f"Run ID: {best_run.info.run_id}")
print(f"Report: {metrics}")

In [None]:
from wine_quality.mlflow_helpers import register_best_model

model_version: ModelVersion = register_best_model(client=client, run=best_run)

## Promote the latest model to the `Production` stage for usage.

In [None]:
model_version: ModelVersion = client.transition_model_version_stage(
    name=os.environ["MLFLOW_EXPERIMENT_NAME"],
    version=model_version.version,
    stage="Production",
    archive_existing_versions=True,
)

And now we're ready to create an API Deployment with our Production model with the following steps:

1. Select "Deploy" in the top right of this AE5 Screen.
2. Name the deployment - We will be creating two deployments in this demo so it can be helpful to include "API" in the name.
3. Set the deployment command to 'host-production-model-rest-api'.
4. Set the URL to 'Static' and ensure the URL matches the 'endpoint-url' in the 'consume-rest-api.ipynb' notebook.
5. Set the privacy to "Public".
6. Deploy!

And in a few moments we will have an API that we can use to make predictions.
