# Training models in Azure Databricks and deploying them on Azure ML

This notebook demostrates how to train models in Azure Databricks (or any Databricks implementation) and deploying those models on Azure ML. Two workflows are demostrated here depending on the level of integration you want to keep and how you want to do tracking:

1. **Scenario 1: Training on Azure Databricks while tracking experiments and models in Azure ML:** This example shows how to do training of models in Azure Databricks while doing all the tracking of experiments in Azure ML (instead of in the MLflow instance running on Azure Databricks). This will also allow you to seemessly deploy models to Azure ML deployment targets in the easiest way.
2. **Scenario 2: Training and tracking experiments in Azure Databricks with Model Registries in Azure ML:** This example shows how to do training and tracking of models in Azure Databricks. Tracking of experiments happens here in the MLflow instance running on Azure Databricks. However, model registries are kept on Azure ML to allow quick model's deployment from a centralized location and registry of models.

Read each scenario to know more about advantages and disadvantages of each approach.

## Before starting

To run this notebook ensure you have:
- A Databricks workspace with a compute with the following libraries:
  - xgboost
  - scikit-learn==1.1.1
  - pandas
  - numpy
  - mlflow
  - azureml-mlflow

Also, configure the following variables:

In [None]:
subscription_id = "<SUBSCRIPTION_ID>"
resource_group = "<RESOURCE_GROUP>"
workspace = "<AML_WORKSPACE_NAME>"
adb_user_id = "<ADB_USER_ID>"

You will need to connect MLflow to the Azure Machine Learning workspace you want to work on. MLflow uses the tracking URI to indicate the MLflow server you want to connect to. There are multiple ways to get the Azure Machine Learning MLflow Tracking URI. In this tutorial we will use the Azure ML SDK for Python, but you can check [Set up tracking environment - Azure Machine Learning Docs](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-use-mlflow-cli-runs#set-up-tracking-environment) for more alternatives.

In [None]:
from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential

ml_client = MLClient(
    DefaultAzureCredential(), subscription_id, resource_group, workspace
)

You can use the workspace object to get the tracking URI:

In [None]:
azureml_tracking_uri = ml_client.workspaces.get(
    ml_client.workspace_name
).mlflow_tracking_uri

## Scenario 1: Training on Azure Databricks while tracking experiments and models in Azure ML

In this scenario, you want tracking and model registry to happen in Azure Machine Learning, however, you want to keep training models in Azure Databricks. To do that, we need to configure the tracking URI on each instance of Databricks:

In [None]:
import mlflow

mlflow.set_tracking_uri(azureml_tracking_uri)

### Training a heart condition classifier

#### Configuring the experiment

Tracking of experiments will happen in Azure ML and hence we need to use the naming convention we generally use with MLflow. 

>Note that naming in Azure Databricks is different as you have to use the path to where the experiment will be saved. In Azure ML and in general MLflow this is not the case.

In [None]:
mlflow.set_experiment(experiment_name="heart-condition-classifier")

> **About authentication:** Interactive Authentication or Device Authentication will be triggered when you can `set_experiment`. This is used to authenticate against Azure Machine Learning and be able to call the tracking API. If you are executing the code in the context of a job where interactive authentication is not possible, see the example `notebooks/using-mlflow/train-with-mlflow/xgboost_service_principal.ipynb` for an example about how to use a Service Principal to authenticate against Azure Machine Learning and MLflow.

Since all the tracking is happening in Azure ML, you can train and register models in the regular way you do with mlflow.

#### Exploring the data

In [None]:
import pandas as pd

file_url = "https://azuremlexampledata.blob.core.windows.net/data/heart-disease-uci/data/heart.csv"
df = pd.read_csv(file_url)
display(df)

As we can see, some of the variables are categorical. To make it simpler for our model to handle these values, let's use their encoded values:

In [None]:
df["thal"] = df["thal"].astype("category").cat.codes

Let's split our dataset in train and test, so we can assess the performance of the model without overfitting the dataset

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    df.drop("target", axis=1), df["target"], test_size=0.3
)

#### Training a model

We are going to use autologging capabilities in MLflow to track parameters and metrics:

In [None]:
mlflow.xgboost.autolog()

Let's create a simple classifier and train it:

In [None]:
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, recall_score

model = XGBClassifier(use_label_encoder=False, eval_metric="logloss")

In [None]:
with mlflow.start_run() as run:
    model.fit(X_train, y_train, eval_set=[(X_test, y_test)], verbose=False)

    y_pred = model.predict(X_test)

    accuracy = accuracy_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)

    print("Accuracy: %.2f%%" % (accuracy * 100.0))
    print("Recall: %.2f%%" % (recall * 100.0))

### Registering the model in Azure ML

Since our experiments are being tracked in Azure ML, we can simply register models in the registry like this:

In [None]:
mlflow.register_model(
    model_uri=f"runs:/{run.info.run_id}/model", name="databricks-heart-classifier"
)

## Scenario 2: Training and tracking experiments in Azure Databricks with Model Registries in Azure ML

In some cases you may want to keep doing tracking of experiments in the MLflow instance that comes with Azure Databricks. This is the case for instance of customers that were already using MLflow in Azure Databricks so they want to keep they existing experiments there. However, they may want to take adavantage of the deployment capabilities of Azure ML including managed inference solutions, no-code deployments, etc.

In this cases, it is possible to keep tracking of experiments on Azure Databricks while keeping you model's registered and deployed in Azure ML. This example shows you how to achieve this configuration:

### Configuring models' registry

MLflow allows you to segregate the instance where experiments are being tracked from the instance where models' are being tracked (or registered). The first one is referred to **Tracking URI** while the second one is referred as **Registry URI**. By default, both of them are set to the same value, and in Azure Databricks, both of them are set to "databricks" meaning that tracking and model registries will happen inside of the MLflow instance that Databricks runs for you.

We are going to track the experiments in Azure Databricks, but model registries will be held in Azure ML. This will allow us to manage the model's lifecycle - including deployments - in Azure ML.

In [None]:
import mlflow

mlflow.set_registry_uri(azureml_tracking_uri)

#### Configuring the experiment

Tracking of experiments will happen in Azure Datbricks and hence we need to use the naming we use here.  

>Note that naming in Azure Databricks is different as you have to use the path to where the experiment will be saved.

In [None]:
mlflow.set_experiment(
    experiment_name=f"/Users/{adb_user_id}/heart-condition-classifier"
)

#### Exploring the data

In [None]:
import pandas as pd

file_url = "https://azuremlexampledata.blob.core.windows.net/data/heart-disease-uci/data/heart.csv"
df = pd.read_csv(file_url)
display(df)

As we can see, some of the variables are categorical. To make it simpler for our model to handle these values, let's use their encoded values:

In [None]:
df["thal"] = df["thal"].astype("category").cat.codes

Let's split our dataset in train and test, so we can assess the performance of the model without overfitting the dataset

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    df.drop("target", axis=1), df["target"], test_size=0.3
)

#### Training a model

We are going to use autologging capabilities in MLflow to track parameters and metrics:

In [None]:
mlflow.xgboost.autolog()

Let's create a simple classifier and train it:

In [None]:
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, recall_score

model = XGBClassifier(use_label_encoder=False, eval_metric="logloss")

In [None]:
with mlflow.start_run() as run:
    model.fit(X_train, y_train, eval_set=[(X_test, y_test)], verbose=False)

    y_pred = model.predict(X_test)

    accuracy = accuracy_score(y_test, y_pred)
    recall = reca
    ll_score(y_test, y_pred)

    print("Accuracy: %.2f%%" % (accuracy * 100.0))
    print("Recall: %.2f%%" % (recall * 100.0))

### Registering the model in Azure ML

So far, our model is trained and tracked inside of the MLflow instance in Azure Databricks. Now we want to register this model in Azure ML to manage the life cicle there. However, if we try to register the model as we usually do using the sintax `mlflow.register_model(model_uri=f"runs:/{run.info.run_id}/model").` you will found an error. The reason why this is happening is related to where runs are being stored.

Right now runs are being stored in Azure Databricks and models in Azure ML. If you try to create a registered model from a Run, Azure ML don't have any way to guess how to get access to the runs, that are stored in a different service. because of that, you can't use `runs:/` URI for registering models.

To overcome this limitation, you have to register the model from the artifacts themselfs, which you can achieve by first downloading them.

In [None]:
client = mlflow.tracking.MlflowClient()
model_path = client.download_artifacts(run.info.run_id, path="model")

`model_path` is a local path to the artifacts representing the MLmodel created. We can use this artifacts to register the model now:

> **Important:** Note that doing this has some implications. Since Azure ML knows nothing about the run that generated this model, lineage is lost from this point on. You can, although, store the RUN ID that generated this model in a tag in the registry for your reference.

In [None]:
mlflow.register_model(
    model_uri=f"file://{model_path}", name="databricks-heart-classifier"
)

Notice in the instruction above how the protocol is now `file://` instead of `runs:/`.

## Deploying models registered in Azure ML

Once a model is registered in Azure ML, you can deploy them using either the UI interface in Azure ML Studio, the Azure ML CLI v2 from a console, or the azureml-mlflow plugin for MLflow. Use the approach it best suites your needs. Here we will demostrate how to do that using the MLflow deployment plugin.

### Deploying models registered in Azure ML to Managed Inference

To make the deployment happen, you will need a deployment client. Deployments can be generated using both the Python API for MLflow or MLflow CLI. In both cases, a JSON configuration file needs to be indicated with the details of the deployment you want to achieve. The full specification of this configuration can be found at [Managed online deployment schema (v2)](https://docs.microsoft.com/en-us/azure/machine-learning/reference-yaml-deployment-managed-online).

In [None]:
import json
from mlflow.deployments import get_deploy_client

# Create the deployment configuration.
deploy_config = {
    "instance_type": "Standard_DS2_v2",
    "instance_count": 1,
}

Write the deployment configuration into a file.

In [None]:
deployment_config_path = "deployment_config.json"
with open(deployment_config_path, "w") as outfile:
    outfile.write(json.dumps(deploy_config))

#### Configuring the deployment client

Indicate to MLflow where we want to deploy:

In [None]:
client = get_deploy_client(azureml_tracking_uri)

Indicate to MLflow from where the models need to be pulled from. Currently, the source and target URLs need to be the same:

In [None]:
mlflow.set_tracking_uri(azureml_tracking_uri)

#### Deploying the model

MLflow requires the deployment configuration to be passed as a dictionary.

In [None]:
config = {"deploy-config-file": deployment_config_path}
model_name = "databricks-heart-classifier"
model_version = 1

In [None]:
# define the model path and the name is the service name
# if model is not registered, it gets registered automatically and a name is autogenerated using the "name" parameter below
client.create_deployment(
    model_uri=f"models:/{model_name}/{model_version}",
    config=config,
    name="mymodel-mir-deployment",
)