# Track model training in notebooks with MLflow

You can use MLflow in a notebook to track any models you train. As you'll run this notebook with an Azure Machine Learning compute instance, you don't need to set up MLflow: it's already installed and integrated. 

You'll prepare some data and train a model to predict diabetes. You'll use autologging, and custom logging to explore how you can use MLflow in notebooks.

## Before you start

You'll need the latest version of the **azure-ai-ml** package to run the code in this notebook. Run the cell below to verify that it is installed.

> **Note**:
> If the **azure-ai-ml** package is not installed, run `pip install azure-ai-ml` to install it.

In [1]:
!pip install azure-ai-ml
!pip install mlflow



In [2]:
pip show azure-ai-ml

Name: azure-ai-ml
Version: 1.24.0
Summary: Microsoft Azure Machine Learning Client Library for Python
Home-page: https://github.com/Azure/azure-sdk-for-python
Author: Microsoft Corporation
Author-email: azuresdkengsysadmins@microsoft.com
License: MIT License
Location: /anaconda/envs/azureml_py38/lib/python3.10/site-packages
Requires: azure-common, azure-core, azure-mgmt-core, azure-monitor-opentelemetry, azure-storage-blob, azure-storage-file-datalake, azure-storage-file-share, colorama, isodate, jsonschema, marshmallow, msrest, pydash, pyjwt, pyyaml, strictyaml, tqdm, typing-extensions
Required-by: 
Note: you may need to restart the kernel to use updated packages.


## Connect to your workspace

With the required SDK packages installed, now you're ready to connect to your workspace.

To connect to a workspace, we need identifier parameters - a subscription ID, resource group name, and workspace name. Since you're working with a compute instance, managed by Azure Machine Learning, you can use the default values to connect to the workspace.

In [3]:
from azure.identity import DefaultAzureCredential, InteractiveBrowserCredential
from azure.ai.ml import MLClient

try:
    credential = DefaultAzureCredential()
    # Check if given credential can get token successfully.
    credential.get_token("https://management.azure.com/.default")
except Exception as ex:
    # Fall back to InteractiveBrowserCredential in case DefaultAzureCredential not work
    credential = InteractiveBrowserCredential()

In [4]:
# Get a handle to workspace
ml_client = MLClient.from_config(credential=credential)

Found the config file in: /config.json


## Configure MLflow

As you're running this notebook on a compute instance in the Azure Machine Learning studio, you don't need to configure MLflow. 

Still, it's good to verify that the necessary library is indeed installed.

> **Note**:
> If the **mlflow** library is not installed, run `pip install mlflow` to install it.

In [5]:
pip show mlflow

Name: mlflow
Version: 2.20.1
Summary: MLflow is an open source platform for the complete machine learning lifecycle
Home-page: 
Author: 
Author-email: 
License: Copyright 2018 Databricks, Inc.  All rights reserved.
        
                                        Apache License
                                   Version 2.0, January 2004
                                http://www.apache.org/licenses/
        
           TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
        
           1. Definitions.
        
              "License" shall mean the terms and conditions for use, reproduction,
              and distribution as defined by Sections 1 through 9 of this document.
        
              "Licensor" shall mean the copyright owner or entity authorized by
              the copyright owner that is granting the License.
        
              "Legal Entity" shall mean the union of the acting entity and all
              other entities that control, are controlled by, 

## Prepare the data

You'll train a diabetes classification model. The training data is stored in the **data** folder as **diabetes.csv**. 

First, let's read the data:

In [6]:
import pandas as pd
import numpy as np

print("Reading data...")
accident = pd.read_csv('./data/accident.csv') 
accident

Reading data...


Unnamed: 0,Age,Gender,Speed_of_Impact,Helmet_Used,Seatbelt_Used,Survived
0,56,Female,27.0,No,No,1
1,69,Female,46.0,No,Yes,1
2,46,Male,46.0,Yes,Yes,0
3,32,Male,117.0,No,Yes,0
4,60,Female,40.0,Yes,Yes,0
...,...,...,...,...,...,...
195,69,Female,111.0,No,Yes,1
196,30,Female,51.0,No,Yes,1
197,58,Male,110.0,No,Yes,1
198,20,Male,103.0,No,Yes,1


In [7]:
df = accident.copy().dropna()
df

Unnamed: 0,Age,Gender,Speed_of_Impact,Helmet_Used,Seatbelt_Used,Survived
0,56,Female,27.0,No,No,1
1,69,Female,46.0,No,Yes,1
2,46,Male,46.0,Yes,Yes,0
3,32,Male,117.0,No,Yes,0
4,60,Female,40.0,Yes,Yes,0
...,...,...,...,...,...,...
195,69,Female,111.0,No,Yes,1
196,30,Female,51.0,No,Yes,1
197,58,Male,110.0,No,Yes,1
198,20,Male,103.0,No,Yes,1


you convert the object to categorical data type, then hot encode.Next, you'll split the data into features and the label (Survived):

In [8]:
df.dtypes

Age                  int64
Gender              object
Speed_of_Impact    float64
Helmet_Used         object
Seatbelt_Used       object
Survived             int64
dtype: object

In [9]:
cat_columns = ['Gender','Helmet_Used','Seatbelt_Used']
cat_columns

['Gender', 'Helmet_Used', 'Seatbelt_Used']

In [10]:
for cat in cat_columns:
    df[cat] = df[cat].astype('category')

In [11]:
df.dtypes

Age                   int64
Gender             category
Speed_of_Impact     float64
Helmet_Used        category
Seatbelt_Used      category
Survived              int64
dtype: object

In [12]:
df['Gender'].cat.codes.values

array([0, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0,
       0, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0,
       1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1,
       1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0,
       1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 0,
       0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0,
       0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1,
       1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0,
       0, 1, 1, 1, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0],
      dtype=int8)

In [13]:
df['Gender'] = df['Gender'].cat.codes.values
df['Helmet_Used']= df['Helmet_Used'].cat.codes.values
df['Seatbelt_Used'] = df['Seatbelt_Used'].cat.codes.values

df

Unnamed: 0,Age,Gender,Speed_of_Impact,Helmet_Used,Seatbelt_Used,Survived
0,56,0,27.0,0,0,1
1,69,0,46.0,0,1,1
2,46,1,46.0,1,1,0
3,32,1,117.0,0,1,0
4,60,0,40.0,1,1,0
...,...,...,...,...,...,...
195,69,0,111.0,0,1,1
196,30,0,51.0,0,1,1
197,58,1,110.0,0,1,1
198,20,1,103.0,0,1,1


In [14]:
print("Splitting data...")
X, y = df[['Age','Gender','Speed_of_Impact','Helmet_Used','Seatbelt_Used']].values,df['Survived'].values

Splitting data...


In [15]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=0)

In [16]:
X_train[:5,]

array([[43.,  1., 95.,  0.,  0.],
       [61.,  0., 29.,  1.,  1.],
       [24.,  0., 52.,  0.,  0.],
       [47.,  0., 66.,  1.,  1.],
       [39.,  1., 68.,  0.,  0.]])

You now have four dataframes:

- `X_train`: The training dataset containing the features.
- `X_test`: The test dataset containing the features.
- `y_train`: The label for the training dataset.
- `y_test`: The label for the test dataset.

You'll use these to train and evaluate the models you'll train.

## Create an MLflow experiment

Now that you're ready to train machine learning models, you'll first create an MLflow experiment. By creating the experiment, you can group all runs within one experiment and make it easier to find the runs in the studio.

In [17]:
import mlflow
experiment_name = "mlflow-experiment-accident2"
mlflow.set_experiment(experiment_name)

<Experiment: artifact_location='', creation_time=1739280502819, experiment_id='41daa28d-29aa-44ec-a283-090a391fe200', last_update_time=None, lifecycle_stage='active', name='mlflow-experiment-accident', tags={}>

## Train and track models

To track a model you train, you can use MLflow and enable autologging. The following cell will train a classification model using logistic regression. You'll notice that you don't need to calculate any evaluation metrics because they're automatically created and logged by MLflow.

In [18]:
from sklearn.linear_model import LogisticRegression

with mlflow.start_run():
    mlflow.sklearn.autolog()

    model = LogisticRegression(C=1/0.1, solver="liblinear").fit(X_train, y_train)


🏃 View run ivory_sail_9dpyfw4z at: https://eastus.api.azureml.ms/mlflow/v2.0/subscriptions/cda9116f-5326-4a9b-9407-bc3a4391c27c/resourceGroups/rg-dp100/providers/Microsoft.MachineLearningServices/workspaces/gabbyworkout/#/experiments/41daa28d-29aa-44ec-a283-090a391fe200/runs/0b03dbaf-9c7c-4152-a3d9-4e5ffd04adf1
🧪 View experiment at: https://eastus.api.azureml.ms/mlflow/v2.0/subscriptions/cda9116f-5326-4a9b-9407-bc3a4391c27c/resourceGroups/rg-dp100/providers/Microsoft.MachineLearningServices/workspaces/gabbyworkout/#/experiments/41daa28d-29aa-44ec-a283-090a391fe200


You can also use custom logging with MLflow. You can add custom logging to autologging, or you can use only custom logging.

Let's train two more models with scikit-learn. Since you ran the `mlflow.sklearn.autolog()` command before, MLflow will now automatically log any model trained with scikit-learn. To disable the autologging, run the following cell:

In [19]:
mlflow.sklearn.autolog(disable=True)

Now, you can train and track models using only custom logging. 

When you run the following cell, you'll only log one parameter and one metric.

In [20]:
from sklearn.linear_model import LogisticRegression
import numpy as np

with mlflow.start_run():
    model = LogisticRegression(C=1/0.1, solver="liblinear").fit(X_train, y_train)

    y_hat = model.predict(X_test)
    acc = np.average(y_hat == y_test)

    mlflow.log_param("regularization_rate", 0.1)
    mlflow.log_metric("Accuracy", acc)

🏃 View run magenta_ghost_mz5xp61h at: https://eastus.api.azureml.ms/mlflow/v2.0/subscriptions/cda9116f-5326-4a9b-9407-bc3a4391c27c/resourceGroups/rg-dp100/providers/Microsoft.MachineLearningServices/workspaces/gabbyworkout/#/experiments/41daa28d-29aa-44ec-a283-090a391fe200/runs/d2f05732-ba5d-4d1f-a696-b48168301f75
🧪 View experiment at: https://eastus.api.azureml.ms/mlflow/v2.0/subscriptions/cda9116f-5326-4a9b-9407-bc3a4391c27c/resourceGroups/rg-dp100/providers/Microsoft.MachineLearningServices/workspaces/gabbyworkout/#/experiments/41daa28d-29aa-44ec-a283-090a391fe200


The reason why you'd want to track models, could be to compare the results of models you train with different hyperparameter values. 

For example, you just trained a logistic regression model with a regularization rate of 0.1. Now, train another model, but this time with a regularization rate of 0.01. Since you're also tracking the accuracy, you can compare and decide which rate results in a better performing model.

In [21]:
from sklearn.linear_model import LogisticRegression
import numpy as np

with mlflow.start_run():
    model = LogisticRegression(C=1/0.01, solver="liblinear").fit(X_train, y_train)

    y_hat = model.predict(X_test)
    acc = np.average(y_hat == y_test)

    mlflow.log_param("regularization_rate", 0.01)
    mlflow.log_metric("Accuracy", acc)

🏃 View run green_van_rb8ytkj5 at: https://eastus.api.azureml.ms/mlflow/v2.0/subscriptions/cda9116f-5326-4a9b-9407-bc3a4391c27c/resourceGroups/rg-dp100/providers/Microsoft.MachineLearningServices/workspaces/gabbyworkout/#/experiments/41daa28d-29aa-44ec-a283-090a391fe200/runs/1ed97f1a-f3c0-4dea-8d7a-fe77d6ed4d5c
🧪 View experiment at: https://eastus.api.azureml.ms/mlflow/v2.0/subscriptions/cda9116f-5326-4a9b-9407-bc3a4391c27c/resourceGroups/rg-dp100/providers/Microsoft.MachineLearningServices/workspaces/gabbyworkout/#/experiments/41daa28d-29aa-44ec-a283-090a391fe200


Another reason to track your model's results is when you're testing another estimator. All models you've trained so far used the logistic regression estimator. 

Run the following cell to train a model with the decision tree classifier estimator and review whether the accuracy is higher compared to the other runs.

In [22]:
from sklearn.tree import DecisionTreeClassifier
import numpy as np

with mlflow.start_run():
    model = DecisionTreeClassifier().fit(X_train, y_train)

    y_hat = model.predict(X_test)
    acc = np.average(y_hat == y_test)

    mlflow.log_param("estimator", "DecisionTreeClassifier")
    mlflow.log_metric("Accuracy", acc)

🏃 View run clever_napkin_dw12376m at: https://eastus.api.azureml.ms/mlflow/v2.0/subscriptions/cda9116f-5326-4a9b-9407-bc3a4391c27c/resourceGroups/rg-dp100/providers/Microsoft.MachineLearningServices/workspaces/gabbyworkout/#/experiments/41daa28d-29aa-44ec-a283-090a391fe200/runs/46210501-6adb-428e-8aec-7fca19022ba1
🧪 View experiment at: https://eastus.api.azureml.ms/mlflow/v2.0/subscriptions/cda9116f-5326-4a9b-9407-bc3a4391c27c/resourceGroups/rg-dp100/providers/Microsoft.MachineLearningServices/workspaces/gabbyworkout/#/experiments/41daa28d-29aa-44ec-a283-090a391fe200


Finally, let's try to log an artifact. An artifact can be any file. For example, you can plot the ROC curve and store the plot as an image. The image can be logged as an artifact. 

Run the following cell to log a parameter, metric, and an artifact.

In [23]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import roc_curve
import matplotlib.pyplot as plt
import numpy as np

with mlflow.start_run():
    model = DecisionTreeClassifier().fit(X_train, y_train)

    y_hat = model.predict(X_test)
    acc = np.average(y_hat == y_test)

    # plot ROC curve
    y_scores = model.predict_proba(X_test)

    fpr, tpr, thresholds = roc_curve(y_test, y_scores[:,1])
    fig = plt.figure(figsize=(6, 4))
    # Plot the diagonal 50% line
    plt.plot([0, 1], [0, 1], 'k--')
    # Plot the FPR and TPR achieved by our model
    plt.plot(fpr, tpr)
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title('ROC Curve')
    plt.savefig("ROC-Curve.png")

    mlflow.log_param("estimator", "DecisionTreeClassifier")
    mlflow.log_metric("Accuracy", acc)
    mlflow.log_artifact("ROC-Curve.png")

🏃 View run helpful_giraffe_3g9jr1g3 at: https://eastus.api.azureml.ms/mlflow/v2.0/subscriptions/cda9116f-5326-4a9b-9407-bc3a4391c27c/resourceGroups/rg-dp100/providers/Microsoft.MachineLearningServices/workspaces/gabbyworkout/#/experiments/41daa28d-29aa-44ec-a283-090a391fe200/runs/e7abb426-741f-4f7c-afb5-5d86c0eb79d6
🧪 View experiment at: https://eastus.api.azureml.ms/mlflow/v2.0/subscriptions/cda9116f-5326-4a9b-9407-bc3a4391c27c/resourceGroups/rg-dp100/providers/Microsoft.MachineLearningServices/workspaces/gabbyworkout/#/experiments/41daa28d-29aa-44ec-a283-090a391fe200


Review the model's results on the Jobs page of the Azure Machine Learning studio. 

- You'll find the parameters under **Params** in the **Overview** tab.
- You'll find the metrics under **Metrics** in the **Overview** tab, and in the **Metrics** tab.
- You'll find the artifacts in the **Outputs + logs** tab.

![Screenshot of outputs and logs tab on the Jobs page.](./images/output-logs.png)