## AzureML Model Monitoring through Operationalization

In this sample notebook, you will observe the end-to-end lifecycle of the Machine Learning (ML) operationalization process. You will follow the following steps to train your ML model, deploy it to production, and monitor it to ensure its continuous performance:

1) Setup environment 
2) Register data assets
3) Train the model
4) Deploy the model
5) Simulate inference requests
6) Monitor the model

Let's begin. 

## Setup your environment

To start, connect to your project workspace.

In [2]:
from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential

# Connect to the project workspace
ml_client = MLClient.from_config(credential=DefaultAzureCredential())

Found the config file in: /config.json


Set up a compute cluster to use to train your model.

In [3]:
from azure.ai.ml.entities import AmlCompute

cluster_basic = AmlCompute(
    name="cpu-cluster",
    type="amlcompute",
    size="STANDARD_F2S_V2",  # you can replace it with other supported VM SKUs
    location=ml_client.workspaces.get(ml_client.workspace_name).location,
    min_instances=0,
    max_instances=1,
    idle_time_before_scale_down=360,
)

ml_client.begin_create_or_update(cluster_basic).result()

AmlCompute({'type': 'amlcompute', 'created_on': None, 'provisioning_state': 'Succeeded', 'provisioning_errors': None, 'name': 'cpu-cluster', 'description': None, 'tags': None, 'properties': {}, 'print_as_yaml': True, 'id': '/subscriptions/5f341982-4f40-4ecf-9cee-93ab5e24693f/resourceGroups/rg-sandbox-azureml-01/providers/Microsoft.MachineLearningServices/workspaces/mlw246jkl01/computes/cpu-cluster', 'Resource__source_path': None, 'base_path': '/mnt/batch/tasks/shared/LS_root/mounts/clusters/nimoore-246jkl/code/Users/nimoore/model-monitoring-demo/notebooks', 'creation_context': None, 'serialize': <msrest.serialization.Serializer object at 0x7fe5dc678bb0>, 'resource_id': None, 'location': 'westus3', 'size': 'STANDARD_F4S_V2', 'min_instances': 0, 'max_instances': 1, 'idle_time_before_scale_down': 360.0, 'identity': None, 'ssh_public_access_enabled': True, 'ssh_settings': None, 'network_settings': <azure.ai.ml.entities._compute.compute.NetworkSettings object at 0x7fe5dc678d00>, 'tier': 'de

## Register data assets

Next, let's use some sample data to train our model. We will randomly split the dataset into reference and production sets. We add a timestamp column to simulate "production-like" data, since production data typically comes with timestamps. The dataset we are using in this example notebook has several columns related to credit card borrowers and contains a column on whether or not they defaulted on their credit card debt. We will train a model to predict `DEFAULT_NEXT_MONTH`, which is whether or not a borrower will default on their debt next month.

In [4]:
import pandas as pd
import datetime

# Read the default_of_credit_card_clients dataset into a pandas data frame
data_path = "https://azuremlexamples.blob.core.windows.net/datasets/credit_card/default_of_credit_card_clients.csv"
df = pd.read_csv(data_path, header=1, index_col=0).rename(
    columns={"default payment next month": "DEFAULT_NEXT_MONTH"}
)

# Split the data into production_data_df and reference_data_df
# Use the iloc method to select the first 80% and the last 20% of the rows
reference_data_df = df.iloc[: int(0.8 * len(df))].copy()
production_data_df = df.iloc[int(0.8 * len(df)) :].copy()

# Add a timestamp column in ISO8601 format
timestamp = datetime.datetime.now() - datetime.timedelta(days=45)
reference_data_df["TIMESTAMP"] = timestamp.strftime("%Y-%m-%dT%H:%M:%S")
production_data_df["TIMESTAMP"] = [
    timestamp + datetime.timedelta(minutes=i * 10)
    for i in range(len(production_data_df))
]
production_data_df["TIMESTAMP"] = production_data_df["TIMESTAMP"].apply(
    lambda x: x.strftime("%Y-%m-%dT%H:%M:%S")
)

In [23]:
import os


def write_df(df, local_path, file_name):
    # Create directory if it does not exist
    os.makedirs(local_path, exist_ok=True)

    # Write data
    df.to_csv(f"{local_path}/{file_name}", index=False)


# Write data to local directory
reference_data_dir_local_path = "../data/reference"
production_data_dir_local_path = "../data/production"

write_df(reference_data_df, reference_data_dir_local_path, "01.csv"),
write_df(production_data_df, production_data_dir_local_path, "01.csv")

In [41]:
import mltable
from mltable import MLTableHeaders, MLTableFileEncoding

from azureml.fsspec import AzureMachineLearningFileSystem
from azure.ai.ml.entities import Data
from azure.ai.ml.constants import AssetTypes


def upload_data_and_create_data_asset(
    local_path, remote_path, datastore_uri, data_name, data_version
):
    # Write MLTable file
    tbl = mltable.from_delimited_files(
        paths=[{"pattern": f"{datastore_uri}{remote_path}*.csv"}],
        delimiter=",",
        header="all_files_same_headers",
        infer_column_types=True,
        include_path_column=False,
        encoding="utf8",
    )

    tbl.save(local_path)

    # Instantiate file system
    fs = AzureMachineLearningFileSystem(datastore_uri)

    # Upload data
    fs.upload(
        lpath=local_path,
        rpath=remote_path,
        recursive=False,
        **{"overwrite": "MERGE_WITH_OVERWRITE"},
    )

    # Define the Data asset object
    data = Data(
        path=f"{datastore_uri}{remote_path}",
        type=AssetTypes.MLTABLE,
        name=data_name,
        version=data_version,
    )

    # Create the data asset in the workspace
    ml_client.data.create_or_update(data)

    return data


# Datastore uri for data
datastore_uri = "azureml://subscriptions/{}/resourcegroups/{}/workspaces/{}/datastores/workspaceblobstore/paths/".format(
    ml_client.subscription_id, ml_client.resource_group_name, ml_client.workspace_name
)

# Define paths
reference_data_dir_remote_path = "data/credit-default/reference/"
production_data_dir_remote_path = "data/credit-default/production/"

# Define data asset names
reference_data_asset_name = "credit-default-reference"
production_data_asset_name = "credit-default-production"

# Write data to remote directory and create data asset
reference_data = upload_data_and_create_data_asset(
    reference_data_dir_local_path,
    reference_data_dir_remote_path,
    datastore_uri,
    reference_data_asset_name,
    "1",
)
production_data = upload_data_and_create_data_asset(
    production_data_dir_local_path,
    production_data_dir_remote_path,
    datastore_uri,
    production_data_asset_name,
    "1",
)

## Train the model

Train the model.

In [4]:
from azure.ai.ml import load_job

# Define training pipeline directory
training_pipeline_path = "../configurations/training_pipeline.yaml"

# Trigger training
training_pipeline_definition = load_job(source=training_pipeline_path)
training_pipeline_job = ml_client.jobs.create_or_update(training_pipeline_definition)

ml_client.jobs.stream(training_pipeline_job.name)

[32mUploading code (0.01 MBs): 100%|██████████| 7445/7445 [00:00<00:00, 82192.33it/s]
[39m



RunId: goofy_nutmeg_1yysd7rnvf
Web View: https://ml.azure.com/runs/goofy_nutmeg_1yysd7rnvf?wsid=/subscriptions/5f341982-4f40-4ecf-9cee-93ab5e24693f/resourcegroups/rg-sandbox-azureml-01/workspaces/mlw246jkl01

Streaming logs/azureml/executionlogs.txt

[2024-01-11 09:22:19Z] Submitting 1 runs, first five are: ff07cd7b:b6d9c913-4324-4c2b-be13-cc78221c75df
[2024-01-11 09:23:16Z] Completing processing run id b6d9c913-4324-4c2b-be13-cc78221c75df.

Execution Summary
RunId: goofy_nutmeg_1yysd7rnvf
Web View: https://ml.azure.com/runs/goofy_nutmeg_1yysd7rnvf?wsid=/subscriptions/5f341982-4f40-4ecf-9cee-93ab5e24693f/resourcegroups/rg-sandbox-azureml-01/workspaces/mlw246jkl01



## Deploy the model

Deploy the model with AzureML managed online endpoints.

### Create Endpoint

In [5]:
from azure.ai.ml import load_online_endpoint

# Define endpoint directory
endpoint_path = "../endpoints/endpoint.yaml"

# Trigger endpoint creation
endpoint_definition = load_online_endpoint(source=endpoint_path)
endpoint = ml_client.online_endpoints.begin_create_or_update(endpoint_definition)

In [9]:
# Check endpoint status
endpoint = ml_client.online_endpoints.get(name=endpoint_definition.name)
print(
    f'Endpoint "{endpoint.name}" with provisioning state "{endpoint.provisioning_state}" is retrieved'
)

Endpoint "credit-default" with provisioning state "Succeeded" is retrieved


### Create Deployment

As part of the deployment configuration, the Model Data Collector (MDC) is enabled, so that inference data is collected for model monitoring. 

In [13]:
from azure.ai.ml import load_online_deployment

# Define deployment directory
deployment_path = "../endpoints/deployment.yaml"

# Trigger deployment creation
deployment_definition = load_online_deployment(source=deployment_path)
deployment = ml_client.online_deployments.begin_create_or_update(deployment_definition)

Check: endpoint credit-default exists
[32mUploading code (0.01 MBs): 100%|██████████| 7431/7431 [00:00<00:00, 79754.43it/s]
[39m



.

In [15]:
# Check deployment status
deployment = ml_client.online_deployments.get(
    name=deployment_definition.name, endpoint_name=endpoint_definition.name
)
print(
    f'Deployment "{deployment.name}" with provisioning state "{deployment.provisioning_state}" is retrieved'
)

Deployment "main" with provisioning state "Updating" is retrieved


.........

## Simulate production inference data

### Generate Sample Data

We generate sample inference data by taking the distribution for each input feature and adding a small amount of random noise. 

In [16]:
import numpy as np

# Define numeric and categotical feature columns
NUMERIC_FEATURES = [
    "LIMIT_BAL",
    "AGE",
    "BILL_AMT1",
    "BILL_AMT2",
    "BILL_AMT3",
    "BILL_AMT4",
    "BILL_AMT5",
    "BILL_AMT6",
    "PAY_AMT1",
    "PAY_AMT2",
    "PAY_AMT3",
    "PAY_AMT4",
    "PAY_AMT5",
    "PAY_AMT6",
]
CATEGORICAL_FEATURES = [
    "SEX",
    "EDUCATION",
    "MARRIAGE",
    "PAY_0",
    "PAY_2",
    "PAY_3",
    "PAY_4",
    "PAY_5",
    "PAY_6",
]


def generate_sample_inference_data(df_production, number_of_records=20):
    # Sample records
    df_sample = df_production.sample(n=number_of_records, replace=True)

    # Generate numeric features with random noise
    df_numeric_generated = pd.DataFrame(
        {
            feature: np.random.normal(
                0, df_production[feature].std(), number_of_records
            ).astype(np.int64)
            for feature in NUMERIC_FEATURES
        }
    ) + df_sample[NUMERIC_FEATURES].reset_index(drop=True)

    # Take categorical columns
    df_categorical = df_sample[CATEGORICAL_FEATURES].reset_index(drop=True)

    # Combine numerical and categorical columns
    df_combined = pd.concat([df_numeric_generated, df_categorical], axis=1)

    return df_combined

In [24]:
import mltable
import pandas as pd
from azure.ai.ml import MLClient

# Load production / inference data
data_asset = ml_client.data.get("credit-default-production", version="1")
tbl = mltable.load(data_asset.path)
df_production = tbl.to_pandas_dataframe()

# Generate sample data for inference
number_of_records = 20
df_generated = generate_sample_inference_data(df_production, number_of_records)

### Call Online Managed Endpoint

Call the endpoint with the sample data. Since your deployment was created with the Model Data Collector (MDC) enabled, the inference inputs and outputs will be collected in your workspace blob storage. 

In [25]:
import json
import os

request_file_name = "request.json"

# Request sample data
data = {"data": df_generated.to_dict(orient="records")}

# Write sample data
with open(request_file_name, "w") as f:
    json.dump(data, f)

# Call online endpoint
result = ml_client.online_endpoints.invoke(
    endpoint_name=endpoint_definition.name,
    deployment_name=deployment_definition.name,
    request_file=request_file_name,
)

# Delete sample data
os.remove(request_file_name)

## Create model monitor

Here is a basic model monitor. Please feel free to augment it to meet the needs of your scenario. 

In [19]:
from azure.identity import DefaultAzureCredential
from azure.ai.ml import MLClient
from azure.ai.ml.entities import (
    AlertNotification,
    MonitoringTarget,
    MonitorDefinition,
    MonitorSchedule,
    RecurrencePattern,
    RecurrenceTrigger,
    ServerlessSparkCompute,
)

# get a handle to the workspace
ml_client = MLClient(
    DefaultAzureCredential(),
    subscription_id="subscription_id",
    resource_group_name="resource_group_name",
    workspace_name="workspace_name",
)

# create the compute
spark_compute = ServerlessSparkCompute(
    instance_type="standard_e4s_v3", runtime_version="3.3"
)

# specify your online endpoint deployment
monitoring_target = MonitoringTarget(
    ml_task="classification", endpoint_deployment_id="azureml:credit-default:main"
)


# create alert notification object
alert_notification = AlertNotification(emails=["abc@example.com", "def@example.com"])

# create the monitor definition
monitor_definition = MonitorDefinition(
    compute=spark_compute,
    monitoring_target=monitoring_target,
    alert_notification=alert_notification,
)

# specify the schedule frequency
recurrence_trigger = RecurrenceTrigger(
    frequency="day", interval=1, schedule=RecurrencePattern(hours=3, minutes=15)
)

# create the monitor
model_monitor = MonitorSchedule(
    name="credit_default_monitor_basic",
    trigger=recurrence_trigger,
    create_monitor=monitor_definition,
)

poller = ml_client.schedules.begin_create_or_update(model_monitor)
created_monitor = poller.result()

{}
....

Here is an advanced model monitoring configuration. Feel free to augment it to meet the needs of your scenario. 

In [13]:
from azure.identity import DefaultAzureCredential
from azure.ai.ml import Input, MLClient
from azure.ai.ml.constants import (
    MonitorDatasetContext,
)
from azure.ai.ml.entities import (
    AlertNotification,
    DataDriftSignal,
    DataQualitySignal,
    PredictionDriftSignal,
    DataDriftMetricThreshold,
    DataQualityMetricThreshold,
    PredictionDriftMetricThreshold,
    NumericalDriftMetrics,
    CategoricalDriftMetrics,
    DataQualityMetricsNumerical,
    DataQualityMetricsCategorical,
    MonitorFeatureFilter,
    MonitoringTarget,
    MonitorDefinition,
    MonitorSchedule,
    RecurrencePattern,
    RecurrenceTrigger,
    ServerlessSparkCompute,
    ReferenceData,
)

# get a handle to the workspace
ml_client = MLClient(
    DefaultAzureCredential(),
    subscription_id="subscription_id",
    resource_group_name="resource_group_name",
    workspace_name="workspace_name",
)

# create your compute
spark_compute = ServerlessSparkCompute(
    instance_type="standard_e4s_v3", runtime_version="3.3"
)

# specify the online deployment (if you have one)
monitoring_target = MonitoringTarget(
    ml_task="classification", endpoint_deployment_id="azureml:credit-default:main"
)

# training data to be used as baseline dataset
reference_data_training = ReferenceData(
    input_data=Input(type="mltable", path="azureml:credit-default-reference:1"),
    target_column_name="DEFAULT_NEXT_MONTH",
    data_context=MonitorDatasetContext.TRAINING,
)

# create an advanced data drift signal
features = MonitorFeatureFilter(top_n_feature_importance=10)

metric_thresholds = DataDriftMetricThreshold(
    numerical=NumericalDriftMetrics(jensen_shannon_distance=0.01),
    categorical=CategoricalDriftMetrics(pearsons_chi_squared_test=0.02),
)

advanced_data_drift = DataDriftSignal(
    reference_data=reference_data_training,
    features=features,
    metric_thresholds=metric_thresholds,
)

# create an advanced prediction drift signal
metric_thresholds = PredictionDriftMetricThreshold(
    categorical=CategoricalDriftMetrics(jensen_shannon_distance=0.01)
)

advanced_prediction_drift = PredictionDriftSignal(
    reference_data=reference_data_training, metric_thresholds=metric_thresholds
)

# create an advanced data quality signal
features = ["SEX", "EDUCATION", "AGE"]

metric_thresholds = DataQualityMetricThreshold(
    numerical=DataQualityMetricsNumerical(null_value_rate=0.01),
    categorical=DataQualityMetricsCategorical(out_of_bounds_rate=0.02),
)

advanced_data_quality = DataQualitySignal(
    reference_data=reference_data_training,
    features=features,
    metric_thresholds=metric_thresholds,
    alert_enabled=False,
)

# put all monitoring signals in a dictionary
monitoring_signals = {
    "data_drift_advanced": advanced_data_drift,
    "data_quality_advanced": advanced_data_quality,
}

# create alert notification object
alert_notification = AlertNotification(emails=["abc@example.com", "def@example.com"])

# create the monitor definition
monitor_definition = MonitorDefinition(
    compute=spark_compute,
    monitoring_target=monitoring_target,
    monitoring_signals=monitoring_signals,
    alert_notification=alert_notification,
)

# specify the frequency on which to run your monitor
recurrence_trigger = RecurrenceTrigger(
    frequency="day", interval=1, schedule=RecurrencePattern(hours=3, minutes=15)
)

# create your monitor
model_monitor = MonitorSchedule(
    name="credit_default_monitor_advanced",
    trigger=recurrence_trigger,
    create_monitor=monitor_definition,
)

poller = ml_client.schedules.begin_create_or_update(model_monitor)
created_monitor = poller.result()

Class PredictionDriftMetricThreshold: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Class PredictionDriftSignal: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Class DataQualityMetricsNumerical: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Class DataQualityMetricsCategorical: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Class DataQualityMetricThreshold: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Class DataQualitySignal: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Class AlertNotification: This is an expe

{}
...