# Data and Model Drift Detection for Tabular Data

The environment of our world is constantly changing. For machine learning, this means that deployed models are confronted with unknown data and can become outdated over time. A proactive drift management approach is required to ensure that productive AI services deliver consistent business value in the long term. Check out our background article [Getting traction on Data and Model Drift with Azure Machine Learning](https://medium.com/p/ebd240176b8b/edit) for an in-depth discussion about key concepts.

This notebook provides the following mechanisms to detect and mitigate data and model drift:
- Create automated pipelines to identify data drift regularly as part of an MLOps solution using Azure Machine Learning

The notebook was developed and tested using the ``Python 3.8-AzureML`` kernel on a Azure ML Compute Instance.

## Setup

In [1]:
import pandas as pd
import numpy as np
import os

from azure.identity import DefaultAzureCredential, InteractiveBrowserCredential
from azure.ai.ml import MLClient, Input
from azure.ai.ml.dsl import pipeline
from azure.ai.ml import load_component

# To access files better
os.chdir("../")
print(os.getcwd())

/mnt/batch/tasks/shared/LS_root/mounts/clusters/natashasavic2/code/Users/natashasavic/data-model-drift/tabular-data


In [2]:
try:
    credential = DefaultAzureCredential()
    # Check if given credential can get token successfully.
    credential.get_token("https://management.azure.com/.default")
except Exception as ex:
    # Fall back to InteractiveBrowserCredential in case DefaultAzureCredential not work
    credential = InteractiveBrowserCredential()

In [3]:
# Get a handle to workspace
ml_client = MLClient.from_config(credential=credential)

# Retrieve an already attached Azure Machine Learning Compute.
cluster_name = "cpu-cluster"
# print(ml_client.compute.get(cluster_name))

Found the config file in: /mnt/batch/tasks/shared/LS_root/mounts/clusters/natashasavic2/code/Users/natashasavic/.azureml/config.json


## Build your custom environment

Us this command to build the environment. If you need to make changes, make sure to update the environment and add a new version

`az ml environment create --name data-model-drift-env --version 1 --file conda_image_docker.yml --conda-file conda_yamls/env_cli.yml --resource-group <add RG> --workspace-name <add workspace name>`


## Define Azure ML Pipeline

In [4]:
parent_dir = os.getcwd()

# Paths to your custom defined components
prep_yml = "/SDK-V2/prep_data.yml"
drift_yml = "/SDK-V2/data_drift.yml"
drift_db_yml = "/SDK-V2/data_drift_db.yml"


# 1. Load components
prepare_data = load_component(path=f"{parent_dir}{prep_yml}")
measure_data_drift = load_component(f"{parent_dir}{drift_yml}")
collect_data_drift_values = load_component(f"{parent_dir}{drift_db_yml}")


# 2. Construct pipeline
@pipeline()
def data_drift_preprocess(pipeline_job_input):
    # the parameters come from the respectove .yml file step. E.g. "input_path" is under inputs
    transform_data = prepare_data(input_path=pipeline_job_input)
    # the input for this pipeline is the output of the previous pipeline which is called "output_path"
    drift_detect = measure_data_drift(
        tansformed_data_path=transform_data.outputs.output_path,
        threshold = 0.01
    )
    save_drift_db = collect_data_drift_values(
        tansformed_data_path=transform_data.outputs.output_path,
        threshold = 0.01
    )
    return {
        "pipeline_job_prepped_data": transform_data.outputs.output_path,
        "pipeline_job_detect_data_drift": drift_detect.outputs.drift_plot_path,
        "pipeline_job_store_data_drift": save_drift_db.outputs.drift_db_path,

    }

pipeline_job = data_drift_preprocess(
    Input(type="uri_folder", path=parent_dir + "/data/data_raw/predictive_maintenance")
)
# demo how to change pipeline output settings
pipeline_job.outputs.pipeline_job_prepped_data.mode = "upload" # "rw_mount"
pipeline_job.outputs.pipeline_job_detect_data_drift.mode = "upload" 
pipeline_job.outputs.pipeline_job_store_data_drift.mode = "upload" 


# set pipeline level compute
pipeline_job.settings.default_compute="cpu-cluster"
# set pipeline level datastore
pipeline_job.settings.default_datastore="workspaceblobstore"

## Submit pipeline

Now you will submit your pipeline. To monitor its' status, please navigate to the `Jobs` pane on the left hand side of Azure ML

In [5]:
# submit job to workspace
experiment_name = "data_drift_experiment"

pipeline_job = ml_client.jobs.create_or_update(
    pipeline_job, experiment_name=experiment_name
)
pipeline_job

[32mUploading data_drift_db_src (0.01 MBs):   0%|          | 0/9649 [00:00<?, ?it/s][32mUploading data_drift_db_src (0.01 MBs): 100%|██████████| 9649/9649 [00:00<00:00, 174970.45it/s]
[39m



Experiment,Name,Type,Status,Details Page
data_drift_experiment,dynamic_frog_pp4jpnrj39,pipeline,Preparing,Link to Azure Machine Learning studio


## Register component

In [None]:
# Now we register the component to the workspace
train_component = ml_client.create_or_update(train_component)

# Create (register) the component in your workspace
print(
    f"Component {train_component.name} with Version {train_component.version} is registered"
)