# Monitoring Data Drift

Over time, models can become less effective at predicting accurately due to changing trends in feature data. This phenomenon is known as *data drift*, and it's important to monitor your machine learning solution to detect it so you can retrain your models if necessary.

In this lab, you'll configure data drift monitoring for datasets.

## Before you start

In addition to the latest version of the **azureml-sdk** and **azureml-widgets** packages, you'll need the **azureml-datadrift** package to run the code in this notebook. Run the cell below to verify that it is installed.

In [1]:
!pip install azureml-datadrift





## Connect to your workspace

With the required SDK packages installed, now you're ready to connect to your workspace.

> **Note**: If you haven't already established an authenticated session with your Azure subscription, you'll be prompted to authenticate by clicking a link, entering an authentication code, and signing into Azure.

In [2]:
from azureml.core import Workspace

# Load the workspace from the saved config file
ws = Workspace.from_config()
print('Ready to work with', ws.name)

Ready to work with mlops-test


## Create a *baseline* dataset

To monitor a dataset for data drift, you must register a *baseline* dataset (usually the dataset used to train your model) to use as a point of comparison with data collected in the future. 

In [6]:
!pip install azureml-dataset-runtime --upgrade

Collecting azureml-dataset-runtime
  Downloading azureml_dataset_runtime-1.49.0-py3-none-any.whl (2.3 kB)
Collecting azureml-dataprep<4.10.0a,>=4.9.0a
  Downloading azureml_dataprep-4.9.3-py3-none-any.whl (38.4 MB)
[K     |████████████████████████████████| 38.4 MB 7.4 MB/s eta 0:00:01
Collecting azureml-dataprep-rslex~=2.16.0dev0
  Downloading azureml_dataprep_rslex-2.16.3-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (16.5 MB)
[K     |████████████████████████████████| 16.5 MB 67.0 MB/s eta 0:00:01
[31mERROR: azureml-train-automl 1.48.0 has requirement azureml-dataset-runtime[fuse,pandas]~=1.48.0, but you'll have azureml-dataset-runtime 1.49.0 which is incompatible.[0m
[31mERROR: azureml-train-automl-runtime 1.48.0 has requirement azureml-dataset-runtime[fuse,pandas]~=1.48.0, but you'll have azureml-dataset-runtime 1.49.0 which is incompatible.[0m
[31mERROR: azureml-train-automl-client 1.48.0 has requirement azureml-dataset-runtime~=1.48.0, but you'll have azureml-dat

  Attempting uninstall: azureml-dataprep
    Found existing installation: azureml-dataprep 4.8.3
    Uninstalling azureml-dataprep-4.8.3:
      Successfully uninstalled azureml-dataprep-4.8.3
  Attempting uninstall: azureml-dataset-runtime
    Found existing installation: azureml-dataset-runtime 1.48.0
    Uninstalling azureml-dataset-runtime-1.48.0:
      Successfully uninstalled azureml-dataset-runtime-1.48.0
Successfully installed azureml-dataprep-4.9.3 azureml-dataprep-rslex-2.16.3 azureml-dataset-runtime-1.49.0


In [8]:
!/anaconda/envs/jupyter_env/bin/python -m pip install azureml-dataset-runtime --upgrade

Collecting azureml-dataset-runtime
  Using cached azureml_dataset_runtime-1.49.0-py3-none-any.whl (2.3 kB)
Collecting pyarrow<=9.0.0,>=0.17.0
  Downloading pyarrow-9.0.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (35.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m35.3/35.3 MB[0m [31m36.0 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Collecting azureml-dataprep<4.10.0a,>=4.9.0a
  Using cached azureml_dataprep-4.9.3-py3-none-any.whl (38.4 MB)
Collecting azureml-dataprep-native<39.0.0,>=38.0.0
  Downloading azureml_dataprep_native-38.0.0-cp38-cp38-manylinux1_x86_64.whl (1.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.4/1.4 MB[0m [31m50.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting azure-identity>=1.7.0
  Downloading azure_identity-1.12.0-py3-none-any.whl (135 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m135.5/135.5 kB[0m [31m14.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting azureml-dataprep-rs

In [9]:
from azureml.core import Datastore, Dataset


# Upload the baseline data
default_ds = ws.get_default_datastore()
print(default_ds)
default_ds.upload_files(files=['data/diabetes.csv', 'data/diabetes2.csv'],
                       target_path='diabetes-baseline',
                       overwrite=True, 
                       show_progress=True)

# Create and register the baseline dataset
print('Registering baseline dataset...')
baseline_data_set = Dataset.Tabular.from_delimited_files(path=(default_ds, 'diabetes-baseline/*.csv'))
baseline_data_set = baseline_data_set.register(workspace=ws, 
                           name='diabetes baseline',
                           description='diabetes baseline data',
                           tags = {'format':'CSV'},
                           create_new_version=True)

print('Baseline dataset registered!')

{
  "name": "workspaceblobstore",
  "container_name": "azureml-blobstore-300436b5-8af5-4d92-aa56-905518df1430",
  "account_name": "mlopstest1640674209",
  "protocol": "https",
  "endpoint": "core.windows.net"
}
Uploading an estimated of 2 files
Uploading data/diabetes.csv
Uploaded data/diabetes.csv, 1 files out of an estimated total of 2
Uploading data/diabetes2.csv
Uploaded data/diabetes2.csv, 2 files out of an estimated total of 2
Uploaded 2 files
Registering baseline dataset...
Baseline dataset registered!


## Create a *target* dataset

Over time, you can collect new data with the same features as your baseline training data. To compare this new data to the baseline data, you must define a target dataset that includes the features you want to analyze for data drift as well as a timestamp field that indicates the point in time when the new data was current -this enables you to measure data drift over temporal intervals. The timestamp can either be a field in the dataset itself, or derived from the folder and filename pattern used to store the data. For example, you might store new data in a folder hierarchy that consists of a folder for the year, containing a folder for the month, which in turn contains a folder for the day; or you might just encode the year, month, and day in the file name like this: *data_2020-01-29.csv*; which is the approach taken in the following code:

In [10]:
import datetime as dt
import pandas as pd

print('Generating simulated data...')

# Load the smaller of the two data files
data = pd.read_csv('data/diabetes2.csv')

# We'll generate data for the past 6 weeks
weeknos = reversed(range(6))

file_paths = []
for weekno in weeknos:
    
    # Get the date X weeks ago
    data_date = dt.date.today() - dt.timedelta(weeks=weekno)
    
    # Modify data to ceate some drift
    data['Pregnancies'] = data['Pregnancies'] + 1
    data['Age'] = round(data['Age'] * 1.2).astype(int)
    data['BMI'] = data['BMI'] * 1.1
    
    # Save the file with the date encoded in the filename
    file_path = 'data/diabetes_{}.csv'.format(data_date.strftime("%Y-%m-%d"))
    data.to_csv(file_path)
    file_paths.append(file_path)

# Upload the files
path_on_datastore = 'diabetes-target'
default_ds.upload_files(files=file_paths,
                       target_path=path_on_datastore,
                       overwrite=True,
                       show_progress=True)

# Use the folder partition format to define a dataset with a 'date' timestamp column
partition_format = path_on_datastore + '/diabetes_{date:yyyy-MM-dd}.csv'
target_data_set = Dataset.Tabular.from_delimited_files(path=(default_ds, path_on_datastore + '/*.csv'),
                                                       partition_format=partition_format)

# Register the target dataset
print('Registering target dataset...')
target_data_set = target_data_set.with_timestamp_columns('date').register(workspace=ws,
                                                                          name='diabetes target',
                                                                          description='diabetes target data',
                                                                          tags = {'format':'CSV'},
                                                                          create_new_version=True)

print('Target dataset registered!')

Generating simulated data...
Uploading an estimated of 6 files
Uploading data/diabetes_2023-02-09.csv
Uploaded data/diabetes_2023-02-09.csv, 1 files out of an estimated total of 6
Uploading data/diabetes_2023-02-16.csv
Uploaded data/diabetes_2023-02-16.csv, 2 files out of an estimated total of 6
Uploading data/diabetes_2023-02-23.csv
Uploaded data/diabetes_2023-02-23.csv, 3 files out of an estimated total of 6
Uploading data/diabetes_2023-03-02.csv
Uploaded data/diabetes_2023-03-02.csv, 4 files out of an estimated total of 6
Uploading data/diabetes_2023-03-09.csv
Uploaded data/diabetes_2023-03-09.csv, 5 files out of an estimated total of 6
Uploading data/diabetes_2023-03-16.csv
Uploaded data/diabetes_2023-03-16.csv, 6 files out of an estimated total of 6
Uploaded 6 files
Registering target dataset...
Target dataset registered!


## Create a data drift monitor

Now you're ready to create a data drift monitor for the diabetes data. The data drift monitor will run periodicaly or on-demand to compare the baseline dataset with the target dataset, to which new data will be added over time.

### Create a compute target

To run the data drift monitor, you'll need a compute target. Run the following cell to specify a compute cluster (if it doesn't exist, it will be created).

> **Important**: Change *your-compute-cluster* to the name of your compute cluster in the code below before running it! Cluster names must be globally unique names between 2 to 16 characters in length. Valid characters are letters, digits, and the - character.

In [11]:
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException

cluster_name = "agcluster"

try:
    # Check for existing compute target
    training_cluster = ComputeTarget(workspace=ws, name=cluster_name)
    print('Found existing cluster, use it.')
except ComputeTargetException:
    # If it doesn't already exist, create it
    try:
        compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_DS11_V2', max_nodes=2)
        training_cluster = ComputeTarget.create(ws, cluster_name, compute_config)
        training_cluster.wait_for_completion(show_output=True)
    except Exception as ex:
        print(ex)
    

InProgress.
SucceededProvisioning operation finished, operation "Succeeded"
Succeeded
AmlCompute wait for completion finished

Minimum number of nodes requested have been provisioned


> **Note**: Compute instances and clusters are based on standard Azure virtual machine images. For this exercise, the *Standard_DS11_v2* image is recommended to achieve the optimal balance of cost and performance. If your subscription has a quota that does not include this image, choose an alternative image; but bear in mind that a larger image may incur higher cost and a smaller image may not be sufficient to complete the tasks. Alternatively, ask your Azure administrator to extend your quota.

### Define the data drift monitor

Now you're ready to use a **DataDriftDetector** class to define the data drift monitor for your data. You can specify the features you want to monitor for data drift, the name of the compute target to be used to run the monitoring process, the frequency at which the data should be compared, the data drift threshold above which an alert should be triggered, and the latency (in hours) to allow for data collection.

In [21]:
!/anaconda/envs/jupyter_env/bin/python -m pip install of azureml-datadrift

Collecting of
  Downloading of-1.0.1.tar.gz (77 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m77.1/77.1 kB[0m [31m6.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25ldone
[?25hCollecting azureml-datadrift
  Downloading azureml_datadrift-1.49.0-py3-none-any.whl (97 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m98.0/98.0 kB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting pymongo
  Downloading pymongo-4.3.3-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (501 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m501.4/501.4 kB[0m [31m8.9 MB/s[0m eta [36m0:00:00[0m00:01[0m
Collecting cherrypy
  Downloading CherryPy-18.8.0-py2.py3-none-any.whl (348 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m348.4/348.4 kB[0m [31m13.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting ws4py
  Downloading ws4py-0.5.1.tar.gz (51 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Collecting cheroot>=8.2.1
  Downloading cheroot-9.0.0-py2.py3-none-any.whl (100 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m100.6/100.6 kB[0m [31m12.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting zc.lockfile
  Downloading zc.lockfile-3.0.post1-py3-none-any.whl (9.8 kB)
Collecting more-itertools
  Downloading more_itertools-9.1.0-py3-none-any.whl (54 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m54.2/54.2 kB[0m [31m7.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting portend>=2.1.1
  Downloading portend-3.1.0-py3-none-any.whl (5.3 kB)
Collecting jaraco.collections
  Downloading jaraco.collections-3.8.0-py3-none-any.whl (10 kB)
Collecting joblib>=1.1.1
  Downloading joblib-1.2.0-py3-none-any.whl (297 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m298.0/298.0 kB[0m [31m32.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting threadpoolctl>=2.0.0
  Downloading threadpoolctl-3.1.0-py3-none-any.whl (14 kB)
Collecting dnspython<3

Collecting jaraco.classes
  Downloading jaraco.classes-3.2.3-py3-none-any.whl (6.0 kB)
Collecting jaraco.text
  Downloading jaraco.text-3.11.1-py3-none-any.whl (11 kB)
Collecting jaraco.context>=4.1
  Downloading jaraco.context-4.3.0-py3-none-any.whl (5.3 kB)
Collecting inflect
  Downloading inflect-6.0.2-py3-none-any.whl (34 kB)
Collecting autocommand
  Downloading autocommand-2.2.2-py3-none-any.whl (19 kB)
Building wheels for collected packages: of, pyspark, ws4py
  Building wheel for of (setup.py) ... [?25ldone
[?25h  Created wheel for of: filename=of-1.0.1-py3-none-any.whl size=115447 sha256=2c4ecc7df6e5671914e63e8e6aaa14cfc49a060fc69dfed9224d665d22ce51c3
  Stored in directory: /home/azureuser/.cache/pip/wheels/75/84/5e/abed4f3acbcd99704fa4fa616a796949b7359eeb293c10bc3c
  Building wheel for pyspark (setup.py) ... [?25ldone
[?25h  Created wheel for pyspark: filename=pyspark-3.3.2-py2.py3-none-any.whl size=281824024 sha256=9d911423c7e2927a2877a567179d58b0d71ffd95d8d5fba65aaa19eea

In [22]:
from azureml.datadrift import DataDriftDetector

# set up feature list
features = ['Pregnancies', 'Age', 'BMI']

# set up data drift detector
monitor = DataDriftDetector.create_from_datasets(ws, 'mslearn-diabates-drift', baseline_data_set, target_data_set,
                                                      compute_target=cluster_name, 
                                                      frequency='Week', 
                                                      feature_list=features, 
                                                      drift_threshold=.3, 
                                                      latency=24)
monitor

{'_logger': <_TelemetryLoggerContextAdapter azureml.datadrift._logging._telemetry_logger.azureml.datadrift.datadriftdetector (DEBUG)>, '_workspace': Workspace.create(name='mlops-test', subscription_id='1c2fd79b-ad21-4ad0-8d53-12de16650452', resource_group='mlopstest-group'), '_frequency': 'Week', '_schedule_start': None, '_schedule_id': None, '_interval': 1, '_state': 'Disabled', '_alert_config': None, '_type': 'DatasetBased', '_id': 'b9c83997-7b5e-4737-8840-fc99ecd3be01', '_compute_target_name': 'agcluster', '_drift_threshold': 0.3, '_baseline_dataset_id': 'c5509552-330e-4753-8e26-25f14694feff', '_target_dataset_id': '58e69803-0342-4a37-81c1-9f99b3f687f4', '_feature_list': ['Pregnancies', 'Age', 'BMI'], '_latency': 24, '_name': 'mslearn-diabates-drift', '_latest_run_time': None, '_client': <azureml.datadrift._restclient.datadrift_client.DataDriftClient object at 0x7f229c999880>}

## Backfill the data drift monitor

You have a baseline dataset and a target dataset that includes simulated weekly data collection for six weeks. You can use this to backfill the monitor so that it can analyze data drift between the original baseline and the target data.

> **Note** This may take some time to run, as the compute target must be started to run the backfill analysis. The widget may not always update to show the status, so click the link to observe the experiment status in Azure Machine Learning studio!

In [23]:
from azureml.widgets import RunDetails

backfill = monitor.backfill(dt.datetime.now() - dt.timedelta(weeks=6), dt.datetime.now())

RunDetails(backfill).show()
backfill.wait_for_completion()

_UserRunWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', '…

{'runId': 'mslearn-diabates-drift-Monitor-Runs_1678973755922',
 'target': 'agcluster',
 'status': 'Completed',
 'startTimeUtc': '2023-03-16T13:50:08.089134Z',
 'endTimeUtc': '2023-03-16T13:53:14.608322Z',
 'services': {},
   'message': 'target dataset id:58e69803-0342-4a37-81c1-9f99b3f687f4 do not contain sufficient amount of data after timestamp filteringMinimum needed: 50 rows.Skipping calculation for time slice 2023-01-29 00:00:00 to 2023-02-05 00:00:00.'}],
 'properties': {'_azureml.ComputeTargetType': 'amlctrain',
  'ContentSnapshotId': '578a7864-2c2a-4b39-9059-339880bedc74',
  'ProcessInfoFile': 'azureml-logs/process_info.json',
  'ProcessStatusFile': 'azureml-logs/process_status.json'},
 'inputDatasets': [{'dataset': {'id': 'c5509552-330e-4753-8e26-25f14694feff'}, 'consumptionDetails': {'type': 'Reference'}}, {'dataset': {'id': '58e69803-0342-4a37-81c1-9f99b3f687f4'}, 'consumptionDetails': {'type': 'Reference'}}],
 'outputDatasets': [],
 'runDefinition': {'script': '_generate_sc

## Analyze data drift

You can use the following code to examine data drift for the points in time collected in the backfill run.

In [24]:
drift_metrics = backfill.get_metrics()
for metric in drift_metrics:
    print(metric, drift_metrics[metric])

start_date 2023-01-29
end_date 2023-03-19
frequency Week
Datadrift percentage {'days_from_start': [7, 14, 21, 28, 35, 42], 'drift_percentage': [74.19152901127207, 87.23985219136877, 91.74192122865539, 94.96492628559955, 97.58354951107833, 99.23199438682525]}


You can also visualize the data drift metrics in [Azure Machine Learning studio](https://ml.azure.com) by following these steps:

1. On the **Datasets** page, view the **Dataset monitors** tab.
2. Click the data drift monitor you want to view.
3. Select the date range over which you want to view data drift metrics (if the column chart does not show multiple weeks of data, wait a minute or so and click **Refresh**).
4. Examine the charts in the **Drift overview** section at the top, which show overall drift magnitude and the drift contribution per feature.
5. Explore the charts in the **Feature detail** section at the bottom, which enable you to see various measures of drift for individual features.

> **Note**: For help understanding the data drift metrics, see the [How to monitor datasets](https://docs.microsoft.com/azure/machine-learning/how-to-monitor-datasets#understanding-data-drift-results) in the Azure Machine Learning documentation.

## Explore further

This lab is designed to introduce you to the concepts and principles of data drift monitoring. To learn more about monitoring data drift using datasets, see the [Detect data drift on datasets](https://docs.microsoft.com/azure/machine-learning/how-to-monitor-datasets) in the Azure machine Learning documentation.

You can also collect data from published services and use it as a target dataset for datadrift monitoring. See [Collect data from models in production](https://docs.microsoft.com/azure/machine-learning/how-to-enable-data-collection) for details.
