### Data drift

You typically train a machine learning model using a historical dataset that is representative of the new data that your model will receive for inferencing. However, over time there may be trends that change the profile of the data, making your model less accurate.

Azure Machine Learning supports data drift monitoring through the use of datasets. You can capture new feature data in a dataset and compare it to the dataset with which the model was trained.

To monitor data drift using registered datasets, you need to register two datasets:

- A baseline dataset - usually the original training data.
- A target dataset that will be compared to the baseline based on time intervals. This dataset requires a column for each feature you want to compare, and a timestamp column so the rate of data drift can be measured.

You can schedule when the data drift task should be started and configure the alert along with its threshold. Data drift is measured using a calculated magnitude of change in the statistical distribution of feature values over time. You can expect some natural random variation between the baseline and target datasets, but you should monitor for large changes that might indicate significant data drift.

In [5]:
# !pip install azureml-datadrift

In [1]:
from azureml.core import Workspace

ws = Workspace.from_config()

In [2]:
# Baseline dataset.

from azureml.core import Datastore, Dataset
from azureml.data.datapath import DataPath

# Upload the baseline data
default_ds = ws.get_default_datastore()
Dataset.File.upload_directory(src_dir='Script/data/',
                              target=DataPath(default_ds, 'diabetes-baseline/')
                              )

# Create and register the baseline dataset
print('Registering baseline dataset...')

baseline_data_set = Dataset.Tabular.from_delimited_files(path=(default_ds, 'diabetes-baseline/*.csv'))
baseline_data_set = baseline_data_set.register(workspace=ws,
                           name='diabetes baseline',
                           description='diabetes baseline data',
                           tags = {'format':'CSV'},
                           create_new_version=True)

print('Baseline dataset registered!')

Validating arguments.
Arguments validated.
Uploading file to diabetes-baseline/
Uploading an estimated of 8 files
Target already exists. Skipping upload for diabetes-baseline/diabetes.csv
Target already exists. Skipping upload for diabetes-baseline/diabetes2.csv
Target already exists. Skipping upload for diabetes-baseline/diabetes_2022-12-14.csv
Target already exists. Skipping upload for diabetes-baseline/diabetes_2022-12-21.csv
Target already exists. Skipping upload for diabetes-baseline/diabetes_2022-12-28.csv
Target already exists. Skipping upload for diabetes-baseline/diabetes_2023-01-04.csv
Target already exists. Skipping upload for diabetes-baseline/diabetes_2023-01-11.csv
Target already exists. Skipping upload for diabetes-baseline/diabetes_2023-01-18.csv
Uploaded 0 files
Creating new dataset
Registering baseline dataset...
Baseline dataset registered!


In [3]:
# Target dataset.

import datetime as dt
import pandas as pd

print('Generating simulated data...')

# Load the smaller of the two data files
data = pd.read_csv('./Script/data/diabetes2.csv')

# We'll generate data for the past 6 weeks
weeknos = reversed(range(6))

file_paths = []
for weekno in weeknos:

    # Get the date X weeks ago
    data_date = dt.date.today() - dt.timedelta(weeks=weekno)

    # Modify data to ceate some drift
    data['Pregnancies'] = data['Pregnancies'] + 1
    data['Age'] = round(data['Age'] * 1.2).astype(int)
    data['BMI'] = data['BMI'] * 1.1

    # Save the file with the date encoded in the filename
    file_path = './Script/data/diabetes_{}.csv'.format(data_date.strftime("%Y-%m-%d"))
    print(f'save {file_path}')
    data.to_csv(file_path)
    file_paths.append(file_path)

# Upload the files
path_on_datastore = 'diabetes-target'

for file in file_paths:
    default_ds.upload_files(files = [file],
                            target_path=path_on_datastore,
                            overwrite=True,
                            show_progress=True)

Generating simulated data...
save ./Script/data/diabetes_2022-12-14.csv
save ./Script/data/diabetes_2022-12-21.csv
save ./Script/data/diabetes_2022-12-28.csv
save ./Script/data/diabetes_2023-01-04.csv
save ./Script/data/diabetes_2023-01-11.csv


"datastore.upload_files" is deprecated after version 1.0.69. Please use "FileDatasetFactory.upload_directory" instead. See Dataset API change notice at https://aka.ms/dataset-deprecation.


save ./Script/data/diabetes_2023-01-18.csv
Uploading an estimated of 1 files
Uploading ./Script/data/diabetes_2022-12-14.csv
Uploaded ./Script/data/diabetes_2022-12-14.csv, 1 files out of an estimated total of 1
Uploaded 1 files
Uploading an estimated of 1 files
Uploading ./Script/data/diabetes_2022-12-21.csv
Uploaded ./Script/data/diabetes_2022-12-21.csv, 1 files out of an estimated total of 1
Uploaded 1 files
Uploading an estimated of 1 files
Uploading ./Script/data/diabetes_2022-12-28.csv
Uploaded ./Script/data/diabetes_2022-12-28.csv, 1 files out of an estimated total of 1
Uploaded 1 files
Uploading an estimated of 1 files
Uploading ./Script/data/diabetes_2023-01-04.csv
Uploaded ./Script/data/diabetes_2023-01-04.csv, 1 files out of an estimated total of 1
Uploaded 1 files
Uploading an estimated of 1 files
Uploading ./Script/data/diabetes_2023-01-11.csv
Uploaded ./Script/data/diabetes_2023-01-11.csv, 1 files out of an estimated total of 1
Uploaded 1 files
Uploading an estimated of 1

In [9]:
# Use the folder partition format to define a dataset with a 'date' timestamp column
# partition_format = path_on_datastore + '/diabetes_{date:yyyy-MM-dd}.csv'

target_data_set = Dataset.Tabular.from_delimited_files(path=(default_ds, path_on_datastore + '/*.csv'))
print('Registering target dataset...')
# Register the target dataset

target_data_set = target_data_set.with_timestamp_columns('date').register(workspace=ws,
                                                                          name='diabetes target',
                                                                          description='diabetes target data',
                                                                          tags = {'format':'CSV'},
                                                                          create_new_version=True)

print('Target dataset registered!')

Target dataset registered!


In [10]:
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException

cluster_name = "ravazzil-cluster"

try:
    # Check for existing compute target
    training_cluster = ComputeTarget(workspace=ws, name=cluster_name)
    print('Found existing cluster, use it.')
except ComputeTargetException:
    # If it doesn't already exist, create it
    try:
        compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_DS11_V2', max_nodes=2)
        training_cluster = ComputeTarget.create(ws, cluster_name, compute_config)
        training_cluster.wait_for_completion(show_output=True)
    except Exception as ex:
        print(ex)


InProgress.
SucceededProvisioning operation finished, operation "Succeeded"
Succeeded
AmlCompute wait for completion finished

Minimum number of nodes requested have been provisioned


In [11]:
from azureml.datadrift import DataDriftDetector

# set up feature list
features = ['Pregnancies', 'Age', 'BMI']

# set up data drift detector
monitor = DataDriftDetector.create_from_datasets(ws,
                                                 'mslearn-diabates-drift',
                                                 baseline_data_set,
                                                 target_data_set,
                                                 compute_target=cluster_name,
                                                 frequency='Week',
                                                 feature_list=features,
                                                 drift_threshold=.3,
                                                 latency=24)
monitor

{'_logger': <_TelemetryLoggerContextAdapter azureml.datadrift._logging._telemetry_logger.azureml.datadrift.datadriftdetector (DEBUG)>, '_workspace': Workspace.create(name='ravazzil-workspace', subscription_id='d12c1b85-0a70-4232-b483-12d1ffcfc148', resource_group='ResourceGroupRavazzi'), '_frequency': 'Week', '_schedule_start': None, '_schedule_id': None, '_interval': 1, '_state': 'Disabled', '_alert_config': None, '_type': 'DatasetBased', '_id': '46c9ac0a-2041-4837-8338-8e9a014e46c2', '_compute_target_name': 'ravazzil-cluster', '_drift_threshold': 0.3, '_baseline_dataset_id': 'f41c8b60-7daf-488b-b11e-d29819dae3a2', '_target_dataset_id': '3b81a32d-92d8-4398-bd0a-c5eff6c15949', '_feature_list': ['Pregnancies', 'Age', 'BMI'], '_latency': 24, '_name': 'mslearn-diabates-drift', '_latest_run_time': None, '_client': <azureml.datadrift._restclient.datadrift_client.DataDriftClient object at 0x00000298113CD708>}

In [12]:
from azureml.widgets import RunDetails

backfill = monitor.backfill(dt.datetime.now() - dt.timedelta(weeks=6), dt.datetime.now())

RunDetails(backfill).show()
backfill.wait_for_completion()

_UserRunWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', '…

ActivityFailedException: ActivityFailedException:
	Message: Activity Failed:
{
    "error": {
        "code": "UserError",
        "message": "Execution failed. User process '/azureml-envs/azureml_b7f96dea2d17a49b2f9af79a608f13c5/bin/python' exited with status code 1. Please check log file 'user_logs/std_log.txt' for error details. Error:     return func(*args, **kwargs)\n  File \"/azureml-envs/azureml_b7f96dea2d17a49b2f9af79a608f13c5/lib/python3.8/site-packages/azureml/data/tabular_dataset.py\", line 624, in time_between\n    return self._time_filter(self.time_between.__name__,\n  File \"/azureml-envs/azureml_b7f96dea2d17a49b2f9af79a608f13c5/lib/python3.8/site-packages/azureml/data/tabular_dataset.py\", line 867, in _time_filter\n    self._validate_timestamp_columns([col_fine_timestamp, col_coarse_timestamp])\n  File \"/azureml-envs/azureml_b7f96dea2d17a49b2f9af79a608f13c5/lib/python3.8/site-packages/azureml/data/tabular_dataset.py\", line 924, in _validate_timestamp_columns\n    _validate_has_columns(self._dataflow, columns, [FieldType.DATE for c in columns])\n  File \"/azureml-envs/azureml_b7f96dea2d17a49b2f9af79a608f13c5/lib/python3.8/site-packages/azureml/data/dataset_error_handling.py\", line 81, in _validate_has_columns\n    raise DatasetValidationError('The specified columns {} do not exist in the current dataset.'\nazureml.data.dataset_error_handling.DatasetValidationError: DatasetValidationError:\n\tMessage: The specified columns ['date'] do not exist in the current dataset.\n\tInnerException None\n\tErrorResponse \n{\n    \"error\": {\n        \"code\": \"UserError\",\n        \"message\": \"The specified columns ['date'] do not exist in the current dataset.\"\n    }\n}\n\n",
        "messageParameters": {},
        "details": []
    },
    "time": "0001-01-01T00:00:00.000Z",
    "componentName": "CommonRuntime"
}
	InnerException None
	ErrorResponse 
{
    "error": {
        "message": "Activity Failed:\n{\n    \"error\": {\n        \"code\": \"UserError\",\n        \"message\": \"Execution failed. User process '/azureml-envs/azureml_b7f96dea2d17a49b2f9af79a608f13c5/bin/python' exited with status code 1. Please check log file 'user_logs/std_log.txt' for error details. Error:     return func(*args, **kwargs)\\n  File \\\"/azureml-envs/azureml_b7f96dea2d17a49b2f9af79a608f13c5/lib/python3.8/site-packages/azureml/data/tabular_dataset.py\\\", line 624, in time_between\\n    return self._time_filter(self.time_between.__name__,\\n  File \\\"/azureml-envs/azureml_b7f96dea2d17a49b2f9af79a608f13c5/lib/python3.8/site-packages/azureml/data/tabular_dataset.py\\\", line 867, in _time_filter\\n    self._validate_timestamp_columns([col_fine_timestamp, col_coarse_timestamp])\\n  File \\\"/azureml-envs/azureml_b7f96dea2d17a49b2f9af79a608f13c5/lib/python3.8/site-packages/azureml/data/tabular_dataset.py\\\", line 924, in _validate_timestamp_columns\\n    _validate_has_columns(self._dataflow, columns, [FieldType.DATE for c in columns])\\n  File \\\"/azureml-envs/azureml_b7f96dea2d17a49b2f9af79a608f13c5/lib/python3.8/site-packages/azureml/data/dataset_error_handling.py\\\", line 81, in _validate_has_columns\\n    raise DatasetValidationError('The specified columns {} do not exist in the current dataset.'\\nazureml.data.dataset_error_handling.DatasetValidationError: DatasetValidationError:\\n\\tMessage: The specified columns ['date'] do not exist in the current dataset.\\n\\tInnerException None\\n\\tErrorResponse \\n{\\n    \\\"error\\\": {\\n        \\\"code\\\": \\\"UserError\\\",\\n        \\\"message\\\": \\\"The specified columns ['date'] do not exist in the current dataset.\\\"\\n    }\\n}\\n\\n\",\n        \"messageParameters\": {},\n        \"details\": []\n    },\n    \"time\": \"0001-01-01T00:00:00.000Z\",\n    \"componentName\": \"CommonRuntime\"\n}"
    }
}

In [13]:
drift_metrics = backfill.get_metrics()
for metric in drift_metrics:
    print(metric, drift_metrics[metric])

start_date 2022-12-04
end_date 2023-01-22
frequency Week
