# Detecting Data Drift

First, we must import the TabularDrift detector from the alibi-detect package, as well as the relevant packages for loading and splitting the data

In [1]:
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
import alibi
from alibi_detect.cd import TabularDrift

None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.


Next, we must get and split the data:

In [2]:
wine_data = load_wine()
feature_names = wine_data.feature_names
X, y = wine_data.data, wine_data.target
X_ref, X_test, y_ref, y_test = train_test_split(X, y,
                                                test_size=0.50,
                                                random_state=42)

Next, we must initialize our drift detector using the reference data and by providing the p-value we want to be used by the statistical significance tests. If you want to make your drift detector trigger when smaller differences occur in the data distribution, you must select a larger p_val:

In [8]:
X

array([[1.423e+01, 1.710e+00, 2.430e+00, ..., 1.040e+00, 3.920e+00,
        1.065e+03],
       [1.320e+01, 1.780e+00, 2.140e+00, ..., 1.050e+00, 3.400e+00,
        1.050e+03],
       [1.316e+01, 2.360e+00, 2.670e+00, ..., 1.030e+00, 3.170e+00,
        1.185e+03],
       ...,
       [1.327e+01, 4.280e+00, 2.260e+00, ..., 5.900e-01, 1.560e+00,
        8.350e+02],
       [1.317e+01, 2.590e+00, 2.370e+00, ..., 6.000e-01, 1.620e+00,
        8.400e+02],
       [1.413e+01, 4.100e+00, 2.740e+00, ..., 6.100e-01, 1.600e+00,
        5.600e+02]])

In [4]:
TabularDrift?

In [5]:
cd = TabularDrift(x_ref=X_ref, p_val=.05)



In [6]:
preds = cd.predict(X_test)
labels = ['No', 'Yes']
print('Drift: {}'.format(labels[preds['data']['is_drift']]))

Drift: No


This returns 'Drift: No'.

So, we have not detected drift here, as expected (see the following Important note for more on this).

Although there was no drift in this case, we can easily simulate a scenario where the chemical apparatus being used for measuring the chemical properties experienced a calibration error, and all the values are recorded as 10% higher than their true values.

In this case, if we run drift detection again on the same reference dataset, we will get the following output:


In [7]:
X_test_cal_error = 1.1*X_test
preds = cd.predict(X_test_cal_error)
labels = ['No', 'Yes']
print('Drift: {}'.format(labels[preds['data']['is_drift']]))

Drift: Yes


This returns 'Drift: Yes'.



This returns 'Drift: Yes', showing that the drift has been successfully detected.

IMPORTANT NOTE

This example is very artificial but is useful for illustrating the point. In a standard dataset like this, there won't be data drift between 50% of the randomly sampled data and the other 50% of the data. This is why we have to artificially shift some of the points to show that the detector does indeed work. In real-world scenarios, data drift can occur naturally due to everything from updates to sensors being used for measurements; to changes in consumer behavior; all the way through to changes in database software or schemas. So, be on guard as many drift cases won't be as easy to spot as in this case!

This example shows how, with a few simple lines of Python, we can detect a change in our dataset, which means our ML model may start to degrade in performance if we do not retrain to take the new properties of the data into account. We can also use similar techniques to track when the performance metrics of our model, for example accuracy or mean squared error, are drifting as well. In this case we have to make sure we periodically calculate performance on new test or validation datasets.

The first drift detection example was very simple, and showed us how to detect a basic case of one-off of data drift, specifically feature drift. We will now show an example of detecting label drift, which is basically the same but now we simply use the labels as the reference and comparison data set. We will ignore the first few steps as they are identical, and resume from the point where we have reference and test datasets available.

As in the example for the drift in the features, we can configure the tabular drift detector, but now we will use the initial label as our baseline dataset:

In [11]:
cd = TabularDrift(x_ref=y_ref, p_val=.05, categories_per_feature={})

In [12]:
preds = cd.predict(y_test)
labels = ['No', 'Yes']
print('Drift: {}'.format(labels[preds['data']['is_drift']]))

Drift: No


This returns 'Drift: No'.

So, we have not detected drift here, as expected. Note that this method can also be used as a good sanity check that training and test data labels follow similar distributions and our sampling of test data is representative.


We will now move onto a far more complex scenario, which is detecting concept drift.

## Detecting concept drift

Concept drift was described in the ‘Retraining Required’ section, and there it was emphasized that this type of drift is really all about a change in the relationships between the variables in our model. This means by definition that it is far more likely that cases of this type will be complex and potentially quite hard to diagnose.

The most common way that you can catch concept drift is by monitoring the performance of your model through time. For example, if we are working with the wine classification problem again, we can look at metrics that tell us the models classification performance, plot these through time and then build logic around the trends and outliers that we might see in these values.

The alibi_detect package, which we have already been using, has several useful methods for online drift detection that can be used to find concept drift as it happens and impacts model performance. Online here refers to the fact that the drift detection takes place at the level of a single data point, so that this can happen even if data comes in completely sequentially in production. Several of these assume that either PyTorch or TensorFlow are available as backends since the methods use Untrained AutoEncoders (UAEs) as out of the box pre-processing methods.

As an example, let us walk through an example of creating and using one of these online detectors, the Online Maximum Mean Discrepancy method. The following example assumes that in addition to the reference data set, x_ref, we have also defined variables for the expected run time, ert, and the window size, window_size. The expected run time is a variable that states the average number of data points the detector should run before it raises false positive detection. The idea here is that you want the expected run time to be larger but as it gets larger then the detector becomes more insensitive to actual drift, so a balance must be struck. The window_size is the size of the sliding window of data used in order to calculate the appropriate drift test statistic. A smaller window_size means you are tuning the detector to find sharp changes in the data or performance in a small ‘time’ frame whereas longer window sizes will mean you are tuning to look for more subtle drift effects over longer periods of ‘time’.

In [13]:
from alibi_detect.cd import MMDDriftOnline

We then initialise the drift detector with some variable settings as discussed in the previous paragraph. We also include the number of bootstraped simulations we want to apply in order for the method to calculate some thresholds for detecting the drift. Depending on your hardware settings for the deep learning library used and the size of the data, this may take some time.


In [14]:
ert = 50
window_size = 10
cd = MMDDriftOnline(X_ref, ert, window_size, backend='pytorch', n_bootstraps=2500)

ImportError: `Framework.PYTORCH` not installed. Cannot initialize and run MMDDriftOnline with pytorch backend. The necessary missing dependencies can be installed using `pip install alibi-detect[torch]`.

We can then simulate the drift detection in a production setting by taking the test data from the Wine dataset and feeding it in one feature vector at a time. If the feature vector for any given instance of data is given by x, we can then call the predict method of the drift detector and retrieve the ‘is_drift’ value from the returned metadata like so:

In [None]:
cd.predict(X)['data']['is_drift']