:depth: 2
:local: true
Although powerful, modern machine learning models can be sensitive.
Seemingly subtle changes in a data distribution can destroy the
performance of otherwise state-of-the art models, which can be
especially problematic when ML models are deployed in production.
Typically, ML models are tested on held out data in order to estimate
their future performance. Crucially, this assumes that the process
underlying the input data
Drift is said to occur when the process underlying
:align: center
:alt: Drift in deployment
To explore the different types of drift, consider the common scenario
where we deploy a model
we can classify drift under a number of types:
-
Covariate drift: Also referred to as input drift, this occurs
when the distribution of the input data has shifted
$P(\mathbf{X}) \ne P_{ref}(\mathbf{X})$ , whilst$P(\mathbf{Y}|\mathbf{X})$ =$P_{ref}(\mathbf{Y}|\mathbf{X})$ . This may result in the model giving unreliable predictions. -
Prior drift: Also referred to as label drift, this occurs when
the distribution of the outputs has shifted
$P(\mathbf{Y}) \ne P_{ref}(\mathbf{Y})$ , whilst$P(\mathbf{X}|\mathbf{Y})=P_{ref}(\mathbf{X}|\mathbf{Y})$ . This can affect the model's decision boundary, as well as the model's performance metrics. -
Concept drift: This occurs when the process generating
$y$ from$x$ has changed, such that$P(\mathbf{Y}|\mathbf{X}) \ne P_{ref}(\mathbf{Y}|\mathbf{X})$ . It is possible that the model might no longer give a suitable approximation of the true process.
Note that a change in one of the conditional probabilities
Below, the different types of drift are visualised for a simple two-dimensional classification problem. It is possible for a drift to fall under more than one category, for example the prior drift below also happens to be a case of covariate drift.
:align: center
:alt: 2D drift example
It is relatively easy to spot drift by eyeballing these figures here.
However, the task becomes considerably harder for high-dimensional real
problems, especially since real-time ground truths are not typically
available. Some types of drift, such as prior and concept drift, are
especially difficult to detect without access to ground truths. As a
workaround proxies are required, for example a model’s predictions can
be monitored to check for prior drift.
Alibi Detect offers a
wide array of methods for detecting drift (see
here), some of which are examined in the
NeurIPS 2019 paper Failing Loudly: An Empirical Study of Methods for
Detecting Dataset Shift.
Generally, these aim to determine whether the distribution
Due to natural randomness in the process being modelled, we don’t necessarily expect observations $\mathbf{z}1,\dots,\mathbf{z}N$ drawn from $P(\mathbf{z})$ to be identical to $\mathbf{z}^{ref}1,\dots,\mathbf{z}^{ref}M$ drawn from $P{ref}(\mathbf{z})$. To decide whether differences between $P(\mathbf{z})$ and $P{ref}(\mathbf{z})$ are due to drift or just natural randomness in the data, statistical two-sample hypothesis testing is used, with the null hypothesis $P(\mathbf{z})=P{ref}(\mathbf{z})$. If the $p$-value obtained is below a given threshold, the null is rejected and the alternative hypothesis $P(\mathbf{z}) \ne P{ref}(\mathbf{z})$ is accepted, suggesting drift is occurring.
Since
:::{figure} images/drift_pipeline.png :align: center :alt: Drift detection pipeline
Figure inspired by Figure 1 in Failing Loudly: An Empirical Study of Methods for Detecting Dataset Shift. :::
Hypothesis testing involves first choosing a test statistic
The test statistics available in Alibi Detect can be broadly split into two categories; univariate and multivariate tests:
-
Univariate:
- Chi-Squared (for categorical data)
- Kolmogorov-Smirnov
- Cramér-von Mises
- Fisher's Exact Test (for binary data)
-
Multivariate:
When applied to multidimensional data with dimension
Given an input dataset
- Linear projections
- Non-linear projections
- Feature maps (from ML model)
- Model uncertainty
Alibi Detect allows for a
high degree of flexibility here, with a user’s chosen dimension
reduction technique able to be incorporated into their chosen detector
via the preprocess_fn
argument (and sometimes
preprocess_batch_fn
and preprocess_at_init
, depending on the
detector). In the following sections, the three categories of techniques
are briefly introduced. Alibi Detect offers the following functionality
using either TensorFlow or
PyTorch backends and preprocessing utilities.
For more details, see the examples.
This includes dimension reduction techniques such as principal
component analysis
(PCA)
and sparse random projections
(SRP). These techniques
involve using a transformation or projection matrix preprocess_fn
argument, for example for the
scikit-learn
library’s PCA
class:
pca = PCA(2)
pca.fit(X_train)
detector = MMDDrift(X_ref, backend='tensorflow', p_val=.05, preprocess_fn=pca.transform)
:::{admonition} Note 1: Disjoint training and reference data sets
Astute readers may have noticed that in the code snippet above,
the data X_train
is used to “train” the PCA
model, but
the MMDDrift
detector is initialised with X_ref
. This is a
subtle yet important point. If a detector’s preprocessor (a
dimension reduction or other input
preprocessing step) is trained on the
reference data (X_ref
), any over-fitting to this data may make
the resulting detector overly sensitive to differences between the
reference and test data sets.
To avoid an overly discriminative detector, it is customary to draw
two disjoint datasets from
A common strategy for obtaining non-linear dimension reducing
representations is to use an autoencoder, but other non-linear
techniques
can also be used. Autoencoders consist of an encoder function
pytorch
autoencoder can be incorporated into a
detector by packaging it as a callable function using {func}~alibi_detect.cd.pytorch.preprocess.preprocess_drift
and {func}~functools.partial
:
encoder_net = torch.nn.Sequential(...)
preprocess_fn = partial(preprocess_drift, model=encoder_net, batch_size=512)
detector = MMDDrift(X_ref, backend='pytorch', p_val=.05, preprocess_fn=preprocess_fn)
Following Detecting and Correcting for Label Shift with Black Box Predictors, feature maps can be extracted from existing pre-trained black-box models such as the image classifier shown below. Instead of using the latent space as the dimensionality-reducing representation, other layers of the model such as the softmax outputs or predicted class-labels can also be extracted and monitored. Since different layers yield different output dimensions, different hypothesis tests are required for each.
:::{figure} images/BBSD.png :align: center :alt: Black box shift detection
Figure inspired by this MNIST classification example from the timeserio package. :::
Failing Loudly: An Empirical Study of Methods for Detecting Dataset
Shift shows that extracting
feature maps from existing models can be an effective technique, which
is encouraging since this allows the user to repurpose existing
black-box models for use as drift detectors. The syntax for
incorporating existing models into drift detectors is similar to the
previous autoencoder example, with the added step of using
{class}~alibi_detect.cd.tensorflow.preprocess.HiddenOutput
to select the model’s network layer to extract outputs from. The code
snippet below is borrowed from Maximum Mean Discrepancy drift detector
on CIFAR-10, where the softmax
layer of the well-known
ResNet-32 model is fed into
an MMDDrift
detector.
clf = fetch_tf_model('cifar10', 'resnet32')
preprocess_fn = partial(preprocess_drift, model=HiddenOutput(clf, layer=-1), batch_size=128)
detector = MMDDrift(X_ref, backend='tensorflow', p_val=.05,preprocess_fn=preprocess_fn)
The model uncertainty-based drift detector uses the ML model of interest itself to detect drift. These detectors aim to directly detect drift that’s likely to affect the performance of a model of interest. The approach is to test for change in the number of instances falling into regions of the input space on which the model is uncertain in its predictions. For each instance in the reference set the detector obtains the model’s prediction and some associated notion of uncertainty. The same is done for the test set and if significant differences in uncertainty are detected (via a Kolmogorov-Smirnoff test) then drift is flagged. The model’s notion of uncertainty depends on the type of model. For a classifier this may be the entropy of the predicted label probabilities. For a regressor with dropout layers, dropout Monte Carlo can be used to provide a notion of uncertainty.
The model uncertainty-based detectors are classed under the dimension
reduction category since a model's uncertainty is by definition one-dimensional.
However, the syntax for the uncertainty-based detectors is different to the
other detectors. Instead of passing a pre-processing step to a detector via
a preprocess_fn
(or similar) argument, the dimension reduction (in this case
computing a notion of uncertainty) is performed internally by these detectors.
reg = # pytorch regression model with at least 1 dropout layer
detector = RegressorUncertaintyDrift(x_ref, reg, backend='pytorch',
p_val=.05, uncertainty_type='mc_dropout')
Dimension reduction is a common preprocessing task (e.g. for covariate drift detection on tabular or image data), but some modalities of data (e.g. text and graph data) require other forms of preprocessing in order for drift detection to be performed effectively.
When dealing with text data, performing drift detection on raw strings or tokenized data is not effective since they don’t represent the semantics of the input. Instead, we extract contextual embeddings from language transformer models and detect drift on those. This procedure has a significant impact on the type of drift we detect. Strictly speaking we are not detecting covariate/input drift anymore since the entire training procedure (objective function, training data etc) for the (pre)trained embeddings has an impact on the embeddings we extract.
:::{figure} images/BERT.png :align: center :alt: The DistilBERT language representation model
Figure based on Jay Alammar’s excellent visual guide to the BERT model :::
Alibi Detect contains functionality to leverage pre-trained embeddings
from HuggingFace’s
transformer package. Popular
models such as BERT or
DistilBERT (shown above) can be
used, but Alibi Detect also allows you to easily use your own embeddings
of choice. A subsequent dimension reduction step can also be applied if
necessary, as is done in the Text drift detection on IMDB movie
reviews example, where the
768-dimensional embeddings from the BERT model are passed through an
untrained AutoEncoder to reduce their dimensionality. Alibi Detect
allows various types of embeddings to be extracted from transformer
models, using {class}~alibi_detect.models.tensorflow.embedding.TransformerEmbedding
.
In a similar manner to text data, graph data requires preprocessing before drift detection can be performed. This can be done by extracting graph embeddings from graph neural network (GNN) encoders, as shown below, and demonstrated in the Drift detection on molecular graphs example.
:align: center
:alt: A graph embedding
:width: 550px
For a simple example, we’ll use the MMD detector to check for drift on the two-dimensional binary classification problem shown previously (see notebook). The MMD detector is a kernel-based method for multivariate two sample testing. Since the number of dimensions is already low, dimension reduction step is not necessary here here. For a more advanced example using the MMD detector with dimension reduction, check out the Maximum Mean Discrepancy drift detector on CIFAR-10 example.
The true model/process is defined as:
where the slope
def true_model(X,slope=-1):
z = slope*X[:,0]
idx = np.argwhere(X[:,1]>z)
y = np.zeros(X.shape[0])
y[idx] = 1
return y
true_slope = -1
The reference distribution is defined as a mixture of two Normal distributions:
with the standard deviation set at true_model()
.
# Reference distribution
sigma = 0.8
phi1 = 0.5
phi2 = 0.5
ref_norm_0 = multivariate_normal([-1,-1], np.eye(2)*sigma**2)
ref_norm_1 = multivariate_normal([ 1, 1], np.eye(2)*sigma**2)
# Reference data (to initialise the detectors)
N_ref = 240
X_0 = ref_norm_0.rvs(size=int(N_ref*phi1),random_state=1)
X_1 = ref_norm_1.rvs(size=int(N_ref*phi2),random_state=1)
X_ref = np.vstack([X_0, X_1])
y_ref = true_model(X_ref,true_slope)
# Training data (to train the classifer)
N_train = 240
X_0 = ref_norm_0.rvs(size=int(N_train*phi1),random_state=0)
X_1 = ref_norm_1.rvs(size=int(N_train*phi2),random_state=0)
X_train = np.vstack([X_0, X_1])
y_train = true_model(X_train,true_slope)
For a model, we choose the well-known decision tree classifier. As well
as training the model, this is a good time to initialise the MMD detector
with the held-out reference data
detector = MMDDrift(X_ref, backend='pytorch', p_val=.05)
The significance threshold is set at
# Fit decision tree classifier
from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier(max_depth=20)
clf.fit(X_train, y_train)
# Plot with a pre-defined helper function
plot(X_ref,y_ref,true_slope,clf=clf)
# Classifier accuracy
print('Mean training accuracy %.2f%%' %(100*clf.score(X_ref,y_ref)))
# Fit a drift detector to the training data
from alibi_detect.cd import MMDDrift
detector = MMDDrift(X_ref, backend='pytorch', p_val=.05)
.. parsed-literal::
Mean training accuracy 99.17%
No GPU detected, fall back on CPU.
Before introducing drift, we first examine the case where no drift is
present. We resample from the same mixture of Gaussian distributions
to generate test data
N_test = 120
X_0 = ref_norm_0.rvs(size=int(N_test*phi1),random_state=2)
X_1 = ref_norm_1.rvs(size=int(N_test*phi2),random_state=2)
X_test = np.vstack([X_0, X_1])
# Plot
y_test = true_model(X_test,true_slope)
plot(X_test,y_test,true_slope,clf=clf)
# Classifier accuracy
print('Mean test accuracy %.2f%%' %(100*clf.score(X_test,y_test)))
.. parsed-literal::
Mean test accuracy 95.00%
Unsurprisingly, the model’s mean test accuracy is relatively high. To
run the detector on test data the .predict()
is used:
detector.predict(X_test)
.. parsed-literal::
{'data': {'is_drift': 0,
'distance': 0.0023595122654528344,
'p_val': 0.30000001192092896,
'threshold': 0.05,
'distance_threshold': 0.008109889},
'meta': {'name': 'MMDDriftTorch',
'detector_type': 'offline',
'data_type': None,
'backend': 'pytorch'}}
For the test statistic
A 'is_drift':0
here, indicating that drift is not
detected. More specifically, the detector’s p_val
)
is above the threshold of threshold
),
indicating that no statistically significant drift has been detected.
The .predict()
method also returns distance_threshold
), which is the threshold in terms of the test
statistic
To impose covariate drift, we apply a shift to the mean of one of the normal distributions:
shift_norm_0 = multivariate_normal([2, -4], np.eye(2)*sigma**2)
X_0 = shift_norm_0.rvs(size=N_test*phi1,random_state=2)
X_1 = ref_norm_1.rvs(size=N_test*phi2,random_state=2)
X_test = np.vstack([X_0, X_1])
# Plot
y_test = true_model(X_test,slope)
plot(X_test,y_test,slope,clf=clf)
# Classifier accuracy
print('Mean test accuracy %.2f%%' %(100*clf.score(X_test,y_test)))
# Check for drift in covariates
pred = detector.predict(X_test)
labels = ['No','Yes']
print('Is drift? %s!' %labels[pred['data']['is_drift']])
.. parsed-literal::
Mean test accuracy 66.67%
Is drift? Yes!
The test data has drifted into a previously unseen region of feature space, and the model is now misclassifying a number of test observations. If true test labels are available, this is easily detectable by monitoring the test accuracy. However, labels are not always available at test time, in which case a drift detector monitoring the covariates comes in handy. In this case, the MMD detector successfully detects the covariate drift.
In a similar manner, a proxy for prior drift can be monitored by initialising a detector on labels from the reference set, and then feeding it a model’s predicted labels:
label_detector = MMDDrift(y_ref.reshape(-1,1), backend='tensorflow', p_val=.05)
y_pred = clf.predict(X_test)
label_detector.predict(y_pred.reshape(-1,1))
It can often be challenging to specify a test statistic
- Learned kernel
- Classifier
- Spot-the-diff(erence)
These detectors can be highly effective, but require training hence potentially increasing data requirements and set-up time. Similarly to when training preprocessing steps, it is important that the learned detectors are trained on training data which is held-out from the reference data set (see Note 1). A brief overview of these detectors is given below. For more details, see the detectors’ respective pages.
The MMD detector uses a kernel
where
The figure below compares the use of a Gaussian and a learned kernel for
identifying differences between two distributions
:::{figure} images/deep_kernel.png :align: center :alt: Gaussian and deep kernels
Original image source: Liu et al., 2020. Captions modified to match notation used elsewhere on this page. :::
The classifier-based drift detector (Lopez-Paz and Oquab, 2017) attempts to detect drift by explicitly training a classifier to discriminate between data from the reference and test sets. The statistical test used depends on whether the classifier outputs probabilities or binarized (0 or 1) predictions, but the general idea is to determine whether the classifiers performance is statistically different from random chance. If the classifier can learn to discriminate better than randomly (in a generalisable manner) then drift must have occurred.
Liu et al. show that a classifier-based drift detector is actually a special case of the learned kernel. An important difference is that to train a classifier we maximise its accuracy (or a cross-entropy proxy), while for a learned kernel we maximise the test power directly. Liu et al. show that the latter approach is empirically superior.
The spot-the-diff(erence) drift detector is an extension of the Classifier drift detector, where the classifier is specified in a manner that makes detections interpretable at the feature level when they occur. The detector is inspired by the work of Jitkrittum et al. (2016) but various major adaptations have been made.
As with the usual classifier-based approach, a portion of the available data is used to train a classifier that can discriminate reference instances from test instances. However, the spot-the-diff detector is specified such that when drift is detected, we can inspect the weights of the classifier to shine light on exactly which features of the data were used to distinguish reference from test samples, and therefore caused drift to be detected. The Interpretable drift detection with the spot-the-diff detector on MNIST and Wine-Quality datasets example demonstrates this capability.
So far, we have discussed drift detection in an offline context, with the entire test set ${\mathbf{z}i}{i=1}^{N}$ compared to the reference dataset ${\mathbf{z}^{ref}i}{i=1}^{M}$. However, at test time, data sometimes arrives sequentially. Here it is desirable to detect drift in an online fashion, allowing us to respond as quickly as possible and limit the damage it might cause.
:align: center
:alt: Online drift detection
:width: 350px
One approach is to perform a test for drift every
:align: center
:alt: Offline detector with W=2
:width: 700px
:align: center
:alt: Offline detector with W=20
:width: 700px
An alternative strategy is to perform a test each time data arrives.
However the usual offline methods are not applicable because the process
for computing
- Online Maximum Mean Discrepancy
- Online Least-Squares Density Difference
- Online Cramér-von Mises
- Online Fisher's Exact Test
These detectors leverage the calibration method introduced by
Cobb et al. (2021) in order to ensure they are well
well-calibrated when used in a sequential manner. The detectors compute a test statistic
online_detector = MMDDriftOnline(X_ref, ert, window_size, backend='tensorflow', n_bootstraps=5000)
But, in addition to providing the detector with reference data, the expected run-time (see below), and size of the sliding window must also be defined. Another important difference is that the online detectors make predictions on single data instances:
result = online_detector.predict(X[i])
This can be seen in the animation below, where the online detector considers each incoming observation/sample individually, instead of considering a batch of observations like the offline detectors.
:align: center
:alt: Online detector
:width: 700px
Unlike offline detectors which require the specification of a threshold
Usually we would like the ERT to be large, however this results in
insensitive detectors which are slow to respond when drift does occur.
Hence, there is a tradeoff between the expected run time and the
expected detection delay (the time taken for the detector to respond to
drift in the data). To target the desired ERT, thresholds are configured
during an initial configuration phase via simulation (n_bootstraps
sets the number of boostrap simulations used here). This configuration
process is only suitable when the amount of reference data is relatively
large (ideally around an order of magnitude larger than the desired
ERT). Configuration can be expensive (less so with a GPU), but allows
the detector to operate at a low-cost at test time. For a more in-depth
explanation, see Drift Detection: An Introduction with
Seldon.