# Data drift detection with alibi detect

In [1]:
from alibi_detect.cd import MMDDrift
import numpy as np
from numpy.random import choice

First we load the data and shows the shape of the arrays

In [2]:
print("Load data")
data = np.load("/data/drift_test.npz")
for k in data.keys():
    print(f"{k}, with shape: {data[k].shape}")

Load data
x_ref, with shape: (2092, 9216)
x_drift, with shape: (2092, 9216)


```x_ref``` represent the reference feature set, (i.e. the feature extracted from train set), while ```x_drift``` is the set of feature generated from the source images "dirty" with random gaussian noise.

We instantiate a [Maximum Mean Discrepancy drift detector](https://docs.seldon.io/projects/alibi-detect/en/stable/cd/methods/mmddrift.html) (MDD) provided by alibi detect, in short is a multivariate 2 sample testing, that actually fits our needs. The drift detector is initailized with ```x_ref``` and with a ```p_value``` of $0.05$.

In [3]:
dd = MMDDrift(x_ref=data["x_ref"], x_ref_preprocessed=False,p_val=.05, backend='pytorch', device='CPU')

Requested device not recognised, fall back on CPU.


For testing purpose we firstly check if there is no drift if we use a subset reference data, so we draw a random sample from ```_x_ref``` and we use that for the prediction of data drift. We espect that no drift is detected

In [4]:
sample_from_xref = data["x_ref"][choice(np.arange(2092), size=150)]
dd.predict(x=sample_from_xref)

{'data': {'is_drift': 0,
  'distance': -0.0007553766636618775,
  'p_val': 1.0,
  'threshold': 0.05,
  'distance_threshold': array(0.00123531, dtype=float32)},
 'meta': {'name': 'MMDDriftTorch',
  'online': False,
  'data_type': None,
  'version': '0.10.4',
  'detector_type': 'drift',
  'backend': 'pytorch'}}

As expected the attribute ```is_drift``` is equal to 0, hence no drift has been detected. Now for the next test we use the "corrupted" features and we expect that a drift is detected. A subset of the drifted data has been considered in order to reduce computational time.

In [5]:
sample_from_drift = data["x_drift"][choice(np.arange(2092), size=150)]
dd.predict(x=sample_from_drift)

{'data': {'is_drift': 1,
  'distance': 0.04072630898517504,
  'p_val': 0.0,
  'threshold': 0.05,
  'distance_threshold': array(0.00120294, dtype=float32)},
 'meta': {'name': 'MMDDriftTorch',
  'online': False,
  'data_type': None,
  'version': '0.10.4',
  'detector_type': 'drift',
  'backend': 'pytorch'}}

 As expected a data drift has been detected on the drifted data. This approach with the "online" flavor will be used in production.