In [1]:
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

from frouros.callbacks.batch import PermutationTestOnBatchData
from frouros.detectors.data_drift import MMD

# Data drift - Multivariate detector

The following example shows the use of MMD {cite:p}`JMLR:v13:gretton12a` multivariate detector for the breast cancer dataset provided by scikit-learn.

In [2]:
np.random.seed(seed=31)

X, y = load_breast_cancer(return_X_y=True)

Since this is a small data set and the only objective is to show the integration the use of a multivariate detector, the data is simply split in half.

In [3]:
idx_split = int(X.shape[0] * 0.5)
X_train, y_train, X_test, y_test = X[:idx_split], y[:idx_split], X[idx_split:], y[idx_split:]

The significance level will be $\alpha = 0.01$.

In [4]:
alpha = 0.01

Create and fit a MMD detector using the training dataset.

In [5]:
detector = MMD(
    callbacks=[
        PermutationTestOnBatchData(
            num_permutations=100,
            random_state=31,
            num_jobs=-1,
            name="permutation_test",
            verbose=True,
        ),
    ],
)
_ = detector.fit(X=X_train)

Fitting a logistic regression with the training/reference dataset.

In [6]:
pipeline = Pipeline(
    steps=[
        ("scale", StandardScaler()),
        ("model", LogisticRegression(random_state=31)),
    ],
)
pipeline.fit(X=X_train, y=y_train)

In addition to obtaining the predictions for the test data by calling the predict method, the detector compares the reference data with test data to determine if drift is occurring.

In [7]:
y_pred = pipeline.predict(X=X_test)
p_value = detector.compare(X=X_test)[1]["permutation_test"]["p_value"]
print(f"p-value: {round(p_value, 4)}")
if p_value < alpha:
    print("Data drift detected")
else:
    print("No data drift detected")
print(f"Accuracy: {round(accuracy_score(y_test, y_pred), 4)}")

100%|██████████| 100/100 [00:00<00:00, 456.69it/s]

p-value: 0.0
Data drift detected
Accuracy: 0.9719





As the above results show, no data drift was detected. Therefore, we can simulate data drift by applying some noise to the test data, as shown below:

In [9]:
X_test_noise = X_test + np.random.normal(loc=0, scale=X_test.std(axis=0), size=X_test.shape)
y_pred = pipeline.predict(X=X_test_noise)
p_value = detector.compare(X=X_test_noise)[1]["permutation_test"]["p_value"]
print(f"p-value: {round(p_value, 4)}")
if p_value < alpha:
    print("Data drift detected")
else:
    print("No data drift detected")
print(f"Accuracy: {round(accuracy_score(y_test, y_pred), 4)}")

100%|██████████| 100/100 [00:00<00:00, 437.38it/s]

p-value: 0.0
Data drift detected
Accuracy: 0.9088





Data drift has been detected and the model's performance has been affected by significantly lowering the accuracy value.

```{bibliography}
:filter: docname in docnames
```