In [24]:
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.gaussian_process.kernels import RBF
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

from frouros.unsupervised.distance_based import MMD

## Unsupervised - Pipeline with multivariate detector

Unsupervised multivariate detectors can also be used in a scikit-learn Pipeline object. The following example shows the use of MMD as a detector for the breast cancer dataset provided by scikit-learn.

In [25]:
np.random.seed(seed=31)

X, y = load_breast_cancer(return_X_y=True)

Since this is a small data set and the only objective is to show the integration of a multivariate detector with the Pipeline, the data is simply split in half.

In [26]:
idx_split = int(X.shape[0] * 0.5)
X_train, y_train, X_test, y_test = X[:idx_split], y[:idx_split], X[idx_split:], y[idx_split:]

The significance level will be $\alpha = 0.01$.

In [27]:
alpha = 0.01

Next, the pipeline is defined, where in addition to using the detector, the data is standardized and Logistic Regression is used as the model. Moreover, the pipeline is fitted, which means that the reference data for the detector is stored and the model is trained with that same data.

In [28]:
detector = MMD(num_permutations=1000, kernel=RBF(), random_state=31)
pipeline = Pipeline(steps=[("detector", detector),
                           ("scale", StandardScaler()),
                           ("model", LogisticRegression(random_state=31))])
pipeline.fit(X=X_train, y=y_train)

In addition to obtaining the predictions for the test data by calling the predict method, the detector compares the reference data with test data to determine if drift is occurring.

In [29]:
y_pred = pipeline.predict(X=X_test)
p_value = pipeline["detector"].distance.p_value
print(f"p-value: {round(p_value, 4)}")
if p_value < alpha:
    print("Data drift detected")
else:
    print("No data drift detected")
print(f"Accuracy: {round(accuracy_score(y_test, y_pred), 4)}")

p-value: 0.343
No data drift detected
Accuracy: 0.9719


As the above results show, no data drift was detected. Therefore, we can simulate data drift by applying some noise to the test data, as shown below:

In [30]:
X_test_noise = X_test + np.random.normal(loc=0, scale=X_test.std(axis=0)/2, size=X_test.shape)
y_pred = pipeline.predict(X=X_test_noise)
p_value = pipeline["detector"].distance.p_value
print(f"p-value: {round(p_value, 4)}")
if p_value < alpha:
    print("Data drift detected")
else:
    print("No data drift detected")
print(f"Accuracy: {round(accuracy_score(y_test, y_pred), 4)}")

p-value: 0.0
Data drift detected
Accuracy: 0.9579


Data drift has been detected and the model's performance has been affected by lowering the accuracy value.

Calling the predict method (which internally calls the detector's transform method) is computationally costly in the case of using MMD.
One possible way to reduce the time in each call to the predict method is to take a sample of the data that is going to be used as reference. This has the disadvantage that it will not be possible to use the detector with the fit from the Pipeline object, so it has to be fitted outside the Pipeline and then added, as shown below:

In [31]:
pipeline = Pipeline(steps=[("scaler", StandardScaler()),
                           ("model", LogisticRegression(random_state=31))])
pipeline.fit(X=X_train, y=y_train)

The detector is fitted with a sample of the training data.

In [32]:
num_ref_samples = 100
idx_samples = np.random.choice(X_train.shape[0], num_ref_samples, replace=False)
X_train_samples = X_train[idx_samples]
detector = MMD(num_permutations=1000, kernel=RBF(), random_state=31)
detector.fit(X=X_train_samples)

Subsequently, the detector is added as the first step in the pipeline.

In [33]:
pipeline.steps.insert(0, ["detector", detector])
pipeline

Finally, the predict method of the pipeline is used as normal.

In [34]:
y_pred = pipeline.predict(X=X_test)
p_value = pipeline["detector"].distance.p_value
print(f"p-value: {round(p_value, 4)}")
if p_value < alpha:
    print("Data drift detected")
else:
    print("No data drift detected")
print(f"Accuracy: {round(accuracy_score(y_test, y_pred), 4)}")

p-value: 0.153
No data drift detected
Accuracy: 0.9719
