In [7]:
import numpy as np
from sklearn.datasets import make_classification
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.pipeline import Pipeline

from frouros.unsupervised.statistical_test import KSTest

## Unsupervised - Pipeline with univariate detector

Unsupervised univariate detectors can also be used in a scikit-learn Pipeline object. The following example shows the use of the univariate detectors with a synthetic dataset composed by 3 informative features and 2 non-informative/useless features for the model.

In [2]:
np.random.seed(seed=31)

X, y = make_classification(n_samples=10000,
                           n_features=5,
                           n_informative=3,
                           n_redundant=0,
                           n_repeated=0,
                           n_classes=2,
                           scale=[10, 0.1, 5, 15, 1],
                           shuffle=False,  # False because it also shuffles features order (we dont want features to be shuffled)
                           random_state=31,)

Random shuffle the data rows and split data in train (70%) and test (30%).

In [3]:
idxs = np.arange(X.shape[0])
np.random.shuffle(idxs)
X, y = X[idxs], y[idxs]

idx_split = int(X.shape[0] * 0.7)
X_train, y_train, X_test, y_test = X[:idx_split], y[:idx_split], X[idx_split:], y[idx_split:]

The significance level will be $\alpha = 0.01$.

In [4]:
alpha = 0.01

A detector can be easily added to a Pipeline object. The only requirement is that it needs to be defined before the final estimator.

In [5]:
pipeline = Pipeline(steps=[("detector", KSTest()),
                           ("model", DecisionTreeClassifier(random_state=31))])
pipeline.fit(X=X_train, y=y_train)

In addition to obtaining the predictions for the test data by calling the predict method, the detector compares the reference data with test data to determine if drift is occurring.

In [6]:
y_pred = pipeline.predict(X=X_test)
for i, feature_test in enumerate(pipeline["detector"].test, start=1):
    print(f"Feature {i}:")
    p_value = feature_test.p_value
    print(f"\tp-value: {round(p_value, 4)}")
    if p_value < alpha:
        print("\tData drift detected\n")
    else:
        print("\tNo data drift detected\n")
print(f"Accuracy: {round(accuracy_score(y_test, y_pred), 4)}")

Feature 1:
	p-value: 0.1606
	No data drift detected

Feature 2:
	p-value: 0.5984
	No data drift detected

Feature 3:
	p-value: 0.0637
	No data drift detected

Feature 4:
	p-value: 0.2359
	No data drift detected

Feature 5:
	p-value: 0.8064
	No data drift detected

Accuracy: 0.9277


### Concept drift

To simulate how data drift can end up producing concept drift, we apply some noise to two of the three relevant features, as shown below:

In [7]:
X_test_noise = X_test.copy()
X_test_noise[:, :2] = X_test_noise[:, :2] + np.random.normal(loc=0, scale=X_test_noise[:, :2].std(axis=0), size=X_test_noise[:, :2].shape)  # Add noise to features 1 and 2 (both informative)
y_pred = pipeline.predict(X=X_test_noise)
for i, feature_test in enumerate(pipeline["detector"].test, start=1):
    print(f"Feature {i}:")
    p_value = feature_test.p_value
    print(f"\tp-value: {round(p_value, 4)}")
    if p_value < alpha:
        print("\tData drift detected\n")
    else:
        print("\tNo data drift detected\n")
print(f"Accuracy: {round(accuracy_score(y_test, y_pred), 4)}")

Feature 1:
	p-value: 0.0
	Data drift detected

Feature 2:
	p-value: 0.0
	Data drift detected

Feature 3:
	p-value: 0.0637
	No data drift detected

Feature 4:
	p-value: 0.2359
	No data drift detected

Feature 5:
	p-value: 0.8064
	No data drift detected

Accuracy: 0.6353


Data drift has been detected for the two of the three informative features. This has lead to a significantly drop in the accuracy, thus producing concept drift.

### Virtual drift

On the other hand, if we apply some noise to the non-informative features (they should not be important for the model) we expect to see data drift in these features, but model's performance should not decrease significantly, meaning that virtual drift is occurring.

In [8]:
X_test_noise = X_test.copy()
X_test_noise[:, 3:] = X_test_noise[:, 3:] + np.random.normal(loc=0, scale=X_test_noise[:, 3:].std(axis=0), size=X_test_noise[:, 3:].shape)  # Add noise to features 4 and 5 (both non-informative)
y_pred = pipeline.predict(X=X_test_noise)
for i, feature_test in enumerate(pipeline["detector"].test, start=1):
    print(f"Feature {i}:")
    p_value = feature_test.p_value
    print(f"\tp-value: {round(p_value, 4)}")
    if p_value < alpha:
        print("\tData drift detected\n")
    else:
        print("\tNo data drift detected\n")
print(f"Accuracy: {round(accuracy_score(y_test, y_pred), 4)}")

Feature 1:
	p-value: 0.1606
	No data drift detected

Feature 2:
	p-value: 0.5984
	No data drift detected

Feature 3:
	p-value: 0.0637
	No data drift detected

Feature 4:
	p-value: 0.0
	Data drift detected

Feature 5:
	p-value: 0.0
	Data drift detected

Accuracy: 0.928


We can see how data drift has occurred in the two non-informative features, making the performance of the model unaffected, so there is virtual drift.