In [2]:
%load_ext autoreload
%autoreload 2

import sys
from pathlib import Path

cwd = Path.cwd()
ROOT_PATH = str(cwd.parent.parent.parent.parent)
sys.path.append(ROOT_PATH)
from simpml.tabular.all import *
from simpml.vision.all import *
%matplotlib inline

# Data Drift Detection Checks Documentation

This document provides detailed information about a series of data drift detection checks implemented for monitoring data quality and consistency over time. Each test is designed to identify different aspects of data drift, ensuring robust and reliable model performance.

## Out Of Domain Detection (`OODCheck`)

### What It Does
Detects potential data drift by training a RandomForest classifier to distinguish between new and original data. If the classifier can easily differentiate between the two, it indicates that the data distributions might have shifted significantly.

### How It Works
- Concatenates the new and original data, assigning labels to differentiate them.
- Splits the combined data into training and testing sets.
- Trains a RandomForestClassifier on the training set.
- Evaluates the classifier's performance on the test set.
- Uses the specified metric (e.g., accuracy) to assess whether the data distributions have drifted apart.

### Input
- `new_data`: A pandas DataFrame of new data.
- `original_data`: A pandas DataFrame of original data used for model training.
- `monitoring`: Any additional monitoring information (not used in this check).

### Output
- `TestResult`: An object containing the success status, message, threshold, and metric result. Indicates whether potential data drift was detected.

## KS Test for Feature Distribution (`KSCheck`)

### What It Does
Performs Kolmogorov-Smirnov (KS) tests on the distributions of numerical features between the new and original data to detect shifts.

### How It Works
- Iterates over numerical features in the data.
- Conducts a KS test for each feature to compare its distribution in the new and original datasets.
- Collects features with p-values lower than the specified threshold, indicating distributional differences.

### Input
- `new_data`: A pandas DataFrame of new data.
- `original_data`: A pandas DataFrame of original data.
- `monitoring`: Any additional monitoring information (not used in this check).

### Output
- `TestResult`: Contains the success status, message, threshold, and minimum p-value found. Lists features with significant distribution changes.

## Mean Comparison Check (`MeanComparisonCheck`)

### What It Does
Compares the means of numerical features between the new and original data, identifying significant differences.

### How It Works
- Identifies numerical features present in both datasets.
- Computes and compares the mean of each feature in the new and original datasets.
- Notes features with mean differences exceeding the specified threshold.

### Input
- `new_data`: A pandas DataFrame of new data.
- `original_data`: A pandas DataFrame of original data.
- `monitoring`: Any additional monitoring information (not used in this check).

### Output
- `TestResult`: Indicates whether significant mean differences were detected, listing affected features.

## Variance Comparison Check (`VarianceComparisonCheck`)

### What It Does
Compares the variances of numerical features between new and original data to identify significant changes.

### How It Works
- Computes the variance of each numerical feature in both datasets.
- Identifies features with variance differences exceeding the specified threshold.

### Input
- `new_data`: A pandas DataFrame of new data.
- `original_data`: A pandas DataFrame of original data.
- `monitoring`: Any additional monitoring information (not used in this check).

### Output
- `TestResult`: Details whether significant variance differences were found, listing affected features.

## Missing Values Check (`MissingValuesCheck`)

### What It Does
Compares the proportion of missing values in features between the new and original data, identifying significant discrepancies.

### How It Works
- Calculates the missing value ratio for each feature in both datasets.
- Highlights features with missing value ratio differences exceeding the specified threshold.

### Input
- `new_data`: A pandas DataFrame of new data.
- `original_data`: A pandas DataFrame of original data.
- `monitoring`: Any additional monitoring information (not used in this check).

### Output
- `TestResult`: Indicates whether significant differences in missing values ratios were detected, listing affected features.

## Prediction Score Distribution Check (`PredictionScoreDistributionCheck`)

### What It Does
Evaluates shifts in the distribution of a model's prediction scores between the new and original data.

### How It Works
- Utilizes the model's `predict_proba` method to obtain prediction scores for both datasets.
- Conducts a KS test to compare the score distributions.
- Assesses whether the distribution shift is significant based on the p-value.

### Input
- `new_data`: A pandas DataFrame of new data.
- `original_data`: A pandas DataFrame of original data.
- `monitoring`: Includes the model to be used for prediction score comparison.

### Output
- `TestResult`: Details whether a significant shift in prediction score distribution was detected.

## Anomaly Detection Rate Check (`AnomalyDetectionRateCheck`)

### What It Does
Monitors changes in the rate of identified anomalies between new and original data, using a specified anomaly detection method.

### How It Works
- Applies the anomaly detection method to both datasets to identify anomalies.
- Calculates the rate of anomalies in both datasets.
- Compares the anomaly rates, assessing significant changes based on the specified percentage threshold.

### Input
- `new_data`: A pandas DataFrame of new data.
- `original_data`: A pandas DataFrame of original data.
- `monitoring`: Any additional monitoring information (not used in this check).

### Output
- `TestResult`: Indicates whether there was a significant change in the anomaly detection rate, detailing the findings.

---
Each of these checks plays a crucial role in maintaining data integrity and model performance over time by identifying potential data drifts early on.


In [2]:
data_manager = SupervisedTabularDataManager(data = DataSet.load_titanic_dataset(),
                                            target = 'Survived',
                                            prediction_type = PredictionType.BinaryClassification,
                                            splitter = RandomSplitter(split_sets = {Dataset.Train: 0.6, Dataset.Valid: 0.2, Dataset.Test: 0.2}, target = 'Survived'))

data_manager.build_pipeline(drop_cols = ['PassengerId'])

Sklearn Pipeline:
MatchVariablesBefore (MatchVariables(missing_values='ignore')) ->
SafeDropFeaturesBefore (SafeDropFeatures(features_to_drop=['PassengerId'])) ->
NanColumnDropper (NanColumnDropper()) ->
Infinity2Nan (Infinity2Nan()) ->
MinMaxScaler (MinMaxScalerWithColumnNames()) ->
HighCardinalityDropper (HighCardinalityDropper()) ->
AddMissingIndicator (AddMissingIndicator()) ->
NumericalImputer (MeanMedianImputer()) ->
SafeCategoricalImputer (SafeCategoricalTransformer(transformer_cls=<class 'feature_engine.imputation.categorical.CategoricalImputer'>)) ->
SafeOneHotEncoder (SafeCategoricalTransformer(transformer_cls=<class 'feature_engine.encoding.one_hot.OneHotEncoder'>)) ->
RemoveSpecialJSONCharacters (RemoveSpecialJSONCharacters()) ->
SafeDropFeaturesAfter (SafeDropFeatures(features_to_drop=['PassengerId'])) ->
MatchVariablesAfter (MatchVariables(missing_values='ignore'))

Target Pipeline:
LabelEncoder (DictLabelEncoder())

In [3]:
exp_mang = ExperimentManager(data_manager, optimize_metric = MetricName.AUC)
exp_mang.run_experiment(metrics_kwargs = {'pos_label': 1})

Unnamed: 0,Experiment ID,Experiment Description,Model,Model Description,Data Version,Data Description,Model Params,Metric Params,Accuracy,AUC,Recall,Precision,Balanced Accuracy,F1,Run Time
0,20240407154400_00f1,,Baseline Classification,Default settings,fd7897fb,,"{'experiment_manager': Prediction Type: PredictionType.BinaryClassification, Metric: Metric: AUC. Description: , Random State: 0}",{'pos_label': 1},0.505618,0.476471,0.352941,0.352941,0.476471,0.352941,0:00:00
1,20240407154400_00f1,,Logistic Regression,Default settings,fd7897fb,,"{'experiment_manager': Prediction Type: PredictionType.BinaryClassification, Metric: Metric: AUC. Description: , Random State: 0}",{'pos_label': 1},0.825843,0.811364,0.75,0.784615,0.811364,0.766917,0:00:00
2,20240407154400_00f1,,Support Vector Classifier,Default settings,fd7897fb,,"{'experiment_manager': Prediction Type: PredictionType.BinaryClassification, Metric: Metric: AUC. Description: , Random State: 0}",{'pos_label': 1},0.808989,0.792112,0.720588,0.765625,0.792112,0.742424,0:00:00
3,20240407154400_00f1,,AdaBoost Classifier,Default settings,fd7897fb,,"{'experiment_manager': Prediction Type: PredictionType.BinaryClassification, Metric: Metric: AUC. Description: , Random State: 0}",{'pos_label': 1},0.825843,0.822594,0.808824,0.753425,0.822594,0.780142,0:00:00
4,20240407154400_00f1,,Decision Tree,Default settings,fd7897fb,,"{'experiment_manager': Prediction Type: PredictionType.BinaryClassification, Metric: Metric: AUC. Description: , Random State: 0}",{'pos_label': 1},0.792135,0.781283,0.735294,0.724638,0.781283,0.729927,0:00:00
5,20240407154400_00f1,,Random Forest,Default settings,fd7897fb,,"{'experiment_manager': Prediction Type: PredictionType.BinaryClassification, Metric: Metric: AUC. Description: , Random State: 0}",{'pos_label': 1},0.808989,0.789305,0.705882,0.774194,0.789305,0.738462,0:00:00
6,20240407154400_00f1,,Gradient Boosting,Default settings,fd7897fb,,"{'experiment_manager': Prediction Type: PredictionType.BinaryClassification, Metric: Metric: AUC. Description: , Random State: 0}",{'pos_label': 1},0.831461,0.804679,0.691176,0.839286,0.804679,0.758065,0:00:00
7,20240407154400_00f1,,XGBoost,Default settings,fd7897fb,,"{'experiment_manager': Prediction Type: PredictionType.BinaryClassification, Metric: Metric: AUC. Description: , Random State: 0}",{'pos_label': 1},0.808989,0.792112,0.720588,0.765625,0.792112,0.742424,0:00:00
8,20240407154400_00f1,,LightGBM,Default settings,fd7897fb,,"{'experiment_manager': Prediction Type: PredictionType.BinaryClassification, Metric: Metric: AUC. Description: , Random State: 0}",{'pos_label': 1},0.820225,0.801203,0.720588,0.790323,0.801203,0.753846,0:01:00


In [4]:
best_model = exp_mang.get_model('XGBoost', exp_mang.get_current_experiment_id())
interp = TabularInterpreterBinaryClassification(
    model = best_model,
    data_manager = data_manager,
    opt_metric = exp_mang.opt_metric,
    pos_class = {'pos_class' : 1}
)

In [18]:
X_ref, y_ref = data_manager.get_data(Dataset.Test)
X_raw_ref, y_raw_ref = data_manager.get_dataset_from_indices(Dataset.Test)
X_shap_ref = interp.shap_manager.get_shap_values(X_ref)
y_ref_predict_proba = best_model.model.predict_proba(X_ref)

X_new, y_new = data_manager.get_data(Dataset.Valid)
X_raw_new, y_raw_new = data_manager.get_dataset_from_indices(Dataset.Valid)
X_shap_new = interp.shap_manager.get_shap_values(X_new)
y_new_predict_proba = best_model.model.predict_proba(X_new)

inf_mamager = SafeTabularInferenceManager(data_manager, best_model)
fails_sf_ref = inf_mamager.safety_filter.predict(X_ref)
fails_sf_new = inf_mamager.safety_filter.predict(X_new)

tabular_monitoring = TabularMonitoring()

prep_data_checks = [
    OODCheck(metric=accuracy_score, threshold=0.7),
    KSCheck(threshold=0.05)
]

original_data_checks = [
    MeanComparisonCheck(threshold=0.2),
    MissingValuesCheck(threshold=0.1),
    ColumnChangeCheck(threshold=0.1)
]

shap_data_checks = [
    OODCheck(metric=accuracy_score, threshold=0.7)
]

traget_cheks = [
    PredictionScoreDistributionCheck(threshold = 0.05)
]

safety_filter_cheks = [
    AnomalyDetectionRateCheck(threshold = 0.1)
]

checks_config = {
    'Prep Data': (X_ref, X_new, prep_data_checks),
    'Raw Data': (X_raw_ref, X_raw_new, original_data_checks),
    'Shap Data': (X_shap_ref, X_shap_new, shap_data_checks),
    'Traget': (y_ref_predict_proba, y_new_predict_proba, traget_cheks),
    'Fails Safety Filter': (fails_sf_ref, fails_sf_new, safety_filter_cheks)
}

# Running the checks
checks_res = tabular_monitoring.run_checks(checks_config)

Unnamed: 0,Experiment ID,Experiment Description,Model,Model Description,Data Version,Data Description,Model Params,Metric Params,Accuracy,AUC,Recall,Precision,Balanced Accuracy,F1,Run Time
0,20240407133201_6d6d,,Feature Bagging,Default settings,d1f0dfa6,,"{'contamination': 5e-07, 'experiment_manager': Prediction Type: PredictionType.AnomalyDetection, Metric: Metric: F1. Description: , Random State: 0}",{},0.66,0.66,0.395,0.840426,0.66,0.537415,0:00:00
1,20240407133201_6d6d,,LODA,Default settings,d1f0dfa6,,"{'contamination': 5e-07, 'experiment_manager': Prediction Type: PredictionType.AnomalyDetection, Metric: Metric: F1. Description: , Random State: 0}",{},0.4775,0.4775,0.045,0.333333,0.4775,0.079295,0:00:00
2,20240407133201_6d6d,,Isolation Forest,Default settings,d1f0dfa6,,"{'contamination': 5e-07, 'experiment_manager': Prediction Type: PredictionType.AnomalyDetection, Metric: Metric: F1. Description: , Random State: 0}",{},0.5475,0.5475,0.165,0.702128,0.5475,0.267206,0:00:00


In [19]:
tabular_monitoring.export_report('report.json')

In [20]:
tabular_monitoring.get_df_results()

Unnamed: 0,type,check,description,success,message,threshold,value
0,Prep Data,Out Of Domain Detection,Out Of Domain Detection,True,No significant data drift detected.,0.7,0.416667
1,Prep Data,KS Test for Feature Distribution,KS Test for Feature Distribution,True,No significant drift detected in feature distributions.,0.05,0.407438
2,Raw Data,Mean Comparison Check,Mean Comparison Check,True,No significant difference in means.,0.2,0.070731
3,Raw Data,Missing Values Check,Missing Values Check,True,No significant difference in missing values ratio.,0.1,0.00232
4,Raw Data,Column Change Check,Column Change Check,True,Column difference within acceptable threshold.,0.1,0.0
5,Shap Data,Out Of Domain Detection,Out Of Domain Detection,True,No significant data drift detected.,0.7,0.472222
6,Traget,Prediction Score Distribution Check,Prediction Score Distribution Check,True,No significant shift in prediction score distribution.,0.05,0.829045
7,Fails Safety Filter,Anomaly Detection Rate Check,Anomaly Detection Rate Check,True,No significant change in anomaly detection rate.,0.1,0.010671
