## Model Monitoring and Data Drift Analysis

This section outlines our approach to data drift analysis using the Alibi-Detect library. We evaluate numeric features by comparing data from the training set against the production dataset to identify any significant drift. The analysis provides critical metrics including drift detection status and p-values to quantify the drift.

In [9]:
# !pip install alibi-detect

In [1]:
# import necessary libraries
import pandas as pd
from alibi_detect.cd import KSDrift # for performing Kolmogorov-Smirnov Drift Detection on dataset features.
from sklearn.preprocessing import StandardScaler, OneHotEncoder # for numerical and categorical data preprocessing.
from alibi_detect.utils.data import create_outlier_batch # to generate sample batches for testing outlier detection in the data.
import numpy as np

In [5]:
# Load the training and production datasets from GitHub and store in dataframes
url_train = 'https://github.com/DeshKunal/MLOps_Project/raw/refs/heads/main/Datasets/Processed/credit_data_train.parquet'
url_prod = 'https://github.com/DeshKunal/MLOps_Project/raw/refs/heads/main/Datasets/Processed/credit_data_prod.parquet'

train_data = pd.read_parquet(url_train)
prod_data = pd.read_parquet(url_prod)

In [7]:
# Define numerical and categorical columns
numerical_cols = train_data.select_dtypes(include=['int64', 'float64']).columns.drop('Credit_Risk')
categorical_cols = train_data.select_dtypes(include=['object', 'category']).columns

# Preprocessing for numerical features
scaler = StandardScaler()
train_numerical = scaler.fit_transform(train_data[numerical_cols])
prod_numerical = scaler.transform(prod_data[numerical_cols])

# Preprocessing for categorical features
encoder = OneHotEncoder(handle_unknown='ignore')
train_categorical = encoder.fit_transform(train_data[categorical_cols])
prod_categorical = encoder.transform(prod_data[categorical_cols])

# Combine numerical and categorical data back into arrays
train_preprocessed = np.hstack((train_numerical, train_categorical.toarray()))
prod_preprocessed = np.hstack((prod_numerical, prod_categorical.toarray()))

In [15]:
# Initialize the drift detector on the preprocessed training data
ks_drift = KSDrift(p_val=0.05, x_ref=train_preprocessed)

# Check for drift on the production data
drift_results = ks_drift.predict(prod_preprocessed)

In [31]:
def display_drift_results(drift_results):
     """
    Construct a DataFrame containing the results of the drift detection analysis.

    This function aggregates the drift detection results into a structured format,
    providing a clear summary of which features have exhibited statistical signs of drift,
    based on the specified tests, and includes the p-values for these tests.

    Parameters:
    - drift_results (dict): A dictionary containing the results from the drift detection algorithm.
    
    Returns:
    - pd.DataFrame: A DataFrame where each row corresponds to a feature. The columns include:
        - 'Feature': The name of the feature.
        - 'Drift Detected': 0 - No drift detected, 1 - drift detected
        - 'p-value': The p-value from the drift detection test, indicating the significance of the drift detection.
        
    """
    results_df = pd.DataFrame({
        'Feature': np.concatenate([numerical_cols, encoder.get_feature_names_out(categorical_cols)]),
        'Drift Detected': drift_results['data']['is_drift'],
        'p-value': drift_results['data']['p_val']
    })
    return results_df
    
display_df = display_drift_results(drift_results)

# Dispaly the drift results in a tabular format
display_df

Unnamed: 0,Feature,Drift Detected,p-value
0,Duration_Months,0,0.102750
1,Credit_Amount,0,0.137514
2,Installment_Rate,0,0.929149
3,Present_Residence_Since,0,0.998722
4,Age_Years,0,0.465057
...,...,...,...
56,Job_A174,0,0.989933
57,Telephone_A191,0,0.434555
58,Telephone_A192,0,0.434555
59,Foreign_Worker_A201,0,1.000000
