In [5]:
import os
project_dir = r'C:\Users\india\Desktop\Jio_Institute\MLOps\Project\Nagendra\MLOPS'
os.chdir(project_dir)

In [7]:
ls

 Volume in drive C is Windows
 Volume Serial Number is E054-376B

 Directory of C:\Users\india\Desktop\Jio_Institute\MLOps\Project\Nagendra\MLOPS

17-03-2025  00:43    <DIR>          .
09-03-2025  18:17    <DIR>          ..
03-03-2025  17:27             3,586 .gitignore
03-03-2025  17:46    <DIR>          .ipynb_checkpoints
03-03-2025  16:38    <DIR>          Dataset
17-03-2025  00:42         2,218,971 Metro_Interstate_Traffic_Volume_Profile.html
17-03-2025  01:07    <DIR>          mlruns
17-03-2025  02:06    <DIR>          Notebooks
03-03-2025  17:27                 7 README.md
               3 File(s)      2,222,564 bytes
               6 Dir(s)  134,890,029,056 bytes free


## Model Monitoring and Data Drift Analysis
In this section, we assess whether the production data has significantly drifted from the training data using the alibi-detect library. Specifically, we focus on numeric features such as temperature, rainfall, snowfall, and cloud coverage. The analysis computes key metrics that indicate drift, including:

Drift Detected: A flag indicating if drift is present.
P-Value: The statistical significance of the drift test.
Distance: A measure of the difference between the feature distributions in training and production data.
By comparing these metrics, we can determine if the production data’s distribution has changed relative to the training set—information that is critical for monitoring model performance over time.

In [11]:
import pandas as pd
from alibi_detect.cd import TabularDrift
import warnings

# Suppress warnings for cleaner output
warnings.filterwarnings("ignore")

def detect_numeric_drift(train_file: str, prod_file: str, numeric_features: list, p_value_threshold: float = 0.05) -> pd.DataFrame:
    """
    Detects data drift between training and production datasets for numeric features using Alibi Detect's TabularDrift.
    
    Parameters:
        train_file (str): File path to the training dataset (Parquet format).
        prod_file (str): File path to the production dataset (Parquet format).
        numeric_features (list): List of numeric feature column names to be analyzed.
        p_value_threshold (float): Significance threshold for drift detection (default is 0.05).
        
    Returns:
        pd.DataFrame: A DataFrame summarizing drift detection metrics including:
                      - 'Drift Detected': Flag indicating if drift is detected (True/False)
                      - 'P-Value': Statistical significance of the drift test for the features
                      - 'Distance': A measure of the distance between the distributions
    """
    # Load the training and production datasets from Parquet files.
    train_df = pd.read_parquet(train_file)
    prod_df = pd.read_parquet(prod_file)
    
    # Drop the target column 'traffic_volume' to focus on the features.
    X_train = train_df.drop(columns=['traffic_volume'])
    X_prod = prod_df.drop(columns=['traffic_volume'])
    
    # Extract the numeric features as numpy arrays.
    X_train_numeric = X_train[numeric_features].values
    X_prod_numeric = X_prod[numeric_features].values
    
    # Initialize the TabularDrift detector with the training data as the reference.
    cd = TabularDrift(X_train_numeric, p_val=p_value_threshold)
    
    # Run the drift detector on the production data.
    preds = cd.predict(X_prod_numeric)
    
    # Extract drift detection results.
    drift_detected = preds['data']['is_drift']
    p_value = preds['data']['p_val']
    distance = preds['data']['distance']
    
    # Organize the metrics into a DataFrame.
    results = pd.DataFrame({
        'Metric': ['Drift Detected', 'P-Value', 'Distance'],
        'Value': [drift_detected, p_value, distance]
    })
    
    return results

# Define the list of numeric features as per your dataset.
numeric_features = ["temp", "rain_1h", "snow_1h", "clouds_all"]

# Example call:
numeric_results = detect_numeric_drift(
    train_file='Dataset/Parquet/Metro_Interstate_Traffic_Volume_train.parquet',
    prod_file='Dataset/Parquet/Metro_Interstate_Traffic_Volume_prod.parquet',
    numeric_features=numeric_features
)

print("Numeric Drift Detection Results:")
print(numeric_results.to_string())


Numeric Drift Detection Results:
           Metric                                                 Value
0  Drift Detected                                                     1
1         P-Value           [7.4e-44, 0.00014044569, 1.0, 3.451241e-34]
2        Distance  [0.083073415, 0.025691379, 0.0021782727, 0.07325293]


The numeric drift analysis indicates an overall shift in data distribution between the training and production datasets (Drift Detected = 1). Specifically:

Feature 1:
P-value: 7.4e-44
Distance: 0.083
This feature shows extremely significant drift.  

Feature 2:
P-value: 0.00014
Distance: 0.026
Drift is significant here as well.  

Feature 3:
P-value: 1.0
Distance: 0.0022
No drift is detected for this feature.  

Feature 4:
P-value: 3.45e-34
Distance: 0.073
This feature also exhibits significant drift.  

In summary, three out of four numeric features show significant distributional changes in production compared to training, which may impact model performance and indicate a need for model retraining or further data preprocessing.

In [19]:
import pandas as pd
from scipy.stats import chi2_contingency
import warnings

# Suppress warnings for cleaner output
warnings.filterwarnings("ignore")

def detect_categorical_drift(train_file: str, prod_file: str, categorical_features: list, alpha: float = 0.05) -> pd.DataFrame:
    """
    Detects drift in categorical features using the chi-square test.
    
    For each categorical feature, this function computes the frequency distributions in both the training 
    and production datasets, aligns them by their categories, and then applies the chi-square test to determine 
    if the distributions differ significantly.
    
    Parameters:
        train_file (str): File path to the training dataset (Parquet format).
        prod_file (str): File path to the production dataset (Parquet format).
        categorical_features (list): List of categorical feature column names to be analyzed.
        alpha (float): Significance level for the chi-square test (default is 0.05).
        
    Returns:
        pd.DataFrame: A DataFrame summarizing the drift detection results for each feature, including:
                      - Feature: Name of the categorical feature.
                      - Chi2 Statistic: The chi-square test statistic.
                      - p-value: The p-value from the chi-square test.
                      - Drift Detected: 1 if drift is detected (p < alpha), otherwise 0.
    """
    # Load the training and production datasets from Parquet files.
    train_df = pd.read_parquet(train_file)
    prod_df = pd.read_parquet(prod_file)
    
    # Drop the target column 'traffic_volume' from both datasets.
    X_train = train_df.drop(columns=['traffic_volume'])
    X_prod = prod_df.drop(columns=['traffic_volume'])
    
    results = []
    
    # Iterate over each categorical feature for drift detection.
    for col in categorical_features:
        # Compute frequency counts for each category in training and production data.
        train_counts = X_train[col].value_counts().sort_index()
        prod_counts = X_prod[col].value_counts().sort_index()
        
        # Determine the union of categories present in both datasets.
        all_categories = sorted(set(train_counts.index) | set(prod_counts.index))
        
        # Reindex counts to include all categories (fill missing with 0).
        train_counts = train_counts.reindex(all_categories, fill_value=0)
        prod_counts = prod_counts.reindex(all_categories, fill_value=0)
        
        # Create a contingency table with datasets as rows and categories as columns.
        contingency_table = pd.DataFrame({
            'train': train_counts,
            'prod': prod_counts
        })
        
        # Perform chi-square test on the transposed table so that each row represents a dataset.
        chi2, p, dof, expected = chi2_contingency(contingency_table.T)
        
        # Determine drift flag based on the p-value.
        drift_flag = 1 if p < alpha else 0
        
        results.append({
            'Feature': col,
            'Chi2 Statistic': chi2,
            'p-value': p,
            'Drift Detected': drift_flag
        })
    
    return pd.DataFrame(results)

# Define the list of categorical features as per your dataset.
categorical_features = ['holiday', 'weather_main', 'weather_description']

# Call the function with your file paths (adjust these paths as needed).
categorical_results = detect_categorical_drift(
    train_file='Dataset/Parquet/Metro_Interstate_Traffic_Volume_train.parquet',
    prod_file='Dataset/Parquet/Metro_Interstate_Traffic_Volume_prod.parquet',
    categorical_features=categorical_features
)

# Print the drift detection results in a tabular format.
print("Categorical Drift Detection Results:")
print(categorical_results.to_string(index=False))


Categorical Drift Detection Results:
            Feature  Chi2 Statistic      p-value  Drift Detected
            holiday        7.033957 7.222354e-01               0
       weather_main      427.780505 1.141185e-85               1
weather_description     1664.129109 0.000000e+00               1


Summary of Categorical Drift Detection Results:

holiday:  
Chi² Statistic: 7.034  
p-value: 0.722  
Drift Detected: 0  
This indicates that the distribution of the "holiday" feature in production is statistically similar to that in the training data, and no significant drift was detected.

weather_main:  
Chi² Statistic: 427.781  
p-value: 1.14e-85   
Drift Detected: 1  
The extremely low p-value shows a highly significant difference between the training and production distributions for "weather_main," signaling that drift has occurred.     

weather_description:  
Chi² Statistic: 1664.129  
p-value: 0.0000  
Drift Detected: 1  
Similar to "weather_main," the near-zero p-value for "weather_description" confirms significant drift in its distribution.  

In short, while the "holiday" feature remains consistent between training and production, both "weather_main" and "weather_description" show substantial distributional changes, indicating drift.