### Detect Data Drift in ML Models
**Objective**: Monitor and detect changes in data distributions that impact ML model performance.

**Task**: Feature Correlation Drift

**Steps**:
1. Compute the correlation matrix of features in your training dataset.
2. Compute the correlation matrix of the same features in your production data.
3. Assess changes in the correlation matrix over time to identify any significant deviations.
4. Investigate any significant changes in correlation as they may indicate issues in the data collection process or model assumptions.

In [2]:
import pandas as pd
import numpy as np

def detect_correlation_drift(train_df, prod_df, threshold=0.2):
    # Step 1: Compute correlation matrices
    train_corr = train_df.corr(numeric_only=True)
    prod_corr = prod_df.corr(numeric_only=True)

    # Step 2: Ensure the same columns in both matrices
    common_cols = list(set(train_corr.columns) & set(prod_corr.columns))
    train_corr = train_corr.loc[common_cols, common_cols]
    prod_corr = prod_corr.loc[common_cols, common_cols]

    # Step 3: Compute drift as absolute difference
    drift_matrix = (train_corr - prod_corr).abs()

    # Step 4: Identify significantly drifted correlations
    drifted_pairs = []
    for col1 in common_cols:
        for col2 in common_cols:
            if col1 != col2 and drift_matrix.loc[col1, col2] > threshold:
                drifted_pairs.append({
                    "Feature_1": col1,
                    "Feature_2": col2,
                    "Train_Corr": train_corr.loc[col1, col2],
                    "Prod_Corr": prod_corr.loc[col1, col2],
                    "Drift": drift_matrix.loc[col1, col2]
                })

    result_df = pd.DataFrame(drifted_pairs).drop_duplicates(subset=['Feature_1', 'Feature_2'])

    return result_df.sort_values("Drift", ascending=False)

# ---------------------- Example with Synthetic Data ----------------------

# Simulated training dataset
np.random.seed(42)
train_data = pd.DataFrame({
    'feature_a': np.random.rand(1000),
    'feature_b': np.random.rand(1000) * 0.5,
    'feature_c': np.random.rand(1000) * 2
})

# Simulated production dataset with altered correlations
prod_data = train_data.copy()
prod_data['feature_b'] = prod_data['feature_a'] * 0.9 + np.random.rand(1000) * 0.1  # Introduce correlation
prod_data['feature_c'] = prod_data['feature_a'] * -0.8 + np.random.rand(1000) * 0.2

# Run correlation drift detection
drift_results = detect_correlation_drift(train_data, prod_data, threshold=0.3)

print("🔍 Detected Feature Correlation Drift:\n")
print(drift_results if not drift_results.empty else "✅ No significant correlation drift detected.")

🔍 Detected Feature Correlation Drift:

   Feature_1  Feature_2  Train_Corr  Prod_Corr     Drift
0  feature_b  feature_c    0.027262  -0.966132  0.993395
2  feature_c  feature_b    0.027262  -0.966132  0.993395
3  feature_c  feature_a    0.014518  -0.970732  0.985250
5  feature_a  feature_c    0.014518  -0.970732  0.985250
1  feature_b  feature_a    0.029310   0.994083  0.964774
4  feature_a  feature_b    0.029310   0.994083  0.964774
