### Detect Data Drift in ML Models
**Objective**: Monitor and detect changes in data distributions that impact ML model performance.

**Task**: Feature Correlation Drift

**Steps**:
1. Compute the correlation matrix of features in your training dataset.
2. Compute the correlation matrix of the same features in your production data.
3. Assess changes in the correlation matrix over time to identify any significant deviations.
4. Investigate any significant changes in correlation as they may indicate issues in the data collection process or model assumptions.

In [1]:
import pandas as pd
import numpy as np

# Step 1: Simulated training dataset
train_data = pd.DataFrame({
    'feature1': np.random.normal(0, 1, 100),
    'feature2': np.random.normal(0, 1, 100),
    'feature3': np.random.normal(0, 1, 100)
})
train_data['feature2'] = train_data['feature1'] * 0.8 + np.random.normal(0, 0.2, 100)

# Step 2: Simulated production dataset with drift
prod_data = pd.DataFrame({
    'feature1': np.random.normal(0, 1, 100),
    'feature2': np.random.normal(0, 1, 100),
    'feature3': np.random.normal(0, 1, 100)
})

# Step 3: Compute correlation matrices
train_corr = train_data.corr()
prod_corr = prod_data.corr()

# Step 4: Compare correlation matrices
diff_corr = (train_corr - prod_corr).abs()

print("Absolute differences in correlation matrices:")
print(diff_corr)


Absolute differences in correlation matrices:
          feature1  feature2  feature3
feature1  0.000000  0.914982  0.109939
feature2  0.914982  0.000000  0.049445
feature3  0.109939  0.049445  0.000000
