### Detect Data Drift in ML Models
**Objective**: Monitor and detect changes in data distributions that impact ML model performance.

**Task**: Feature Correlation Drift

**Steps**:
1. Compute the correlation matrix of features in your training dataset.
2. Compute the correlation matrix of the same features in your production data.
3. Assess changes in the correlation matrix over time to identify any significant deviations.
4. Investigate any significant changes in correlation as they may indicate issues in the data collection process or model assumptions.

In [1]:
# write your code from here
import pandas as pd
import numpy as np

def detect_correlation_drift(train_df, prod_df, features=None, threshold=0.2):
    """
    Detect drift in feature correlations between training and production datasets.
    
    Parameters:
    - train_df: pd.DataFrame with training data
    - prod_df: pd.DataFrame with production data
    - features: list of features to check (default: all numeric columns)
    - threshold: float, minimum absolute change in correlation to flag drift
    
    Returns:
    - drift_report: pd.DataFrame listing feature pairs with correlation change > threshold
    """
    if features is None:
        features = train_df.select_dtypes(include=[np.number]).columns.tolist()
    
    # Compute correlation matrices
    corr_train = train_df[features].corr()
    corr_prod = prod_df[features].corr()
    
    # Calculate absolute difference matrix
    corr_diff = (corr_train - corr_prod).abs()
    
    # Extract pairs with significant changes
    drift_pairs = []
    for i in range(len(features)):
        for j in range(i+1, len(features)):
            f1 = features[i]
            f2 = features[j]
            diff = corr_diff.loc[f1, f2]
            if diff > threshold:
                drift_pairs.append({
                    'feature_1': f1,
                    'feature_2': f2,
                    'train_corr': corr_train.loc[f1, f2],
                    'prod_corr': corr_prod.loc[f1, f2],
                    'corr_change': diff
                })
    
    drift_report = pd.DataFrame(drift_pairs)
    if drift_report.empty:
        print("No significant correlation drift detected.")
    else:
        print(f"Detected {len(drift_report)} feature pairs with correlation drift > {threshold}:")
        print(drift_report)
    
    return drift_report

# Example usage:

np.random.seed(42)

# Training data: correlated features
train_data = pd.DataFrame({
    'f1': np.random.normal(0, 1, 1000),
    'f2': np.random.normal(0, 1, 1000),
    'f3': np.random.normal(0, 1, 1000)
})
train_data['f2'] = train_data['f1'] * 0.8 + np.random.normal(0, 0.2, 1000)  # strong corr with f1

# Production data: correlation between f1 and f2 changes significantly
prod_data = pd.DataFrame({
    'f1': np.random.normal(0, 1, 1000),
    'f2': np.random.normal(0, 1, 1000),
    'f3': np.random.normal(0, 1, 1000)
})
prod_data['f2'] = prod_data['f1'] * 0.2 + np.random.normal(0, 0.8, 1000)  # weaker corr with f1

detect_correlation_drift(train_data, prod_data)



Detected 1 feature pairs with correlation drift > 0.2:
  feature_1 feature_2  train_corr  prod_corr  corr_change
0        f1        f2    0.967081   0.257421      0.70966


Unnamed: 0,feature_1,feature_2,train_corr,prod_corr,corr_change
0,f1,f2,0.967081,0.257421,0.70966
