### Detect Data Drift in ML Models
**Objective**: Monitor and detect changes in data distributions that impact ML model performance.

**Task**: Categorical Feature Drift

**Steps**:
1. Load the baseline distribution for a categorical feature (e.g., gender ) from your training dataset.
2. Load the same feature from your current production data.
3. Use chi-squared tests to compare the distributions of the categorical feature.
4. Step 4: If significant drift is detected, investigate the cause and update the model as needed.

In [None]:
import pandas as pd
from scipy.stats import chi2_contingency

# Step 1: (training) data distribution
train_data = pd.DataFrame({
    'gender': ['Male', 'Female', 'Female', 'Male', 'Female', 'Male', 'Male', 'Female', 'Male', 'Female']
})

# Step 2: Production data distribution
prod_data = pd.DataFrame({
    'gender': ['Male', 'Male', 'Male', 'Male', 'Male', 'Male', 'Male', 'Male', 'Female', 'Male']
})

# Step 3: Create frequency tables
train_counts = train_data['gender'].value_counts().sort_index()
prod_counts = prod_data['gender'].value_counts().sort_index()

# Combine into contingency table
categories = sorted(set(train_counts.index).union(set(prod_counts.index)))
contingency = [
    [train_counts.get(cat, 0) for cat in categories],
    [prod_counts.get(cat, 0) for cat in categories]
]

# Chi-squared test
chi2, p, dof, expected = chi2_contingency(contingency)

print("Chi-squared test p-value:", p)
if p < 0.05:
    print("Significant drift detected in 'gender' feature.")
else:
    print("No significant drift detected in 'gender' feature.")


Chi-squared test p-value: 0.1432349075246656
No significant drift detected in 'gender' feature.
