### Detect Data Drift in ML Models
**Objective**: Monitor and detect changes in data distributions that impact ML model performance.

**Task**: Categorical Feature Drift

**Steps**:
1. Load the baseline distribution for a categorical feature (e.g., gender ) from your training dataset.
2. Load the same feature from your current production data.
3. Use chi-squared tests to compare the distributions of the categorical feature.
4. Step 4: If significant drift is detected, investigate the cause and update the model as needed.

In [1]:

import pandas as pd
from scipy.stats import chi2_contingency
baseline_data = pd.DataFrame({
    'gender': ['Male'] * 60 + ['Female'] * 40
})
production_data = pd.DataFrame({
    'gender': ['Male'] * 40 + ['Female'] * 60
})
baseline_counts = baseline_data['gender'].value_counts().sort_index()
production_counts = production_data['gender'].value_counts().sort_index()
contingency_table = pd.DataFrame([baseline_counts, production_counts])
contingency_table.index = ['Baseline', 'Production']
chi2, p, dof, expected = chi2_contingency(contingency_table)
print("Chi-Squared Test Result:")
print(f"Chi2 Statistic: {chi2:.2f}")
print(f"P-value: {p:.4f}")
if p < 0.05:
    print("🚨 Significant drift detected in 'gender' distribution.")
    print("➡️ Investigate and consider retraining your model.")
else:
    print("✅ No significant drift detected in 'gender' distribution.")


Chi-Squared Test Result:
Chi2 Statistic: 7.22
P-value: 0.0072
🚨 Significant drift detected in 'gender' distribution.
➡️ Investigate and consider retraining your model.
