### Detect Data Drift in ML Models
**Objective**: Monitor and detect changes in data distributions that impact ML model performance.

**Task**: Categorical Feature Drift

**Steps**:
1. Load the baseline distribution for a categorical feature (e.g., gender ) from your training dataset.
2. Load the same feature from your current production data.
3. Use chi-squared tests to compare the distributions of the categorical feature.
4. Step 4: If significant drift is detected, investigate the cause and update the model as needed.

### Detect Data Drift in ML Models
**Objective**: Monitor and detect changes in data distributions that impact ML model performance.

**Task**: Categorical Feature Drift

**Steps**:
1. Load the baseline distribution for a categorical feature (e.g., gender ) from your training dataset.
2. Load the same feature from your current production data.
3. Use chi-squared tests to compare the distributions of the categorical feature.
4. Step 4: If significant drift is detected, investigate the cause and update the model as needed.

In [1]:
# write your code from here

import pandas as pd
from scipy.stats import chi2_contingency

# Step 1: Baseline (training) categorical feature distribution
baseline_data = {
    'gender': ['Male', 'Female', 'Female', 'Male', 'Female', 'Male', 'Male', 'Female']
}
baseline_df = pd.DataFrame(baseline_data)

# Step 2: Current production categorical feature distribution
production_data = {
    'gender': ['Male', 'Male', 'Male', 'Female', 'Male', 'Male', 'Male', 'Male']
}
production_df = pd.DataFrame(production_data)

# Get counts of categories in each dataset
baseline_counts = baseline_df['gender'].value_counts().sort_index()
production_counts = production_df['gender'].value_counts().sort_index()

# Align categories (to handle categories missing in one dataset)
all_categories = sorted(set(baseline_counts.index).union(set(production_counts.index)))
baseline_counts = baseline_counts.reindex(all_categories, fill_value=0)
production_counts = production_counts.reindex(all_categories, fill_value=0)

# Create contingency table
contingency_table = pd.DataFrame({
    'baseline': baseline_counts,
    'production': production_counts
})

print("Contingency Table:")
print(contingency_table)

# Step 3: Perform Chi-Squared Test
chi2_stat, p_value, dof, expected = chi2_contingency(contingency_table.T)

print(f"\nChi-squared Statistic: {chi2_stat:.4f}")
print(f"P-value: {p_value:.4f}")

# Step 4: Interpret results
alpha = 0.05  # significance level
if p_value < alpha:
    print("Significant data drift detected in categorical feature 'gender'.")
    print("Investigate the cause and consider updating the ML model.")
else:
    print("No significant data drift detected in 'gender'.")


Contingency Table:
        baseline  production
gender                      
Female         4           1
Male           4           7

Chi-squared Statistic: 1.1636
P-value: 0.2807
No significant data drift detected in 'gender'.
