### Detect Data Drift in ML Models
**Objective**: Monitor and detect changes in data distributions that impact ML model performance.

**Task**: Categorical Feature Drift

**Steps**:
1. Load the baseline distribution for a categorical feature (e.g., gender ) from your training dataset.
2. Load the same feature from your current production data.
3. Use chi-squared tests to compare the distributions of the categorical feature.
4. Step 4: If significant drift is detected, investigate the cause and update the model as needed.

In [1]:
# write your code from here
import pandas as pd
from scipy.stats import chi2_contingency

def detect_categorical_drift_chi2(baseline_series, current_series, alpha=0.05):
    """
    Detect drift in a categorical feature using Chi-squared test.
    
    Parameters:
    - baseline_series: pd.Series with categorical data from baseline (train)
    - current_series: pd.Series with categorical data from production (current)
    - alpha: significance level
    
    Returns:
    - drift_detected: bool
    - p_value: float
    """
    # Calculate frequency counts for both datasets
    baseline_counts = baseline_series.value_counts().sort_index()
    current_counts = current_series.value_counts().sort_index()

    # Align categories (fill missing categories with zero counts)
    all_categories = baseline_counts.index.union(current_counts.index)
    baseline_counts = baseline_counts.reindex(all_categories, fill_value=0)
    current_counts = current_counts.reindex(all_categories, fill_value=0)
    
    # Create contingency table
    contingency_table = pd.DataFrame({
        'baseline': baseline_counts,
        'current': current_counts
    })
    
    # Perform Chi-squared test
    chi2_stat, p_value, dof, expected = chi2_contingency(contingency_table.T)
    
    print(f"Chi-squared statistic: {chi2_stat:.4f}")
    print(f"P-value: {p_value:.4f}")
    
    drift_detected = p_value < alpha
    if drift_detected:
        print("Data drift detected in the categorical feature.")
    else:
        print("No significant data drift detected.")
    
    return drift_detected, p_value

# Example usage:

# Baseline training data
baseline_gender = pd.Series(['Male', 'Female', 'Female', 'Male', 'Female', 'Male', 'Male'])

# Current production data (shift towards Female)
current_gender = pd.Series(['Female', 'Female', 'Female', 'Male', 'Female', 'Female', 'Male', 'Female'])

detect_categorical_drift_chi2(baseline_gender, current_gender)


Chi-squared statistic: 0.5469
P-value: 0.4596
No significant data drift detected.


(False, 0.45959738618394197)