# 🚨 Drift Detection on the Adult Dataset

This notebook demonstrates how to detect **data drift** between a reference and current dataset using multiple statistical techniques:

- Kolmogorov–Smirnov (K-S) Test
- KL Divergence
- Jensen-Shannon (JS) Divergence
- Population Stability Index (PSI)
- Wasserstein Distance
- Page-Hinkley Drift Detection

We'll apply these techniques to the **Adult Income dataset**, using a split based on the `education` feature.

In [5]:


import numpy as np
import pandas as pd
from sklearn import datasets
from scipy.stats import ks_2samp, entropy, wasserstein_distance

# Load the dataset
adult_data = datasets.fetch_openml(name='adult', version=2, as_frame='auto')
adult = adult_data.frame

# Split into reference and current datasets
adult_ref = adult[~adult.education.isin(['Some-college', 'HS-grad', 'Bachelors'])].reset_index(drop=True)
adult_cur = adult[adult.education.isin(['Some-college', 'HS-grad', 'Bachelors'])].reset_index(drop=True)

# Use only numeric columns
numeric_cols = adult.select_dtypes(include=np.number).columns.tolist()

print(f"📊 Numeric features to analyze: {numeric_cols}")
print(f"✅ Reference dataset size: {len(adult_ref)}")
print(f"✅ Current dataset size: {len(adult_cur)}")

📊 Numeric features to analyze: ['age', 'fnlwgt', 'education-num', 'capital-gain', 'capital-loss', 'hours-per-week']
✅ Reference dataset size: 14155
✅ Current dataset size: 34687


## 🔧 Drift Detection Functions

We define utility functions for all the statistical tests we’ll use:
- KL & JS Divergence
- PSI
- Page-Hinkley

In [6]:
def compute_kl_divergence(p, q, bins=20):
    p_hist, _ = np.histogram(p, bins=bins, density=True)
    q_hist, _ = np.histogram(q, bins=bins, density=True)
    p_hist += 1e-10
    q_hist += 1e-10
    return entropy(p_hist, q_hist)

def compute_js_divergence(p, q, bins=20):
    p_hist, _ = np.histogram(p, bins=bins, density=True)
    q_hist, _ = np.histogram(q, bins=bins, density=True)
    p_hist += 1e-10
    q_hist += 1e-10
    m = 0.5 * (p_hist + q_hist)
    return 0.5 * (entropy(p_hist, m) + entropy(q_hist, m))

def compute_psi(expected, actual, buckets=10):
    breakpoints = np.percentile(expected, np.linspace(0, 100, buckets + 1))
    psi_value = 0
    for i in range(buckets):
        e_count = ((expected >= breakpoints[i]) & (expected < breakpoints[i + 1])).sum()
        a_count = ((actual >= breakpoints[i]) & (actual < breakpoints[i + 1])).sum()
        e_pct = e_count / len(expected) if e_count > 0 else 1e-10
        a_pct = a_count / len(actual) if a_count > 0 else 1e-10
        psi_value += (e_pct - a_pct) * np.log(e_pct / a_pct)
    return psi_value

def page_hinkley(data, threshold=0.1, alpha=0.99):
    mean = 0
    cumulative_sum = 0
    min_cum_sum = 0
    for value in data:
        mean = alpha * mean + (1 - alpha) * value
        cumulative_sum += value - mean - threshold
        min_cum_sum = min(min_cum_sum, cumulative_sum)
        if cumulative_sum - min_cum_sum > threshold:
            return True
    return False

## 🚀 Run Drift Detection on Each Numeric Feature

We apply all drift detection metrics per feature and compile results in a table.

In [7]:
results = []

for col in numeric_cols:
    ref_data = adult_ref[col].dropna().values
    cur_data = adult_cur[col].dropna().values
    
    ks_stat, ks_p = ks_2samp(ref_data, cur_data)
    kl = compute_kl_divergence(ref_data, cur_data)
    js = compute_js_divergence(ref_data, cur_data)
    psi = compute_psi(ref_data, cur_data)
    wass = wasserstein_distance(ref_data, cur_data)
    ph = page_hinkley(cur_data)

    results.append({
        'Feature': col,
        'K-S p-value': ks_p,
        'K-S Stat': ks_stat,
        'KL Divergence': kl,
        'JS Divergence': js,
        'PSI': psi,
        'Wasserstein': wass,
        'Page-Hinkley Drift': '✅' if ph else '—'
    })

# Format results
results_df = pd.DataFrame(results).sort_values(by='K-S Stat', ascending=False)

# Highlight and display nicely
def highlight_drift(val):
    if isinstance(val, float) and val > 0.1:
        return 'color: red; font-weight: bold'
    return ''

styled = results_df.style.applymap(highlight_drift, subset=['KL Divergence', 'JS Divergence', 'PSI', 'Wasserstein']) \
                         .background_gradient(subset=['K-S Stat'], cmap='Reds') \
                         .format(precision=4)

styled


  styled = results_df.style.applymap(highlight_drift, subset=['KL Divergence', 'JS Divergence', 'PSI', 'Wasserstein']) \


Unnamed: 0,Feature,K-S p-value,K-S Stat,KL Divergence,JS Divergence,PSI,Wasserstein,Page-Hinkley Drift
2,education-num,0.0,0.4527,19.9361,0.7191,14.9928,2.5099,✅
0,age,0.0,0.0827,0.0373,0.0095,0.0728,2.6908,✅
5,hours-per-week,0.0,0.0313,0.0084,0.002,0.014,1.1891,✅
3,capital-gain,0.0002,0.0216,0.0088,0.0019,0.0,810.1876,✅
1,fnlwgt,0.0278,0.0146,0.0026,0.0006,0.002,2509.6368,✅
4,capital-loss,0.3464,0.0093,0.0341,0.0064,0.0,14.5376,✅


## 🧾 Interpretation Guidelines

- **K-S Test**: p-value < 0.05 usually indicates drift
- **KL/JS Divergence**: Higher values indicate more drift (no universal threshold, but > 0.1 is often used)
- **PSI**:
  - < 0.1 → No drift
  - 0.1–0.25 → Moderate drift
  - > 0.25 → Significant drift
- **Wasserstein**: Higher = more drift (relative comparison)
- **Page-Hinkley**: Binary flag — detects change in mean over time