# Module 10: Data Drift Monitoring

**Course**: End-to-End Machine Learning (Datacamp)  
**Case Study**: CardioCare Heart Disease Prediction  
**Author**: Seif

---

## Overview

- What is data drift and why it matters
- Kolmogorov-Smirnov (KS) test for detecting drift in a single feature
- Practical example with scipy.stats.ks_2samp
- Remediation strategies (retrain, blend old/new data)
- References to drift-detection libraries (Evidently, NannyML)

## What is data drift?

Data drift occurs when the statistical properties of input features change over time:
- Population changes (e.g., aging demographics, improved healthcare)
- Data collection process shifts (new devices, protocols)
- Seasonal or environmental trends

**Example**: A heart disease model trained decades ago might see fewer young patients with disease today due to better prevention.

Even if model accuracy doesn't drop immediately, drift signals that the training data distribution no longer matches production, risking eventual degradation.

## Kolmogorov-Smirnov (KS) test

The KS test compares two distributions (e.g., training vs. recent production) for a single feature:
- **Test statistic**: magnitude of the maximum difference between cumulative distributions
- **p-value**: probability of observing such a difference if distributions are identical

**Rule of thumb**: p-value < 0.05 → suspect drift (distributions likely differ)

In [None]:
# Example: detect drift in 'cholesterol' between training and recent data
import numpy as np
from scipy.stats import ks_2samp

# Simulate training data (old distribution)
np.random.seed(42)
chol_train = np.random.normal(loc=240, scale=40, size=500)  # mean=240, sd=40

# Simulate recent production data (shifted distribution)
chol_recent = np.random.normal(loc=220, scale=35, size=300)  # mean=220, sd=35

# Perform KS test
statistic, p_value = ks_2samp(chol_train, chol_recent)
print(f"KS statistic: {statistic:.4f}")
print(f"p-value: {p_value:.4f}")

if p_value < 0.05:
    print("⚠️  Drift detected (p < 0.05). Distributions likely differ.")
else:
    print("✓ No significant drift detected (p >= 0.05).")

## Visualizing drift

Plotting histograms or CDFs side-by-side helps interpret the KS result.

In [None]:
# Plot histograms to visualize drift
import matplotlib.pyplot as plt

plt.figure(figsize=(10, 4))

plt.subplot(1, 2, 1)
plt.hist(chol_train, bins=30, alpha=0.6, label='Training (old)', color='blue')
plt.hist(chol_recent, bins=30, alpha=0.6, label='Recent (new)', color='orange')
plt.xlabel('Cholesterol')
plt.ylabel('Frequency')
plt.title('Histogram comparison')
plt.legend()

plt.subplot(1, 2, 2)
# CDF plot
sorted_train = np.sort(chol_train)
sorted_recent = np.sort(chol_recent)
plt.plot(sorted_train, np.linspace(0, 1, len(sorted_train)), label='Training CDF', color='blue')
plt.plot(sorted_recent, np.linspace(0, 1, len(sorted_recent)), label='Recent CDF', color='orange')
plt.xlabel('Cholesterol')
plt.ylabel('Cumulative Probability')
plt.title('CDF comparison (KS test measures max vertical gap)')
plt.legend()

plt.tight_layout()
plt.show()

## Correcting data drift

Once drift is detected:
1. **Retrain** on fresh data if you have enough recent samples.
2. **Blend old + new** if new data is scarce: gradually increase new data proportion until sufficient.
3. **Monitor continuously**: set up periodic KS tests (weekly/monthly) and alerts.
4. **Automate retraining**: trigger pipelines when drift crosses a threshold.

## Beyond KS: other drift detection methods

- **Population Stability Index (PSI)**: effective for categorical features or binned continuous features.
- **Chi-squared test**: for categorical distributions.
- **Jensen-Shannon divergence**: symmetric measure of distribution similarity.

### Dedicated libraries

- [**Evidently**](https://evidentlyai.com/): open-source for drift detection, model performance monitoring, and visualization.
- [**NannyML**](https://www.nannyml.com/): post-deployment performance estimation and drift monitoring without ground truth labels.

Example (Evidently snippet):
```python
from evidently.report import Report
from evidently.metric_preset import DataDriftPreset

report = Report(metrics=[DataDriftPreset()])
report.run(reference_data=train_df, current_data=recent_df)
report.show()
```

## Practice

- Run a KS test on another feature (e.g., `age`, `thalach`) comparing training vs. recent data.
- Set up a scheduled job (cron/Airflow) to run drift checks weekly and log results to MLflow.
- Integrate Evidently or NannyML into your monitoring dashboard.

# Feedback loop, re-training, and labeling

In this section, we connect data drift monitoring with a practical feedback loop to keep your model effective over time. You'll see how to:

- Detect drift and decide when to react
- Acquire new labels (manually, crowd-sourcing, or programmatically)
- Retrain periodically or incrementally (online learning)
- Avoid pitfalls of harmful feedback loops


## What is a feedback loop?

A feedback loop is when a system's outputs (predictions, errors, usage stats) are fed back as inputs to guide future behavior. In ML, this means we:

- Observe model performance and data properties over time
- Use those observations to decide when and how to update the model or data pipeline
- Iterate to adapt to changing conditions, trends, or user behavior

Feedback loops enable continuous learning but must be designed carefully.

## Implementing a feedback loop

Common strategies:

1) Acquire new labels for fresh data
- Manual annotation by domain experts
- Crowdsourcing with quality checks
- Programmatic/weak supervision when appropriate

2) Periodic batch retraining
- Detect drift or performance decay
- Sample recent data, merge with historical, retrain, and validate
- Register and deploy only if it passes acceptance thresholds

3) Online/Incremental learning
- Use algorithms that support `partial_fit` to update on small batches
- Suitable when data arrives continuously and you need rapid adaptation

In [None]:
# KS-based trigger + batch retraining (skeleton)

from typing import Tuple
import numpy as np
import pandas as pd
from scipy.stats import ks_2samp

# Optional integrations
try:
    import mlflow
except Exception:
    mlflow = None

ALPHA = 0.05  # significance level for drift detection


def ks_drift_test(a: pd.Series, b: pd.Series, alpha: float = ALPHA) -> Tuple[float, float, bool]:
    """Return (statistic, p_value, drift_detected)."""
    a = pd.Series(a).dropna().astype(float).to_numpy()
    b = pd.Series(b).dropna().astype(float).to_numpy()
    stat, p = ks_2samp(a, b, alternative="two-sided", mode="auto")
    return stat, p, p < alpha


# Example usage (replace with your actual feature slices)
# january_data = df_jan["feature"]
# february_data = df_feb["feature"]
# stat, p, drift = ks_drift_test(january_data, february_data)
# if drift:
#     print(f"Drift detected: D={stat:.3f}, p={p:.3g} < {ALPHA}")
#     # 1) fetch new labeled data
#     # 2) retrain model (log to MLflow if available)
#     # 3) run your validation gate and only deploy if passing thresholds
# else:
#     print(f"No drift: D={stat:.3f}, p={p:.3g} ≥ {ALPHA}")

In [None]:
# Online learning example with partial_fit

import numpy as np
import pandas as pd
from sklearn.linear_model import SGDClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline

# Suppose you stream batches of (X_batch, y_batch)
classes = np.array([0, 1])
model = make_pipeline(StandardScaler(with_mean=False), SGDClassifier(loss="log_loss", random_state=42))

# First call to partial_fit must include 'classes'
# model.partial_fit(X_first_batch, y_first_batch, classes=classes)

# Then iteratively update as new labeled batches arrive
# for X_batch, y_batch in stream():
#     model.partial_fit(X_batch, y_batch)
#     # Optionally evaluate and log to MLflow


## Dangers of feedback loops

Feedback loops can become harmful when model outputs influence future inputs (e.g., recommender systems reinforcing narrow content). This can cause echo chambers, bias amplification, or unsafe behaviors. Prefer human-in-the-loop controls and guardrails, especially for automated updates.

For our heart disease case, feedback is more reactive: clinicians review, data is labeled carefully, and retraining is scheduled. Still, be cautious and prioritize alignment with human values.

## Let's practice

- Pick 1–3 critical features and wire the KS trigger to raise an alert when drift is detected.
- Define a labeling pathway for a small recent slice (e.g., last 2 weeks) and retrain.
- Add an acceptance test to your CI (see `scripts/validate_model.py`) that blocks deploys unless metrics exceed thresholds.
- Optional: experiment with `partial_fit` on synthetic streams to see how quickly the model adapts.