# Chapter 45: Model Drift Detection

## Learning Objectives

By the end of this chapter, you will be able to:

- Distinguish between data drift, concept drift, and performance drift in the context of a time‑series prediction system
- Choose appropriate statistical tests and machine learning methods to detect drift in your NEPSE prediction pipeline
- Implement drift detection using Python libraries such as `evidently`, `scipy`, and `alibi-detect`
- Monitor feature distributions and model residuals over time
- Set up automated alerts when drift exceeds predefined thresholds
- Quantify the magnitude and impact of drift
- Design retraining triggers and mitigation strategies to keep your model accurate in changing market conditions

---

## Introduction

A machine learning model is a snapshot of the patterns present in the training data. When those patterns change after deployment—because of evolving market dynamics, new regulations, or shifts in investor behaviour—the model’s predictions can become unreliable. This phenomenon is known as **model drift**.

For the NEPSE stock prediction system, drift can manifest in many ways: the average trading volume might increase as more retail investors enter the market, the relationship between technical indicators and future returns might weaken, or a sudden political event could render historical patterns irrelevant. Detecting drift early allows you to retrain or adjust your model before it starts losing money for your users.

In this chapter, we will explore the different types of drift, methods to detect them, and how to integrate drift detection into your production monitoring stack. We will use the NEPSE system as a running example, demonstrating how to monitor the feature distributions and prediction errors of your stock price model over time.

---

## 45.1 Types of Model Drift

Model drift is usually categorised into three interrelated types:

1. **Data Drift (Covariate Shift)**  
   The statistical properties of the input features change over time. For example, the distribution of the `Volume` feature in NEPSE data might shift if the exchange introduces a new trading platform that increases liquidity.

2. **Concept Drift**  
   The relationship between the input features and the target variable changes. For instance, the same level of RSI (Relative Strength Index) might have indicated a buying opportunity during a bull market but becomes less reliable in a sideways market.

3. **Performance Drift**  
   The model’s predictive performance (e.g., accuracy, RMSE) degrades over time. Performance drift is often a consequence of data or concept drift, but it can also be caused by changes in the data quality or the target definition (e.g., a change in the way “close price” is calculated).

In practice, these drifts are interrelated. A shift in feature distribution (data drift) can lead to concept drift if the model’s decision boundary was learned on a different region of the feature space. Performance drift is the ultimate signal that something is wrong, but by the time you observe it, your users may already be affected. Therefore, we aim to detect data and concept drift early as leading indicators.

---

## 45.2 Detecting Data Drift

Data drift detection compares the distribution of each feature (or a multivariate combination) between a **reference period** (usually the training data) and a **current window** (e.g., the last week of production data). If the difference is statistically significant, we flag a drift.

### 45.2.1 Statistical Tests for Univariate Drift

For numerical features like `Close`, `Volume`, or `RSI`, common statistical tests are:

- **Kolmogorov‑Smirnov (KS) test**: Compares the empirical cumulative distribution functions of two samples. It is sensitive to differences in both location and shape.
- **Population Stability Index (PSI)**: Measures how much a variable has shifted by binning the reference distribution and comparing the proportions in each bin.
- **Wasserstein distance (Earth Mover’s Distance)**: Measures the minimum amount of “work” needed to transform one distribution into another.

For categorical features (e.g., `Sector`), you can use the **chi‑square test**.

**Example using SciPy to perform KS test on the `Volume` feature:**

```python
import numpy as np
import pandas as pd
from scipy import stats

# Assume we have a reference dataset (training data)
reference_volume = pd.read_csv('nepse_training.csv')['Volume']

# Current production data (e.g., last 30 days)
current_volume = pd.read_csv('nepse_production_last_30d.csv')['Volume']

# Perform Kolmogorov-Smirnov test
ks_stat, p_value = stats.ks_2samp(reference_volume, current_volume)

print(f"KS statistic: {ks_stat:.4f}, p-value: {p_value:.4f}")

if p_value < 0.05:
    print("⚠️  Significant drift detected in Volume feature!")
else:
    print("✅ Volume distribution stable.")
```

**Explanation:**  
The KS test returns a statistic and a p‑value. The null hypothesis is that the two samples come from the same distribution. If the p‑value is below a threshold (commonly 0.05), we reject the null and conclude that drift has occurred. This test is sensitive to any difference, but it may be too sensitive for large sample sizes, where even trivial differences become significant. In practice, you may want to look at the magnitude of the KS statistic itself, not just the p‑value.

### 45.2.2 Population Stability Index (PSI)

PSI is widely used in credit scoring and finance. It bins the reference distribution into (typically 10) bins and compares the proportion of observations in each bin between reference and current.

```python
def calculate_psi(reference, current, bins=10):
    """
    Calculate Population Stability Index.
    """
    # Create bins based on reference percentiles
    percentiles = np.percentile(reference, np.linspace(0, 100, bins+1))
    # Clip current to avoid out-of-range values
    current_clipped = np.clip(current, percentiles[0], percentiles[-1])
    
    # Count frequencies in each bin
    ref_counts, _ = np.histogram(reference, bins=percentiles)
    curr_counts, _ = np.histogram(current_clipped, bins=percentiles)
    
    # Convert to percentages
    ref_pct = ref_counts / len(reference)
    curr_pct = curr_counts / len(current)
    
    # Avoid division by zero
    ref_pct = np.where(ref_pct == 0, 0.0001, ref_pct)
    curr_pct = np.where(curr_pct == 0, 0.0001, curr_pct)
    
    # Calculate PSI
    psi = np.sum((curr_pct - ref_pct) * np.log(curr_pct / ref_pct))
    return psi

psi_value = calculate_psi(reference_volume, current_volume)
print(f"PSI: {psi_value:.4f}")

if psi_value > 0.25:
    print("⚠️  High drift (PSI > 0.25)")
elif psi_value > 0.1:
    print("⚠️  Moderate drift (PSI between 0.1 and 0.25)")
else:
    print("✅ Low drift (PSI < 0.1)")
```

**Explanation:**  
PSI values below 0.1 indicate no significant change, 0.1–0.25 indicates moderate shift, and above 0.25 indicates a major shift. PSI is popular because it is bounded and interpretable, but it depends on the binning strategy.

### 45.2.3 Multivariate Drift Detection

Univariate tests may miss interactions: perhaps each feature individually looks stable, but their joint distribution has changed. For multivariate drift, you can use:

- **Maximum Mean Discrepancy (MMD)**: A kernel‑based test that compares the distance between distributions in a reproducing kernel Hilbert space.
- **Principal Component Analysis (PCA) on reference data, then compare the distribution of the reconstruction error**.
- **Domain classifiers**: Train a classifier to distinguish between reference and current data; if it performs well, drift is present.

**Example using `alibi-detect` for MMD:**

```python
from alibi_detect.cd import MMDDrift

# Prepare reference data (training features)
X_ref = np.load('nepse_training_features.npy')

# Initialize detector
cd = MMDDrift(X_ref, backend='pytorch', p_val=0.05)

# On a batch of new data
X_new = np.load('nepse_new_features.npy')
preds = cd.predict(X_new)

print(f"Drift detected: {preds['data']['is_drift']}")
print(f"p-value: {preds['data']['p_val']:.4f}")
```

**Explanation:**  
`alibi-detect` provides easy‑to‑use drift detectors. The MMDDrift uses a kernel method to test if the new sample comes from the same distribution as the reference. It returns a boolean and a p‑value.

---

## 45.3 Concept Drift Detection

Concept drift detection requires monitoring the relationship between features and the target. This is more challenging because it often requires ground truth labels, which may be delayed (e.g., you only know the actual price change after the next day). Methods include:

- **Monitoring prediction error** over time (if labels become available).
- **Using a sliding window of recent data to retrain a simple model** and comparing its performance to the original.
- **Statistical tests on model residuals**.

### 45.3.1 Monitoring Residuals

If you have a validation set from the training period, you can establish a baseline distribution of residuals (errors). In production, as labels arrive, you can compute the current residuals and test whether their distribution has shifted.

```python
# Baseline residuals from training validation set
baseline_residuals = np.load('validation_residuals.npy')

# Recent production residuals (e.g., last 30 predictions with known actuals)
recent_residuals = get_production_residuals(30)

# KS test on residuals
ks_stat, p_value = stats.ks_2samp(baseline_residuals, recent_residuals)
if p_value < 0.05:
    print("⚠️  Concept drift detected in residuals!")
```

### 45.3.2 ADWIN (Adaptive Windowing)

ADWIN is an algorithm that maintains a sliding window of data and grows or shrinks it based on detecting changes in the mean. It can be applied to any stream of values, such as prediction errors. The `river` library provides an implementation.

```python
from river import drift

adwin = drift.ADWIN()

for error in production_error_stream():
    adwin.update(error)
    if adwin.change_detected:
        print("⚠️  Drift detected at time", adwin.n_detections)
        break
```

**Explanation:**  
ADWIN keeps a window and splits it into two sub‑windows whenever a change is suspected. If the means of the two sub‑windows differ significantly, it declares drift and shrinks the window to the recent sub‑window. This is useful for online detection.

### 45.3.3 Using a Shadow Model

A common approach in production is to deploy a “shadow” model that is periodically retrained on recent data. By comparing the shadow model’s predictions with the production model’s predictions (or with actuals when they arrive), you can detect when the production model is becoming stale.

**Example workflow:**
- Every week, retrain a model on the last 3 months of data.
- Compare its performance on the last week’s data with the production model.
- If the shadow model outperforms production by a significant margin, trigger an alert.

---

## 45.4 Monitoring Drift in Production

To operationalise drift detection, you need to:

1. **Store feature distributions and predictions** over time (e.g., in a time‑series database like InfluxDB, or as logs).
2. **Run periodic drift checks** (e.g., every hour, daily) on a sliding window.
3. **Emit metrics** from these checks (e.g., drift p‑value per feature) to your monitoring system.

### 45.4.1 Using Evidently for Automated Reports

Evidently is an open‑source library specifically designed for monitoring ML models in production. It can generate HTML reports or produce JSON that can be ingested by monitoring tools.

```python
import pandas as pd
from evidently import ColumnMapping
from evidently.report import Report
from evidently.metrics import ColumnDriftMetric, DatasetDriftMetric

# Reference data (training)
ref = pd.read_csv('nepse_training.csv')
# Current data (recent production)
cur = pd.read_csv('nepse_recent_prod.csv')

column_mapping = ColumnMapping()
column_mapping.target = 'Close'          # target column (optional)
column_mapping.prediction = 'prediction' # prediction column (optional)
column_mapping.numerical_features = ['Open', 'High', 'Low', 'Volume', 'RSI']

# Create a drift report
report = Report(metrics=[
    ColumnDriftMetric(column_name='Volume'),
    ColumnDriftMetric(column_name='RSI'),
    DatasetDriftMetric()
])
report.run(reference_data=ref, current_data=cur, column_mapping=column_mapping)

# Save as HTML or JSON
report.save_html('drift_report.html')
report.json()
```

**Explanation:**  
Evidently computes drift for each specified column using statistical tests (configurable) and also provides an overall dataset drift score. The HTML report visualises the distributions side‑by‑side. The JSON output can be used to feed into a dashboard or to trigger alerts.

### 45.4.2 Integrating with Prometheus

You can run Evidently as a scheduled job and export the drift metrics to Prometheus using a custom exporter, or directly emit the drift p‑values as Prometheus gauges.

```python
from prometheus_client import Gauge, start_http_server
import time

drift_gauge = Gauge('feature_drift_p_value', 'Drift p-value per feature', ['feature'])

while True:
    # Compute drift for each feature
    for feature in numerical_features:
        p_value = compute_drift_p_value(ref[feature], cur[feature])
        drift_gauge.labels(feature=feature).set(p_value)
    time.sleep(3600)  # update every hour
```

Then you can create alerts when the p‑value drops below 0.05 for a certain duration.

---

## 45.5 Drift Quantification

Detecting drift is only half the battle; you also need to understand its **impact**. Not all drift is harmful—a feature may shift but the model’s predictions remain accurate. You should quantify:

- **Magnitude of shift**: PSI, KS statistic, or distance metric.
- **Impact on predictions**: How much does the drift affect the model output? You can simulate by passing both reference and current data through the model and comparing the output distributions.

**Example: Measuring shift in prediction distribution:**

```python
# Get predictions on reference data
ref_preds = model.predict(ref_features)
# Get predictions on current data
cur_preds = model.predict(cur_features)

# Compare prediction distributions
psi_preds = calculate_psi(ref_preds, cur_preds)
print(f"PSI of predictions: {psi_preds:.4f}")

# If prediction distribution shifts, users will see different outputs.
```

---

## 45.6 Adaptive Thresholds

Setting static thresholds for drift detection (e.g., p‑value < 0.05) can lead to many false alarms in large‑scale systems. Consider using **adaptive thresholds** based on historical variability. For instance, you could compute the rolling mean and standard deviation of the drift metric and flag when it exceeds `mean + 3*std`.

Another approach is to use **control charts** (e.g., CUSUM) commonly used in statistical process control.

---

## 45.7 Automated Alerts and Actions

When drift is detected, you may want to trigger automated responses:

- **Log an incident** in your monitoring system.
- **Send a notification** (Slack, email, PagerDuty) to the data science team.
- **Automatically retrain the model** if drift exceeds a critical threshold.
- **Roll back to a previous model** if the drift is severe and a retrained model is not yet available.

**Example of a Slack alert using a webhook:**

```python
import requests

def send_slack_alert(message):
    webhook_url = "https://hooks.slack.com/services/..."
    payload = {"text": message}
    requests.post(webhook_url, json=payload)

if psi_value > 0.25:
    send_slack_alert(f"⚠️ High drift detected in Volume feature: PSI={psi_value:.2f}")
```

---

## 45.8 Drift Mitigation and Retraining Triggers

Drift detection is only useful if it leads to action. The most common mitigation is **retraining** the model on recent data. However, retraining too often can be costly and may introduce instability. Define clear criteria:

- **Time‑based retraining**: e.g., retrain every week regardless of drift (good for predictable seasonal patterns).
- **Performance‑based trigger**: Retrain when prediction error exceeds a threshold.
- **Drift‑based trigger**: Retrain when significant drift is detected in key features or residuals.

For NEPSE, you might retrain your model every month, but also trigger an out‑of‑cycle retraining if a feature like `Volume` drifts beyond PSI > 0.2 for three consecutive days.

**Example of a drift‑triggered retraining pipeline using Airflow:**

```python
from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime, timedelta

def check_drift_and_retrain():
    # 1. Compute drift for all features
    # 2. If any exceeds threshold, trigger retraining
    if should_retrain():
        # Trigger retraining DAG (e.g., via API or by setting a variable)
        trigger_retraining()

with DAG('drift_monitor', schedule_interval='@daily') as dag:
    drift_task = PythonOperator(
        task_id='check_drift',
        python_callable=check_drift_and_retrain
    )
```

---

## 45.9 Case Study: Monitoring NEPSE Drift

Let’s walk through a concrete example for the NEPSE prediction system.

**Features:**  
- `Close_Lag_1`, `Volume_Lag_1`, `SMA_20`, `RSI`, `Volume_Anomaly`

**Target:**  
- Binary: 1 if next‑day closing price is higher than today, else 0

**Reference:** Training data from 2022–2023.  
**Current:** Last 30 days of production data.

We set up a daily job that:

1. Loads the last 30 days of features from the feature store.
2. For each feature, computes PSI against the reference distribution.
3. For the target, if actuals are available, computes the prediction error rate and compares it to validation error rate using a binomial test.
4. If any feature PSI > 0.25, or error rate increase > 5% with p < 0.05, sends a Slack alert and logs a metric in Prometheus.
5. If the error rate increase persists for 3 days, automatically triggers a retraining pipeline.

**Python snippet (simplified):**

```python
def monitor_drift():
    # Load reference stats (precomputed from training)
    ref_stats = load_reference_stats()
    # Load current data
    current = load_production_data(days=30)
    
    alerts = []
    for feature in numerical_features:
        psi = calculate_psi(ref_stats[feature]['values'], current[feature])
        prom_metric.labels(feature=feature).set(psi)
        if psi > 0.25:
            alerts.append(f"{feature} PSI={psi:.2f}")
    
    # Check error rate if labels available
    if 'actual' in current.columns:
        error_rate = (current['prediction'] != current['actual']).mean()
        # Compare to validation error rate (say 0.12)
        if error_rate > 0.12 + 0.05:  # threshold
            alerts.append(f"Error rate increased to {error_rate:.2%}")
    
    if alerts:
        send_slack_alert("Drift detected:\n" + "\n".join(alerts))
```

---

## Chapter Summary

In this chapter, we explored the critical topic of model drift detection for time‑series prediction systems like the NEPSE stock predictor. We covered:

- The three types of drift: data drift, concept drift, and performance drift.
- Statistical methods for detecting univariate drift (KS test, PSI) and multivariate drift (MMD, domain classifiers).
- Concept drift detection using residual monitoring, ADWIN, and shadow models.
- Integrating drift detection into production with tools like Evidently and custom Prometheus exporters.
- Quantifying drift magnitude and setting adaptive thresholds.
- Automating alerts and retraining triggers to keep models fresh and accurate.

By implementing drift detection, you ensure that your NEPSE prediction system remains reliable even as market conditions evolve. In the next chapter, we will discuss **Continuous Retraining Strategies**, diving deeper into how to automatically update your models in response to drift or on a schedule.

---

**End of Chapter 45**

<div style='width:100%; display:flex; justify-content:space-between; align-items:center; margin: 1em 0;'>
  <a href='44. monitoring_and_observability.ipynb' style='font-weight:bold; font-size:1.05em;'>&larr; Previous</a>
  <a href='../TOC.md' style='font-weight:bold; font-size:1.05em; text-align:center;'>Table of Contents</a>
  <a href='46. continuous_retraining_strategies.ipynb' style='font-weight:bold; font-size:1.05em;'>Next &rarr;</a>
</div>
