
# **CHAPTER 23: MONITORING & MAINTENANCE**

*Ensuring Model Reliability in Production*

## **Chapter Overview**

Deployed models degrade over time due to changing data distributions, shifting user behaviors, and upstream system changes. This chapter establishes the operational practices for detecting degradation, automating retraining, and maintaining model performance through continuous monitoring, alerting, and feedback loops.

**Estimated Time:** 30-40 hours (2-3 weeks)  
**Prerequisites:** Chapter 22 (Deployment), Chapter 20 (Data Engineering), familiarity with Prometheus/Grafana

---

## **23.0 Learning Objectives**

By the end of this chapter, you will be able to:
1. Implement comprehensive ML monitoring covering data drift, concept drift, and model performance degradation
2. Design alerting strategies distinguishing between actionable alerts and noise
3. Automate retraining pipelines triggered by performance thresholds or schedules
4. Conduct A/B tests and statistical validation for model updates
5. Implement graceful degradation strategies and incident response procedures
6. Build feedback loops to capture ground truth and calculate business metrics

---

## **23.1 ML Monitoring Fundamentals**

#### **23.1.1 The Three Pillars of ML Monitoring**

**1. Data Monitoring:**
- **Schema Changes:** Missing columns, type changes, range violations
- **Distribution Drift:** KS-test, PSI (Population Stability Index) for feature drift
- **Volume Anomalies:** Sudden drops/spikes in data (pipeline failures or business events)

**2. Model Performance:**
- **Accuracy Metrics:** Accuracy, F1, RMSE (requires ground truth delay)
- **Prediction Distribution:** Output class distributions, confidence scores
- **Calibration:** Reliability diagrams (predicted vs. actual probability)

**3. System Performance:**
- **Latency:** P50, P95, P99 response times
- **Throughput:** Requests per second
- **Resource Utilization:** CPU, GPU memory, I/O

#### **23.1.2 Evidently AI for Drift Detection**

```python
# monitoring/drift_detection.py
import evidently
from evidently.report import Report
from evidently.metric_preset import DataDriftPreset, TargetDriftPreset
from evidently.metrics import ColumnDriftMetric, DatasetDriftMetric
import pandas as pd
from datetime import datetime, timedelta

class DriftMonitor:
    def __init__(self, reference_data: pd.DataFrame):
        self.reference = reference_data
        self.report = Report(metrics=[
            DataDriftPreset(),
            ColumnDriftMetric(column_name="prediction_confidence")
        ])
    
    def check_current_data(self, current_data: pd.DataFrame) -> dict:
        self.report.run(
            reference_data=self.reference,
            current_data=current_data
        )
        
        results = self.report.as_dict()
        
        drift_summary = {
            "timestamp": datetime.now().isoformat(),
            "dataset_drift": results["metrics"][1]["result"]["dataset_drift"],
            "drifted_features_count": results["metrics"][1]["result"]["number_of_drifted_columns"],
            "drift_share": results["metrics"][1]["result"]["drift_share"]
        }
        
        return drift_summary

# Scheduled monitoring (Airflow task)
def daily_drift_check():
    reference = load_training_data()
    yesterday = load_production_data(window="1d")
    
    monitor = DriftMonitor(reference)
    result = monitor.check_current_data(yesterday)
    
    if result["drift_share"] > 0.3:
        trigger_alert("DATA_DRIFT", result)
        trigger_retraining_pipeline()
    
    log_to_monitoring_db(result)
```

#### **23.1.3 WhyLabs for Statistical Monitoring**

Statistical profiles rather than raw data (privacy-preserving).

```python
from whylogs.app import Session
import whylogs as why

session = Session()

# Profile production data
with session.logger(tags={"env": "production"}) as ylog:
    ylog.log_dataframe(production_df)
    
    # Compare to reference profile
    ref_profile = load_reference_profile()
    visualization = ylog.profile.view().to_pandas()
    
    # Check constraints
    from whylogs.core.constraints import ConstraintsBuilder
    
    builder = ConstraintsBuilder(dataset_profile=ylog.profile)
    builder.add_constraint(metric="column_statistics", 
                          column="age", 
                          constraint="greater_than(0)")
    
    constraints = builder.build()
    report = constraints.generate_constraints_report()
    
    if not report.valid:
        alert_on_constraint_violation(report)
```

---

## **23.2 Performance Monitoring & Alerting**

#### **23.2.1 Prometheus Metrics for ML**

```python
# monitoring/metrics.py
from prometheus_client import Counter, Histogram, Gauge, Info

# Business metrics
prediction_counter = Counter(
    'ml_predictions_total',
    'Total predictions',
    ['model_version', 'endpoint']
)

latency_histogram = Histogram(
    'ml_prediction_duration_seconds',
    'Inference latency',
    buckets=[0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0]
)

# Model performance (updated when ground truth available)
accuracy_gauge = Gauge(
    'ml_model_accuracy_1h',
    'Rolling accuracy over 1 hour',
    ['model_version']
)

# Data drift
drift_gauge = Gauge(
    'ml_feature_drift_score',
    'Drift score per feature',
    ['feature_name']
)

# Example usage in FastAPI
@app.post("/predict")
async def predict(request: Request):
    start_time = time.time()
    
    with tracer.start_as_current_span("inference"):
        prediction = model.predict(request.features)
    
    # Record metrics
    prediction_counter.labels(
        model_version="1.2.0",
        endpoint="fraud"
    ).inc()
    
    latency_histogram.observe(time.time() - start_time)
    
    # Log prediction for later ground truth comparison
    log_prediction(request.user_id, prediction, timestamp=time.time())
    
    return prediction
```

#### **23.2.2 Alerting Rules**

```yaml
# prometheus/alerts.yml
groups:
- name: ml_alerts
  rules:
  - alert: HighLatency
    expr: histogram_quantile(0.95, ml_prediction_duration_seconds_bucket) > 0.5
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "High latency on model {{ $labels.model_version }}"
      
  - alert: ModelAccuracyDrop
    expr: ml_model_accuracy_1h < 0.85
    for: 15m
    labels:
      severity: critical
    annotations:
      summary: "Model accuracy dropped below 85%"
      
  - alert: DataDriftDetected
    expr: ml_feature_drift_score > 0.5
    for: 1h
    labels:
      severity: warning
    annotations:
      summary: "Significant drift in feature {{ $labels.feature_name }}"
      
  - alert: PredictionVolumeDrop
    expr: rate(ml_predictions_total[5m]) == 0
    for: 2m
    labels:
      severity: critical
    annotations:
      summary: "No predictions being served - possible outage"
```

---

## **23.3 Automated Retraining**

#### **23.3.1 Triggering Strategies**

**1. Performance-Based:**
```python
# Check if accuracy below threshold for 3 consecutive windows
def should_retrain(accuracy_history: list[float]) -> bool:
    if len(accuracy_history) < 3:
        return False
    return all(acc < 0.85 for acc in accuracy_history[-3:])
```

**2. Drift-Based:**
```python
# Retrain if >40% of features drifted
def should_retrain_drift(drift_report: dict) -> bool:
    return drift_report["drifted_features_ratio"] > 0.4
```

**3. Schedule-Based:** Weekly/monthly regardless of performance (catches subtle drift)

**4. Volume-Based:** Retrain when 10% new data accumulated since last training

#### **23.3.2 Continuous Training Pipeline**

```yaml
# airflow/dags/retrain_pipeline.py
from airflow import DAG
from airflow.operators.python import PythonOperator, BranchPythonOperator

def check_retrain_needed(**context):
    metrics = fetch_recent_metrics(days=7)
    if metrics['accuracy'] < 0.85 or metrics['drift_score'] > 0.5:
        return 'train_new_model'
    return 'skip_retrain'

with DAG('continuous_training', schedule_interval='@daily') as dag:
    check = BranchPythonOperator(
        task_id='check_retrain_needed',
        python_callable=check_retrain_needed
    )
    
    train = PythonOperator(
        task_id='train_new_model',
        python_callable=train_and_validate,
        op_kwargs={'data_window': '30d'}
    )
    
    evaluate = PythonOperator(
        task_id='shadow_test',
        python_callable=deploy_shadow_model
    )
    
    promote = PythonOperator(
        task_id='promote_to_production',
        python_callable=update_production_endpoint,
        trigger_rule='all_success'
    )
    
    skip = PythonOperator(
        task_id='skip_retrain',
        python_callable=lambda: print("Model performing well, skipping")
    )
    
    check >> [train, skip]
    train >> evaluate >> promote
```

---

## **23.4 A/B Testing & Shadow Mode**

#### **23.4.1 Statistical Validation**

```python
# ab_testing.py
from scipy import stats
import numpy as np

def ab_test_metric(control_metrics: list[float], 
                   treatment_metrics: list[float],
                   alpha=0.05) -> dict:
    """
    Two-sample t-test for model comparison
    """
    t_stat, p_value = stats.ttest_ind(
        control_metrics, 
        treatment_metrics,
        equal_var=False  # Welch's t-test
    )
    
    # Effect size (Cohen's d)
    pooled_std = np.sqrt(
        (np.std(control_metrics)**2 + np.std(treatment_metrics)**2) / 2
    )
    cohens_d = (np.mean(treatment_metrics) - np.mean(control_metrics)) / pooled_std
    
    return {
        "p_value": p_value,
        "significant": p_value < alpha,
        "effect_size": cohens_d,
        "winner": "treatment" if (p_value < alpha and cohens_d > 0) else "control"
    }

# Usage in shadow mode
def evaluate_shadow_model():
    # Production model predictions (control)
    control_preds = get_predictions(model_version="v1.2.0", n=1000)
    
    # Shadow model predictions (treatment) - same inputs, not served to users
    shadow_preds = get_predictions(model_version="v1.3.0", n=1000)
    
    ground_truth = get_ground_truth(control_preds.timestamps)
    
    control_acc = calculate_accuracy(control_preds, ground_truth)
    shadow_acc = calculate_accuracy(shadow_preds, ground_truth)
    
    result = ab_test_metric(control_acc, shadow_acc)
    
    if result["significant"] and result["winner"] == "treatment":
        promote_model("v1.3.0")
```

#### **23.4.2 Multi-Armed Bandits**

Dynamic traffic allocation favoring better-performing models (faster than fixed A/B).

```python
# thompson_sampling.py
from scipy.stats import beta

class ThompsonSamplingBandit:
    def __init__(self, n_arms):
        self.alpha = np.ones(n_arms)  # Successes
        self.beta = np.ones(n_arms)   # Failures
    
    def select_arm(self):
        # Sample from posterior
        samples = [np.random.beta(self.alpha[i], self.beta[i]) 
                  for i in range(len(self.alpha))]
        return np.argmax(samples)
    
    def update(self, arm, reward):
        # Bayesian update
        if reward > 0:
            self.alpha[arm] += 1
        else:
            self.beta[arm] += 1

# Usage: Route traffic to model with highest sampled conversion rate
```

---

## **23.5 Incident Response & Reliability**

#### **23.5.1 Circuit Breaker Pattern**

Prevent cascade failures when model service degrades.

```python
# circuit_breaker.py
from enum import Enum
import time

class State(Enum):
    CLOSED = "closed"      # Normal operation
    OPEN = "open"          # Failing fast
    HALF_OPEN = "half_open"  # Testing recovery

class CircuitBreaker:
    def __init__(self, failure_threshold=5, timeout=60):
        self.failure_threshold = failure_threshold
        self.timeout = timeout
        self.state = State.CLOSED
        self.failures = 0
        self.last_failure_time = None
    
    def call(self, func, *args, **kwargs):
        if self.state == State.OPEN:
            if time.time() - self.last_failure_time > self.timeout:
                self.state = State.HALF_OPEN
            else:
                raise Exception("Circuit breaker is OPEN")
        
        try:
            result = func(*args, **kwargs)
            self._on_success()
            return result
        except Exception as e:
            self._on_failure()
            raise e
    
    def _on_success(self):
        self.failures = 0
        self.state = State.CLOSED
    
    def _on_failure(self):
        self.failures += 1
        self.last_failure_time = time.time()
        if self.failures >= self.failure_threshold:
            self.state = State.OPEN

# Usage
breaker = CircuitBreaker(failure_threshold=3, timeout=30)

@app.post("/predict")
def predict(request):
    try:
        return breaker.call(model.predict, request.features)
    except:
        # Fallback to rule-based system or cached response
        return fallback_prediction(request)
```

#### **23.5.2 Graceful Degradation**

```python
def get_prediction_with_fallback(user_id, features):
    try:
        # Try primary model (complex, accurate)
        return ml_model.predict(features)
    except ModelTimeout:
        try:
            # Fallback to lighter model (cached, faster)
            return light_model.predict(features)
        except:
            # Ultimate fallback: heuristic/rules
            return rule_based_heuristic(user_id)
```

---

## **23.6 Workbook Labs**

### **Lab 1: Monitoring Dashboard**
Build a Grafana dashboard for model monitoring:

1. **Data Drift:** Heatmap of feature drift scores over time
2. **Performance:** Accuracy, precision, recall (requires delayed ground truth)
3. **Business Metrics:** Revenue per prediction, conversion rates
4. **System Health:** Latency percentiles, error rates, GPU utilization
5. **Alerting:** Configure alerts for accuracy drop > 5%

**Deliverable:** Dashboard JSON and alert configuration.

### **Lab 2: Automated Retraining**
Implement continuous training:

1. **Trigger:** Detect drift using Evidently on daily batch
2. **Pipeline:** Automatically start training job when drift > threshold
3. **Validation:** Compare new model vs. production on holdout set
4. **Deployment:** Automatic canary deployment if performance improves

**Deliverable:** Airflow DAG with automated retraining logic.

### **Lab 3: A/B Testing Framework**
Set up statistical comparison:

1. **Traffic Split:** Route 10% traffic to challenger model (v2)
2. **Metrics:** Track conversion rate, latency, error rate
3. **Analysis:** Calculate p-values, confidence intervals
4. **Auto-promote:** Promote v2 if statistically significant improvement (p < 0.05)

**Deliverable:** A/B test framework with promotion logic.

### **Lab 4: Chaos Engineering**
Test resilience:

1. **Failure Injection:** Randomly kill model pods, inject latency
2. **Circuit Breaker:** Verify fallback triggers correctly
3. **Recovery:** Measure time to recovery (MTTR)
4. **Data Corruption:** Send malformed inputs, verify error handling

**Deliverable:** Chaos test report with resilience improvements.

---

## **23.7 Common Pitfalls**

1. **Alert Fatigue:** Too many false positives cause engineers to ignore alerts. **Solution:** Use anomaly detection (seasonality-aware), consolidate related alerts, prioritize actionable metrics.

2. **Ground Truth Delay:** For fraud detection, true labels may take weeks. **Solution:** Use proxy metrics (chargeback rate within 24h) or unsupervised drift detection as early warning.

3. **Cold Start Monitoring:** New models start with zero predictions, causing division by zero in accuracy calculations. **Solution:** Minimum sample size thresholds before calculating metrics.

4. **Ignoring Business Metrics:** Model accuracy up 2%, but revenue down 10% (model gaming metric). **Solution:** Always track business KPIs (conversion, revenue, user satisfaction) alongside ML metrics.

5. **Static Thresholds:** Fixed thresholds (e.g., accuracy < 0.8) don't account for seasonality (holiday shopping patterns). **Solution:** Dynamic thresholds based on historical ranges or forecasting.

---

## **23.8 Interview Questions**

**Q1:** How do you detect concept drift vs. data drift, and why does the distinction matter?
*A: Data drift (covariate shift): P(X) changes—input feature distributions shift (e.g., age range of users changes). Detected via statistical tests (KS, PSI). Concept drift: P(Y|X) changes—relationship between features and target changes (e.g., fraud patterns evolve). Detected via performance degradation on ground truth or prediction confidence distribution shifts. Distinction matters because: (1) Data drift may not affect performance if decision boundary unchanged, (2) Concept drift always requires retraining, (3) Different mitigation strategies (feature adaptation vs. model retraining).*

**Q2:** Describe your strategy for monitoring models when ground truth is delayed (e.g., credit default takes 2 years).
*A: (1) Upstream monitoring: Detect data drift in inputs as early warning, (2) Proxy metrics: Short-term indicators correlated with final outcome (missed payment within 30 days), (3) Prediction distribution monitoring: Sudden shifts in model confidence or output distribution suggest issues, (4) Human-in-the-loop: Sample predictions for manual review to estimate current accuracy, (5) Counterfactual evaluation: How would old model perform on new data? (requires holdout set), (6) Business metric tracking: Default rate trends, even without individual labels.*

**Q3:** How do you decide between scheduled retraining vs. performance-triggered retraining?
*A: Scheduled: Good for stable domains, simpler operations, predictable costs. Triggered: Good for dynamic environments, cost-efficient (don't train if not needed), faster response to drift. Hybrid approach: (1) Minimum scheduled retraining (monthly) to incorporate new data, (2) Triggered retraining for emergency drift detection (>30% features drifted or accuracy drop >5%), (3) Continuous training for high-velocity systems (daily incremental updates).*

**Q4:** What is shadow mode deployment, and when would you use it?
*A: Shadow mode: Route production traffic to new model (challenger) without serving its predictions to users. Log predictions for comparison against current model (control). Use when: (1) High risk of new model failure (safety-critical), (2) Ground truth unavailable immediately—need to accumulate prediction logs before evaluation, (3) Testing infrastructure capacity (can new model handle production load?), (4) Regulatory requirements to validate before serving. Limitation: Doesn't test user reaction to new predictions (only technical correctness).*

**Q5:** Design a circuit breaker for a real-time fraud detection system.
*A: States: Closed (normal), Open (failing fast), Half-Open (testing recovery). Configuration: Failure threshold 5 errors in 1 minute, timeout 30s. Fallback hierarchy: (1) Try cached prediction for user, (2) Use lighter rule-based model (no ML), (3) Approve transaction (business decision—false negatives better than blocking all). Monitoring: Alert when circuit opens, track fallback rate. Recovery: Half-open allows 1 request per minute to test health before closing. Implementation: Use libraries like `pybreaker` or Istio/Envoy outlier detection for service mesh level.*

---

## **23.9 Further Reading**

**Books:**
- *Practical Machine Learning for Computer Vision* (O'Reilly) - Monitoring CV models
- *Site Reliability Engineering* (Google) - General reliability principles

**Papers:**
- "Monitoring Machine Learning Models in Production" (Sashank, 2021)
- "The ML Test Score: A Rubric for ML Production Readiness"

**Tools:**
- **Evidently AI:** Drift detection and reports
- **WhyLabs:** Statistical monitoring
- **Arize AI:** ML observability platform
- **Fiddler:** Model performance management

---

## **23.10 Checkpoint Project: Production Monitoring System**

Build a complete monitoring and maintenance system for the fraud detection model from Chapter 22.

**Requirements:**

1. **Monitoring Stack:**
   - Prometheus for metrics collection
   - Grafana for visualization
   - Evidently for drift reports
   - PagerDuty/Opsgenie for alerting

2. **Metrics Pipeline:**
   - Log every prediction with features, timestamp, model version
   - Daily batch job calculating drift vs. training set
   - Weekly accuracy calculation (using confirmed fraud labels)
   - Business metrics: Fraud caught ($), false positive rate

3. **Alerting Rules:**
   - Critical: Accuracy < 80%, system down (0 RPS), P99 latency > 500ms
   - Warning: Drift detected in >3 features, error rate > 1%
   - Info: Model version nearing retirement age (30 days old)

4. **Automated Response:**
   - Drift detected → Trigger retraining pipeline → Shadow test → Auto-promote if better
   - Circuit breaker opens → Fallback to rules engine → Page on-call engineer
   - Accuracy drop → Immediate rollback to previous model version

5. **Reporting:**
   - Weekly automated report: Model performance, drift summary, business impact
   - Monthly review: Feature importance shifts, retraining recommendations

**Deliverables:**
- `monitoring/` directory with Prometheus configs, Grafana dashboards
- `maintenance/` with retraining DAGs and rollback scripts
- Runbook: "Incident Response: Model Serving Outage"
- Demo: Simulated drift causing automated retraining

**Success Criteria:**
- Zero false positives in alerting (tuned thresholds)
- Automated retraining successfully deployed improved model
- <5 minute time to detection (TTD) for accuracy drop
- <15 minute time to recovery (TTR) using rollback

---

**End of Chapter 23**

*You can now maintain ML systems in production with confidence. Chapter 24 covers Responsible AI & Ethics—ensuring your systems are fair, explainable, and secure.*

---
