# Week 14 Lab: Model Monitoring & Observability

**CS 203: Software Tools and Techniques for AI**

---

## Lab Overview

In this lab, you will learn to:
1. **Detect data drift** using statistical tests
2. **Monitor model performance** over time
3. **Build dashboards** with Evidently AI
4. **Set up alerts** for model degradation

**Goal**: Build a complete monitoring system for an ML model.

---

## Setup

In [None]:
# Install required packages
!pip install evidently pandas scikit-learn matplotlib numpy

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score
from scipy.stats import ks_2samp
import warnings
warnings.filterwarnings('ignore')

print("All imports successful!")

---

# Part 1: Understanding Model Drift

Models degrade over time as data changes.

```
┌────────────────────────────────────────────────────────────┐
│                     Model Lifecycle                        │
│                                                            │
│  Training    Deploy     Production      Degradation        │
│    Data  ──► Model  ──►   Data     ──►  Detected!         │
│                                                            │
│    2023       v1.0       2024          Accuracy ↓         │
│                                                            │
└────────────────────────────────────────────────────────────┘
```

**Types of Drift**:
- **Data drift**: Input distribution changes
- **Concept drift**: Relationship between X and Y changes
- **Label drift**: Target distribution changes

### Question 1.1 (Solved): Create Synthetic Dataset with Drift

In [None]:
# SOLVED EXAMPLE

np.random.seed(42)

def create_reference_data(n_samples=1000):
    """Create reference (training) data."""
    data = {
        'feature_1': np.random.normal(0, 1, n_samples),
        'feature_2': np.random.normal(5, 2, n_samples),
        'feature_3': np.random.uniform(0, 10, n_samples),
    }
    df = pd.DataFrame(data)
    # Create target based on features
    df['target'] = (df['feature_1'] + df['feature_2'] > 5).astype(int)
    return df

def create_production_data(n_samples=1000, drift_amount=0.5):
    """Create production data with drift."""
    data = {
        'feature_1': np.random.normal(0 + drift_amount, 1, n_samples),  # Mean shifted
        'feature_2': np.random.normal(5 + drift_amount, 2, n_samples),  # Mean shifted
        'feature_3': np.random.uniform(0, 10, n_samples),               # No drift
    }
    df = pd.DataFrame(data)
    df['target'] = (df['feature_1'] + df['feature_2'] > 5).astype(int)
    return df

# Create datasets
reference_data = create_reference_data(1000)
production_data = create_production_data(1000, drift_amount=1.0)

print(f"Reference data shape: {reference_data.shape}")
print(f"Production data shape: {production_data.shape}")
print(f"\nReference feature_1 mean: {reference_data['feature_1'].mean():.3f}")
print(f"Production feature_1 mean: {production_data['feature_1'].mean():.3f}")

### Question 1.2: Visualize the Drift

In [None]:
# YOUR CODE HERE
# Create histograms comparing reference vs production distributions

fig, axes = plt.subplots(1, 3, figsize=(15, 4))

features = ['feature_1', 'feature_2', 'feature_3']

for i, feature in enumerate(features):
    # Plot reference
    axes[i].hist(reference_data[feature], bins=30, alpha=0.5, label='Reference', density=True)
    # Plot production
    axes[i].hist(production_data[feature], bins=30, alpha=0.5, label='Production', density=True)
    axes[i].set_title(feature)
    axes[i].legend()

plt.tight_layout()
plt.savefig('drift_visualization.png', dpi=150)
plt.show()

---

# Part 2: Statistical Drift Detection

Use statistical tests to detect drift.

## 2.1 Kolmogorov-Smirnov Test

### Question 2.1 (Solved): KS Test for Drift

In [None]:
# SOLVED EXAMPLE

def detect_drift_ks(reference, current, threshold=0.05):
    """Detect drift using Kolmogorov-Smirnov test."""
    results = {}
    
    for column in reference.columns:
        if column == 'target':
            continue
        
        stat, p_value = ks_2samp(reference[column], current[column])
        drift_detected = p_value < threshold
        
        results[column] = {
            'statistic': stat,
            'p_value': p_value,
            'drift_detected': drift_detected
        }
    
    return results

# Run drift detection
drift_results = detect_drift_ks(reference_data, production_data)

print("Drift Detection Results:")
print("=" * 50)
for feature, result in drift_results.items():
    status = "DRIFT DETECTED" if result['drift_detected'] else "No drift"
    print(f"{feature:15s}: p-value={result['p_value']:.4f} -> {status}")

### Question 2.2: Implement PSI (Population Stability Index)

In [None]:
# YOUR CODE HERE

def calculate_psi(reference, current, bins=10):
    """Calculate Population Stability Index.
    
    PSI < 0.1: No significant change
    0.1 < PSI < 0.25: Moderate change
    PSI > 0.25: Significant change
    """
    # Implement PSI calculation
    pass

# Calculate PSI for each feature


---

# Part 3: Training a Model and Monitoring Performance

Train a model and monitor its performance on drifted data.

## 3.1 Train Baseline Model

### Question 3.1 (Solved): Train Model on Reference Data

In [None]:
# SOLVED EXAMPLE

# Prepare data
feature_cols = ['feature_1', 'feature_2', 'feature_3']
X_train = reference_data[feature_cols]
y_train = reference_data['target']

# Train model
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# Baseline performance
y_train_pred = model.predict(X_train)
baseline_accuracy = accuracy_score(y_train, y_train_pred)
baseline_f1 = f1_score(y_train, y_train_pred)

print(f"Baseline Accuracy: {baseline_accuracy:.4f}")
print(f"Baseline F1 Score: {baseline_f1:.4f}")

### Question 3.2: Evaluate on Production Data

In [None]:
# YOUR CODE HERE
# Evaluate model on production data and compare to baseline

X_prod = production_data[feature_cols]
y_prod = production_data['target']

# Make predictions

# Calculate metrics

# Print comparison


---

# Part 4: Using Evidently for Monitoring

Evidently provides easy-to-use monitoring tools.

## 4.1 Generate Drift Reports

### Question 4.1 (Solved): Create Evidently Report

In [None]:
# SOLVED EXAMPLE
from evidently.report import Report
from evidently.metric_preset import DataDriftPreset

# Create drift report
report = Report(metrics=[DataDriftPreset()])

# Run report
report.run(
    reference_data=reference_data[feature_cols],
    current_data=production_data[feature_cols]
)

# Save as HTML
report.save_html('drift_report.html')
print("Report saved to drift_report.html")

# Get metrics as dict
report_dict = report.as_dict()
print(f"\nDataset drift detected: {report_dict['metrics'][0]['result']['dataset_drift']}")

### Question 4.2: Create Model Performance Report

In [None]:
# YOUR CODE HERE
# Create an Evidently report for model performance
from evidently.metric_preset import ClassificationPreset

# Add predictions to dataframes
reference_data['prediction'] = model.predict(reference_data[feature_cols])
production_data['prediction'] = model.predict(production_data[feature_cols])

# Create and run report


---

# Part 5: Building an Alert System

Automatically detect when to retrain.

## 5.1 Define Alert Rules

### Question 5.1 (Solved): Alert System

In [None]:
# SOLVED EXAMPLE

class MonitoringAlert:
    """Simple alert system for model monitoring."""
    
    def __init__(self, accuracy_threshold=0.85, drift_threshold=0.05):
        self.accuracy_threshold = accuracy_threshold
        self.drift_threshold = drift_threshold
        self.alerts = []
    
    def check_accuracy(self, current_accuracy, baseline_accuracy):
        """Check if accuracy dropped significantly."""
        if current_accuracy < self.accuracy_threshold:
            self.alerts.append({
                'type': 'ACCURACY_DROP',
                'severity': 'HIGH',
                'message': f'Accuracy dropped to {current_accuracy:.2%} (threshold: {self.accuracy_threshold:.2%})'
            })
            return True
        return False
    
    def check_drift(self, drift_results):
        """Check if data drift detected."""
        drifted_features = [
            f for f, r in drift_results.items() 
            if r['drift_detected']
        ]
        
        if drifted_features:
            self.alerts.append({
                'type': 'DATA_DRIFT',
                'severity': 'MEDIUM',
                'message': f'Drift detected in: {", ".join(drifted_features)}'
            })
            return True
        return False
    
    def should_retrain(self):
        """Determine if model should be retrained."""
        high_severity = any(a['severity'] == 'HIGH' for a in self.alerts)
        return high_severity
    
    def get_alerts(self):
        """Get all alerts."""
        return self.alerts

# Test alert system
alert_system = MonitoringAlert(accuracy_threshold=0.90)

# Check accuracy
current_accuracy = accuracy_score(y_prod, model.predict(X_prod))
alert_system.check_accuracy(current_accuracy, baseline_accuracy)

# Check drift
alert_system.check_drift(drift_results)

# Print alerts
print("\nAlerts:")
for alert in alert_system.get_alerts():
    print(f"  [{alert['severity']}] {alert['type']}: {alert['message']}")

print(f"\nShould retrain: {alert_system.should_retrain()}")

### Question 5.2: Create Notification Function

In [None]:
# YOUR CODE HERE
# Create a function that formats alerts for Slack/email notification

def format_alert_message(alerts, model_name="iris_classifier"):
    """Format alerts for notification."""
    pass

# Test it
# message = format_alert_message(alert_system.get_alerts())
# print(message)

---

# Part 6: Simulating Production Monitoring

Simulate weekly batches with increasing drift.

## 6.1 Weekly Monitoring Simulation

### Question 6.1: Simulate Gradual Drift

In [None]:
# YOUR CODE HERE
# Simulate 10 weeks of production data with increasing drift

weeks = 10
results = []

for week in range(weeks):
    drift_amount = week * 0.2  # Increasing drift
    
    # Create production data for this week
    week_data = create_production_data(500, drift_amount=drift_amount)
    
    # Calculate metrics
    X_week = week_data[feature_cols]
    y_week = week_data['target']
    y_pred = model.predict(X_week)
    
    accuracy = accuracy_score(y_week, y_pred)
    drift_result = detect_drift_ks(reference_data, week_data)
    n_drifted = sum(1 for r in drift_result.values() if r['drift_detected'])
    
    results.append({
        'week': week + 1,
        'drift_amount': drift_amount,
        'accuracy': accuracy,
        'n_drifted_features': n_drifted
    })

# Create DataFrame
results_df = pd.DataFrame(results)
print(results_df.to_string(index=False))

### Question 6.2: Visualize Performance Over Time

In [None]:
# YOUR CODE HERE
# Create a plot showing accuracy and drift over time

fig, axes = plt.subplots(1, 2, figsize=(12, 4))

# Accuracy over time
axes[0].plot(results_df['week'], results_df['accuracy'], marker='o')
axes[0].axhline(y=0.9, color='r', linestyle='--', label='Threshold')
axes[0].set_xlabel('Week')
axes[0].set_ylabel('Accuracy')
axes[0].set_title('Model Accuracy Over Time')
axes[0].legend()

# Drift features over time
axes[1].bar(results_df['week'], results_df['n_drifted_features'])
axes[1].set_xlabel('Week')
axes[1].set_ylabel('Drifted Features')
axes[1].set_title('Data Drift Detection Over Time')

plt.tight_layout()
plt.savefig('monitoring_dashboard.png', dpi=150)
plt.show()

---

# Summary

In this lab, you learned:

1. **Types of drift**: Data, concept, and label drift
2. **Statistical tests**: KS test, PSI for drift detection
3. **Evidently**: Creating drift and performance reports
4. **Alert systems**: Automated monitoring and notifications
5. **Production simulation**: Tracking drift over time

## Key Takeaways

| Detection Method | Use Case | Threshold |
|------------------|----------|------------|
| KS Test | Continuous features | p < 0.05 |
| PSI | Any distribution | PSI > 0.25 |
| Chi-squared | Categorical features | p < 0.05 |

**When to retrain**:
- Accuracy drops > 10%
- Multiple features drift
- Critical alerts triggered

---

## Submission

Submit:
1. This completed notebook
2. Evidently drift report (HTML)
3. Monitoring dashboard plot
4. Brief report: At what week would you trigger retraining?