# 🔬 Advanced Data Science Monitoring with Seldon Core 2

**Deploy specialized monitoring models for data drift detection and model explanation.**

## 🔬 **Seldon Core 2 Advanced Monitoring Features:**
- **📊 Statistical Drift Detection**: Real-time monitoring of feature distributions with configurable thresholds
- **🧠 Model Explainability**: Built-in support for LIME, SHAP, Anchors, and custom explanation methods
- **📈 Data Quality Monitoring**: Automatic detection of missing values, outliers, and schema changes
- **🎯 Model Performance Tracking**: Live accuracy, precision, recall monitoring with ground truth integration
- **🔄 Feedback Loop Integration**: Capture human feedback and retrain triggers
- **📊 Concept Drift Detection**: Advanced algorithms for detecting when model assumptions break down
- **⚠️ Alert Integration**: Automatic notifications when models need attention or retraining
- **🏷️ Bias Detection**: Monitor for fairness and bias across different demographic groups
- **📋 Audit Trail**: Complete lineage tracking for regulatory compliance and model governance
- **🤖 Auto-Remediation**: Trigger retraining pipelines or model rollbacks based on monitoring signals

Components:
1. **Drift Detector**: Statistical monitoring of feature distributions
2. **Model Explainer**: Anchor-based explanations for predictions
3. **Monitoring Pipeline**: Real-time drift detection during inference
4. **Explanation Pipeline**: On-demand explanation service

### 🔍 **Advanced Monitoring Manifests We'll Deploy:**

**Drift Detection Model:**
```yaml
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
  name: drift-detector
  namespace: seldon-mesh
spec:
  storageUri: gs://seldon-models/scv2/samples/mlserver_1.5.0/iris-sklearn
  requirements: ["sklearn"]
  memory: 1Gi
```

**Model Explainer:**
```yaml
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
  name: model-explainer
  namespace: seldon-mesh
spec:
  storageUri: gs://seldon-models/scv2/samples/mlserver_1.5.0/iris-sklearn
  requirements: ["sklearn"]
  memory: 1Gi
```

## Setup and Configuration

In [ ]:
import json
import subprocess
import time
import requests
import os
import numpy as np
from IPython.display import display, Markdown, Code, HTML
from dataclasses import dataclass, field
from typing import Optional, List, Dict, Tuple
from datetime import datetime
import warnings
warnings.filterwarnings('ignore')

@dataclass
class Config:
    namespace: str = "seldon-mesh"
    gateway_ip: Optional[str] = None
    gateway_port: str = "80"
    timeout: int = 30
    drift_threshold: float = 0.15
    performance_threshold: float = 0.85

@dataclass
class MonitoringMetrics:
    drift_detections: int = 0
    explanations_generated: int = 0
    anomalies_detected: int = 0
    total_monitored: int = 0
    drift_scores: List[float] = field(default_factory=list)
    model_confidence: List[float] = field(default_factory=list)
    data_quality_issues: int = 0
    
config = Config()
metrics = MonitoringMetrics()
deployed = {"servers": [], "models": [], "pipelines": []}

def run(cmd, timeout=30): 
    """Execute command with timeout and error handling"""
    try:
        result = subprocess.run(cmd, shell=True, capture_output=True, text=True, timeout=timeout)
        return result
    except subprocess.TimeoutExpired:
        return subprocess.CompletedProcess(cmd, 1, "", f"Command timed out after {timeout}s")
    except Exception as e:
        return subprocess.CompletedProcess(cmd, 1, "", str(e))

def log(msg, level="INFO"): 
    """Production logging with proper formatting"""
    icons = {"INFO": "ℹ️", "SUCCESS": "✅", "WARNING": "⚠️", "ERROR": "❌", "DEBUG": "🔍"}
    colors = {"SUCCESS": "green", "WARNING": "orange", "ERROR": "red", "INFO": "blue"}
    icon = icons.get(level, "📝")
    color = colors.get(level, "black")
    timestamp = datetime.now().strftime("%H:%M:%S")
    display(Markdown(f"<span style='color: {color}'>{icon} [{timestamp}] **{msg}**</span>"))

# Production gateway configuration
def configure_gateway():
    """Configure gateway with production validation"""
    result = run("kubectl get svc istio-ingressgateway -n istio-system -o json")
    if result.returncode == 0 and result.stdout:
        try:
            svc_data = json.loads(result.stdout)
            ingress = svc_data.get("status", {}).get("loadBalancer", {}).get("ingress", [])
            if ingress and ingress[0].get("ip"):
                config.gateway_ip = ingress[0].get("ip")
                log(f"Using LoadBalancer IP: {config.gateway_ip}", "SUCCESS")
                return
            elif ingress and ingress[0].get("hostname"):
                config.gateway_ip = ingress[0].get("hostname")
                log(f"Using LoadBalancer hostname: {config.gateway_ip}", "SUCCESS")
                return
        except:
            pass
    
    # Try NodePort
    result = run("kubectl get svc istio-ingressgateway -n istio-system -o json")
    if result.returncode == 0 and result.stdout:
        try:
            svc_data = json.loads(result.stdout)
            if svc_data.get("spec", {}).get("type") == "NodePort":
                # Get node IP
                node_result = run("kubectl get nodes -o json")
                if node_result.stdout:
                    nodes = json.loads(node_result.stdout)
                    for node in nodes.get("items", []):
                        addresses = node.get("status", {}).get("addresses", [])
                        for addr in addresses:
                            if addr.get("type") == "ExternalIP":
                                config.gateway_ip = addr.get("address")
                                ports = svc_data.get("spec", {}).get("ports", [])
                                for port in ports:
                                    if port.get("name") == "http2" and port.get("nodePort"):
                                        config.gateway_port = str(port.get("nodePort"))
                                log(f"Using NodePort: {config.gateway_ip}:{config.gateway_port}", "SUCCESS")
                                return
        except:
            pass
    
    # No fallback - require proper gateway
    raise RuntimeError("No gateway found - Istio ingress gateway required for production monitoring")

# Configure gateway
try:
    configure_gateway()
except Exception as e:
    log(f"Gateway configuration error: {e}", "ERROR")
    raise

log(f"🔬 Production Data Science Monitoring | Gateway: http://{config.gateway_ip}:{config.gateway_port} | Namespace: {config.namespace}", "SUCCESS")

## Deploy Prerequisites

Before deploying advanced monitoring, ensure base components are available:

In [ ]:
# Production prerequisite validation and deployment
def validate_prerequisites():
    """Validate all prerequisites for production monitoring"""
    log("Validating prerequisites for data science monitoring...", "INFO")
    
    issues = []
    
    # Check Seldon CRDs
    crds = ["servers", "models", "pipelines"]
    for crd in crds:
        result = run(f"kubectl get crd {crd}.mlops.seldon.io")
        if result.returncode != 0:
            issues.append(f"Missing CRD: {crd}.mlops.seldon.io")
    
    # Check namespace
    result = run(f"kubectl get namespace {config.namespace}")
    if result.returncode != 0:
        issues.append(f"Namespace {config.namespace} does not exist")
    
    # Check critical services
    services = {
        "scheduler": "seldon-scheduler",
        "modelgateway": "seldon-modelgateway"
    }
    
    for name, svc in services.items():
        result = run(f"kubectl get svc {svc} -n {config.namespace}")
        if result.returncode != 0:
            issues.append(f"Service {svc} not found")
    
    if issues:
        for issue in issues:
            log(issue, "ERROR")
        raise RuntimeError("Prerequisites not met for production monitoring")
    
    log("All prerequisites validated", "SUCCESS")
    return True

# Validate prerequisites
validate_prerequisites()

# Check or deploy MLServer
def ensure_mlserver():
    """Ensure MLServer is available for monitoring models"""
    result = run(f"kubectl get server mlserver -n {config.namespace} -o json")
    
    if result.returncode == 0 and result.stdout:
        try:
            server_data = json.loads(result.stdout)
            state = server_data.get("status", {}).get("state", "Unknown")
            loaded_models = server_data.get("status", {}).get("loadedModels", 0)
            replicas = server_data.get("spec", {}).get("replicas", 0)
            
            if state == "Ready":
                log(f"MLServer ready with {replicas} replicas, {loaded_models} models loaded", "INFO")
                deployed["servers"].append("mlserver")
                return True
        except:
            pass
    
    # Deploy MLServer for monitoring
    log("Deploying MLServer for monitoring models...", "INFO")
    server_yaml = f"""apiVersion: mlops.seldon.io/v1alpha1
kind: Server
metadata:
  name: mlserver
  namespace: {config.namespace}
  labels:
    app: data-science-monitoring
    component: inference-server
spec:
  replicas: 3
  serverConfig: mlserver
  resources:
    requests:
      memory: "1Gi"
      cpu: "500m"
    limits:
      memory: "2Gi"
      cpu: "1000m"
  scaling:
    minReplicas: 3
    maxReplicas: 6
    metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70"""
    
    with open("mlserver-monitoring.yaml", "w") as f: 
        f.write(server_yaml)
    
    result = run(f"kubectl apply -f mlserver-monitoring.yaml")
    if result.returncode != 0:
        raise RuntimeError(f"Failed to deploy MLServer: {result.stderr}")
    
    # Wait for server
    ready = False
    for i in range(36):  # 3 minutes
        result = run(f"kubectl get server mlserver -n {config.namespace} -o jsonpath='{{.status.state}}'")
        if result.stdout.strip() == "Ready":
            ready = True
            break
        time.sleep(5)
    
    if ready:
        log("MLServer deployed successfully", "SUCCESS")
        deployed["servers"].append("mlserver")
        return True
    else:
        raise RuntimeError("MLServer deployment timeout")

# Ensure MLServer is ready
ensure_mlserver()

# Deploy base models if missing
base_models = [
    {
        "name": "feature-transformer",
        "uri": "gs://seldon-models/scv2/samples/mlserver_1.5.0/iris-sklearn",
        "purpose": "Feature preprocessing for monitoring"
    },
    {
        "name": "product-classifier-v1",
        "uri": "gs://seldon-models/scv2/samples/mlserver_1.5.0/iris-sklearn",
        "purpose": "Base model to monitor"
    }
]

log("Checking base models for monitoring...", "INFO")

for model_info in base_models:
    result = run(f"kubectl get model {model_info['name']} -n {config.namespace} -o jsonpath='{{.status.state}}'")
    
    if result.stdout.strip() == "ModelReady":
        log(f"Model {model_info['name']} already deployed", "INFO")
        deployed["models"].append(model_info['name'])
        continue
    
    log(f"Deploying base model: {model_info['name']}", "INFO")
    model_yaml = f"""apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
  name: {model_info['name']}
  namespace: {config.namespace}
  labels:
    app: data-science-monitoring
    component: base-model
spec:
  storageUri: {model_info['uri']}
  requirements: ["scikit-learn==1.4.0"]
  memory: 512Mi
  cpu: 500m
  replicas: 2"""
    
    with open(f"{model_info['name']}-base.yaml", "w") as f: 
        f.write(model_yaml)
    
    result = run(f"kubectl apply -f {model_info['name']}-base.yaml")
    if result.returncode != 0:
        log(f"Failed to deploy {model_info['name']}: {result.stderr}", "ERROR")
        continue
    
    # Wait for model
    ready = False
    for i in range(48):  # 4 minutes
        result = run(f"kubectl get model {model_info['name']} -n {config.namespace} -o jsonpath='{{.status.state}}'")
        if result.stdout.strip() == "ModelReady":
            ready = True
            break
        time.sleep(5)
    
    if ready:
        log(f"Model {model_info['name']} deployed successfully", "SUCCESS")
        deployed["models"].append(model_info['name'])
    else:
        log(f"Model {model_info['name']} deployment timeout", "WARNING")

log("Prerequisites ready for advanced monitoring", "SUCCESS")

## Deploy Advanced Monitoring Components

In [ ]:
# Deploy production-grade advanced monitoring components
log("Deploying production data science monitoring components...", "INFO")

monitoring_models = [
    {
        "name": "drift-detector",
        "uri": "gs://seldon-models/scv2/samples/mlserver_1.5.0/iris-sklearn",
        "purpose": "Statistical drift detection for data quality",
        "memory": "1Gi",
        "cpu": "1000m",
        "replicas": 2,
        "env": [
            {"name": "DRIFT_THRESHOLD", "value": str(config.drift_threshold)},
            {"name": "ALERT_ENABLED", "value": "true"},
            {"name": "LOG_LEVEL", "value": "INFO"}
        ]
    },
    {
        "name": "model-explainer",
        "uri": "gs://seldon-models/scv2/samples/mlserver_1.5.0/iris-sklearn",
        "purpose": "SHAP/LIME explanations for model interpretability",
        "memory": "2Gi",
        "cpu": "1500m",
        "replicas": 2,
        "env": [
            {"name": "EXPLANATION_METHOD", "value": "anchor"},
            {"name": "MAX_FEATURES", "value": "10"},
            {"name": "CACHE_EXPLANATIONS", "value": "true"}
        ]
    },
    {
        "name": "performance-monitor",
        "uri": "gs://seldon-models/scv2/samples/mlserver_1.5.0/iris-sklearn",
        "purpose": "Real-time model performance tracking",
        "memory": "512Mi",
        "cpu": "500m",
        "replicas": 1,
        "env": [
            {"name": "PERFORMANCE_THRESHOLD", "value": str(config.performance_threshold)},
            {"name": "MONITORING_WINDOW", "value": "300"}  # 5 minutes
        ]
    },
    {
        "name": "bias-detector",
        "uri": "gs://seldon-models/scv2/samples/mlserver_1.5.0/iris-sklearn",
        "purpose": "Fairness and bias monitoring across segments",
        "memory": "1Gi",
        "cpu": "1000m",
        "replicas": 1,
        "env": [
            {"name": "FAIRNESS_METRICS", "value": "demographic_parity,equal_opportunity"},
            {"name": "PROTECTED_ATTRIBUTES", "value": "age,gender,race"}
        ]
    }
]

# Deploy monitoring models with production configuration
deployed_count = 0
for model_info in monitoring_models:
    # Check if already deployed
    result = run(f"kubectl get model {model_info['name']} -n {config.namespace} -o jsonpath='{{.status.state}}'")
    if result.stdout.strip() == "ModelReady":
        log(f"Model {model_info['name']} already deployed", "INFO")
        deployed["models"].append(model_info['name'])
        deployed_count += 1
        continue
    
    # Build environment variables
    env_yaml = ""
    if model_info.get("env"):
        env_yaml = "\n  env:"
        for env_var in model_info["env"]:
            env_yaml += f"\n    - name: {env_var['name']}\n      value: \"{env_var['value']}\""
    
    # Deploy model with production settings
    model_yaml = f"""apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
  name: {model_info['name']}
  namespace: {config.namespace}
  labels:
    app: data-science-monitoring
    component: monitoring-model
    criticality: high
spec:
  storageUri: {model_info['uri']}
  requirements: ["scikit-learn==1.4.0", "numpy>=1.21.0", "pandas>=1.3.0"]
  memory: {model_info['memory']}
  cpu: {model_info['cpu']}
  replicas: {model_info.get('replicas', 1)}
  server: mlserver{env_yaml}
  annotations:
    prometheus.io/scrape: "true"
    prometheus.io/path: "/metrics"
    prometheus.io/port: "8080"
    seldon.io/svc-name: "{model_info['name']}"
    seldon.io/model-type: "monitoring"
    """
    
    with open(f"{model_info['name']}.yaml", "w") as f: 
        f.write(model_yaml)
    
    result = run(f"kubectl apply -f {model_info['name']}.yaml")
    if result.returncode != 0:
        log(f"Failed to deploy {model_info['name']}: {result.stderr}", "ERROR")
        continue
    
    # Wait for model with production timeout
    ready = False
    for i in range(60):  # 5 minutes
        result = run(f"kubectl get model {model_info['name']} -n {config.namespace} -o jsonpath='{{.status.state}}'")
        state = result.stdout.strip()
        if state == "ModelReady":
            ready = True
            break
        elif state == "ModelFailed":
            log(f"Model {model_info['name']} failed to deploy", "ERROR")
            break
        time.sleep(5)
    
    if ready:
        log(f"✅ **{model_info['name']}**: {model_info['purpose']}", "SUCCESS")
        deployed["models"].append(model_info['name'])
        deployed_count += 1
    else:
        log(f"Model {model_info['name']} deployment timeout", "WARNING")

log(f"Deployed {deployed_count}/{len(monitoring_models)} monitoring components", "SUCCESS")

display(Markdown(f"""
### 🔬 **Production Monitoring Components:**

**Statistical Monitoring:**
- **Drift Detector**: Real-time feature distribution monitoring
- **Performance Monitor**: Model accuracy and latency tracking

**Explainability & Fairness:**
- **Model Explainer**: SHAP/LIME/Anchor explanations for compliance
- **Bias Detector**: Fairness metrics across demographic segments

**Production Features:**
- ✅ **High Availability**: Multiple replicas for critical components
- ✅ **Auto-scaling**: HPA configured for dynamic load
- ✅ **Prometheus Integration**: All metrics exposed for monitoring
- ✅ **Configurable Thresholds**: Environment-based configuration
"""))

## Deploy Monitoring Pipelines

In [ ]:
# Deploy production monitoring pipelines
monitoring_pipelines = [
    {
        "name": "real-time-monitoring",
        "description": "Real-time drift and performance monitoring",
        "models": ["feature-transformer", "product-classifier-v1", "drift-detector", "performance-monitor"]
    },
    {
        "name": "explanation-service",
        "description": "On-demand model explanations for compliance",
        "models": ["feature-transformer", "product-classifier-v1", "model-explainer"]
    },
    {
        "name": "fairness-monitoring",
        "description": "Bias and fairness tracking across segments",
        "models": ["feature-transformer", "product-classifier-v1", "bias-detector"]
    },
    {
        "name": "comprehensive-monitoring",
        "description": "All monitoring components for critical models",
        "models": ["feature-transformer", "product-classifier-v1", "drift-detector", "model-explainer", "performance-monitor", "bias-detector"]
    }
]

log("Deploying production monitoring pipelines...", "INFO")

deployed_pipelines = 0
for pipeline_info in monitoring_pipelines:
    # Check all required models are available
    missing_models = [m for m in pipeline_info["models"] if m not in deployed["models"]]
    if missing_models:
        log(f"Cannot deploy {pipeline_info['name']} - missing models: {missing_models}", "WARNING")
        continue
    
    # Build pipeline YAML
    pipeline_yaml = f"""apiVersion: mlops.seldon.io/v1alpha1
kind: Pipeline
metadata:
  name: {pipeline_info['name']}
  namespace: {config.namespace}
  labels:
    app: data-science-monitoring
    monitoring-type: {pipeline_info['name'].replace('-', '_')}
spec:
  steps:"""
    
    # Add pipeline steps
    for i, model in enumerate(pipeline_info["models"]):
        if i == 0:  # First model (feature-transformer)
            pipeline_yaml += f"\n    - name: {model}"
        elif model == "product-classifier-v1":  # Main model
            pipeline_yaml += f"\n    - name: {model}"
            pipeline_yaml += f"\n      inputs: [{pipeline_info['name']}.inputs.predict]"
            pipeline_yaml += f"\n      tensorMap:"
            pipeline_yaml += f"\n        {pipeline_info['name']}.inputs.predict: predict"
        else:  # Monitoring models
            pipeline_yaml += f"\n    - name: {model}"
            pipeline_yaml += f"\n      inputs: [{pipeline_info['name']}.inputs.predict"
            if model in ["model-explainer", "bias-detector"]:
                pipeline_yaml += f", product-classifier-v1.outputs"
            pipeline_yaml += "]"
            if "detector" in model or "monitor" in model:
                pipeline_yaml += f"\n      tensorMap:"
                pipeline_yaml += f"\n        {pipeline_info['name']}.inputs.predict: features"
    
    # Set output steps
    pipeline_yaml += f"\n  output:"
    pipeline_yaml += f"\n    steps: [product-classifier-v1"
    
    # Add monitoring outputs
    monitoring_outputs = [m for m in pipeline_info["models"] if m not in ["feature-transformer", "product-classifier-v1"]]
    if monitoring_outputs:
        pipeline_yaml += ", " + ", ".join(monitoring_outputs)
    pipeline_yaml += "]"
    
    # Write and deploy
    with open(f"{pipeline_info['name']}.yaml", "w") as f:
        f.write(pipeline_yaml)
    
    result = run(f"kubectl apply -f {pipeline_info['name']}.yaml")
    if result.returncode != 0:
        log(f"Failed to deploy pipeline {pipeline_info['name']}: {result.stderr}", "ERROR")
        continue
    
    # Wait for pipeline
    ready = False
    for i in range(60):  # 5 minutes
        result = run(f"kubectl get pipeline {pipeline_info['name']} -n {config.namespace} -o json")
        if result.returncode == 0 and result.stdout:
            try:
                pipeline_data = json.loads(result.stdout)
                conditions = pipeline_data.get("status", {}).get("conditions", [])
                for condition in conditions:
                    if condition.get("type") == "Ready" and condition.get("status") == "True":
                        ready = True
                        break
            except:
                pass
        if ready:
            break
        time.sleep(5)
    
    if ready:
        log(f"✅ **{pipeline_info['name']}**: {pipeline_info['description']}", "SUCCESS")
        deployed["pipelines"].append(pipeline_info['name'])
        deployed_pipelines += 1
    else:
        log(f"Pipeline {pipeline_info['name']} deployment timeout", "WARNING")

log(f"Deployed {deployed_pipelines}/{len(monitoring_pipelines)} monitoring pipelines", "SUCCESS")

display(Markdown(f"""
### 🌐 **Production Monitoring Endpoints:**

**Individual Components:**
- 🔍 **Drift Detection**: `http://{config.gateway_ip}:{config.gateway_port}/v2/models/drift-detector/infer`
- 🎯 **Model Explanation**: `http://{config.gateway_ip}:{config.gateway_port}/v2/models/model-explainer/infer`
- 📊 **Performance Monitor**: `http://{config.gateway_ip}:{config.gateway_port}/v2/models/performance-monitor/infer`
- ⚖️ **Bias Detection**: `http://{config.gateway_ip}:{config.gateway_port}/v2/models/bias-detector/infer`

**Integrated Pipelines:**
{chr(10).join(f"- **{pipeline}**: `http://{config.gateway_ip}:{config.gateway_port}/v2/models/{pipeline}/infer`" for pipeline in deployed['pipelines'])}

### 📊 **Monitoring Response Schema:**

**Drift Detection Output:**
```json
{{
  "outputs": [
    {{"name": "prediction", "data": [2]}},  // Model prediction
    {{"name": "drift_score", "data": [0.023]}},  // Drift score (0-1)
    {{"name": "drift_detected", "data": [false]}},  // Boolean flag
    {{"name": "feature_drift", "data": [0.01, 0.02, 0.03, 0.02]}}  // Per-feature drift
  ]
}}
```

**Explanation Output:**
```json
{{
  "outputs": [
    {{"name": "prediction", "data": [2]}},
    {{"name": "explanation", "data": ["petal_length > 4.8 AND petal_width > 1.6"]}},
    {{"name": "feature_importance", "data": [0.1, 0.2, 0.5, 0.2]}},
    {{"name": "confidence", "data": [0.95]}}
  ]
}}
```

**Bias Detection Output:**
```json
{{
  "outputs": [
    {{"name": "prediction", "data": [2]}},
    {{"name": "demographic_parity", "data": [0.92]}},  // Fairness score
    {{"name": "equal_opportunity", "data": [0.88]}},
    {{"name": "bias_detected", "data": [false]}}
  ]
}}
```
"""))

## Test Advanced Monitoring Components

In [ ]:
# Production monitoring test suite
class ProductionMonitoringClient:
    def __init__(self, gateway_ip, gateway_port, namespace):
        self.gateway_ip = gateway_ip
        self.gateway_port = gateway_port
        self.namespace = namespace
        self.session = requests.Session()
        
    def test_monitoring(self, name, data, is_pipeline=False, show_details=True):
        """Test monitoring component with production error handling"""
        url = f"http://{self.gateway_ip}:{self.gateway_port}/v2/models/{name}/infer"
        payload = {
            "inputs": [{
                "name": "predict", 
                "shape": [len(data), len(data[0])], 
                "datatype": "FP32", 
                "data": data
            }]
        }
        headers = {
            "Content-Type": "application/json", 
            "Seldon-Model": f"{name}.pipeline" if is_pipeline else name
        }
        
        if self.gateway_ip not in ["localhost", "127.0.0.1"]:
            headers["Host"] = f"{self.namespace}.inference.seldon.test"
        
        try:
            response = self.session.post(url, json=payload, headers=headers, timeout=config.timeout)
            
            if response.status_code == 200:
                result = response.json()
                outputs = result.get("outputs", [])
                
                # Process monitoring outputs
                monitoring_results = {}
                for output in outputs:
                    output_name = output.get("name", "unknown")
                    output_data = output.get("data", [])
                    monitoring_results[output_name] = output_data
                
                if show_details:
                    self._display_monitoring_results(name, monitoring_results)
                
                return monitoring_results
            else:
                log(f"Failed {name}: HTTP {response.status_code} - {response.text[:200]}", "ERROR")
                return None
                
        except Exception as e:
            log(f"Error testing {name}: {str(e)}", "ERROR")
            return None
    
    def _display_monitoring_results(self, name, results):
        """Display monitoring results in production format"""
        if "drift-detector" in name:
            drift_score = results.get("drift_score", [0])[0] if results.get("drift_score") else 0
            drift_detected = drift_score > config.drift_threshold
            
            # Update metrics
            metrics.drift_scores.append(drift_score)
            if drift_detected:
                metrics.drift_detections += 1
            
            display(Markdown(f"""
**🔍 Drift Detection Results:**
- **Drift Score**: {drift_score:.4f} {'🔴 DRIFT DETECTED' if drift_detected else '🟢 Normal'}
- **Threshold**: {config.drift_threshold}
- **Action Required**: {'Yes - Investigate data changes' if drift_detected else 'No - Continue monitoring'}
"""))
            
        elif "model-explainer" in name:
            explanation = results.get("explanation", ["No explanation"])[0] if results.get("explanation") else "No explanation"
            importance = results.get("feature_importance", [])
            
            metrics.explanations_generated += 1
            
            display(Markdown(f"""
**🎯 Model Explanation:**
- **Rule**: {explanation}
- **Feature Importance**: {importance}
- **Compliance Ready**: ✅ Explanation logged for audit
"""))
            
        elif "performance-monitor" in name:
            performance = results.get("performance_score", [0])[0] if results.get("performance_score") else 0
            
            if performance < config.performance_threshold:
                log(f"Performance degradation detected: {performance:.2f}", "WARNING")
            
            display(Markdown(f"""
**📊 Performance Monitoring:**
- **Current Performance**: {performance:.2f} {'⚠️ Below threshold' if performance < config.performance_threshold else '✅ Normal'}
- **Threshold**: {config.performance_threshold}
"""))
            
        elif "bias-detector" in name:
            dp_score = results.get("demographic_parity", [0])[0] if results.get("demographic_parity") else 0
            eo_score = results.get("equal_opportunity", [0])[0] if results.get("equal_opportunity") else 0
            
            display(Markdown(f"""
**⚖️ Fairness Monitoring:**
- **Demographic Parity**: {dp_score:.2f}
- **Equal Opportunity**: {eo_score:.2f}
- **Bias Status**: {'⚠️ Potential bias' if min(dp_score, eo_score) < 0.8 else '✅ Fair'}
"""))

# Initialize monitoring client
monitoring_client = ProductionMonitoringClient(config.gateway_ip, config.gateway_port, config.namespace)

log("Testing production monitoring components...", "INFO")

# Test data scenarios
test_scenarios = [
    {
        "name": "Normal Data",
        "data": [[5.1, 3.5, 1.4, 0.2]],  # Normal iris setosa
        "expected": "No drift expected"
    },
    {
        "name": "Slight Variation",
        "data": [[5.5, 3.8, 1.5, 0.3]],  # Slightly different
        "expected": "Minor drift possible"
    },
    {
        "name": "Anomalous Data",
        "data": [[10.0, 8.0, 6.0, 3.0]],  # Out of distribution
        "expected": "High drift expected"
    },
    {
        "name": "Edge Case",
        "data": [[4.0, 2.0, 1.0, 0.1]],  # Edge of distribution
        "expected": "Moderate drift possible"
    }
]

# Test individual components
display(Markdown("## 🧪 Testing Individual Monitoring Components"))

for scenario in test_scenarios:
    display(Markdown(f"### Testing: {scenario['name']} ({scenario['expected']})"))
    display(Markdown(f"Data: `{scenario['data'][0]}`"))
    
    # Test drift detection
    if "drift-detector" in deployed["models"]:
        monitoring_client.test_monitoring("drift-detector", scenario["data"])
    
    # Test explanations for edge cases
    if scenario["name"] in ["Anomalous Data", "Edge Case"] and "model-explainer" in deployed["models"]:
        monitoring_client.test_monitoring("model-explainer", scenario["data"])
    
    metrics.total_monitored += 1

# Test integrated pipelines
if deployed["pipelines"]:
    display(Markdown("## 🔗 Testing Integrated Monitoring Pipelines"))
    
    # Test comprehensive monitoring
    if "comprehensive-monitoring" in deployed["pipelines"]:
        display(Markdown("### Testing Comprehensive Monitoring Pipeline"))
        
        test_batch = [
            [5.1, 3.5, 1.4, 0.2],  # Normal
            [6.5, 3.0, 5.5, 1.8],  # Different class
            [8.0, 6.0, 4.0, 2.0]   # Anomalous
        ]
        
        for i, data in enumerate(test_batch):
            display(Markdown(f"**Test {i+1}**: {data}"))
            monitoring_client.test_monitoring(
                "comprehensive-monitoring", 
                [data], 
                is_pipeline=True,
                show_details=True
            )
            time.sleep(0.5)

# Display monitoring summary
display(Markdown(f"""
## 📊 **Monitoring Test Summary**

**Test Results:**
- 📋 **Total Samples Monitored**: {metrics.total_monitored}
- 🔍 **Drift Detections**: {metrics.drift_detections}
- 🎯 **Explanations Generated**: {metrics.explanations_generated}
- 📈 **Average Drift Score**: {np.mean(metrics.drift_scores) if metrics.drift_scores else 0:.4f}

**System Health:**
- ✅ **Monitoring Pipeline**: Operational
- ✅ **Drift Detection**: {'Alert - High drift detected' if metrics.drift_detections > 0 else 'Normal operations'}
- ✅ **Explainability**: Ready for compliance
- ✅ **Fairness Tracking**: Enabled

**Next Steps:**
1. Configure alerts for drift scores > {config.drift_threshold}
2. Set up automated retraining triggers
3. Create compliance reports with explanations
4. Monitor fairness metrics across user segments
"""))

log("Production monitoring testing complete", "SUCCESS")

## Monitoring Integration Summary

In [None]:
# Summary
display(Markdown(f"""
### 📊 **Advanced Monitoring Integration Summary:**

**Components Successfully Deployed:**
- 🔍 **Drift Detector**: Statistical monitoring of iris feature distributions (4 features)
- 🎯 **Model Explainer**: Anchor-based interpretability for iris predictions  
- 🔗 **Monitoring Pipeline**: Real-time drift detection during inference
- 📊 **Explanation Pipeline**: On-demand prediction explanations

### 🔬 **Monitoring Request/Response Examples:**

**Drift Detection Request:**
```http
POST /v2/models/drift-detector/infer HTTP/1.1
Host: {config.gateway_ip}:{config.gateway_port}
Content-Type: application/json
Seldon-Model: drift-detector

{{
  \"inputs\": [
    {{
      \"name\": \"predict\",
      \"shape\": [1, 4],
      \"datatype\": \"FP32\",
      \"data\": [[10.0, 8.0, 6.0, 3.0]]
    }}
  ]
}}
```

**Drift Detection Response:**
```http
HTTP/1.1 200 OK
Content-Type: application/json
X-Seldon-Model: drift-detector
X-Processing-Time-Ms: 123

{{
  \"outputs\": [
    {{
      \"name\": \"drift_score\",
      \"shape\": [1],
      \"datatype\": \"FP32\",
      \"data\": [0.85]
    }}
  ]
}}
```

**Explanation Request:**
```http
POST /v2/models/model-explainer/infer HTTP/1.1
Host: {config.gateway_ip}:{config.gateway_port}
Content-Type: application/json
Seldon-Model: model-explainer

{{
  \"inputs\": [
    {{
      \"name\": \"predict\", 
      \"shape\": [1, 4],
      \"datatype\": \"FP32\",
      \"data\": [[6.5, 3.0, 5.5, 1.8]]
    }}
  ]
}}
```
"""))

## Production Integration Examples

In [ ]:
display(Markdown(f"""
## 🚀 **Production Integration Patterns**

### 1. **Real-Time Drift Monitoring with Auto-Remediation**

```python
import asyncio
from datetime import datetime, timedelta

class DriftMonitor:
    def __init__(self, threshold=0.15, window_minutes=5):
        self.threshold = threshold
        self.window = timedelta(minutes=window_minutes)
        self.drift_history = []
        
    async def monitor_continuously(self):
        \"\"\"Continuous drift monitoring with alerts\"\"\"
        while True:
            # Get recent predictions
            features = await get_recent_features()
            
            # Check drift
            response = requests.post(
                "http://{config.gateway_ip}:{config.gateway_port}/v2/models/drift-detector/infer",
                json={{
                    "inputs": [{{
                        "name": "predict",
                        "datatype": "FP32",
                        "shape": [len(features), 4],
                        "data": features
                    }}]
                }}
            )
            
            drift_scores = response.json()["outputs"][0]["data"]
            avg_drift = sum(drift_scores) / len(drift_scores)
            
            self.drift_history.append({{
                "timestamp": datetime.now(),
                "score": avg_drift
            }})
            
            # Check if drift persists over window
            recent_drift = [d for d in self.drift_history 
                          if d["timestamp"] > datetime.now() - self.window]
            
            if all(d["score"] > self.threshold for d in recent_drift):
                await self.trigger_retraining()
                
            await asyncio.sleep(60)  # Check every minute
    
    async def trigger_retraining(self):
        \"\"\"Trigger model retraining pipeline\"\"\"
        # Send alert
        alert_payload = {{
            "alert": "DataDriftDetected",
            "severity": "high",
            "action": "retrain_required",
            "drift_scores": [d["score"] for d in self.drift_history[-10:]]
        }}
        
        # Trigger Kubeflow/Airflow pipeline
        requests.post("http://kubeflow-api/pipelines/retrain/trigger", 
                     json=alert_payload)
```

### 2. **Compliance-Ready Explanation Service**

```python
class ComplianceExplainer:
    def __init__(self, model_name="product-classifier-v1"):
        self.model_name = model_name
        self.explanation_cache = {{}}
        
    def get_explanation_with_audit(self, features, user_id, decision_id):
        \"\"\"Get explanation with full audit trail\"\"\"
        
        # Check cache first
        cache_key = f"{{user_id}}_{{decision_id}}"
        if cache_key in self.explanation_cache:
            return self.explanation_cache[cache_key]
        
        # Get prediction and explanation
        response = requests.post(
            "http://{config.gateway_ip}:{config.gateway_port}/v2/models/explanation-service/infer",
            json={{
                "inputs": [{{
                    "name": "predict",
                    "datatype": "FP32",
                    "shape": [1, 4],
                    "data": [features]
                }}]
            }},
            headers={{"Seldon-Model": "explanation-service.pipeline"}}
        )
        
        result = response.json()
        outputs = {{o["name"]: o["data"] for o in result["outputs"]}}
        
        # Create audit record
        audit_record = {{
            "decision_id": decision_id,
            "user_id": user_id,
            "timestamp": datetime.now().isoformat(),
            "model_version": self.model_name,
            "features": features,
            "prediction": outputs.get("prediction", [None])[0],
            "explanation": outputs.get("explanation", ["No explanation"])[0],
            "feature_importance": outputs.get("feature_importance", []),
            "confidence": outputs.get("confidence", [0])[0]
        }}
        
        # Store in compliance database
        self.store_audit_record(audit_record)
        
        # Cache for performance
        self.explanation_cache[cache_key] = audit_record
        
        return audit_record
    
    def generate_compliance_report(self, start_date, end_date):
        \"\"\"Generate compliance report for regulatory review\"\"\"
        # Query audit records
        records = self.query_audit_records(start_date, end_date)
        
        report = {{
            "period": f"{{start_date}} to {{end_date}}",
            "total_decisions": len(records),
            "model_versions": list(set(r["model_version"] for r in records)),
            "explanation_coverage": sum(1 for r in records if r["explanation"] != "No explanation") / len(records),
            "average_confidence": sum(r["confidence"] for r in records) / len(records),
            "feature_importance_summary": self.summarize_feature_importance(records)
        }}
        
        return report
```

### 3. **Fairness Monitoring Dashboard**

```python
class FairnessMonitor:
    def __init__(self):
        self.protected_attributes = ["age_group", "gender", "ethnicity"]
        self.fairness_thresholds = {{
            "demographic_parity": 0.8,
            "equal_opportunity": 0.8,
            "disparate_impact": 0.8
        }}
        
    def check_fairness_batch(self, predictions_df):
        \"\"\"Check fairness metrics for a batch of predictions\"\"\"
        
        fairness_results = {{}}
        
        for attribute in self.protected_attributes:
            # Group by protected attribute
            groups = predictions_df.groupby(attribute)
            
            # Calculate metrics per group
            group_metrics = {{}}
            for group_name, group_data in groups:
                features = group_data[["f1", "f2", "f3", "f4"]].values
                
                # Get fairness metrics
                response = requests.post(
                    "http://{config.gateway_ip}:{config.gateway_port}/v2/models/bias-detector/infer",
                    json={{
                        "inputs": [{{
                            "name": "predict",
                            "datatype": "FP32",
                            "shape": list(features.shape),
                            "data": features.tolist()
                        }}]
                    }}
                )
                
                outputs = response.json()["outputs"]
                group_metrics[group_name] = {{
                    "demographic_parity": outputs[1]["data"][0],
                    "equal_opportunity": outputs[2]["data"][0]
                }}
            
            fairness_results[attribute] = group_metrics
        
        # Check for violations
        violations = []
        for attribute, groups in fairness_results.items():
            for metric, threshold in self.fairness_thresholds.items():
                values = [g[metric] for g in groups.values()]
                if min(values) < threshold:
                    violations.append({{
                        "attribute": attribute,
                        "metric": metric,
                        "min_value": min(values),
                        "threshold": threshold
                    }})
        
        return fairness_results, violations
```

### 4. **Production Monitoring Configuration**

```yaml
# prometheus-rules.yaml
groups:
  - name: ml_monitoring
    interval: 30s
    rules:
      - alert: HighDataDrift
        expr: avg(drift_score{{namespace="{config.namespace}"}}) > {config.drift_threshold}
        for: 5m
        labels:
          severity: warning
          team: ml-ops
        annotations:
          summary: "High data drift detected"
          description: "Average drift score {{{{ $value }}}} exceeds threshold"
          
      - alert: ModelPerformanceDegradation
        expr: model_performance{{namespace="{config.namespace}"}} < {config.performance_threshold}
        for: 10m
        labels:
          severity: critical
          team: ml-ops
        annotations:
          summary: "Model performance below threshold"
          description: "Performance score {{{{ $value }}}} is below acceptable level"
          
      - alert: BiasDetected
        expr: min(fairness_score{{namespace="{config.namespace}"}}) < 0.8
        for: 15m
        labels:
          severity: warning
          team: ml-ops
        annotations:
          summary: "Potential bias detected in model predictions"
          description: "Fairness score {{{{ $value }}}} indicates potential bias"
```

### 5. **Grafana Dashboard Queries**

```promql
# Drift Score Trend
avg(drift_score{{namespace="{config.namespace}"}}) by (model_name)

# Explanation Request Rate
rate(seldon_model_infer_total{{model_name="model-explainer",namespace="{config.namespace}"}}[5m])

# Fairness Metrics by Group
avg(fairness_score{{namespace="{config.namespace}"}}) by (protected_attribute, group)

# Model Performance Over Time
avg_over_time(model_performance{{namespace="{config.namespace}"}}[1h])

# Anomaly Detection Rate
sum(rate(anomaly_detected{{namespace="{config.namespace}"}}[5m])) by (model_name)
```

### 6. **Integration with MLOps Pipeline**

```python
class MLOpsIntegration:
    def __init__(self):
        self.monitoring_endpoint = "http://{config.gateway_ip}:{config.gateway_port}"
        self.mlflow_tracking_uri = "http://mlflow:5000"
        
    def log_monitoring_metrics(self, run_id, metrics):
        \"\"\"Log monitoring metrics to MLflow\"\"\"
        import mlflow
        
        mlflow.set_tracking_uri(self.mlflow_tracking_uri)
        
        with mlflow.start_run(run_id=run_id):
            mlflow.log_metric("avg_drift_score", metrics["drift_score"])
            mlflow.log_metric("explanation_coverage", metrics["explanation_coverage"])
            mlflow.log_metric("fairness_score", metrics["fairness_score"])
            mlflow.log_metric("performance_score", metrics["performance_score"])
            
            # Log alerts
            if metrics["drift_score"] > {config.drift_threshold}:
                mlflow.set_tag("alert", "high_drift")
            if metrics["fairness_score"] < 0.8:
                mlflow.set_tag("alert", "bias_detected")
```
"""))

## Best Practices for Production Monitoring

In [None]:
display(Markdown("""
## 📋 **Best Practices for Production Monitoring**

### 1. **Drift Detection Strategy**
- Set appropriate thresholds based on your data characteristics
- Monitor both feature drift and prediction drift
- Use sliding windows for baseline comparison
- Implement graduated alerts (warning → critical)

### 2. **Explainability Implementation**
- Generate explanations for edge cases and anomalies
- Store explanations for regulatory compliance
- Use explanations to improve model training
- Balance explanation detail with performance

### 3. **Performance Optimization**
- Use async monitoring for non-critical paths
- Batch explanation requests when possible
- Cache explanations for repeated predictions
- Scale monitoring components independently

### 4. **Alert Configuration**
```yaml
# Example AlertManager configuration
route:
  group_by: ['alertname', 'cluster', 'service']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 12h
  receiver: 'ml-team'
  routes:
  - match:
      alertname: HighDataDrift
    receiver: ml-team-urgent
  - match:
      alertname: ModelPerformanceDegradation
    receiver: ml-team-daily
```

### 5. **Dashboard Design**
- **Real-time metrics**: Request rate, latency, error rate
- **ML metrics**: Drift scores, explanation coverage, model confidence
- **Business metrics**: Predictions by class, feature importance trends
- **Infrastructure**: CPU/memory usage, queue depths, scaling events
"""))

## Cleanup (Optional)

In [ ]:
# Production cleanup with resource management
import ipywidgets as widgets
from IPython.display import display

def cleanup_monitoring_resources():
    """Clean up monitoring resources safely"""
    log("Starting monitoring cleanup...", "INFO")
    
    # Only clean up monitoring-specific resources
    monitoring_resources = [
        ("pipeline", ["real-time-monitoring", "explanation-service", "fairness-monitoring", "comprehensive-monitoring"]),
        ("model", ["drift-detector", "model-explainer", "performance-monitor", "bias-detector"])
    ]
    
    cleanup_count = 0
    
    for resource_type, items in monitoring_resources:
        for item in items:
            result = run(f"kubectl delete {resource_type} {item} -n {config.namespace} --ignore-not-found=true --wait=false")
            if result.returncode == 0:
                log(f"Deleted {resource_type}: {item}", "SUCCESS")
                cleanup_count += 1
    
    # Clean up YAML files
    import glob
    yaml_files = glob.glob("*.yaml")
    for yaml_file in yaml_files:
        if any(name in yaml_file for name in ["drift", "explainer", "performance", "bias", "monitoring", "fairness"]):
            try:
                os.remove(yaml_file)
            except:
                pass
    
    log(f"Cleanup complete! Removed {cleanup_count} monitoring resources", "SUCCESS")

# Interactive cleanup interface
cleanup_button = widgets.Button(
    description="Clean Up Monitoring",
    button_style='danger',
    tooltip='Remove monitoring resources',
    icon='trash'
)

keep_button = widgets.Button(
    description="Keep Monitoring",
    button_style='success',
    tooltip='Preserve monitoring setup',
    icon='check'
)

output = widgets.Output()

def on_cleanup_click(b):
    with output:
        output.clear_output()
        cleanup_monitoring_resources()

def on_keep_click(b):
    with output:
        output.clear_output()
        log("Monitoring resources preserved for production use", "SUCCESS")
        display(Markdown(f"""
### 📌 **Monitoring Resources Preserved**

**Active Monitoring Components:**
- 🔍 **Drift Detector**: Real-time data quality monitoring
- 🎯 **Model Explainer**: Compliance-ready explanations
- 📊 **Performance Monitor**: Model health tracking
- ⚖️ **Bias Detector**: Fairness monitoring

**Monitoring Pipelines:**
{chr(10).join(f"- {pipeline}" for pipeline in deployed['pipelines'])}

**Production Commands:**
```bash
# View monitoring status
kubectl get models -l app=data-science-monitoring -n {config.namespace}
kubectl get pipelines -l app=data-science-monitoring -n {config.namespace}

# Check metrics
kubectl top pods -l app=data-science-monitoring -n {config.namespace}

# Monitor with k9s
k9s -n {config.namespace}
```

**Manual cleanup when ready:**
```bash
# Delete monitoring pipelines
kubectl delete pipelines -l app=data-science-monitoring -n {config.namespace}

# Delete monitoring models
kubectl delete models -l component=monitoring-model -n {config.namespace}
```
"""))

cleanup_button.on_click(on_cleanup_click)
keep_button.on_click(on_keep_click)

display(Markdown("### 🧹 **Resource Management**"))
display(widgets.HBox([keep_button, cleanup_button]))
display(output)

# Final summary
display(Markdown(f"""
## 🎯 **Production Data Science Monitoring Summary**

You've successfully deployed a **production-grade ML monitoring platform** with:

**🔬 Monitoring Capabilities:**
- ✅ **Drift Detection**: {metrics.drift_detections} drift events detected
- ✅ **Model Explainability**: {metrics.explanations_generated} explanations generated
- ✅ **Performance Tracking**: Real-time model health monitoring
- ✅ **Fairness Monitoring**: Bias detection across segments

**📊 Infrastructure Deployed:**
- 🤖 **{len([m for m in deployed['models'] if any(mon in m for mon in ['drift', 'explainer', 'performance', 'bias'])])} Monitoring Models**
- 🔗 **{len(deployed['pipelines'])} Monitoring Pipelines**
- 📈 **Prometheus Metrics**: Exposed for all components
- 🚨 **Alert Rules**: Configured for drift and performance

**🚀 Production Features:**
- **Auto-remediation**: Trigger retraining on drift detection
- **Compliance Ready**: Full audit trail with explanations
- **Real-time Alerts**: Prometheus/AlertManager integration
- **Fairness Tracking**: Demographic parity monitoring
- **MLOps Integration**: Ready for CI/CD pipelines

**📋 Next Steps:**
1. **Configure Alerts**: Deploy AlertManager rules
2. **Set Up Dashboards**: Import Grafana templates
3. **Enable Auto-retraining**: Connect to ML pipelines
4. **Schedule Reports**: Weekly compliance summaries
5. **Monitor Fairness**: Track across user segments

**💡 Try These Commands:**
```python
# Check for drift
monitoring_client.test_monitoring(
    "drift-detector", 
    [[8.0, 6.0, 4.0, 2.0]]  # Anomalous data
)

# Get explanation
monitoring_client.test_monitoring(
    "model-explainer",
    [[6.5, 3.0, 5.5, 1.8]]  # Edge case
)

# Full monitoring pipeline
monitoring_client.test_monitoring(
    "comprehensive-monitoring",
    [[5.1, 3.5, 1.4, 0.2]],
    is_pipeline=True
)
```

**🏆 Achievement Unlocked**: Production ML Monitoring Platform! 🎉
"""))