# ML Pipeline Platform - Performance Monitoring

This notebook provides comprehensive performance monitoring and system analytics for the ML Pipeline Platform.

## Contents
1. [System Metrics Collection](#system-metrics)
2. [Model Performance Tracking](#model-performance)
3. [Data Quality Monitoring](#data-quality)
4. [Real-time Dashboard Simulation](#dashboard)
5. [Alert System Analysis](#alerts)
6. [Performance Optimization](#optimization)


In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import warnings
warnings.filterwarnings('ignore')

# Time and monitoring libraries
from datetime import datetime, timedelta
import time
import random
import json

# Statistical libraries
from scipy import stats
from sklearn.metrics import mean_squared_error, mean_absolute_error

# Set style
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

print("Libraries imported successfully!")
print(f"Notebook started at: {datetime.now()}")

## 1. System Metrics Collection {#system-metrics}

Simulate and analyze system performance metrics.

In [None]:
# Generate synthetic system metrics data
def generate_system_metrics(days=7, interval_minutes=5):
    """Generate synthetic system metrics for monitoring simulation"""
    
    # Calculate number of data points
    total_minutes = days * 24 * 60
    num_points = total_minutes // interval_minutes
    
    # Generate timestamps
    start_time = datetime.now() - timedelta(days=days)
    timestamps = [start_time + timedelta(minutes=i*interval_minutes) for i in range(num_points)]
    
    # Generate realistic metrics with patterns
    metrics = []
    
    for i, ts in enumerate(timestamps):
        # Add daily patterns (higher load during business hours)
        hour = ts.hour
        daily_factor = 1.0 + 0.5 * np.sin((hour - 6) * np.pi / 12) if 6 <= hour <= 18 else 0.3
        
        # Add weekly patterns (lower load on weekends)
        weekly_factor = 0.6 if ts.weekday() >= 5 else 1.0
        
        # Base load with some randomness
        base_factor = daily_factor * weekly_factor
        
        # System metrics
        cpu_usage = max(0, min(100, 30 * base_factor + np.random.normal(0, 10)))
        memory_usage = max(0, min(100, 40 * base_factor + np.random.normal(0, 8)))
        
        # API metrics
        requests_per_second = max(0, 50 * base_factor + np.random.normal(0, 15))
        response_time = max(0, 100 + 50 * base_factor + np.random.exponential(20))
        error_rate = max(0, min(10, 0.5 + np.random.exponential(0.5)))
        
        # Model metrics
        predictions_per_minute = max(0, requests_per_second * 0.8 + np.random.normal(0, 5))
        model_accuracy = max(0.8, min(1.0, 0.95 + np.random.normal(0, 0.02)))
        
        # Storage metrics
        disk_usage = min(100, 60 + i * 0.01 + np.random.normal(0, 2))  # Gradually increasing
        
        metrics.append({
            'timestamp': ts,
            'cpu_usage': cpu_usage,
            'memory_usage': memory_usage,
            'requests_per_second': requests_per_second,
            'response_time_ms': response_time,
            'error_rate': error_rate,
            'predictions_per_minute': predictions_per_minute,
            'model_accuracy': model_accuracy,
            'disk_usage': disk_usage
        })
    
    return pd.DataFrame(metrics)

# Generate metrics data
metrics_df = generate_system_metrics(days=7, interval_minutes=5)
print(f"Generated {len(metrics_df)} system metric records")
print(f"Time range: {metrics_df['timestamp'].min()} to {metrics_df['timestamp'].max()}")
print("\nSample data:")
print(metrics_df.head())

In [None]:
# System metrics overview
fig = make_subplots(
    rows=3, cols=2,
    subplot_titles=('CPU Usage (%)', 'Memory Usage (%)', 
                   'Response Time (ms)', 'Error Rate (%)',
                   'Requests/Second', 'Model Accuracy'),
    vertical_spacing=0.08
)

# CPU Usage
fig.add_trace(go.Scatter(x=metrics_df['timestamp'], y=metrics_df['cpu_usage'],
                        mode='lines', name='CPU Usage', line=dict(color='blue')),
             row=1, col=1)

# Memory Usage
fig.add_trace(go.Scatter(x=metrics_df['timestamp'], y=metrics_df['memory_usage'],
                        mode='lines', name='Memory Usage', line=dict(color='green')),
             row=1, col=2)

# Response Time
fig.add_trace(go.Scatter(x=metrics_df['timestamp'], y=metrics_df['response_time_ms'],
                        mode='lines', name='Response Time', line=dict(color='orange')),
             row=2, col=1)

# Error Rate
fig.add_trace(go.Scatter(x=metrics_df['timestamp'], y=metrics_df['error_rate'],
                        mode='lines', name='Error Rate', line=dict(color='red')),
             row=2, col=2)

# Requests per Second
fig.add_trace(go.Scatter(x=metrics_df['timestamp'], y=metrics_df['requests_per_second'],
                        mode='lines', name='Requests/Second', line=dict(color='purple')),
             row=3, col=1)

# Model Accuracy
fig.add_trace(go.Scatter(x=metrics_df['timestamp'], y=metrics_df['model_accuracy'],
                        mode='lines', name='Model Accuracy', line=dict(color='darkgreen')),
             row=3, col=2)

fig.update_layout(height=1000, title_text="System Performance Metrics Overview", showlegend=False)
fig.show()

In [None]:
# Calculate key performance indicators (KPIs)
current_time = metrics_df['timestamp'].max()
last_hour_data = metrics_df[metrics_df['timestamp'] >= current_time - timedelta(hours=1)]
last_24h_data = metrics_df[metrics_df['timestamp'] >= current_time - timedelta(days=1)]

kpis = {
    'Current Performance': {
        'CPU Usage (%)': metrics_df['cpu_usage'].iloc[-1],
        'Memory Usage (%)': metrics_df['memory_usage'].iloc[-1],
        'Response Time (ms)': metrics_df['response_time_ms'].iloc[-1],
        'Error Rate (%)': metrics_df['error_rate'].iloc[-1],
        'Model Accuracy': metrics_df['model_accuracy'].iloc[-1]
    },
    'Last Hour Averages': {
        'CPU Usage (%)': last_hour_data['cpu_usage'].mean(),
        'Memory Usage (%)': last_hour_data['memory_usage'].mean(),
        'Response Time (ms)': last_hour_data['response_time_ms'].mean(),
        'Error Rate (%)': last_hour_data['error_rate'].mean(),
        'Requests/Second': last_hour_data['requests_per_second'].mean()
    },
    'Last 24h Summary': {
        'Total Requests': int(last_24h_data['requests_per_second'].sum() * 5 / 60),  # Convert to total
        'Avg Response Time (ms)': last_24h_data['response_time_ms'].mean(),
        'Max Response Time (ms)': last_24h_data['response_time_ms'].max(),
        'Avg Error Rate (%)': last_24h_data['error_rate'].mean(),
        'Uptime (%)': 100 - (last_24h_data['error_rate'] > 5).mean() * 100
    }
}

print("=" * 60)
print("üìä SYSTEM PERFORMANCE DASHBOARD")
print("=" * 60)

for category, metrics in kpis.items():
    print(f"\n{category}:")
    for metric, value in metrics.items():
        if isinstance(value, float):
            print(f"  {metric}: {value:.2f}")
        else:
            print(f"  {metric}: {value:,}")

## 2. Model Performance Tracking {#model-performance}

Monitor model performance over time and detect drift.

In [None]:
# Generate model performance data over time
def generate_model_performance_data(days=30):
    """Generate synthetic model performance data"""
    
    dates = pd.date_range(start=datetime.now() - timedelta(days=days), 
                         end=datetime.now(), freq='H')
    
    performance_data = []
    
    # Simulate gradual model drift
    base_accuracy = 0.95
    drift_rate = 0.0001  # Small drift per hour
    
    for i, date in enumerate(dates):
        # Add drift and noise
        accuracy = base_accuracy - (i * drift_rate) + np.random.normal(0, 0.01)
        accuracy = max(0.8, min(1.0, accuracy))  # Clamp between 0.8 and 1.0
        
        # Other metrics
        precision = accuracy + np.random.normal(0, 0.005)
        recall = accuracy + np.random.normal(0, 0.005)
        f1_score = 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0
        
        # Prediction volume
        hour = date.hour
        daily_pattern = 1.0 + 0.5 * np.sin((hour - 6) * np.pi / 12) if 6 <= hour <= 18 else 0.3
        predictions = max(0, int(1000 * daily_pattern + np.random.normal(0, 100)))
        
        performance_data.append({
            'timestamp': date,
            'accuracy': accuracy,
            'precision': precision,
            'recall': recall,
            'f1_score': f1_score,
            'predictions_count': predictions,
            'false_positives': int(predictions * (1 - precision) * recall),
            'false_negatives': int(predictions * precision * (1 - recall))
        })
    
    return pd.DataFrame(performance_data)

# Generate model performance data
model_perf_df = generate_model_performance_data(days=30)
print(f"Generated {len(model_perf_df)} model performance records")
print("\nSample data:")
print(model_perf_df.head())

In [None]:
# Model performance visualization
fig = make_subplots(
    rows=2, cols=2,
    subplot_titles=('Model Accuracy Over Time', 'Precision vs Recall',
                   'Prediction Volume', 'False Positives/Negatives'),
    specs=[[{"secondary_y": False}, {"secondary_y": False}],
           [{"secondary_y": True}, {"secondary_y": False}]]
)

# Accuracy over time with trend line
fig.add_trace(go.Scatter(x=model_perf_df['timestamp'], y=model_perf_df['accuracy'],
                        mode='lines', name='Accuracy', line=dict(color='blue')),
             row=1, col=1)

# Add trend line
z = np.polyfit(range(len(model_perf_df)), model_perf_df['accuracy'], 1)
trend_line = np.poly1d(z)(range(len(model_perf_df)))
fig.add_trace(go.Scatter(x=model_perf_df['timestamp'], y=trend_line,
                        mode='lines', name='Trend', line=dict(color='red', dash='dash')),
             row=1, col=1)

# Precision vs Recall scatter
fig.add_trace(go.Scatter(x=model_perf_df['recall'], y=model_perf_df['precision'],
                        mode='markers', name='Precision vs Recall',
                        marker=dict(color=model_perf_df['f1_score'], colorscale='Viridis',
                                  colorbar=dict(title="F1 Score", x=0.48))),
             row=1, col=2)

# Prediction volume
fig.add_trace(go.Scatter(x=model_perf_df['timestamp'], y=model_perf_df['predictions_count'],
                        mode='lines', name='Predictions', line=dict(color='green')),
             row=2, col=1)

# False positives and negatives
fig.add_trace(go.Scatter(x=model_perf_df['timestamp'], y=model_perf_df['false_positives'],
                        mode='lines', name='False Positives', line=dict(color='orange')),
             row=2, col=2)
fig.add_trace(go.Scatter(x=model_perf_df['timestamp'], y=model_perf_df['false_negatives'],
                        mode='lines', name='False Negatives', line=dict(color='red')),
             row=2, col=2)

fig.update_layout(height=800, title_text="Model Performance Analysis", showlegend=True)
fig.show()

# Calculate drift statistics
initial_accuracy = model_perf_df['accuracy'].iloc[:24].mean()  # First day
recent_accuracy = model_perf_df['accuracy'].iloc[-24:].mean()  # Last day
drift_magnitude = abs(recent_accuracy - initial_accuracy)

print(f"\nüìà Model Drift Analysis:")
print(f"Initial Accuracy (Day 1): {initial_accuracy:.4f}")
print(f"Recent Accuracy (Last Day): {recent_accuracy:.4f}")
print(f"Drift Magnitude: {drift_magnitude:.4f}")
print(f"Drift Rate: {(recent_accuracy - initial_accuracy)/initial_accuracy*100:.2f}%")

if drift_magnitude > 0.01:
    print("‚ö†Ô∏è  ALERT: Significant model drift detected!")
else:
    print("‚úÖ Model performance is stable")

## 3. Data Quality Monitoring {#data-quality}

Monitor data quality and feature distributions.

In [None]:
# Generate data quality metrics
def generate_data_quality_metrics(days=7):
    """Generate synthetic data quality metrics"""
    
    dates = pd.date_range(start=datetime.now() - timedelta(days=days), 
                         end=datetime.now(), freq='H')
    
    quality_data = []
    
    for date in dates:
        # Data completeness (percentage of non-null values)
        completeness = max(85, min(100, 98 + np.random.normal(0, 2)))
        
        # Data freshness (delay in minutes)
        freshness_delay = max(0, np.random.exponential(5))  # Exponential distribution
        
        # Feature distribution drift (KL divergence simulation)
        feature_drift = abs(np.random.normal(0, 0.1))
        
        # Anomaly detection (percentage of anomalous records)
        anomaly_rate = max(0, min(10, np.random.exponential(0.5)))
        
        # Schema violations
        schema_violations = max(0, int(np.random.poisson(0.1)))
        
        # Record count
        hour = date.hour
        daily_pattern = 1.0 + 0.5 * np.sin((hour - 6) * np.pi / 12) if 6 <= hour <= 18 else 0.3
        record_count = max(0, int(5000 * daily_pattern + np.random.normal(0, 500)))
        
        quality_data.append({
            'timestamp': date,
            'data_completeness': completeness,
            'freshness_delay_minutes': freshness_delay,
            'feature_drift_score': feature_drift,
            'anomaly_rate': anomaly_rate,
            'schema_violations': schema_violations,
            'record_count': record_count
        })
    
    return pd.DataFrame(quality_data)

# Generate data quality metrics
data_quality_df = generate_data_quality_metrics(days=7)
print(f"Generated {len(data_quality_df)} data quality records")
print("\nSample data:")
print(data_quality_df.head())

In [None]:
# Data quality visualization
fig = make_subplots(
    rows=3, cols=2,
    subplot_titles=('Data Completeness (%)', 'Data Freshness (minutes)',
                   'Feature Drift Score', 'Anomaly Rate (%)',
                   'Schema Violations', 'Record Count'),
    vertical_spacing=0.08
)

# Data Completeness
fig.add_trace(go.Scatter(x=data_quality_df['timestamp'], y=data_quality_df['data_completeness'],
                        mode='lines+markers', name='Completeness',
                        line=dict(color='green')),
             row=1, col=1)
fig.add_hline(y=95, line_dash="dash", line_color="red", 
              annotation_text="Threshold", row=1, col=1)

# Data Freshness
fig.add_trace(go.Scatter(x=data_quality_df['timestamp'], y=data_quality_df['freshness_delay_minutes'],
                        mode='lines+markers', name='Freshness Delay',
                        line=dict(color='blue')),
             row=1, col=2)
fig.add_hline(y=10, line_dash="dash", line_color="red", 
              annotation_text="SLA", row=1, col=2)

# Feature Drift
fig.add_trace(go.Scatter(x=data_quality_df['timestamp'], y=data_quality_df['feature_drift_score'],
                        mode='lines+markers', name='Feature Drift',
                        line=dict(color='orange')),
             row=2, col=1)
fig.add_hline(y=0.2, line_dash="dash", line_color="red", 
              annotation_text="Alert Threshold", row=2, col=1)

# Anomaly Rate
fig.add_trace(go.Scatter(x=data_quality_df['timestamp'], y=data_quality_df['anomaly_rate'],
                        mode='lines+markers', name='Anomaly Rate',
                        line=dict(color='red')),
             row=2, col=2)

# Schema Violations
fig.add_trace(go.Bar(x=data_quality_df['timestamp'], y=data_quality_df['schema_violations'],
                    name='Schema Violations', marker_color='purple'),
             row=3, col=1)

# Record Count
fig.add_trace(go.Scatter(x=data_quality_df['timestamp'], y=data_quality_df['record_count'],
                        mode='lines', name='Record Count',
                        line=dict(color='darkgreen')),
             row=3, col=2)

fig.update_layout(height=1000, title_text="Data Quality Monitoring Dashboard", showlegend=False)
fig.show()

In [None]:
# Data quality summary and alerts
current_metrics = data_quality_df.iloc[-1]
recent_24h = data_quality_df.iloc[-24:]

# Define thresholds
thresholds = {
    'data_completeness': {'min': 95, 'direction': 'above'},
    'freshness_delay_minutes': {'max': 10, 'direction': 'below'},
    'feature_drift_score': {'max': 0.2, 'direction': 'below'},
    'anomaly_rate': {'max': 3, 'direction': 'below'},
    'schema_violations': {'max': 0, 'direction': 'below'}
}

print("=" * 60)
print("üîç DATA QUALITY MONITORING REPORT")
print("=" * 60)

print(f"\nüìä Current Status (as of {current_metrics['timestamp']})")
alerts = []

for metric, threshold in thresholds.items():
    current_value = current_metrics[metric]
    
    # Check if metric violates threshold
    if threshold['direction'] == 'above' and 'min' in threshold:
        violation = current_value < threshold['min']
        status = "‚ö†Ô∏è " if violation else "‚úÖ"
        threshold_text = f"(threshold: >{threshold['min']})"
    elif threshold['direction'] == 'below' and 'max' in threshold:
        violation = current_value > threshold['max']
        status = "‚ö†Ô∏è " if violation else "‚úÖ"
        threshold_text = f"(threshold: <{threshold['max']})"
    else:
        violation = False
        status = "‚úÖ"
        threshold_text = ""
    
    if violation:
        alerts.append(f"{metric}: {current_value:.2f} {threshold_text}")
    
    print(f"  {status} {metric.replace('_', ' ').title()}: {current_value:.2f} {threshold_text}")

print(f"\nüìà 24-Hour Trends:")
print(f"  ‚Ä¢ Average Completeness: {recent_24h['data_completeness'].mean():.2f}%")
print(f"  ‚Ä¢ Average Freshness Delay: {recent_24h['freshness_delay_minutes'].mean():.2f} minutes")
print(f"  ‚Ä¢ Max Feature Drift: {recent_24h['feature_drift_score'].max():.3f}")
print(f"  ‚Ä¢ Total Records Processed: {recent_24h['record_count'].sum():,}")
print(f"  ‚Ä¢ Total Schema Violations: {recent_24h['schema_violations'].sum()}")

if alerts:
    print(f"\nüö® ACTIVE ALERTS ({len(alerts)}):")
    for i, alert in enumerate(alerts, 1):
        print(f"  {i}. {alert}")
else:
    print(f"\n‚úÖ No active data quality alerts")

print(f"\nüí° Recommendations:")
if recent_24h['data_completeness'].mean() < 98:
    print(f"  ‚Ä¢ Investigate data ingestion pipeline for completeness issues")
if recent_24h['freshness_delay_minutes'].mean() > 5:
    print(f"  ‚Ä¢ Optimize data pipeline for faster processing")
if recent_24h['feature_drift_score'].max() > 0.15:
    print(f"  ‚Ä¢ Monitor feature distributions for potential model retraining")
if recent_24h['anomaly_rate'].mean() > 2:
    print(f"  ‚Ä¢ Review anomaly detection rules and investigate root causes")

## 4. Real-time Dashboard Simulation {#dashboard}

Create a simulated real-time monitoring dashboard.

In [None]:
# Create comprehensive monitoring dashboard
def create_monitoring_dashboard():
    """Create a comprehensive monitoring dashboard"""
    
    # Get latest data points
    latest_system = metrics_df.iloc[-1]
    latest_model = model_perf_df.iloc[-1]
    latest_quality = data_quality_df.iloc[-1]
    
    # Create dashboard with subplots
    fig = make_subplots(
        rows=4, cols=4,
        subplot_titles=(
            'CPU Usage', 'Memory Usage', 'Response Time', 'Error Rate',
            'Model Accuracy', 'Predictions/Min', 'Data Completeness', 'Feature Drift',
            'Request Volume (24h)', 'Model Performance Trend', 'Data Quality Score', 'System Health',
            'Alert Summary', 'Performance Distribution', 'Resource Utilization', 'Uptime Status'
        ),
        specs=[
            [{"type": "indicator"}, {"type": "indicator"}, {"type": "indicator"}, {"type": "indicator"}],
            [{"type": "indicator"}, {"type": "indicator"}, {"type": "indicator"}, {"type": "indicator"}],
            [{"type": "scatter"}, {"type": "scatter"}, {"type": "scatter"}, {"type": "scatter"}],
            [{"type": "table"}, {"type": "histogram"}, {"type": "pie"}, {"type": "indicator"}]
        ],
        vertical_spacing=0.08
    )
    
    # Row 1: Key Indicators
    # CPU Usage
    fig.add_trace(go.Indicator(
        mode="gauge+number+delta",
        value=latest_system['cpu_usage'],
        domain={'x': [0, 1], 'y': [0, 1]},
        title={'text': "CPU %"},
        delta={'reference': 50},
        gauge={'axis': {'range': [None, 100]},
               'bar': {'color': "darkblue"},
               'steps': [{'range': [0, 50], 'color': "lightgray"},
                        {'range': [50, 80], 'color': "yellow"}],
               'threshold': {'line': {'color': "red", 'width': 4},
                           'thickness': 0.75, 'value': 90}}
    ), row=1, col=1)
    
    # Memory Usage
    fig.add_trace(go.Indicator(
        mode="gauge+number",
        value=latest_system['memory_usage'],
        title={'text': "Memory %"},
        gauge={'axis': {'range': [None, 100]},
               'bar': {'color': "darkgreen"},
               'threshold': {'line': {'color': "red", 'width': 4},
                           'thickness': 0.75, 'value': 85}}
    ), row=1, col=2)
    
    # Response Time
    fig.add_trace(go.Indicator(
        mode="gauge+number",
        value=latest_system['response_time_ms'],
        title={'text': "Response Time (ms)"},
        gauge={'axis': {'range': [0, 500]},
               'bar': {'color': "orange"},
               'threshold': {'line': {'color': "red", 'width': 4},
                           'thickness': 0.75, 'value': 300}}
    ), row=1, col=3)
    
    # Error Rate
    fig.add_trace(go.Indicator(
        mode="number+delta",
        value=latest_system['error_rate'],
        title={'text': "Error Rate %"},
        delta={'reference': 1, 'increasing': {'color': "red"}}
    ), row=1, col=4)
    
    # Row 2: Model Indicators
    # Model Accuracy
    fig.add_trace(go.Indicator(
        mode="gauge+number",
        value=latest_model['accuracy'],
        title={'text': "Model Accuracy"},
        gauge={'axis': {'range': [0.8, 1.0]},
               'bar': {'color': "purple"},
               'threshold': {'line': {'color': "red", 'width': 4},
                           'thickness': 0.75, 'value': 0.9}}
    ), row=2, col=1)
    
    # Predictions per minute
    fig.add_trace(go.Indicator(
        mode="number+delta",
        value=latest_system['predictions_per_minute'],
        title={'text': "Predictions/Min"},
        delta={'reference': 50}
    ), row=2, col=2)
    
    # Data Completeness
    fig.add_trace(go.Indicator(
        mode="gauge+number",
        value=latest_quality['data_completeness'],
        title={'text': "Data Completeness %"},
        gauge={'axis': {'range': [90, 100]},
               'bar': {'color': "green"},
               'threshold': {'line': {'color': "red", 'width': 4},
                           'thickness': 0.75, 'value': 95}}
    ), row=2, col=3)
    
    # Feature Drift
    fig.add_trace(go.Indicator(
        mode="number+delta",
        value=latest_quality['feature_drift_score'],
        title={'text': "Feature Drift"},
        delta={'reference': 0.1, 'increasing': {'color': "red"}}
    ), row=2, col=4)
    
    # Row 3: Time Series
    # Request volume (last 24h)
    last_24h_metrics = metrics_df.iloc[-288:]  # Last 24 hours (5 min intervals)
    fig.add_trace(go.Scatter(
        x=last_24h_metrics['timestamp'], 
        y=last_24h_metrics['requests_per_second'],
        mode='lines', name='Requests/sec',
        line=dict(color='blue')
    ), row=3, col=1)
    
    # Model performance trend
    recent_model = model_perf_df.iloc[-48:]  # Last 48 hours
    fig.add_trace(go.Scatter(
        x=recent_model['timestamp'], 
        y=recent_model['accuracy'],
        mode='lines', name='Accuracy',
        line=dict(color='purple')
    ), row=3, col=2)
    
    # Data quality score (composite)
    data_quality_df['quality_score'] = (
        data_quality_df['data_completeness'] / 100 * 0.4 +
        np.maximum(0, 1 - data_quality_df['freshness_delay_minutes'] / 60) * 0.3 +
        np.maximum(0, 1 - data_quality_df['feature_drift_score'] / 0.5) * 0.3
    ) * 100
    
    fig.add_trace(go.Scatter(
        x=data_quality_df['timestamp'], 
        y=data_quality_df['quality_score'],
        mode='lines', name='Quality Score',
        line=dict(color='green')
    ), row=3, col=3)
    
    # System health (composite)
    metrics_df['health_score'] = (
        np.maximum(0, 1 - metrics_df['cpu_usage'] / 100) * 0.25 +
        np.maximum(0, 1 - metrics_df['memory_usage'] / 100) * 0.25 +
        np.maximum(0, 1 - metrics_df['response_time_ms'] / 1000) * 0.25 +
        np.maximum(0, 1 - metrics_df['error_rate'] / 10) * 0.25
    ) * 100
    
    recent_health = metrics_df.iloc[-288:]  # Last 24 hours
    fig.add_trace(go.Scatter(
        x=recent_health['timestamp'], 
        y=recent_health['health_score'],
        mode='lines', name='Health Score',
        line=dict(color='red')
    ), row=3, col=4)
    
    fig.update_layout(
        height=1200, 
        title_text="üñ•Ô∏è ML Pipeline Platform - Real-time Monitoring Dashboard",
        showlegend=False
    )
    
    return fig

# Create and display dashboard
dashboard = create_monitoring_dashboard()
dashboard.show()

## 5. Alert System Analysis {#alerts}

Analyze and simulate alert conditions.

In [None]:
# Define alert rules and check conditions
def check_alerts(metrics_df, model_perf_df, data_quality_df):
    """Check all alert conditions and generate alerts"""
    
    alerts = []
    current_time = datetime.now()
    
    # Get latest values
    latest_system = metrics_df.iloc[-1]
    latest_model = model_perf_df.iloc[-1]
    latest_quality = data_quality_df.iloc[-1]
    
    # Recent data for trend analysis
    recent_system = metrics_df.iloc[-12:]  # Last hour (5-min intervals)
    recent_model = model_perf_df.iloc[-24:]  # Last 24 hours
    recent_quality = data_quality_df.iloc[-6:]  # Last 6 hours
    
    # System Performance Alerts
    if latest_system['cpu_usage'] > 90:
        alerts.append({
            'severity': 'CRITICAL',
            'category': 'System',
            'message': f"High CPU usage: {latest_system['cpu_usage']:.1f}%",
            'value': latest_system['cpu_usage'],
            'threshold': 90,
            'timestamp': current_time
        })
    
    if latest_system['memory_usage'] > 85:
        alerts.append({
            'severity': 'WARNING',
            'category': 'System',
            'message': f"High memory usage: {latest_system['memory_usage']:.1f}%",
            'value': latest_system['memory_usage'],
            'threshold': 85,
            'timestamp': current_time
        })
    
    if latest_system['response_time_ms'] > 300:
        alerts.append({
            'severity': 'WARNING',
            'category': 'Performance',
            'message': f"High response time: {latest_system['response_time_ms']:.1f}ms",
            'value': latest_system['response_time_ms'],
            'threshold': 300,
            'timestamp': current_time
        })
    
    if latest_system['error_rate'] > 5:
        alerts.append({
            'severity': 'CRITICAL',
            'category': 'Reliability',
            'message': f"High error rate: {latest_system['error_rate']:.2f}%",
            'value': latest_system['error_rate'],
            'threshold': 5,
            'timestamp': current_time
        })
    
    # Model Performance Alerts
    if latest_model['accuracy'] < 0.90:
        alerts.append({
            'severity': 'WARNING',
            'category': 'Model',
            'message': f"Low model accuracy: {latest_model['accuracy']:.3f}",
            'value': latest_model['accuracy'],
            'threshold': 0.90,
            'timestamp': current_time
        })
    
    # Model drift detection
    if len(recent_model) >= 24:
        accuracy_trend = recent_model['accuracy'].iloc[-1] - recent_model['accuracy'].iloc[0]
        if accuracy_trend < -0.02:  # 2% drop in 24 hours
            alerts.append({
                'severity': 'WARNING',
                'category': 'Model',
                'message': f"Model accuracy declining: {accuracy_trend:.3f} in 24h",
                'value': accuracy_trend,
                'threshold': -0.02,
                'timestamp': current_time
            })
    
    # Data Quality Alerts
    if latest_quality['data_completeness'] < 95:
        alerts.append({
            'severity': 'WARNING',
            'category': 'Data Quality',
            'message': f"Low data completeness: {latest_quality['data_completeness']:.1f}%",
            'value': latest_quality['data_completeness'],
            'threshold': 95,
            'timestamp': current_time
        })
    
    if latest_quality['freshness_delay_minutes'] > 10:
        alerts.append({
            'severity': 'WARNING',
            'category': 'Data Quality',
            'message': f"Data freshness delay: {latest_quality['freshness_delay_minutes']:.1f} minutes",
            'value': latest_quality['freshness_delay_minutes'],
            'threshold': 10,
            'timestamp': current_time
        })
    
    if latest_quality['feature_drift_score'] > 0.2:
        alerts.append({
            'severity': 'CRITICAL',
            'category': 'Data Quality',
            'message': f"High feature drift: {latest_quality['feature_drift_score']:.3f}",
            'value': latest_quality['feature_drift_score'],
            'threshold': 0.2,
            'timestamp': current_time
        })
    
    # Volume anomaly detection
    if len(recent_system) >= 12:
        avg_requests = recent_system['requests_per_second'].mean()
        current_requests = latest_system['requests_per_second']
        if current_requests < avg_requests * 0.3:  # 70% drop
            alerts.append({
                'severity': 'CRITICAL',
                'category': 'Traffic',
                'message': f"Low traffic volume: {current_requests:.1f} req/s (avg: {avg_requests:.1f})",
                'value': current_requests,
                'threshold': avg_requests * 0.3,
                'timestamp': current_time
            })
        elif current_requests > avg_requests * 2:  # 100% increase
            alerts.append({
                'severity': 'WARNING',
                'category': 'Traffic',
                'message': f"High traffic spike: {current_requests:.1f} req/s (avg: {avg_requests:.1f})",
                'value': current_requests,
                'threshold': avg_requests * 2,
                'timestamp': current_time
            })
    
    return alerts

# Check for alerts
current_alerts = check_alerts(metrics_df, model_perf_df, data_quality_df)

print("=" * 60)
print("üö® ALERT SYSTEM STATUS")
print("=" * 60)

if current_alerts:
    # Sort alerts by severity
    severity_order = {'CRITICAL': 0, 'WARNING': 1, 'INFO': 2}
    current_alerts.sort(key=lambda x: severity_order.get(x['severity'], 3))
    
    print(f"\nüî• ACTIVE ALERTS ({len(current_alerts)}):")
    
    for i, alert in enumerate(current_alerts, 1):
        severity_emoji = "üî¥" if alert['severity'] == 'CRITICAL' else "üü°" if alert['severity'] == 'WARNING' else "üîµ"
        print(f"\n{i}. {severity_emoji} {alert['severity']} - {alert['category']}")
        print(f"   {alert['message']}")
        print(f"   Time: {alert['timestamp'].strftime('%Y-%m-%d %H:%M:%S')}")
    
    # Alert summary by category and severity
    alert_df = pd.DataFrame(current_alerts)
    
    print(f"\nüìä Alert Summary:")
    print(alert_df.groupby(['severity', 'category']).size().to_string())
    
else:
    print(f"\n‚úÖ No active alerts - System operating normally")

print(f"\nüìã Alert Configuration:")
print(f"   ‚Ä¢ CPU Usage: >90% (Critical), >80% (Warning)")
print(f"   ‚Ä¢ Memory Usage: >85% (Warning)")
print(f"   ‚Ä¢ Response Time: >300ms (Warning), >500ms (Critical)")
print(f"   ‚Ä¢ Error Rate: >5% (Critical), >2% (Warning)")
print(f"   ‚Ä¢ Model Accuracy: <90% (Warning), <85% (Critical)")
print(f"   ‚Ä¢ Data Completeness: <95% (Warning), <90% (Critical)")
print(f"   ‚Ä¢ Feature Drift: >0.2 (Critical), >0.1 (Warning)")

## 6. Performance Optimization {#optimization}

Analyze performance bottlenecks and optimization opportunities.

In [None]:
# Performance optimization analysis
def analyze_performance_bottlenecks(metrics_df, model_perf_df):
    """Identify performance bottlenecks and optimization opportunities"""
    
    optimization_recommendations = []
    
    # Analyze system metrics
    avg_cpu = metrics_df['cpu_usage'].mean()
    avg_memory = metrics_df['memory_usage'].mean()
    avg_response_time = metrics_df['response_time_ms'].mean()
    avg_requests = metrics_df['requests_per_second'].mean()
    
    # CPU Analysis
    if avg_cpu > 70:
        optimization_recommendations.append({
            'category': 'CPU',
            'priority': 'High',
            'issue': f'High average CPU usage: {avg_cpu:.1f}%',
            'recommendation': 'Consider horizontal scaling or CPU optimization',
            'impact': 'Performance degradation under load'
        })
    
    # Memory Analysis
    if avg_memory > 60:
        optimization_recommendations.append({
            'category': 'Memory',
            'priority': 'Medium',
            'issue': f'High average memory usage: {avg_memory:.1f}%',
            'recommendation': 'Optimize memory usage or increase available memory',
            'impact': 'Risk of out-of-memory errors'
        })
    
    # Response Time Analysis
    response_p95 = metrics_df['response_time_ms'].quantile(0.95)
    if response_p95 > 200:
        optimization_recommendations.append({
            'category': 'Latency',
            'priority': 'High',
            'issue': f'High P95 response time: {response_p95:.1f}ms',
            'recommendation': 'Optimize API endpoints and database queries',
            'impact': 'Poor user experience'
        })
    
    # Model Performance Analysis
    recent_accuracy = model_perf_df['accuracy'].iloc[-168:].mean()  # Last week
    initial_accuracy = model_perf_df['accuracy'].iloc[:168].mean()  # First week
    
    if recent_accuracy < initial_accuracy - 0.01:
        optimization_recommendations.append({
            'category': 'Model',
            'priority': 'High',
            'issue': f'Model accuracy decline: {recent_accuracy:.3f} vs {initial_accuracy:.3f}',
            'recommendation': 'Retrain model with recent data',
            'impact': 'Reduced prediction quality'
        })
    
    # Throughput Analysis
    max_throughput = metrics_df['requests_per_second'].max()
    if avg_requests / max_throughput < 0.3:  # Low utilization
        optimization_recommendations.append({
            'category': 'Capacity',
            'priority': 'Low',
            'issue': f'Low resource utilization: {avg_requests/max_throughput*100:.1f}%',
            'recommendation': 'Consider downsizing resources or handling more traffic',
            'impact': 'Cost optimization opportunity'
        })
    
    # Error Rate Analysis
    error_trend = metrics_df['error_rate'].diff().mean()
    if error_trend > 0.01:  # Increasing error rate
        optimization_recommendations.append({
            'category': 'Reliability',
            'priority': 'High',
            'issue': f'Increasing error rate trend: +{error_trend:.3f}%/hour',
            'recommendation': 'Investigate root cause and implement error handling',
            'impact': 'Service reliability concerns'
        })
    
    return optimization_recommendations

# Analyze performance
recommendations = analyze_performance_bottlenecks(metrics_df, model_perf_df)

print("=" * 70)
print("‚ö° PERFORMANCE OPTIMIZATION ANALYSIS")
print("=" * 70)

if recommendations:
    # Sort by priority
    priority_order = {'High': 0, 'Medium': 1, 'Low': 2}
    recommendations.sort(key=lambda x: priority_order.get(x['priority'], 3))
    
    print(f"\nüîç Identified {len(recommendations)} optimization opportunities:")
    
    for i, rec in enumerate(recommendations, 1):
        priority_emoji = "üî¥" if rec['priority'] == 'High' else "üü°" if rec['priority'] == 'Medium' else "üü¢"
        print(f"\n{i}. {priority_emoji} {rec['priority']} Priority - {rec['category']}")
        print(f"   Issue: {rec['issue']}")
        print(f"   Recommendation: {rec['recommendation']}")
        print(f"   Impact: {rec['impact']}")
    
    # Summary by category
    rec_df = pd.DataFrame(recommendations)
    print(f"\nüìä Optimization Summary:")
    print(rec_df.groupby(['priority', 'category']).size().to_string())
    
else:
    print(f"\n‚úÖ No major performance issues identified")
    print(f"   System is operating within optimal parameters")

# Performance metrics summary
print(f"\nüìà Current Performance Metrics:")
print(f"   ‚Ä¢ Average CPU Usage: {metrics_df['cpu_usage'].mean():.1f}%")
print(f"   ‚Ä¢ Average Memory Usage: {metrics_df['memory_usage'].mean():.1f}%")
print(f"   ‚Ä¢ Average Response Time: {metrics_df['response_time_ms'].mean():.1f}ms")
print(f"   ‚Ä¢ P95 Response Time: {metrics_df['response_time_ms'].quantile(0.95):.1f}ms")
print(f"   ‚Ä¢ P99 Response Time: {metrics_df['response_time_ms'].quantile(0.99):.1f}ms")
print(f"   ‚Ä¢ Average Requests/Second: {metrics_df['requests_per_second'].mean():.1f}")
print(f"   ‚Ä¢ Average Error Rate: {metrics_df['error_rate'].mean():.2f}%")
print(f"   ‚Ä¢ Model Accuracy (Recent): {model_perf_df['accuracy'].iloc[-24:].mean():.3f}")

print(f"\nüéØ Performance Targets:")
print(f"   ‚Ä¢ Response Time P95: <200ms")
print(f"   ‚Ä¢ Response Time P99: <500ms")
print(f"   ‚Ä¢ Error Rate: <1%")
print(f"   ‚Ä¢ CPU Usage: <70%")
print(f"   ‚Ä¢ Model Accuracy: >95%")
print(f"   ‚Ä¢ Uptime: >99.9%")

In [None]:
# Performance trends and capacity planning
fig, axes = plt.subplots(2, 2, figsize=(16, 10))

# Resource utilization trend
axes[0, 0].plot(metrics_df.index, metrics_df['cpu_usage'], label='CPU %', alpha=0.7)
axes[0, 0].plot(metrics_df.index, metrics_df['memory_usage'], label='Memory %', alpha=0.7)
axes[0, 0].axhline(y=80, color='r', linestyle='--', alpha=0.5, label='Target Threshold')
axes[0, 0].set_title('Resource Utilization Trend')
axes[0, 0].set_ylabel('Usage %')
axes[0, 0].legend()
axes[0, 0].grid(True, alpha=0.3)

# Response time distribution
axes[0, 1].hist(metrics_df['response_time_ms'], bins=30, alpha=0.7, edgecolor='black')
axes[0, 1].axvline(x=metrics_df['response_time_ms'].mean(), color='r', linestyle='-', label=f"Mean: {metrics_df['response_time_ms'].mean():.0f}ms")
axes[0, 1].axvline(x=metrics_df['response_time_ms'].quantile(0.95), color='orange', linestyle='--', label=f"P95: {metrics_df['response_time_ms'].quantile(0.95):.0f}ms")
axes[0, 1].set_title('Response Time Distribution')
axes[0, 1].set_xlabel('Response Time (ms)')
axes[0, 1].set_ylabel('Frequency')
axes[0, 1].legend()
axes[0, 1].grid(True, alpha=0.3)

# Throughput vs Response Time correlation
scatter = axes[1, 0].scatter(metrics_df['requests_per_second'], metrics_df['response_time_ms'], 
                           c=metrics_df['cpu_usage'], cmap='viridis', alpha=0.6)
axes[1, 0].set_title('Throughput vs Response Time (colored by CPU)')
axes[1, 0].set_xlabel('Requests per Second')
axes[1, 0].set_ylabel('Response Time (ms)')
plt.colorbar(scatter, ax=axes[1, 0], label='CPU Usage %')
axes[1, 0].grid(True, alpha=0.3)

# Model performance stability
daily_accuracy = model_perf_df.set_index('timestamp').resample('D')['accuracy'].mean()
axes[1, 1].plot(daily_accuracy.index, daily_accuracy.values, marker='o', linewidth=2)
axes[1, 1].axhline(y=0.95, color='r', linestyle='--', alpha=0.5, label='Target: 95%')
axes[1, 1].set_title('Daily Model Accuracy Trend')
axes[1, 1].set_ylabel('Accuracy')
axes[1, 1].legend()
axes[1, 1].grid(True, alpha=0.3)
axes[1, 1].tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.show()

# Capacity planning recommendations
max_observed_rps = metrics_df['requests_per_second'].max()
avg_rps = metrics_df['requests_per_second'].mean()
growth_capacity = (max_observed_rps - avg_rps) / avg_rps * 100

print(f"\nüîÆ CAPACITY PLANNING INSIGHTS:")
print(f"   ‚Ä¢ Current Average Load: {avg_rps:.1f} req/s")
print(f"   ‚Ä¢ Peak Load Observed: {max_observed_rps:.1f} req/s")
print(f"   ‚Ä¢ Growth Capacity: {growth_capacity:.1f}% above average")
print(f"   ‚Ä¢ Recommended Scaling Threshold: {avg_rps * 1.5:.1f} req/s")
print(f"   ‚Ä¢ Estimated Breaking Point: {max_observed_rps * 1.2:.1f} req/s")

if growth_capacity < 50:
    print(f"   ‚ö†Ô∏è  Consider adding capacity - low headroom for traffic spikes")
else:
    print(f"   ‚úÖ Adequate capacity for handling traffic variations")

## üìù Conclusion

This performance monitoring notebook has provided comprehensive insights into the ML Pipeline Platform's operational health:

### Key Monitoring Areas:
- **System Performance**: CPU, memory, response times, and throughput
- **Model Performance**: Accuracy trends, drift detection, and prediction volume
- **Data Quality**: Completeness, freshness, and feature drift monitoring
- **Real-time Alerts**: Automated detection of performance anomalies

### Monitoring Best Practices:
1. **Proactive Alerting**: Set appropriate thresholds for early warning
2. **Trend Analysis**: Monitor long-term patterns for capacity planning
3. **Multi-dimensional Monitoring**: Combine system, model, and data metrics
4. **Performance Optimization**: Regular analysis for bottleneck identification

### Next Steps:
1. **Implement Dashboard**: Deploy real-time monitoring dashboard
2. **Alert Integration**: Connect alerts to incident management systems
3. **Automated Responses**: Implement auto-scaling and self-healing
4. **Historical Analysis**: Maintain long-term performance baselines

This monitoring framework ensures the ML Pipeline Platform operates reliably at scale while maintaining optimal performance and data quality.