# Isolation Forest Anomaly Detection for Self-Healing Platform

## Overview
This notebook demonstrates implementing Isolation Forest for anomaly detection in OpenShift metrics. Isolation Forest is particularly effective for detecting anomalies in high-dimensional data without requiring labeled training data.

## Prerequisites
- Completed: `synthetic-anomaly-generation.ipynb` (Phase 1)
- PyTorch workbench environment with scikit-learn
- Synthetic dataset: `/opt/app-root/src/data/processed/synthetic_anomalies.parquet`

## Why We Use Synthetic Data

### The Problem: Real Anomalies Are Rare
In production OpenShift clusters:
- Anomalies occur <1% of the time
- Collecting 1000 labeled anomalies takes months/years
- Different anomaly types are hard to capture
- Can't deliberately cause failures to collect data

### The Solution: Synthetic Anomalies
We generate synthetic anomalies because:
- ‚úÖ Create 1000+ labeled anomalies in minutes
- ‚úÖ Control anomaly types and severity
- ‚úÖ Ensure balanced training data (50% normal, 50% anomaly)
- ‚úÖ Reproducible and testable
- ‚úÖ Models trained on synthetic data generalize to real anomalies

### Machine Learning Best Practice
Supervised learning requires labeled data. Synthetic data provides:
1. **Ground Truth**: Known labels for evaluation
2. **Balanced Classes**: Equal normal and anomaly samples
3. **Reproducibility**: Same data for consistent results
4. **Generalization**: Models learn patterns, not memorize examples

## Enhanced Metrics (v2.0)
This version includes **30+ metrics** across 6 categories:
- **CPU**: Utilization, saturation, iowait, steal, throttling
- **Memory**: Utilization, pressure, OOM kills, swap
- **Disk I/O**: Latency, IOPS, throughput, utilization
- **Network**: Errors, drops, retransmits, conntrack
- **Stability**: Restarts, crashes, pending pods
- **Kubernetes State**: Deployments, nodes, quotas

## Expected Outcomes
- Train Isolation Forest model on synthetic anomalies
- Evaluate model performance (Precision, Recall, F1)
- Save trained model for integration with coordination engine
- Generate anomaly detection pipeline for real-time use

## References
- ADR-002: Hybrid Deterministic-AI Self-Healing Approach
- ADR-012: Notebook Architecture for End-to-End Workflows
- [Isolation Forest Paper](https://cs.nju.edu.cn/zhouzh/zhouzh.files/publication/icdm08b.pdf) - Liu, Ting & Zhou (2008)
- [Learning from Imbalanced Data](https://ieeexplore.ieee.org/document/5128907) - He & Garcia (2009)
- [Anomaly Detection with Robust Deep Autoencoders](https://arxiv.org/abs/1511.08747) - Goldstein & Uchida (2016)

## Setup and Configuration

In [None]:
# Import required libraries
import sys
import os
from pathlib import Path

# Setup path for utils module - works from any directory
def find_utils_path():
    """Find utils path regardless of current working directory"""
    possible_paths = [
        Path(__file__).parent.parent / 'utils' if '__file__' in dir() else None,
        Path.cwd() / 'notebooks' / 'utils',
        Path.cwd().parent / 'utils',
        Path('/workspace/repo/notebooks/utils'),
        Path('/opt/app-root/src/notebooks/utils'),
        Path('/opt/app-root/src/openshift-aiops-platform/notebooks/utils'),
    ]
    for p in possible_paths:
        if p and p.exists() and (p / 'common_functions.py').exists():
            return str(p)
    current = Path.cwd()
    for _ in range(5):
        utils_path = current / 'notebooks' / 'utils'
        if utils_path.exists():
            return str(utils_path)
        current = current.parent
    return None

utils_path = find_utils_path()
if utils_path:
    sys.path.insert(0, utils_path)
    print(f"‚úÖ Utils path found: {utils_path}")
else:
    print("‚ö†Ô∏è Utils path not found - will use fallback implementations")

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime, timedelta
import joblib
import warnings
warnings.filterwarnings('ignore')

# Machine learning libraries
from sklearn.ensemble import IsolationForest
from sklearn.preprocessing import StandardScaler, RobustScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline  # ‚ú® Added for KServe compatibility

# Try to import common functions, with fallback
try:
    from common_functions import (
        setup_environment, print_environment_info,
        generate_synthetic_timeseries, validate_data_quality,
        plot_metric_overview, save_processed_data, load_processed_data
    )
    print("‚úÖ Common functions imported")
except ImportError as e:
    print(f"‚ö†Ô∏è Common functions not available: {e}")
    print("   Using minimal fallback implementations")
    
    def setup_environment():
        os.makedirs('/opt/app-root/src/data/processed', exist_ok=True)
        os.makedirs('/opt/app-root/src/models/anomaly-detection', exist_ok=True)
        return {'data_dir': '/opt/app-root/src/data', 'models_dir': '/opt/app-root/src/models'}
    
    def print_environment_info(env_info):
        print(f"üìÅ Data dir: {env_info.get('data_dir', 'N/A')}")
    
    def generate_synthetic_timeseries(metric_name, duration_hours=24, interval_minutes=1, 
                                      add_anomalies=True, anomaly_probability=0.02):
        num_points = int(duration_hours * 60 / interval_minutes)
        timestamps = pd.date_range(end=datetime.now(), periods=num_points, freq=f'{interval_minutes}min')
        values = np.random.normal(50, 10, num_points)
        if add_anomalies:
            anomaly_idx = np.random.choice(num_points, int(num_points * anomaly_probability), replace=False)
            values[anomaly_idx] *= np.random.choice([0.3, 3.0], len(anomaly_idx))
        df = pd.DataFrame({'timestamp': timestamps, 'value': values, 'metric': metric_name, 'is_anomaly': False})
        if add_anomalies:
            df.loc[anomaly_idx, 'is_anomaly'] = True
        return df
    
    def save_processed_data(data, filename):
        os.makedirs('/opt/app-root/src/data/processed', exist_ok=True)
        filepath = f'/opt/app-root/src/data/processed/{filename}'
        if hasattr(data, 'to_parquet'):
            data.to_parquet(filepath)
        print(f"üíæ Saved: {filepath}")

print("‚úÖ Libraries imported successfully")
print(f"üî¨ Scikit-learn available with Pipeline support")
print(f"üìä Pandas version: {pd.__version__}")

## Enhanced Metrics Configuration

Import the enhanced metrics configuration module which provides:
- **30+ metrics** across 6 categories (vs. original 5)
- **Category-specific** Isolation Forest configurations
- **Pre-defined PromQL queries** for each metric
- **Thresholds** for warning and critical alerts

In [None]:
# ============================================================================
# ENHANCED METRICS CONFIGURATION (v2.0)
# ============================================================================
# Import enhanced metrics module
# If not available, fall back to inline definitions

try:
    from enhanced_metrics_config import (
        ISOLATION_FOREST_CONFIGS,
        AnomalyCategory,
        TARGET_METRICS_ENHANCED,
        STABILITY_METRICS,
        PERFORMANCE_METRICS,
        RESOURCE_EXHAUSTION_METRICS,
        get_prometheus_queries,
        get_thresholds,
    )
    ENHANCED_CONFIG_AVAILABLE = True
    print("‚úÖ Enhanced metrics configuration loaded from module")
except ImportError:
    print("‚ö†Ô∏è Enhanced metrics module not found - using inline configuration")
    ENHANCED_CONFIG_AVAILABLE = False
    
    # Inline fallback: Define enhanced configuration directly
    from enum import Enum
    
    class AnomalyCategory(Enum):
        RESOURCE = "resource"
        STABILITY = "stability"
        PERFORMANCE = "performance"
        NETWORK = "network"
        CONTROL_PLANE = "control_plane"
    
    ISOLATION_FOREST_CONFIGS = {
        AnomalyCategory.RESOURCE: {
            'contamination': 0.05,
            'n_estimators': 200,
            'max_samples': 'auto',
            'max_features': 1.0,
            'random_state': 42,
            'bootstrap': False,
            'n_jobs': -1
        },
        AnomalyCategory.STABILITY: {
            'contamination': 0.03,
            'n_estimators': 150,
            'max_samples': 256,
            'max_features': 0.8,
            'random_state': 42,
            'bootstrap': True,
            'n_jobs': -1
        },
        AnomalyCategory.PERFORMANCE: {
            'contamination': 0.08,
            'n_estimators': 200,
            'max_samples': 'auto',
            'max_features': 1.0,
            'random_state': 42,
            'bootstrap': False,
            'n_jobs': -1
        },
        AnomalyCategory.NETWORK: {
            'contamination': 0.06,
            'n_estimators': 175,
            'max_samples': 'auto',
            'max_features': 0.9,
            'random_state': 42,
            'bootstrap': False,
            'n_jobs': -1
        },
        AnomalyCategory.CONTROL_PLANE: {
            'contamination': 0.02,
            'n_estimators': 250,
            'max_samples': 512,
            'max_features': 1.0,
            'random_state': 42,
            'bootstrap': False,
            'n_jobs': -1
        }
    }
    
    TARGET_METRICS_ENHANCED = [
        # Original metrics
        'node_cpu_utilization',
        'node_memory_utilization',
        'pod_cpu_usage',
        'pod_memory_usage',
        'container_restart_count',
        # CPU enhancements
        'node_cpu_saturation',
        'node_cpu_iowait',
        'pod_cpu_throttled_percent',
        'node_load_per_cpu',
        # Memory enhancements  
        'node_memory_pressure',
        'node_memory_oom_kills',
        'pod_memory_utilization',
        # Disk I/O
        'node_disk_io_utilization',
        'node_disk_read_latency_ms',
        'node_disk_write_latency_ms',
        # Network
        'node_network_errors',
        'node_network_drops',
        'node_tcp_retransmit_rate',
        'pod_network_errors',
        # Kubernetes state
        'pods_pending',
        'pods_not_ready',
        'deployment_replicas_unavailable',
        # Stability
        'container_restart_rate_1h',
        'pod_crash_loop_backoff',
        'pod_oom_killed',
    ]
    
    STABILITY_METRICS = [
        'container_restart_count', 'container_restart_rate_1h',
        'pod_crash_loop_backoff', 'pod_oom_killed',
        'pods_pending', 'pods_not_ready', 'pods_failed',
        'deployment_replicas_unavailable', 'node_memory_oom_kills',
    ]

print(f"üìä Enhanced metrics available: {len(TARGET_METRICS_ENHANCED)}")

In [None]:
# Set up environment
env_info = setup_environment()
print_environment_info(env_info)

# ============================================================================
# DETECTION FOCUS SELECTION
# ============================================================================
# Choose your detection focus based on what anomalies you want to catch:
#
#   AnomalyCategory.RESOURCE      - CPU/memory exhaustion (most common)
#   AnomalyCategory.STABILITY     - Crashes, restarts, OOM kills
#   AnomalyCategory.PERFORMANCE   - Latency spikes, throughput issues
#   AnomalyCategory.NETWORK       - Connectivity, packet loss, errors
#   AnomalyCategory.CONTROL_PLANE - API server, etcd, scheduler issues
#
# Each category has a tuned Isolation Forest configuration optimized for
# that type of anomaly detection.
# ============================================================================

DETECTION_FOCUS = AnomalyCategory.RESOURCE  # üëà Change this to switch focus

# Get category-specific Isolation Forest configuration
ISOLATION_FOREST_CONFIG = ISOLATION_FOREST_CONFIGS[DETECTION_FOCUS]

print(f"\nüéØ Detection Focus: {DETECTION_FOCUS.value.upper()}")
print(f"   Contamination: {ISOLATION_FOREST_CONFIG['contamination']} ({ISOLATION_FOREST_CONFIG['contamination']*100:.0f}% expected anomalies)")
print(f"   Estimators: {ISOLATION_FOREST_CONFIG['n_estimators']} trees")
print(f"   Max Samples: {ISOLATION_FOREST_CONFIG['max_samples']}")
print(f"   Bootstrap: {ISOLATION_FOREST_CONFIG.get('bootstrap', False)}")

# ============================================================================
# ENHANCED TARGET METRICS
# ============================================================================
# Use all 30+ enhanced metrics for comprehensive anomaly detection
# Or select a subset based on your focus area
# ============================================================================

# Option 1: Use ALL enhanced metrics (recommended for general detection)
TARGET_METRICS = TARGET_METRICS_ENHANCED

# Option 2: Use stability-focused subset
# TARGET_METRICS = STABILITY_METRICS

# Option 3: Use original 5 metrics (for comparison/baseline)
# TARGET_METRICS = [
#     'node_cpu_utilization',
#     'node_memory_utilization', 
#     'pod_cpu_usage',
#     'pod_memory_usage',
#     'container_restart_count'
# ]

print(f"\nüìä Target Metrics: {len(TARGET_METRICS)} metrics")
print(f"üå≤ Isolation Forest: {ISOLATION_FOREST_CONFIG['n_estimators']} trees")

# Display metric categories
print(f"\nüìã Metrics by category:")
categories = {
    'CPU': [m for m in TARGET_METRICS if 'cpu' in m.lower()],
    'Memory': [m for m in TARGET_METRICS if 'memory' in m.lower() or 'oom' in m.lower()],
    'Disk': [m for m in TARGET_METRICS if 'disk' in m.lower()],
    'Network': [m for m in TARGET_METRICS if 'network' in m.lower() or 'tcp' in m.lower()],
    'Stability': [m for m in TARGET_METRICS if any(x in m.lower() for x in ['restart', 'crash', 'pending', 'ready', 'failed'])],
}
for cat, metrics in categories.items():
    if metrics:
        print(f"   {cat}: {len(metrics)} metrics")

## Data Preparation

### Load Synthetic Anomalies for Training

We load synthetic anomalies from Phase 1 (`synthetic-anomaly-generation.ipynb`) for training.

**Why Synthetic Data?**
- Real anomalies are rare (<1% in production clusters)
- Synthetic data provides labeled training examples
- Models learn general patterns, not memorize specific examples
- Balanced dataset (50% normal, 50% anomaly) improves performance
- Reproducible and testable

**Machine Learning Best Practice:**
Supervised learning requires labeled data. Synthetic data provides:
1. **Ground Truth**: Known labels for evaluation
2. **Balanced Classes**: Equal normal and anomaly samples
3. **Reproducibility**: Same data for consistent results
4. **Generalization**: Models learn patterns, not memorize examples

**References:**
- He & Garcia (2009): "Learning from Imbalanced Data" - https://ieeexplore.ieee.org/document/5128907
- Nikolenko (2021): "Synthetic Data for Deep Learning" - https://arxiv.org/abs/1909.11373
- Goldstein & Uchida (2016): "Anomaly Detection with Robust Deep Autoencoders" - https://arxiv.org/abs/1511.08747

In [None]:
def prepare_anomaly_detection_data(duration_hours=48):
    """
    Generate and prepare data for anomaly detection training.
    
    Now supports 30+ enhanced metrics with realistic patterns.
    """
    print("üîÑ Preparing anomaly detection dataset...")
    print(f"   Using {len(TARGET_METRICS)} enhanced metrics")
    
    # Generate synthetic data for each target metric
    all_data = {}
    
    for i, metric in enumerate(TARGET_METRICS):
        print(f"  üìä [{i+1}/{len(TARGET_METRICS)}] Generating {metric}...")
        
        # Adjust anomaly probability based on metric type
        if any(x in metric for x in ['restart', 'crash', 'oom', 'failed']):
            anomaly_prob = 0.02  # Rare events
        elif any(x in metric for x in ['error', 'drop', 'pending']):
            anomaly_prob = 0.04  # Somewhat rare
        else:
            anomaly_prob = 0.03  # Default
        
        df = generate_synthetic_timeseries(
            metric_name=metric,
            duration_hours=duration_hours,
            interval_minutes=1,
            add_anomalies=True,
            anomaly_probability=anomaly_prob
        )
        all_data[metric] = df
        
        if (i + 1) % 10 == 0:
            print(f"       ‚úÖ Generated {i+1}/{len(TARGET_METRICS)} metrics")
    
    print(f"\n‚úÖ Generated data for {len(all_data)} metrics")
    return all_data

# Generate training data
training_data = prepare_anomaly_detection_data(duration_hours=48)

# Display summary
total_points = sum(len(df) for df in training_data.values())
total_anomalies = sum(df['is_anomaly'].sum() for df in training_data.values())
print(f"\nüìà Dataset Summary:")
print(f"  Total data points: {total_points:,}")
print(f"  Total anomalies: {total_anomalies:,} ({total_anomalies/total_points:.2%})")
print(f"  Metrics: {len(training_data)}")

In [None]:
def create_feature_matrix(data_dict):
    """
    Create feature matrix for anomaly detection.
    
    Enhanced version with additional engineered features:
    - Rolling statistics (mean, std, min, max)
    - Lag features
    - Rate of change
    - Cross-metric correlations (for enhanced metrics)
    """
    print("üîß Creating feature matrix...")
    print(f"   Input: {len(data_dict)} metrics")
    
    # Align all time series to common timestamps
    min_start = max(df['timestamp'].min() for df in data_dict.values())
    max_end = min(df['timestamp'].max() for df in data_dict.values())
    
    print(f"  üìÖ Time range: {min_start} to {max_end}")
    
    # Create common time index
    time_index = pd.date_range(start=min_start, end=max_end, freq='1min')
    
    # Build feature matrix
    features = pd.DataFrame(index=time_index)
    labels = pd.Series(index=time_index, dtype=bool, name='is_anomaly')
    
    for metric_name, df in data_dict.items():
        # Resample to common time index
        df_resampled = df.set_index('timestamp').reindex(time_index, method='nearest')
        
        # Add basic features
        features[f'{metric_name}_value'] = df_resampled['value']
        
        # Add rolling statistics (5-minute windows)
        features[f'{metric_name}_mean_5m'] = df_resampled['value'].rolling('5min').mean()
        features[f'{metric_name}_std_5m'] = df_resampled['value'].rolling('5min').std()
        features[f'{metric_name}_min_5m'] = df_resampled['value'].rolling('5min').min()
        features[f'{metric_name}_max_5m'] = df_resampled['value'].rolling('5min').max()
        
        # Add lag features
        features[f'{metric_name}_lag_1'] = df_resampled['value'].shift(1)
        features[f'{metric_name}_lag_5'] = df_resampled['value'].shift(5)
        
        # Add rate of change
        features[f'{metric_name}_diff'] = df_resampled['value'].diff()
        features[f'{metric_name}_pct_change'] = df_resampled['value'].pct_change()
        
        # Combine anomaly labels (any metric anomaly = overall anomaly)
        metric_anomalies = df_resampled['is_anomaly'].fillna(False)
        labels = labels | metric_anomalies
    
    # Fill missing values
    features = features.ffill().bfill()
    labels = labels.fillna(False)
    
    # Replace infinity values with 0 and remaining NaN with 0
    features = features.replace([np.inf, -np.inf], 0)
    features = features.fillna(0)
    
    print(f"  ‚úÖ Feature matrix: {features.shape}")
    print(f"  üè∑Ô∏è Anomaly labels: {labels.sum()} anomalies ({labels.mean():.2%})")
    print(f"  üìê Features per metric: ~9 (value + 4 rolling + 2 lag + 2 diff)")
    
    return features, labels

# Create feature matrix
X, y = create_feature_matrix(training_data)

print(f"\nüìä Feature Engineering Complete:")
print(f"  Features: {X.shape[1]} columns")
print(f"  Samples: {X.shape[0]:,} rows")
print(f"  Anomaly rate: {y.mean():.2%}")

## Model Training and Evaluation

Train Isolation Forest model and evaluate its performance.

**Note:** With enhanced metrics, we now have significantly more features which improves detection accuracy but requires the sklearn Pipeline to handle scaling properly.

In [None]:
# Split data for training and testing
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

print(f"üìä Data Split:")
print(f"  Training: {X_train.shape[0]:,} samples, {X_train.shape[1]} features")
print(f"  Testing: {X_test.shape[0]:,} samples")
print(f"  Training anomalies: {y_train.sum()} ({y_train.mean():.2%})")
print(f"  Testing anomalies: {y_test.sum()} ({y_test.mean():.2%})")

# ============================================================================
# CREATE SKLEARN PIPELINE (KServe Compatible)
# ============================================================================
# Using Pipeline ensures:
# 1. Scaler and model are saved together in ONE .pkl file
# 2. KServe can load and use directly without manual preprocessing
# 3. Inference is consistent with training
# ============================================================================

print(f"\nüîß Creating Isolation Forest pipeline...")
print(f"   Detection focus: {DETECTION_FOCUS.value}")
print(f"   Config: {ISOLATION_FOREST_CONFIG['n_estimators']} trees, {ISOLATION_FOREST_CONFIG['contamination']*100:.0f}% contamination")

isolation_forest_pipeline = Pipeline([
    ('scaler', RobustScaler()),  # More robust to outliers than StandardScaler
    ('isolation_forest', IsolationForest(**ISOLATION_FOREST_CONFIG))
])

print("‚úÖ Pipeline created (RobustScaler + Isolation Forest)")
print(f"   Features: {X_train.shape[1]}")

In [None]:
# Train Isolation Forest Pipeline
print("üå≤ Training Isolation Forest pipeline...")
print(f"   Training on {X_train.shape[0]:,} samples with {X_train.shape[1]} features")
print("   Pipeline automatically handles: scaler.fit_transform() ‚Üí model.fit()")

import time
start_time = time.time()

# Fit pipeline on training data
isolation_forest_pipeline.fit(X_train)

training_time = time.time() - start_time
print(f"‚úÖ Training complete in {training_time:.2f} seconds")

# Make predictions using pipeline
print("\nüîÆ Making predictions...")
y_pred_train = isolation_forest_pipeline.predict(X_train)
y_pred_test = isolation_forest_pipeline.predict(X_test)

# Get anomaly scores
train_scores = isolation_forest_pipeline.decision_function(X_train)
test_scores = isolation_forest_pipeline.decision_function(X_test)

# Convert predictions to binary (1 = normal, -1 = anomaly)
y_pred_train_binary = (y_pred_train == -1)
y_pred_test_binary = (y_pred_test == -1)

print(f"  Training predictions: {y_pred_train_binary.sum()} anomalies detected")
print(f"  Testing predictions: {y_pred_test_binary.sum()} anomalies detected")
print(f"\n‚úÖ Pipeline handles scaling automatically - no separate scaler needed!")

In [None]:
# Evaluate model performance
print("üìä Model Evaluation")
print("=" * 60)
print(f"Detection Focus: {DETECTION_FOCUS.value.upper()}")
print(f"Features: {X.shape[1]} | Metrics: {len(TARGET_METRICS)}")
print("=" * 60)

# Training set performance
print("\nüèãÔ∏è Training Set Performance:")
print(classification_report(y_train, y_pred_train_binary, 
                          target_names=['Normal', 'Anomaly']))

# Test set performance
print("\nüß™ Test Set Performance:")
print(classification_report(y_test, y_pred_test_binary, 
                          target_names=['Normal', 'Anomaly']))

# Confusion matrices
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# Training confusion matrix
cm_train = confusion_matrix(y_train, y_pred_train_binary)
sns.heatmap(cm_train, annot=True, fmt='d', cmap='Blues', 
            xticklabels=['Normal', 'Anomaly'], 
            yticklabels=['Normal', 'Anomaly'], ax=axes[0])
axes[0].set_title(f'Training Set Confusion Matrix\n({len(TARGET_METRICS)} enhanced metrics)')
axes[0].set_ylabel('True Label')
axes[0].set_xlabel('Predicted Label')

# Test confusion matrix
cm_test = confusion_matrix(y_test, y_pred_test_binary)
sns.heatmap(cm_test, annot=True, fmt='d', cmap='Blues', 
            xticklabels=['Normal', 'Anomaly'], 
            yticklabels=['Normal', 'Anomaly'], ax=axes[1])
axes[1].set_title(f'Test Set Confusion Matrix\n(Focus: {DETECTION_FOCUS.value})')
axes[1].set_ylabel('True Label')
axes[1].set_xlabel('Predicted Label')

plt.tight_layout()
plt.show()

## Model Analysis and Visualization

In [None]:
# Analyze anomaly scores distribution
fig, axes = plt.subplots(2, 2, figsize=(15, 10))
fig.suptitle(f'Isolation Forest Analysis ({len(TARGET_METRICS)} Enhanced Metrics)', fontsize=16, fontweight='bold')

# Score distribution
axes[0, 0].hist(train_scores[~y_train], bins=50, alpha=0.7, label='Normal', density=True)
axes[0, 0].hist(train_scores[y_train], bins=50, alpha=0.7, label='Anomaly', density=True)
axes[0, 0].set_title('Anomaly Score Distribution (Training)')
axes[0, 0].set_xlabel('Anomaly Score')
axes[0, 0].set_ylabel('Density')
axes[0, 0].legend()
axes[0, 0].grid(True, alpha=0.3)

# Score vs time (sample)
sample_size = min(1000, len(test_scores))
sample_indices = np.random.choice(len(test_scores), sample_size, replace=False)
sample_indices = np.sort(sample_indices)

axes[0, 1].plot(sample_indices, test_scores[sample_indices], 'b-', alpha=0.7, linewidth=1)
anomaly_indices = sample_indices[y_test.iloc[sample_indices]]
if len(anomaly_indices) > 0:
    axes[0, 1].scatter(anomaly_indices, test_scores[anomaly_indices], 
                      color='red', s=30, alpha=0.8, label='True Anomalies')
axes[0, 1].axhline(y=0, color='gray', linestyle='--', alpha=0.5)
axes[0, 1].set_title('Anomaly Scores Over Time (Test Sample)')
axes[0, 1].set_xlabel('Sample Index')
axes[0, 1].set_ylabel('Anomaly Score')
axes[0, 1].legend()
axes[0, 1].grid(True, alpha=0.3)

# Feature importance (using PCA to visualize)
pca = PCA(n_components=2)
X_test_pca = pca.fit_transform(X_test)

normal_mask = ~y_test
anomaly_mask = y_test

axes[1, 0].scatter(X_test_pca[normal_mask, 0], X_test_pca[normal_mask, 1], 
                  c='blue', alpha=0.6, s=20, label='Normal')
axes[1, 0].scatter(X_test_pca[anomaly_mask, 0], X_test_pca[anomaly_mask, 1], 
                  c='red', alpha=0.8, s=30, label='Anomaly')
axes[1, 0].set_title('PCA Visualization (Test Set)')
axes[1, 0].set_xlabel(f'PC1 ({pca.explained_variance_ratio_[0]:.1%} variance)')
axes[1, 0].set_ylabel(f'PC2 ({pca.explained_variance_ratio_[1]:.1%} variance)')
axes[1, 0].legend()
axes[1, 0].grid(True, alpha=0.3)

# Model performance metrics
from sklearn.metrics import precision_score, recall_score, f1_score, roc_auc_score

precision = precision_score(y_test, y_pred_test_binary)
recall = recall_score(y_test, y_pred_test_binary)
f1 = f1_score(y_test, y_pred_test_binary)

# Convert scores to probabilities for AUC calculation
test_scores_prob = (test_scores - test_scores.min()) / (test_scores.max() - test_scores.min())
auc = roc_auc_score(y_test, 1 - test_scores_prob)

metrics_text = f"""
Model Performance Metrics:

Precision: {precision:.3f}
Recall: {recall:.3f}
F1-Score: {f1:.3f}
AUC-ROC: {auc:.3f}

Configuration:
Focus: {DETECTION_FOCUS.value}
Trees: {ISOLATION_FOREST_CONFIG['n_estimators']}
Contamination: {ISOLATION_FOREST_CONFIG['contamination']}
Features: {X.shape[1]}
Metrics: {len(TARGET_METRICS)}

Data:
Training: {X_train.shape[0]:,}
Testing: {X_test.shape[0]:,}
"""

axes[1, 1].text(0.05, 0.95, metrics_text, transform=axes[1, 1].transAxes, 
               fontsize=10, verticalalignment='top',
               bbox=dict(boxstyle='round', facecolor='lightgray', alpha=0.8))
axes[1, 1].set_title('Model Summary')
axes[1, 1].axis('off')

plt.tight_layout()
plt.show()

print(f"\nüéØ Model Performance Summary:")
print(f"  Detection Focus: {DETECTION_FOCUS.value}")
print(f"  Metrics Used: {len(TARGET_METRICS)}")
print(f"  Features: {X.shape[1]}")
print(f"  Precision: {precision:.3f}")
print(f"  Recall: {recall:.3f}")
print(f"  F1-Score: {f1:.3f}")
print(f"  AUC-ROC: {auc:.3f}")

## Save Model and Upload to S3

Save the trained pipeline model in KServe-compatible format.

In [None]:
# Save pipeline model to persistent storage
# Use /mnt/models for persistent storage (model-storage-pvc)
# Fallback to local for development outside cluster
MODELS_DIR = Path('/mnt/models') if Path('/mnt/models').exists() else Path('/opt/app-root/src/models')

# Create KServe-compatible subdirectory structure
MODEL_NAME = 'anomaly-detector'
MODEL_DIR = MODELS_DIR / MODEL_NAME
MODEL_DIR.mkdir(parents=True, exist_ok=True)

# Save with KServe expected filename
model_path = MODEL_DIR / 'model.pkl'

# Migration: Move old flat file if exists
old_path = MODELS_DIR / 'anomaly-detector.pkl'
if old_path.exists() and not model_path.exists():
    import shutil
    shutil.move(str(old_path), str(model_path))
    print(f"üîÑ Migrated model from {old_path} to {model_path}")

# ‚ú® Save SINGLE pipeline file (KServe compatible)
# KServe sklearn server expects model at: /mnt/models/anomaly-detector/model.pkl
joblib.dump(isolation_forest_pipeline, model_path)
print(f"üíæ Saved Isolation Forest pipeline to: {model_path}")
print(f"   ‚úÖ KServe-compatible path: {MODEL_NAME}/model.pkl")
print(f"   ‚úÖ Single .pkl file (scaler + model combined)")
print(f"   ‚úÖ Enhanced metrics: {len(TARGET_METRICS)} metrics, {X.shape[1]} features")

# Save model metadata (features list for inference)
import json
metadata = {
    'model_name': MODEL_NAME,
    'detection_focus': DETECTION_FOCUS.value,
    'n_metrics': len(TARGET_METRICS),
    'n_features': X.shape[1],
    'metrics': TARGET_METRICS,
    'feature_names': list(X.columns),
    'config': {
        'contamination': ISOLATION_FOREST_CONFIG['contamination'],
        'n_estimators': ISOLATION_FOREST_CONFIG['n_estimators'],
    },
    'performance': {
        'precision': float(precision),
        'recall': float(recall),
        'f1_score': float(f1),
        'auc_roc': float(auc),
    },
    'created_at': datetime.now().isoformat(),
}

metadata_path = MODEL_DIR / 'metadata.json'
with open(metadata_path, 'w') as f:
    json.dump(metadata, f, indent=2)
print(f"üìã Saved metadata to: {metadata_path}")

# Upload model to S3 for persistent storage
try:
    from common_functions import upload_model_to_s3, test_s3_connection
    
    if test_s3_connection():
        upload_model_to_s3(
            str(model_path),
            s3_key='models/anomaly-detection/anomaly-detector/model.pkl'
        )
        print(f"‚òÅÔ∏è  Uploaded to S3: models/anomaly-detection/anomaly-detector/model.pkl")
        
        # Also upload metadata
        upload_model_to_s3(
            str(metadata_path),
            s3_key='models/anomaly-detection/anomaly-detector/metadata.json'
        )
        print(f"‚òÅÔ∏è  Uploaded to S3: models/anomaly-detection/anomaly-detector/metadata.json")
    else:
        print("‚ö†Ô∏è S3 not available - model saved locally only")
except ImportError:
    print("‚ö†Ô∏è S3 functions not available - model saved locally only")
except Exception as e:
    print(f"‚ö†Ô∏è S3 upload failed (non-critical): {e}")

# Verify model saved
assert model_path.exists(), "Pipeline model not saved"
print("\n‚úÖ Model pipeline saved successfully")
print(f"   Path: {model_path}")
print(f"   Size: {model_path.stat().st_size / 1024:.2f} KB")
print(f"   Metadata: {metadata_path}")

# Clean up old separate model/scaler files if they exist
old_model = MODELS_DIR / 'isolation_forest_model.pkl'
old_scaler = MODELS_DIR / 'isolation_forest_scaler.pkl'
for old_file in [old_model, old_scaler]:
    if old_file.exists():
        old_file.unlink()
        print(f"üóëÔ∏è  Removed old file: {old_file.name}")

## Summary

### Enhanced Metrics (v2.0) Changes:

| Aspect | Original | Enhanced |
|--------|----------|----------|
| **Metrics** | 5 | 30+ |
| **Features** | ~45 | ~270+ |
| **Categories** | 1 (general) | 6 (resource, stability, performance, network, control plane, k8s) |
| **Config** | Fixed | Category-specific tuning |

### Detection Categories:
- **RESOURCE**: CPU/memory exhaustion, throttling
- **STABILITY**: Crashes, restarts, OOM kills
- **PERFORMANCE**: Latency, throughput degradation
- **NETWORK**: Errors, drops, retransmits
- **CONTROL_PLANE**: API server, etcd, scheduler

### Next Steps:
1. Deploy model to KServe for real-time inference
2. Connect to Coordination Engine for automated remediation
3. Monitor model performance in production
4. Retrain periodically with real anomaly data