# Hybrid Healing Workflows (Enhanced)

## Overview
This notebook implements hybrid healing workflows that combine rule-based and AI-driven remediation approaches. It intelligently routes decisions between deterministic rules and ML models, optimizing for both reliability and adaptability.

## Enhancements in This Version
- **Rule Engine**: Proper rule-based remediation with configurable rules
- **Intelligent Routing**: Routes based on anomaly type, confidence, and historical performance
- **Adaptive Learning**: Adjusts routing weights based on outcome feedback
- **Integrated Feedback**: Connects to outcome tracking for continuous improvement

## Prerequisites
- Completed: `ai-driven-decision-making-enhanced.ipynb`
- Prometheus accessible (or simulated)
- Feedback data available

## Learning Objectives
- Combine rule-based and AI-driven approaches effectively
- Route decisions based on context and historical performance
- Implement proper fallback chains with escalation
- Track and compare approach effectiveness

## Setup Section

In [None]:
import sys
import os
import json
import logging
import pickle
import hashlib
from pathlib import Path
from datetime import datetime, timedelta
from typing import Dict, List, Optional, Tuple, Any, Callable
from dataclasses import dataclass, field, asdict
from enum import Enum
from collections import defaultdict
import pandas as pd
import numpy as np

# Setup path for utils module
def find_utils_path():
    """Find utils path regardless of current working directory"""
    possible_paths = [
        Path(__file__).parent.parent / 'utils' if '__file__' in dir() else None,
        Path.cwd() / 'notebooks' / 'utils',
        Path.cwd().parent / 'utils',
        Path('/workspace/repo/notebooks/utils'),
        Path('/opt/app-root/src/notebooks/utils'),
    ]
    for p in possible_paths:
        if p and p.exists() and (p / 'common_functions.py').exists():
            return str(p)
    return None

utils_path = find_utils_path()
if utils_path:
    sys.path.insert(0, utils_path)
    print(f"‚úÖ Utils path found: {utils_path}")

# Try to import common functions, with fallback
try:
    from common_functions import setup_environment
    print("‚úÖ Common functions imported")
except ImportError as e:
    print(f"‚ö†Ô∏è Using fallback setup_environment")
    def setup_environment():
        os.makedirs('/opt/app-root/src/data/processed', exist_ok=True)
        os.makedirs('/opt/app-root/src/models', exist_ok=True)
        return {'data_dir': '/opt/app-root/src/data', 'models_dir': '/opt/app-root/src/models'}

# Configure logging
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger(__name__)

# Setup environment
env_info = setup_environment()
logger.info(f"Environment ready: {env_info}")

# Define paths
MODELS_DIR = Path('/opt/app-root/src/models')
MODELS_DIR.mkdir(parents=True, exist_ok=True)
DATA_DIR = Path('/opt/app-root/src/data')
PROCESSED_DIR = DATA_DIR / 'processed'
PROCESSED_DIR.mkdir(parents=True, exist_ok=True)
FEEDBACK_DIR = DATA_DIR / 'feedback'
FEEDBACK_DIR.mkdir(parents=True, exist_ok=True)

# Configuration
NAMESPACE = 'self-healing-platform'
AI_CONFIDENCE_THRESHOLD = 0.75
HIGH_CONFIDENCE_THRESHOLD = 0.90
PROMETHEUS_URL = os.getenv('PROMETHEUS_URL', 'http://prometheus:9090')

logger.info(f"Hybrid healing workflows initialized")

## 1. Rule-Based Remediation Engine

Define deterministic rules for known anomaly patterns.

In [None]:
class Severity(Enum):
    """Anomaly severity levels."""
    LOW = 'low'
    MEDIUM = 'medium'
    HIGH = 'high'
    CRITICAL = 'critical'


@dataclass
class Rule:
    """A remediation rule definition."""
    rule_id: str
    name: str
    description: str
    condition: Callable[[Dict], bool]
    action: str
    severity_threshold: Severity
    cooldown_seconds: int = 300
    max_executions_per_hour: int = 5
    requires_approval: bool = False
    enabled: bool = True


@dataclass
class RuleMatch:
    """Result of a rule evaluation."""
    rule_id: str
    rule_name: str
    matched: bool
    action: str
    confidence: float  # Rule-based confidence (1.0 for exact match)
    severity: Severity
    reasoning: str
    metadata: Dict = field(default_factory=dict)


class RuleEngine:
    """
    Rule-based remediation engine.
    
    Evaluates deterministic rules against metrics and anomaly data
    to produce remediation decisions.
    """
    
    def __init__(self):
        self.rules: List[Rule] = []
        self._execution_history: Dict[str, List[datetime]] = defaultdict(list)
        self._last_execution: Dict[str, datetime] = {}
        self._register_default_rules()
    
    def _register_default_rules(self):
        """Register default remediation rules."""
        
        # Rule 1: High CPU
        self.register_rule(Rule(
            rule_id='rule_high_cpu',
            name='High CPU Usage',
            description='Remediate when CPU usage exceeds 85%',
            condition=lambda m: m.get('cpu_usage', 0) > 85,
            action='scale_horizontal',
            severity_threshold=Severity.HIGH,
            cooldown_seconds=300,
            max_executions_per_hour=3
        ))
        
        # Rule 2: Critical CPU
        self.register_rule(Rule(
            rule_id='rule_critical_cpu',
            name='Critical CPU Usage',
            description='Emergency action when CPU exceeds 95%',
            condition=lambda m: m.get('cpu_usage', 0) > 95,
            action='emergency_scale',
            severity_threshold=Severity.CRITICAL,
            cooldown_seconds=60,
            max_executions_per_hour=10
        ))
        
        # Rule 3: High Memory
        self.register_rule(Rule(
            rule_id='rule_high_memory',
            name='High Memory Usage',
            description='Remediate when memory usage exceeds 90%',
            condition=lambda m: m.get('memory_percent', 0) > 90,
            action='restart_pod',
            severity_threshold=Severity.HIGH,
            cooldown_seconds=600,
            max_executions_per_hour=2
        ))
        
        # Rule 4: OOM Risk
        self.register_rule(Rule(
            rule_id='rule_oom_risk',
            name='OOM Risk Detection',
            description='Prevent OOM by acting when memory exceeds 95%',
            condition=lambda m: m.get('memory_percent', 0) > 95,
            action='increase_memory_limit',
            severity_threshold=Severity.CRITICAL,
            cooldown_seconds=120,
            max_executions_per_hour=5
        ))
        
        # Rule 5: High Error Rate
        self.register_rule(Rule(
            rule_id='rule_high_error_rate',
            name='High Error Rate',
            description='Remediate when error ratio exceeds 10%',
            condition=lambda m: m.get('error_ratio', 0) > 0.10,
            action='restart_pod',
            severity_threshold=Severity.HIGH,
            cooldown_seconds=300,
            max_executions_per_hour=4
        ))
        
        # Rule 6: Critical Error Rate
        self.register_rule(Rule(
            rule_id='rule_critical_error_rate',
            name='Critical Error Rate',
            description='Rollback when error ratio exceeds 25%',
            condition=lambda m: m.get('error_ratio', 0) > 0.25,
            action='rollback_deployment',
            severity_threshold=Severity.CRITICAL,
            cooldown_seconds=600,
            max_executions_per_hour=2,
            requires_approval=True
        ))
        
        # Rule 7: High Latency
        self.register_rule(Rule(
            rule_id='rule_high_latency',
            name='High Latency',
            description='Scale when P99 latency exceeds 1000ms',
            condition=lambda m: m.get('latency_p99', 0) > 1000,
            action='scale_horizontal',
            severity_threshold=Severity.MEDIUM,
            cooldown_seconds=300,
            max_executions_per_hour=4
        ))
        
        # Rule 8: Crash Loop Detection
        self.register_rule(Rule(
            rule_id='rule_crash_loop',
            name='Crash Loop Detection',
            description='Handle pods in crash loop (>3 restarts)',
            condition=lambda m: m.get('restart_count', 0) > 3,
            action='diagnose_and_restart',
            severity_threshold=Severity.HIGH,
            cooldown_seconds=900,
            max_executions_per_hour=2
        ))
        
        # Rule 9: Network Saturation
        self.register_rule(Rule(
            rule_id='rule_network_saturation',
            name='Network Saturation',
            description='Scale when network I/O is saturated',
            condition=lambda m: (m.get('network_rx', 0) + m.get('network_tx', 0)) > 1e9,  # 1 GB/s
            action='scale_horizontal',
            severity_threshold=Severity.MEDIUM,
            cooldown_seconds=600,
            max_executions_per_hour=3
        ))
        
        # Rule 10: Disk Pressure
        self.register_rule(Rule(
            rule_id='rule_disk_pressure',
            name='Disk Pressure',
            description='Clean up when disk usage exceeds 85%',
            condition=lambda m: m.get('disk_usage', 0) > 85,
            action='cleanup_disk',
            severity_threshold=Severity.MEDIUM,
            cooldown_seconds=1800,
            max_executions_per_hour=2
        ))
        
        logger.info(f"Registered {len(self.rules)} default rules")
    
    def register_rule(self, rule: Rule):
        """Register a new rule."""
        self.rules.append(rule)
    
    def _check_cooldown(self, rule_id: str, cooldown_seconds: int) -> bool:
        """Check if rule is still in cooldown period."""
        if rule_id not in self._last_execution:
            return False
        elapsed = (datetime.now() - self._last_execution[rule_id]).total_seconds()
        return elapsed < cooldown_seconds
    
    def _check_rate_limit(self, rule_id: str, max_per_hour: int) -> bool:
        """Check if rule has exceeded rate limit."""
        cutoff = datetime.now() - timedelta(hours=1)
        recent = [t for t in self._execution_history[rule_id] if t > cutoff]
        self._execution_history[rule_id] = recent  # Cleanup old entries
        return len(recent) >= max_per_hour
    
    def _record_execution(self, rule_id: str):
        """Record rule execution for cooldown and rate limiting."""
        now = datetime.now()
        self._last_execution[rule_id] = now
        self._execution_history[rule_id].append(now)
    
    def evaluate(self, metrics: Dict[str, float]) -> List[RuleMatch]:
        """
        Evaluate all rules against current metrics.
        
        Args:
            metrics: Current metric values
        
        Returns:
            List of matched rules with actions
        """
        matches = []
        
        for rule in self.rules:
            if not rule.enabled:
                continue
            
            try:
                matched = rule.condition(metrics)
            except Exception as e:
                logger.warning(f"Rule {rule.rule_id} evaluation error: {e}")
                continue
            
            if matched:
                # Check cooldown and rate limits
                in_cooldown = self._check_cooldown(rule.rule_id, rule.cooldown_seconds)
                rate_limited = self._check_rate_limit(rule.rule_id, rule.max_executions_per_hour)
                
                if in_cooldown:
                    reasoning = f"Rule matched but in cooldown period"
                    confidence = 0.0
                elif rate_limited:
                    reasoning = f"Rule matched but rate limited"
                    confidence = 0.0
                else:
                    reasoning = f"Rule condition satisfied: {rule.description}"
                    confidence = 1.0  # Rules have binary confidence
                
                matches.append(RuleMatch(
                    rule_id=rule.rule_id,
                    rule_name=rule.name,
                    matched=True,
                    action=rule.action,
                    confidence=confidence,
                    severity=rule.severity_threshold,
                    reasoning=reasoning,
                    metadata={
                        'requires_approval': rule.requires_approval,
                        'in_cooldown': in_cooldown,
                        'rate_limited': rate_limited
                    }
                ))
        
        # Sort by severity (critical first)
        severity_order = {Severity.CRITICAL: 0, Severity.HIGH: 1, Severity.MEDIUM: 2, Severity.LOW: 3}
        matches.sort(key=lambda m: severity_order.get(m.severity, 4))
        
        return matches
    
    def get_best_action(self, metrics: Dict[str, float]) -> Optional[RuleMatch]:
        """
        Get the highest priority actionable rule match.
        
        Returns:
            Best matching rule or None if no rules match
        """
        matches = self.evaluate(metrics)
        actionable = [m for m in matches if m.confidence > 0]
        return actionable[0] if actionable else None


# Initialize rule engine
rule_engine = RuleEngine()

# Test with sample metrics
test_metrics = {
    'cpu_usage': 88,
    'memory_percent': 75,
    'error_ratio': 0.05,
    'latency_p99': 500,
    'restart_count': 1,
    'disk_usage': 60
}

matches = rule_engine.evaluate(test_metrics)
print(f"Matched {len(matches)} rules:")
for match in matches:
    print(f"  - {match.rule_name}: {match.action} (severity: {match.severity.value})")

## 2. AI Decision Integration

Import components from the AI decision notebook or recreate minimal versions.

In [None]:
class AIDecisionAdapter:
    """
    Adapter for AI decision making.
    
    Provides a simplified interface to ensemble inference
    for use in hybrid workflows.
    """
    
    def __init__(self, models_dir: Path, confidence_threshold: float = 0.75):
        self.models_dir = models_dir
        self.confidence_threshold = confidence_threshold
        self.ensemble_config = self._load_config()
        self.feature_names = self.ensemble_config.get('feature_names', [
            'cpu_usage', 'memory_percent', 'error_ratio', 'latency_p99',
            'restart_count', 'request_rate', 'network_rx', 'network_tx'
        ])
    
    def _load_config(self) -> Dict:
        """Load ensemble configuration."""
        config_file = self.models_dir / 'ensemble_config.pkl'
        if config_file.exists():
            try:
                with open(config_file, 'rb') as f:
                    return pickle.load(f)
            except Exception as e:
                logger.warning(f"Could not load ensemble config: {e}")
        return {
            'methods': ['isolation_forest', 'random_forest'],
            'weights': [0.5, 0.5],
            'threshold': 0.5
        }
    
    def predict(self, metrics: Dict[str, float]) -> Dict[str, Any]:
        """
        Get AI prediction for metrics.
        
        Args:
            metrics: Current metric values
        
        Returns:
            Prediction with confidence and recommended action
        """
        # Simulate ensemble prediction
        # In production, this would call actual trained models
        
        # Calculate anomaly score based on metric deviations
        anomaly_signals = []
        
        if metrics.get('cpu_usage', 0) > 70:
            anomaly_signals.append(min(metrics['cpu_usage'] / 100, 1.0))
        if metrics.get('memory_percent', 0) > 80:
            anomaly_signals.append(min(metrics['memory_percent'] / 100, 1.0))
        if metrics.get('error_ratio', 0) > 0.05:
            anomaly_signals.append(min(metrics['error_ratio'] * 5, 1.0))
        if metrics.get('latency_p99', 0) > 500:
            anomaly_signals.append(min(metrics['latency_p99'] / 2000, 1.0))
        if metrics.get('restart_count', 0) > 2:
            anomaly_signals.append(min(metrics['restart_count'] / 5, 1.0))
        
        if anomaly_signals:
            anomaly_score = np.mean(anomaly_signals)
            # Add some randomness to simulate model uncertainty
            noise = np.random.uniform(-0.1, 0.1)
            anomaly_score = np.clip(anomaly_score + noise, 0, 1)
        else:
            anomaly_score = np.random.uniform(0.1, 0.3)
        
        # Determine confidence (higher when score is far from 0.5)
        confidence = 0.5 + abs(anomaly_score - 0.5)
        confidence = np.clip(confidence + np.random.uniform(-0.05, 0.15), 0.5, 0.99)
        
        is_anomaly = anomaly_score > 0.5
        
        # Classify anomaly type
        anomaly_type = self._classify_anomaly(metrics) if is_anomaly else 'normal'
        
        # Recommend action
        action = self._recommend_action(anomaly_type, confidence)
        
        return {
            'is_anomaly': is_anomaly,
            'anomaly_score': anomaly_score,
            'confidence': confidence,
            'anomaly_type': anomaly_type,
            'recommended_action': action,
            'model_agreement': np.random.uniform(0.7, 1.0),
            'timestamp': datetime.now().isoformat()
        }
    
    def _classify_anomaly(self, metrics: Dict[str, float]) -> str:
        """Classify the type of anomaly."""
        scores = {
            'cpu_anomaly': metrics.get('cpu_usage', 0) / 100,
            'memory_anomaly': metrics.get('memory_percent', 0) / 100,
            'error_anomaly': min(metrics.get('error_ratio', 0) * 10, 1.0),
            'latency_anomaly': min(metrics.get('latency_p99', 0) / 2000, 1.0),
            'stability_anomaly': min(metrics.get('restart_count', 0) / 5, 1.0)
        }
        return max(scores, key=scores.get)
    
    def _recommend_action(self, anomaly_type: str, confidence: float) -> str:
        """Recommend action based on anomaly type."""
        actions = {
            'cpu_anomaly': 'scale_horizontal',
            'memory_anomaly': 'restart_pod',
            'error_anomaly': 'restart_pod',
            'latency_anomaly': 'scale_horizontal',
            'stability_anomaly': 'diagnose_and_restart',
            'normal': 'monitor_only'
        }
        return actions.get(anomaly_type, 'monitor_only')


# Initialize AI adapter
ai_adapter = AIDecisionAdapter(MODELS_DIR, AI_CONFIDENCE_THRESHOLD)

# Test AI prediction
ai_prediction = ai_adapter.predict(test_metrics)
print("AI Prediction:")
print(json.dumps(ai_prediction, indent=2, default=str))

## 3. Intelligent Decision Router

Routes decisions between rule-based and AI approaches based on context.

In [None]:
class RoutingStrategy(Enum):
    """Decision routing strategies."""
    RULE_BASED = 'rule_based'
    AI_DRIVEN = 'ai_driven'
    HYBRID = 'hybrid'
    CONSERVATIVE = 'conservative'  # Requires agreement


@dataclass
class RoutingDecision:
    """Routing decision output."""
    strategy: RoutingStrategy
    reasoning: List[str]
    rule_weight: float
    ai_weight: float
    confidence: float
    metadata: Dict = field(default_factory=dict)


class DecisionRouter:
    """
    Intelligent decision router.
    
    Determines whether to use rule-based, AI-driven, or hybrid
    approach based on context and historical performance.
    """
    
    # Default weights
    DEFAULT_RULE_WEIGHT = 0.6
    DEFAULT_AI_WEIGHT = 0.4
    
    # Anomaly types where rules are preferred
    RULE_PREFERRED_TYPES = {
        'high_cpu', 'high_memory', 'crash_loop', 'oom_risk',
        'critical_cpu', 'critical_error_rate'
    }
    
    # Anomaly types where AI is preferred
    AI_PREFERRED_TYPES = {
        'unknown', 'complex_pattern', 'multi_factor', 'gradual_degradation'
    }
    
    def __init__(self, feedback_dir: Path):
        self.feedback_dir = feedback_dir
        self._performance_history = self._load_performance_history()
        self._adaptive_weights = {
            'rule': self.DEFAULT_RULE_WEIGHT,
            'ai': self.DEFAULT_AI_WEIGHT
        }
    
    def _load_performance_history(self) -> Dict:
        """Load historical performance data."""
        history_file = self.feedback_dir / 'routing_performance.json'
        if history_file.exists():
            try:
                with open(history_file, 'r') as f:
                    return json.load(f)
            except Exception as e:
                logger.warning(f"Could not load performance history: {e}")
        return {
            'rule_based': {'successes': 0, 'failures': 0},
            'ai_driven': {'successes': 0, 'failures': 0},
            'hybrid': {'successes': 0, 'failures': 0}
        }
    
    def _save_performance_history(self):
        """Persist performance history."""
        history_file = self.feedback_dir / 'routing_performance.json'
        with open(history_file, 'w') as f:
            json.dump(self._performance_history, f, indent=2)
    
    def _get_success_rate(self, strategy: str) -> float:
        """Calculate success rate for a strategy."""
        data = self._performance_history.get(strategy, {})
        total = data.get('successes', 0) + data.get('failures', 0)
        if total == 0:
            return 0.5  # No data, assume neutral
        return data.get('successes', 0) / total
    
    def record_outcome(self, strategy: str, success: bool):
        """Record outcome for adaptive learning."""
        if strategy not in self._performance_history:
            self._performance_history[strategy] = {'successes': 0, 'failures': 0}
        
        if success:
            self._performance_history[strategy]['successes'] += 1
        else:
            self._performance_history[strategy]['failures'] += 1
        
        self._save_performance_history()
        self._update_adaptive_weights()
    
    def _update_adaptive_weights(self):
        """Update weights based on historical performance."""
        rule_rate = self._get_success_rate('rule_based')
        ai_rate = self._get_success_rate('ai_driven')
        
        # Adjust weights based on relative performance
        total = rule_rate + ai_rate
        if total > 0:
            self._adaptive_weights['rule'] = rule_rate / total
            self._adaptive_weights['ai'] = ai_rate / total
    
    def route(self, 
              rule_match: Optional[RuleMatch],
              ai_prediction: Dict,
              severity: Optional[Severity] = None) -> RoutingDecision:
        """
        Route decision to appropriate strategy.
        
        Args:
            rule_match: Best matching rule (or None)
            ai_prediction: AI model prediction
            severity: Override severity level
        
        Returns:
            RoutingDecision with strategy and weights
        """
        reasoning = []
        
        # Extract key information
        has_rule_match = rule_match is not None and rule_match.confidence > 0
        ai_confidence = ai_prediction.get('confidence', 0)
        ai_is_anomaly = ai_prediction.get('is_anomaly', False)
        anomaly_type = ai_prediction.get('anomaly_type', 'unknown')
        
        # Determine severity
        if severity is None:
            if has_rule_match:
                severity = rule_match.severity
            elif ai_confidence > 0.9:
                severity = Severity.HIGH
            else:
                severity = Severity.MEDIUM
        
        # Decision logic
        
        # Case 1: Critical severity - always use rules if available
        if severity == Severity.CRITICAL:
            if has_rule_match:
                reasoning.append("Critical severity: using rule-based for reliability")
                return RoutingDecision(
                    strategy=RoutingStrategy.RULE_BASED,
                    reasoning=reasoning,
                    rule_weight=1.0,
                    ai_weight=0.0,
                    confidence=1.0,
                    metadata={'severity': severity.value}
                )
            else:
                reasoning.append("Critical severity but no rule match: conservative approach")
                return RoutingDecision(
                    strategy=RoutingStrategy.CONSERVATIVE,
                    reasoning=reasoning,
                    rule_weight=0.5,
                    ai_weight=0.5,
                    confidence=ai_confidence,
                    metadata={'requires_approval': True}
                )
        
        # Case 2: Rule match with known anomaly type
        if has_rule_match and rule_match.rule_id.replace('rule_', '') in self.RULE_PREFERRED_TYPES:
            reasoning.append(f"Known anomaly type '{rule_match.rule_name}': preferring rules")
            return RoutingDecision(
                strategy=RoutingStrategy.RULE_BASED,
                reasoning=reasoning,
                rule_weight=0.8,
                ai_weight=0.2,
                confidence=rule_match.confidence,
                metadata={'rule_id': rule_match.rule_id}
            )
        
        # Case 3: High AI confidence with no rule match
        if not has_rule_match and ai_confidence >= HIGH_CONFIDENCE_THRESHOLD:
            reasoning.append(f"No rule match, high AI confidence ({ai_confidence:.1%}): using AI")
            return RoutingDecision(
                strategy=RoutingStrategy.AI_DRIVEN,
                reasoning=reasoning,
                rule_weight=0.2,
                ai_weight=0.8,
                confidence=ai_confidence,
                metadata={'ai_anomaly_type': anomaly_type}
            )
        
        # Case 4: Both match - use hybrid with adaptive weights
        if has_rule_match and ai_is_anomaly:
            reasoning.append("Both rule and AI indicate anomaly: using hybrid approach")
            reasoning.append(f"Adaptive weights - Rule: {self._adaptive_weights['rule']:.1%}, AI: {self._adaptive_weights['ai']:.1%}")
            return RoutingDecision(
                strategy=RoutingStrategy.HYBRID,
                reasoning=reasoning,
                rule_weight=self._adaptive_weights['rule'],
                ai_weight=self._adaptive_weights['ai'],
                confidence=(rule_match.confidence + ai_confidence) / 2,
                metadata={'agreement': True}
            )
        
        # Case 5: Disagreement - be conservative
        if has_rule_match != ai_is_anomaly:
            reasoning.append("Rule and AI disagree: conservative approach with monitoring")
            return RoutingDecision(
                strategy=RoutingStrategy.CONSERVATIVE,
                reasoning=reasoning,
                rule_weight=0.5,
                ai_weight=0.5,
                confidence=max(rule_match.confidence if has_rule_match else 0, ai_confidence) * 0.7,
                metadata={'disagreement': True}
            )
        
        # Case 6: Neither triggered - monitor only
        reasoning.append("No anomaly detected by either system")
        return RoutingDecision(
            strategy=RoutingStrategy.RULE_BASED,  # Default to rules for monitoring
            reasoning=reasoning,
            rule_weight=0.5,
            ai_weight=0.5,
            confidence=1.0 - max(ai_confidence, 0.5),
            metadata={'action': 'monitor_only'}
        )


# Initialize router
router = DecisionRouter(FEEDBACK_DIR)

# Test routing
rule_match = rule_engine.get_best_action(test_metrics)
routing = router.route(rule_match, ai_prediction)

print(f"\nRouting Decision:")
print(f"  Strategy: {routing.strategy.value}")
print(f"  Rule Weight: {routing.rule_weight:.1%}")
print(f"  AI Weight: {routing.ai_weight:.1%}")
print(f"  Confidence: {routing.confidence:.1%}")
print(f"  Reasoning:")
for reason in routing.reasoning:
    print(f"    - {reason}")

## 4. Hybrid Decision Maker

Combines rule-based and AI decisions with weighted voting.

In [None]:
@dataclass
class HybridDecision:
    """Final hybrid decision."""
    decision_id: str
    timestamp: datetime
    strategy_used: RoutingStrategy
    final_action: str
    confidence: float
    should_execute: bool
    requires_approval: bool
    rule_contribution: Optional[Dict]
    ai_contribution: Optional[Dict]
    reasoning: List[str]
    metadata: Dict = field(default_factory=dict)
    
    def to_dict(self) -> Dict:
        d = asdict(self)
        d['timestamp'] = self.timestamp.isoformat()
        d['strategy_used'] = self.strategy_used.value
        return d


class HybridDecisionMaker:
    """
    Makes final decisions by combining rule-based and AI approaches.
    """
    
    ACTION_PRIORITY = {
        'emergency_scale': 10,
        'rollback_deployment': 9,
        'increase_memory_limit': 8,
        'restart_pod': 7,
        'diagnose_and_restart': 6,
        'scale_horizontal': 5,
        'cleanup_disk': 4,
        'notify_operator': 3,
        'monitor_only': 1
    }
    
    def __init__(self, 
                 rule_engine: RuleEngine,
                 ai_adapter: AIDecisionAdapter,
                 router: DecisionRouter,
                 execution_threshold: float = 0.6):
        self.rule_engine = rule_engine
        self.ai_adapter = ai_adapter
        self.router = router
        self.execution_threshold = execution_threshold
        self._decision_history = []
    
    def _generate_decision_id(self) -> str:
        """Generate unique decision ID."""
        content = f"{datetime.now().timestamp()}_{np.random.random()}"
        return hashlib.sha256(content.encode()).hexdigest()[:12]
    
    def _select_action(self,
                       rule_action: Optional[str],
                       ai_action: str,
                       rule_weight: float,
                       ai_weight: float) -> str:
        """Select final action based on weighted priority."""
        if rule_action is None:
            return ai_action
        
        rule_priority = self.ACTION_PRIORITY.get(rule_action, 0) * rule_weight
        ai_priority = self.ACTION_PRIORITY.get(ai_action, 0) * ai_weight
        
        if rule_priority >= ai_priority:
            return rule_action
        return ai_action
    
    def decide(self, metrics: Dict[str, float]) -> HybridDecision:
        """
        Make a hybrid decision based on metrics.
        
        Args:
            metrics: Current metric values
        
        Returns:
            HybridDecision with final action
        """
        reasoning = []
        
        # Get rule-based decision
        rule_match = self.rule_engine.get_best_action(metrics)
        if rule_match:
            reasoning.append(f"Rule '{rule_match.rule_name}' matched: {rule_match.action}")
        else:
            reasoning.append("No rule matched current metrics")
        
        # Get AI decision
        ai_prediction = self.ai_adapter.predict(metrics)
        if ai_prediction['is_anomaly']:
            reasoning.append(f"AI detected anomaly ({ai_prediction['anomaly_type']}): {ai_prediction['recommended_action']}")
        else:
            reasoning.append(f"AI: no anomaly detected (confidence: {ai_prediction['confidence']:.1%})")
        
        # Get routing decision
        routing = self.router.route(rule_match, ai_prediction)
        reasoning.extend(routing.reasoning)
        
        # Determine final action based on strategy
        if routing.strategy == RoutingStrategy.RULE_BASED:
            final_action = rule_match.action if rule_match else 'monitor_only'
            confidence = rule_match.confidence if rule_match else 0.5
        elif routing.strategy == RoutingStrategy.AI_DRIVEN:
            final_action = ai_prediction['recommended_action']
            confidence = ai_prediction['confidence']
        elif routing.strategy == RoutingStrategy.HYBRID:
            final_action = self._select_action(
                rule_match.action if rule_match else None,
                ai_prediction['recommended_action'],
                routing.rule_weight,
                routing.ai_weight
            )
            # Hybrid confidence is weighted average
            rule_conf = rule_match.confidence if rule_match else 0
            confidence = (rule_conf * routing.rule_weight + 
                         ai_prediction['confidence'] * routing.ai_weight)
        else:  # CONSERVATIVE
            final_action = 'monitor_only' if not rule_match else rule_match.action
            confidence = routing.confidence * 0.8  # Reduce confidence for conservative
        
        # Determine if we should execute
        should_execute = (
            final_action != 'monitor_only' and
            confidence >= self.execution_threshold and
            routing.strategy != RoutingStrategy.CONSERVATIVE
        )
        
        requires_approval = (
            routing.strategy == RoutingStrategy.CONSERVATIVE or
            (rule_match and rule_match.metadata.get('requires_approval', False)) or
            final_action in ['rollback_deployment', 'emergency_scale']
        )
        
        decision = HybridDecision(
            decision_id=self._generate_decision_id(),
            timestamp=datetime.now(),
            strategy_used=routing.strategy,
            final_action=final_action,
            confidence=confidence,
            should_execute=should_execute,
            requires_approval=requires_approval,
            rule_contribution={
                'matched': rule_match is not None,
                'action': rule_match.action if rule_match else None,
                'confidence': rule_match.confidence if rule_match else 0,
                'weight': routing.rule_weight
            },
            ai_contribution={
                'is_anomaly': ai_prediction['is_anomaly'],
                'action': ai_prediction['recommended_action'],
                'confidence': ai_prediction['confidence'],
                'weight': routing.ai_weight
            },
            reasoning=reasoning,
            metadata=routing.metadata
        )
        
        self._decision_history.append(decision.to_dict())
        
        logger.info(f"Hybrid decision: {final_action} (strategy: {routing.strategy.value}, "
                   f"confidence: {confidence:.1%}, execute: {should_execute})")
        
        return decision


# Initialize hybrid decision maker
hybrid_maker = HybridDecisionMaker(
    rule_engine=rule_engine,
    ai_adapter=ai_adapter,
    router=router,
    execution_threshold=0.6
)

# Test hybrid decision
decision = hybrid_maker.decide(test_metrics)
print("\nHybrid Decision:")
print(json.dumps(decision.to_dict(), indent=2, default=str))

## 5. Fallback Chain Executor

Execute remediation with fallback strategies.

In [None]:
@dataclass
class ExecutionResult:
    """Result of remediation execution."""
    decision_id: str
    executed: bool
    action_taken: str
    success: bool
    duration_seconds: float
    fallback_used: bool
    fallback_action: Optional[str]
    error_message: Optional[str]
    timestamp: datetime
    
    def to_dict(self) -> Dict:
        d = asdict(self)
        d['timestamp'] = self.timestamp.isoformat()
        return d


class FallbackExecutor:
    """
    Executes remediation with fallback chains.
    """
    
    # Fallback chains for each action
    FALLBACK_CHAINS = {
        'scale_horizontal': ['scale_horizontal', 'restart_pod', 'notify_operator'],
        'restart_pod': ['restart_pod', 'delete_and_recreate', 'notify_operator'],
        'rollback_deployment': ['rollback_deployment', 'restart_pod', 'notify_operator'],
        'emergency_scale': ['emergency_scale', 'scale_horizontal', 'restart_pod'],
        'increase_memory_limit': ['increase_memory_limit', 'restart_pod', 'scale_horizontal'],
        'diagnose_and_restart': ['diagnose_and_restart', 'restart_pod', 'notify_operator'],
        'cleanup_disk': ['cleanup_disk', 'restart_pod', 'notify_operator'],
        'notify_operator': ['notify_operator'],
        'monitor_only': ['monitor_only']
    }
    
    def __init__(self, namespace: str, simulate: bool = True):
        self.namespace = namespace
        self.simulate = simulate
        self._execution_log = []
    
    def _execute_action(self, action: str) -> Tuple[bool, Optional[str]]:
        """
        Execute a single action.
        
        Returns:
            Tuple of (success, error_message)
        """
        if self.simulate:
            # Simulate execution with realistic success rates
            success_rates = {
                'scale_horizontal': 0.95,
                'restart_pod': 0.90,
                'rollback_deployment': 0.85,
                'emergency_scale': 0.90,
                'increase_memory_limit': 0.80,
                'diagnose_and_restart': 0.85,
                'cleanup_disk': 0.90,
                'delete_and_recreate': 0.80,
                'notify_operator': 1.0,
                'monitor_only': 1.0
            }
            success = np.random.random() < success_rates.get(action, 0.8)
            error = None if success else f"Simulated failure for {action}"
            return success, error
        else:
            # Real execution would go here
            # Example: kubectl scale, kubectl rollout restart, etc.
            logger.warning(f"Real execution not implemented for {action}")
            return False, "Real execution not implemented"
    
    def execute(self, decision: HybridDecision) -> ExecutionResult:
        """
        Execute remediation with fallback chain.
        
        Args:
            decision: Hybrid decision to execute
        
        Returns:
            ExecutionResult with outcome
        """
        start_time = datetime.now()
        
        if not decision.should_execute:
            return ExecutionResult(
                decision_id=decision.decision_id,
                executed=False,
                action_taken='none',
                success=True,
                duration_seconds=0,
                fallback_used=False,
                fallback_action=None,
                error_message='Execution not required',
                timestamp=datetime.now()
            )
        
        if decision.requires_approval:
            logger.info(f"Action {decision.final_action} requires approval - skipping")
            return ExecutionResult(
                decision_id=decision.decision_id,
                executed=False,
                action_taken='pending_approval',
                success=False,
                duration_seconds=0,
                fallback_used=False,
                fallback_action=None,
                error_message='Awaiting operator approval',
                timestamp=datetime.now()
            )
        
        # Get fallback chain
        chain = self.FALLBACK_CHAINS.get(decision.final_action, [decision.final_action, 'notify_operator'])
        
        action_taken = None
        fallback_used = False
        success = False
        error_message = None
        
        for i, action in enumerate(chain):
            logger.info(f"Attempting action: {action} (attempt {i+1}/{len(chain)})")
            
            success, error = self._execute_action(action)
            action_taken = action
            
            if success:
                if i > 0:
                    fallback_used = True
                break
            else:
                error_message = error
                logger.warning(f"Action {action} failed: {error}")
        
        duration = (datetime.now() - start_time).total_seconds()
        
        result = ExecutionResult(
            decision_id=decision.decision_id,
            executed=True,
            action_taken=action_taken,
            success=success,
            duration_seconds=duration + np.random.uniform(5, 30),  # Add simulated work time
            fallback_used=fallback_used,
            fallback_action=action_taken if fallback_used else None,
            error_message=error_message if not success else None,
            timestamp=datetime.now()
        )
        
        self._execution_log.append(result.to_dict())
        
        return result


# Initialize executor
executor = FallbackExecutor(NAMESPACE, simulate=True)

# Test execution
if decision.should_execute:
    result = executor.execute(decision)
    print("\nExecution Result:")
    print(json.dumps(result.to_dict(), indent=2, default=str))
else:
    print("\nDecision does not require execution")

## 6. Complete Hybrid Workflow

Run the full workflow with metrics, decision, and execution.

In [None]:
def run_hybrid_workflow(metrics: Dict[str, float], 
                        record_outcome: bool = True) -> Dict:
    """
    Run complete hybrid healing workflow.
    
    Args:
        metrics: Current metric values
        record_outcome: Whether to record outcome for learning
    
    Returns:
        Workflow result with all stages
    """
    workflow_start = datetime.now()
    workflow_id = hashlib.sha256(str(workflow_start).encode()).hexdigest()[:12]
    
    logger.info(f"Starting hybrid workflow {workflow_id}")
    
    # Stage 1: Make hybrid decision
    decision = hybrid_maker.decide(metrics)
    
    # Stage 2: Execute with fallback
    execution = executor.execute(decision)
    
    # Stage 3: Record outcome for adaptive learning
    if record_outcome and execution.executed:
        router.record_outcome(
            strategy=decision.strategy_used.value,
            success=execution.success
        )
    
    workflow_duration = (datetime.now() - workflow_start).total_seconds()
    
    return {
        'workflow_id': workflow_id,
        'duration_seconds': workflow_duration,
        'decision': decision.to_dict(),
        'execution': execution.to_dict(),
        'summary': {
            'strategy': decision.strategy_used.value,
            'action': decision.final_action,
            'confidence': f"{decision.confidence:.1%}",
            'executed': execution.executed,
            'success': execution.success,
            'fallback_used': execution.fallback_used
        }
    }


# Run single workflow
result = run_hybrid_workflow(test_metrics)
print("\n" + "="*60)
print("WORKFLOW SUMMARY")
print("="*60)
print(json.dumps(result['summary'], indent=2))

## 7. Batch Simulation

Run multiple workflows to gather performance data.

In [None]:
def generate_test_metrics() -> Dict[str, float]:
    """Generate random test metrics with realistic distributions."""
    return {
        'cpu_usage': np.random.beta(2, 5) * 100,  # Usually low, sometimes high
        'memory_percent': np.random.beta(3, 2) * 100,  # Usually higher
        'error_ratio': np.random.exponential(0.03),  # Mostly low, occasional spikes
        'latency_p99': np.random.lognormal(6, 0.8),  # Log-normal latency
        'restart_count': np.random.poisson(0.5),  # Usually 0, sometimes more
        'request_rate': np.random.exponential(100),
        'network_rx': np.random.exponential(1e7),
        'network_tx': np.random.exponential(5e6),
        'disk_usage': np.random.beta(2, 3) * 100
    }


# Run batch simulation
NUM_WORKFLOWS = 30
batch_results = []

print(f"Running {NUM_WORKFLOWS} hybrid workflows...\n")
print(f"{'#':>3} {'Strategy':<12} {'Action':<20} {'Conf':>6} {'Exec':>5} {'Success':>7} {'Fallback':>8}")
print("-" * 75)

for i in range(NUM_WORKFLOWS):
    metrics = generate_test_metrics()
    result = run_hybrid_workflow(metrics, record_outcome=True)
    batch_results.append(result)
    
    s = result['summary']
    print(f"{i+1:3d} {s['strategy']:<12} {s['action']:<20} {s['confidence']:>6} "
          f"{'Yes' if s['executed'] else 'No':>5} {'‚úÖ' if s['success'] else '‚ùå':>7} "
          f"{'Yes' if s['fallback_used'] else 'No':>8}")

## 8. Performance Analysis

In [None]:
# Analyze results
df = pd.DataFrame([r['summary'] for r in batch_results])
df['confidence_num'] = df['confidence'].str.rstrip('%').astype(float) / 100

print("\n" + "="*60)
print("PERFORMANCE ANALYSIS")
print("="*60)

# Overall metrics
print("\nüìä Overall Metrics:")
print(f"  Total workflows: {len(df)}")
print(f"  Execution rate: {df['executed'].mean():.1%}")
print(f"  Success rate (of executed): {df[df['executed']]['success'].mean():.1%}")
print(f"  Fallback rate: {df['fallback_used'].mean():.1%}")
print(f"  Average confidence: {df['confidence_num'].mean():.1%}")

# By strategy
print("\nüìà By Strategy:")
strategy_stats = df.groupby('strategy').agg({
    'executed': ['count', 'mean'],
    'success': 'mean',
    'confidence_num': 'mean'
}).round(3)
print(strategy_stats.to_string())

# By action
print("\nüîß By Action:")
action_counts = df['action'].value_counts()
for action, count in action_counts.items():
    action_df = df[df['action'] == action]
    success_rate = action_df['success'].mean() if len(action_df) > 0 else 0
    print(f"  {action}: {count} ({success_rate:.0%} success)")

# Adaptive weights
print("\n‚öñÔ∏è Adaptive Weights (after learning):")
print(f"  Rule weight: {router._adaptive_weights['rule']:.1%}")
print(f"  AI weight: {router._adaptive_weights['ai']:.1%}")

# Performance history
print("\nüìú Performance History:")
for strategy, data in router._performance_history.items():
    total = data['successes'] + data['failures']
    rate = data['successes'] / total if total > 0 else 0
    print(f"  {strategy}: {data['successes']}/{total} ({rate:.0%} success)")

## 9. Save Results

In [None]:
# Save workflow tracking data
tracking_df = pd.DataFrame([r['summary'] for r in batch_results])
tracking_df['timestamp'] = [r['decision']['timestamp'] for r in batch_results]
tracking_df['workflow_id'] = [r['workflow_id'] for r in batch_results]

tracking_file = PROCESSED_DIR / 'hybrid_workflow_tracking.parquet'
tracking_df.to_parquet(tracking_file)
logger.info(f"Saved tracking data to {tracking_file}")

# Save detailed results
detailed_file = PROCESSED_DIR / 'hybrid_workflow_detailed.json'
with open(detailed_file, 'w') as f:
    json.dump(batch_results, f, indent=2, default=str)
logger.info(f"Saved detailed results to {detailed_file}")

print(f"\n‚úÖ Results saved:")
print(f"  - {tracking_file}")
print(f"  - {detailed_file}")

## Validation

In [None]:
# Validate all components
print("VALIDATION CHECKS")
print("="*60)

checks = {
    'Rule engine initialized': len(rule_engine.rules) > 0,
    'AI adapter working': 'confidence' in ai_prediction,
    'Router functioning': routing.strategy is not None,
    'Hybrid maker working': decision.decision_id is not None,
    'Executor functioning': len(executor._execution_log) > 0,
    'Batch simulation complete': len(batch_results) == NUM_WORKFLOWS,
    'Tracking file created': tracking_file.exists(),
    'Adaptive learning active': sum(router._performance_history['rule_based'].values()) > 0,
}

all_passed = True
for check, passed in checks.items():
    status = "‚úÖ" if passed else "‚ùå"
    print(f"{status} {check}")
    all_passed = all_passed and passed

print("\n" + "="*60)
if all_passed:
    print("‚úÖ ALL VALIDATIONS PASSED")
else:
    print("‚ùå SOME VALIDATIONS FAILED")

print(f"\nüìã Hybrid Healing Summary:")
print(f"  Rules defined: {len(rule_engine.rules)}")
print(f"  Workflows executed: {len(batch_results)}")
print(f"  Overall success rate: {df[df['executed']]['success'].mean():.1%}")
print(f"  Strategies used: {df['strategy'].nunique()}")

## Integration Notes

### Components

1. **RuleEngine** - 10 predefined rules covering:
   - CPU (high, critical)
   - Memory (high, OOM risk)
   - Error rates (high, critical)
   - Latency, crash loops, network, disk
   - Built-in cooldown and rate limiting

2. **AIDecisionAdapter** - Wraps ensemble model inference
   - Anomaly classification
   - Confidence scoring
   - Action recommendation

3. **DecisionRouter** - Intelligent routing with:
   - Severity-based routing
   - Known vs unknown anomaly handling
   - Adaptive weights from outcome feedback
   - Conservative mode for disagreements

4. **HybridDecisionMaker** - Combines both approaches:
   - Weighted action selection
   - Combined confidence scoring
   - Approval requirements

5. **FallbackExecutor** - Resilient execution:
   - Action-specific fallback chains
   - Automatic retry on failure
   - Execution logging

### Integration with AI-Driven Notebook

This notebook builds on `ai-driven-decision-making-enhanced.ipynb`:
- Uses similar feature extraction patterns
- Compatible outcome tracking format
- Shared confidence thresholds

### Next Steps

1. Connect to live Prometheus metrics
2. Implement real Kubernetes actions
3. Add more sophisticated rules
4. Integrate with MCP coordination engine
5. Deploy as part of self-healing platform