# Root Cause Analysis with Controlled Anomaly Injection

## Objective
This notebook demonstrates the root cause analysis capabilities of the Intervention Search system by:

1. **Creating a realistic synthetic dataset** with a known causal structure
2. **Injecting controlled anomalies** at specific nodes in the causal graph
3. **Propagating anomaly effects** through the causal DAG to downstream nodes
4. **Training models using Auto ML mode** - testing multiple model types per node
5. **Testing detection capabilities** - verifying if the system can identify the true root causes

## Causal Structure: E-commerce Platform

We'll simulate an e-commerce platform with the following causal structure:

```
Marketing Spend → Website Traffic → Conversion Rate → Orders → Revenue
                                           ↑            ↑
                                    Page Load Time    |
                                           ↑            |
                                    Server Capacity    |
                                                        |
Inventory Level ----------------------------------------+
                    ↓
               Stock-out Rate
```

In [None]:
# ============================================================================
# IMPORTS
# ============================================================================

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import networkx as nx
from scipy import stats
import warnings
warnings.filterwarnings('ignore')

# Set random seed for reproducibility
np.random.seed(42)

# Display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
sns.set_style('whitegrid')

# Import HT RCA tools
from ht_categ import HT, HTConfig

print("✅ All imports successful!")

## Step 1: Define Causal Graph Structure

In [None]:
# ============================================================================
# DEFINE CAUSAL DAG
# ============================================================================

edges = [
    # Marketing & Traffic
    ('marketing_spend', 'website_traffic'),
    
    # Infrastructure
    ('server_capacity', 'page_load_time'),
    ('website_traffic', 'page_load_time'),  # More traffic = slower pages
    
    # Conversion funnel
    ('website_traffic', 'conversion_rate'),
    ('page_load_time', 'conversion_rate'),  # Slow pages reduce conversion
    
    # Orders
    ('website_traffic', 'orders'),
    ('conversion_rate', 'orders'),
    ('inventory_level', 'orders'),  # Low inventory limits orders
    
    # Stock management
    ('inventory_level', 'stockout_rate'),
    
    # Revenue
    ('orders', 'revenue'),
    ('stockout_rate', 'revenue'),  # Stock-outs hurt revenue
]

# Create adjacency matrix
all_nodes = sorted(set([node for edge in edges for node in edge]))
adj_matrix = pd.DataFrame(0, index=all_nodes, columns=all_nodes)

for source, target in edges:
    adj_matrix.loc[source, target] = 1

print("Causal Graph:")
print(f"  Nodes: {len(all_nodes)}")
print(f"  Edges: {len(edges)}")
print(f"\nNodes: {all_nodes}")

# Validate DAG
G = nx.from_pandas_adjacency(adj_matrix, create_using=nx.DiGraph())
assert nx.is_directed_acyclic_graph(G), "Graph must be a DAG!"
print("\n✅ Valid DAG structure confirmed")

## Step 2: Generate Realistic Baseline Data

We'll generate data following the causal relationships with realistic noise.

In [None]:
# ============================================================================
# DATA GENERATION FUNCTION
# ============================================================================

def generate_ecommerce_data(n_samples=1000, inject_anomaly=False, anomaly_config=None):
    """
    Generate realistic e-commerce data following the causal DAG.
    
    Args:
        n_samples: Number of data points to generate
        inject_anomaly: Whether to inject anomalies
        anomaly_config: Dictionary with anomaly specifications:
            {
                'node': str,  # Which node to inject anomaly in
                'start_idx': int,  # When to start anomaly
                'end_idx': int,  # When to end anomaly
                'effect': float,  # Multiplicative effect (e.g., 0.5 = 50% reduction)
                'description': str  # What went wrong
            }
    
    Returns:
        DataFrame with generated data
    """
    
    data = {}
    
    # ROOT NODES (Exogenous variables)
    # Marketing spend: $10K-$50K per period
    data['marketing_spend'] = np.random.uniform(10000, 50000, n_samples)
    
    # Server capacity: 100-500 units
    data['server_capacity'] = np.random.uniform(100, 500, n_samples)
    
    # Inventory level: 1000-10000 units
    data['inventory_level'] = np.random.uniform(1000, 10000, n_samples)
    
    # INTERMEDIATE NODES
    # Website traffic: Driven by marketing spend
    data['website_traffic'] = (
        500 + 
        0.8 * data['marketing_spend'] + 
        np.random.normal(0, 2000, n_samples)
    )
    data['website_traffic'] = np.maximum(data['website_traffic'], 100)  # Floor at 100
    
    # Page load time: Affected by server capacity and traffic
    # Higher traffic and lower capacity = slower pages
    data['page_load_time'] = (
        0.5 + 
        0.00002 * data['website_traffic'] - 
        0.002 * data['server_capacity'] + 
        np.random.normal(0, 0.1, n_samples)
    )
    data['page_load_time'] = np.maximum(data['page_load_time'], 0.1)  # Min 0.1s
    
    # Conversion rate: Traffic brings visitors, but slow pages hurt conversion
    base_conversion = 0.03  # 3% base conversion
    traffic_effect = 0.000001 * data['website_traffic']  # Slight positive effect
    speed_penalty = -0.01 * data['page_load_time']  # Slow pages hurt
    
    data['conversion_rate'] = (
        base_conversion + 
        traffic_effect + 
        speed_penalty + 
        np.random.normal(0, 0.005, n_samples)
    )
    data['conversion_rate'] = np.clip(data['conversion_rate'], 0.001, 0.1)  # 0.1% - 10%
    
    # Stock-out rate: Inversely related to inventory
    data['stockout_rate'] = (
        0.5 - 
        0.00004 * data['inventory_level'] + 
        np.random.normal(0, 0.05, n_samples)
    )
    data['stockout_rate'] = np.clip(data['stockout_rate'], 0, 0.5)  # 0-50%
    
    # Orders: Traffic * Conversion rate, limited by inventory
    potential_orders = data['website_traffic'] * data['conversion_rate']
    inventory_limit = data['inventory_level'] * 0.1  # Can sell 10% of inventory per period
    
    data['orders'] = np.minimum(
        potential_orders + np.random.normal(0, 10, n_samples),
        inventory_limit
    )
    data['orders'] = np.maximum(data['orders'], 0)
    
    # Revenue: Orders * average order value, reduced by stock-outs
    avg_order_value = 100  # $100 per order
    data['revenue'] = (
        data['orders'] * avg_order_value * (1 - 0.3 * data['stockout_rate']) +
        np.random.normal(0, 1000, n_samples)
    )
    data['revenue'] = np.maximum(data['revenue'], 0)
    
    # INJECT ANOMALY if requested
    anomaly_info = None
    if inject_anomaly and anomaly_config:
        node = anomaly_config['node']
        start_idx = anomaly_config['start_idx']
        end_idx = anomaly_config['end_idx']
        effect = anomaly_config['effect']
        
        if node in data:
            # Store original values
            original_mean = np.mean(data[node][start_idx:end_idx])
            
            # Apply multiplicative effect
            data[node][start_idx:end_idx] *= effect
            
            # Re-propagate through the DAG
            # This is crucial: anomalies cascade through causal relationships!
            
            if node == 'server_capacity':
                # Recalculate downstream effects
                data['page_load_time'][start_idx:end_idx] = (
                    0.5 + 
                    0.00002 * data['website_traffic'][start_idx:end_idx] - 
                    0.002 * data['server_capacity'][start_idx:end_idx] + 
                    np.random.normal(0, 0.1, end_idx - start_idx)
                )
                data['page_load_time'][start_idx:end_idx] = np.maximum(
                    data['page_load_time'][start_idx:end_idx], 0.1
                )
                
                # Recalculate conversion rate
                traffic_effect = 0.000001 * data['website_traffic'][start_idx:end_idx]
                speed_penalty = -0.01 * data['page_load_time'][start_idx:end_idx]
                data['conversion_rate'][start_idx:end_idx] = np.clip(
                    base_conversion + traffic_effect + speed_penalty + 
                    np.random.normal(0, 0.005, end_idx - start_idx),
                    0.001, 0.1
                )
                
                # Recalculate orders
                potential_orders = (
                    data['website_traffic'][start_idx:end_idx] * 
                    data['conversion_rate'][start_idx:end_idx]
                )
                inventory_limit = data['inventory_level'][start_idx:end_idx] * 0.1
                data['orders'][start_idx:end_idx] = np.maximum(
                    np.minimum(
                        potential_orders + np.random.normal(0, 10, end_idx - start_idx),
                        inventory_limit
                    ),
                    0
                )
                
                # Recalculate revenue
                data['revenue'][start_idx:end_idx] = np.maximum(
                    data['orders'][start_idx:end_idx] * avg_order_value * 
                    (1 - 0.3 * data['stockout_rate'][start_idx:end_idx]) +
                    np.random.normal(0, 1000, end_idx - start_idx),
                    0
                )
            
            elif node == 'inventory_level':
                # Recalculate stock-out rate
                data['stockout_rate'][start_idx:end_idx] = np.clip(
                    0.5 - 0.00004 * data['inventory_level'][start_idx:end_idx] + 
                    np.random.normal(0, 0.05, end_idx - start_idx),
                    0, 0.5
                )
                
                # Recalculate orders (limited by inventory)
                potential_orders = (
                    data['website_traffic'][start_idx:end_idx] * 
                    data['conversion_rate'][start_idx:end_idx]
                )
                inventory_limit = data['inventory_level'][start_idx:end_idx] * 0.1
                data['orders'][start_idx:end_idx] = np.maximum(
                    np.minimum(
                        potential_orders + np.random.normal(0, 10, end_idx - start_idx),
                        inventory_limit
                    ),
                    0
                )
                
                # Recalculate revenue
                data['revenue'][start_idx:end_idx] = np.maximum(
                    data['orders'][start_idx:end_idx] * avg_order_value * 
                    (1 - 0.3 * data['stockout_rate'][start_idx:end_idx]) +
                    np.random.normal(0, 1000, end_idx - start_idx),
                    0
                )
            
            elif node == 'marketing_spend':
                # Recalculate website traffic
                data['website_traffic'][start_idx:end_idx] = np.maximum(
                    500 + 0.8 * data['marketing_spend'][start_idx:end_idx] + 
                    np.random.normal(0, 2000, end_idx - start_idx),
                    100
                )
                
                # Cascade through page load time, conversion, orders, revenue
                data['page_load_time'][start_idx:end_idx] = np.maximum(
                    0.5 + 0.00002 * data['website_traffic'][start_idx:end_idx] - 
                    0.002 * data['server_capacity'][start_idx:end_idx] + 
                    np.random.normal(0, 0.1, end_idx - start_idx),
                    0.1
                )
                
                traffic_effect = 0.000001 * data['website_traffic'][start_idx:end_idx]
                speed_penalty = -0.01 * data['page_load_time'][start_idx:end_idx]
                data['conversion_rate'][start_idx:end_idx] = np.clip(
                    base_conversion + traffic_effect + speed_penalty + 
                    np.random.normal(0, 0.005, end_idx - start_idx),
                    0.001, 0.1
                )
                
                potential_orders = (
                    data['website_traffic'][start_idx:end_idx] * 
                    data['conversion_rate'][start_idx:end_idx]
                )
                inventory_limit = data['inventory_level'][start_idx:end_idx] * 0.1
                data['orders'][start_idx:end_idx] = np.maximum(
                    np.minimum(
                        potential_orders + np.random.normal(0, 10, end_idx - start_idx),
                        inventory_limit
                    ),
                    0
                )
                
                data['revenue'][start_idx:end_idx] = np.maximum(
                    data['orders'][start_idx:end_idx] * avg_order_value * 
                    (1 - 0.3 * data['stockout_rate'][start_idx:end_idx]) +
                    np.random.normal(0, 1000, end_idx - start_idx),
                    0
                )
            
            anomaly_mean = np.mean(data[node][start_idx:end_idx])
            
            anomaly_info = {
                'node': node,
                'start_idx': start_idx,
                'end_idx': end_idx,
                'effect': effect,
                'original_mean': original_mean,
                'anomaly_mean': anomaly_mean,
                'pct_change': ((anomaly_mean - original_mean) / original_mean) * 100,
                'description': anomaly_config.get('description', 'Unknown anomaly')
            }
    
    df = pd.DataFrame(data)
    
    # Add time index
    df['time_period'] = range(len(df))
    
    return df, anomaly_info

print("✅ Data generation function defined")

## Step 3: Generate Normal (Baseline) Data

In [None]:
# ============================================================================
# GENERATE BASELINE DATA
# ============================================================================

df_normal, _ = generate_ecommerce_data(n_samples=800, inject_anomaly=False)

print(f"Normal data shape: {df_normal.shape}")
print(f"\nBaseline statistics:")
print(df_normal[all_nodes].describe())

# Visualize baseline data
fig, axes = plt.subplots(3, 3, figsize=(15, 12))
axes = axes.flatten()

for idx, node in enumerate(all_nodes):
    axes[idx].hist(df_normal[node], bins=30, alpha=0.7, edgecolor='black')
    axes[idx].set_title(f'{node}\n(μ={df_normal[node].mean():.1f}, σ={df_normal[node].std():.1f})')
    axes[idx].set_xlabel('Value')
    axes[idx].set_ylabel('Frequency')

plt.tight_layout()
plt.suptitle('Baseline Data Distribution', y=1.02, fontsize=14, fontweight='bold')
plt.show()

print("\n✅ Baseline data generated and visualized")

## Step 4: Generate Anomalous Data with Known Root Cause

We'll inject **THREE different anomalies** to test the detection system:

1. **Server Capacity Reduction** (50% reduction) - Simulates infrastructure failure
2. **Inventory Shortage** (30% reduction) - Simulates supply chain issue  
3. **Marketing Budget Cut** (40% reduction) - Simulates business decision

In [None]:
# ============================================================================
# GENERATE ANOMALOUS DATA - SCENARIO 1: Server Capacity Failure
# ============================================================================

anomaly_config_1 = {
    'node': 'server_capacity',
    'start_idx': 600,
    'end_idx': 800,
    'effect': 0.5,  # 50% reduction
    'description': 'Server capacity reduced by 50% due to infrastructure failure'
}

df_anomaly_1, anomaly_info_1 = generate_ecommerce_data(
    n_samples=800,
    inject_anomaly=True,
    anomaly_config=anomaly_config_1
)

print("="*70)
print("SCENARIO 1: SERVER CAPACITY FAILURE")
print("="*70)
print(f"Root cause: {anomaly_info_1['node']}")
print(f"Description: {anomaly_info_1['description']}")
print(f"Anomaly period: Samples {anomaly_info_1['start_idx']}-{anomaly_info_1['end_idx']}")
print(f"Original mean: {anomaly_info_1['original_mean']:.2f}")
print(f"Anomaly mean: {anomaly_info_1['anomaly_mean']:.2f}")
print(f"Percentage change: {anomaly_info_1['pct_change']:.1f}%")

# Extract anomaly period
df_anomaly_period_1 = df_anomaly_1.iloc[600:800].copy()
print(f"\nAnomalous period shape: {df_anomaly_period_1.shape}")

In [None]:
# ============================================================================
# GENERATE ANOMALOUS DATA - SCENARIO 2: Inventory Shortage
# ============================================================================

anomaly_config_2 = {
    'node': 'inventory_level',
    'start_idx': 600,
    'end_idx': 800,
    'effect': 0.7,  # 30% reduction
    'description': 'Inventory reduced by 30% due to supply chain disruption'
}

df_anomaly_2, anomaly_info_2 = generate_ecommerce_data(
    n_samples=800,
    inject_anomaly=True,
    anomaly_config=anomaly_config_2
)

print("\n" + "="*70)
print("SCENARIO 2: INVENTORY SHORTAGE")
print("="*70)
print(f"Root cause: {anomaly_info_2['node']}")
print(f"Description: {anomaly_info_2['description']}")
print(f"Anomaly period: Samples {anomaly_info_2['start_idx']}-{anomaly_info_2['end_idx']}")
print(f"Original mean: {anomaly_info_2['original_mean']:.2f}")
print(f"Anomaly mean: {anomaly_info_2['anomaly_mean']:.2f}")
print(f"Percentage change: {anomaly_info_2['pct_change']:.1f}%")

df_anomaly_period_2 = df_anomaly_2.iloc[600:800].copy()

In [None]:
# ============================================================================
# GENERATE ANOMALOUS DATA - SCENARIO 3: Marketing Budget Cut
# ============================================================================

anomaly_config_3 = {
    'node': 'marketing_spend',
    'start_idx': 600,
    'end_idx': 800,
    'effect': 0.6,  # 40% reduction
    'description': 'Marketing budget cut by 40% due to cost reduction initiative'
}

df_anomaly_3, anomaly_info_3 = generate_ecommerce_data(
    n_samples=800,
    inject_anomaly=True,
    anomaly_config=anomaly_config_3
)

print("\n" + "="*70)
print("SCENARIO 3: MARKETING BUDGET CUT")
print("="*70)
print(f"Root cause: {anomaly_info_3['node']}")
print(f"Description: {anomaly_info_3['description']}")
print(f"Anomaly period: Samples {anomaly_info_3['start_idx']}-{anomaly_info_3['end_idx']}")
print(f"Original mean: {anomaly_info_3['original_mean']:.2f}")
print(f"Anomaly mean: {anomaly_info_3['anomaly_mean']:.2f}")
print(f"Percentage change: {anomaly_info_3['pct_change']:.1f}%")

df_anomaly_period_3 = df_anomaly_3.iloc[600:800].copy()

print("\n✅ All anomaly scenarios generated")

## Step 5: Visualize Anomaly Propagation

Let's visualize how anomalies at root causes cascade through the causal graph.

In [None]:
# ============================================================================
# VISUALIZE ANOMALY IMPACT - SCENARIO 1
# ============================================================================

fig, axes = plt.subplots(3, 3, figsize=(16, 12))
axes = axes.flatten()

for idx, node in enumerate(all_nodes):
    # Plot time series
    axes[idx].plot(df_anomaly_1['time_period'], df_anomaly_1[node], 
                   alpha=0.6, label='With Anomaly', linewidth=1.5)
    
    # Highlight anomaly period
    axes[idx].axvspan(600, 800, alpha=0.2, color='red', label='Anomaly Period')
    
    # Add mean line
    normal_mean = df_normal[node].mean()
    axes[idx].axhline(normal_mean, color='green', linestyle='--', 
                     alpha=0.7, label=f'Normal Mean: {normal_mean:.1f}')
    
    # Calculate percentage drop in anomaly period
    anomaly_mean = df_anomaly_1[node].iloc[600:800].mean()
    pct_change = ((anomaly_mean - normal_mean) / normal_mean) * 100
    
    axes[idx].set_title(f'{node}\n(Anomaly period: {pct_change:+.1f}%)', fontsize=10)
    axes[idx].set_xlabel('Time Period')
    axes[idx].set_ylabel('Value')
    axes[idx].legend(fontsize=8, loc='best')
    axes[idx].grid(True, alpha=0.3)

plt.tight_layout()
plt.suptitle('Scenario 1: Server Capacity Failure - Anomaly Propagation', 
             y=1.02, fontsize=14, fontweight='bold')
plt.show()

print("✅ Anomaly propagation visualized for Scenario 1")

## Step 6: Train Models with Auto ML Mode

Now we'll train the HT model using **Auto ML mode**, which will:
- Test multiple model types for each node (LinearRegression, RandomForest, XGBoost, LightGBM)
- Select the best performing model for each node based on R² or accuracy
- Store performance metrics for inspection

In [None]:
# ============================================================================
# TRAIN HT MODEL WITH AUTO ML
# ============================================================================

# Configure HT with Auto ML mode
config = HTConfig(
    graph=adj_matrix,
    aggregator="max",
    root_cause_top_k=5,
    model_type="AutoML",  # This activates Auto ML mode
    auto_ml=True,
    auto_ml_models=["LinearRegression", "RandomForest", "Xgboost", "LightGBM"]
)

# Create HT instance
ht_model = HT(config)

# Train on normal data only
print("Training HT model with Auto ML mode...\n")
ht_model.train(df_normal[all_nodes], perform_cv=True, verbose_automl=True)

print("\n" + "="*70)
print("✅ TRAINING COMPLETE")
print("="*70)

In [None]:
# ============================================================================
# INSPECT AUTO ML RESULTS
# ============================================================================

print("\n" + "="*70)
print("AUTO ML MODEL SELECTION RESULTS")
print("="*70)

if hasattr(ht_model, 'auto_ml_results'):
    for node, results in ht_model.auto_ml_results.items():
        print(f"\n{node}:")
        for res in results:
            if res.get('model') is not None:
                status = "✅ SELECTED" if res['score'] == max(r['score'] for r in results if r.get('model') is not None) else "  "
                print(f"  {status} {res['model_name']:20s} | {res['metric_str']}")
            elif 'error' in res:
                print(f"     ❌ {res['model_name']:20s} | Failed: {res['error'][:50]}")
else:
    print("Auto ML results not available (not stored during training)")

# Check model quality report
print("\n" + "="*70)
print("MODEL QUALITY REPORT")
print("="*70)
quality_report = ht_model.get_model_quality_report()
print(f"\nOverall Quality Grade: {quality_report['trust_indicators']['quality_grade']}")
print(f"Graph Coverage: {quality_report['trust_indicators']['graph_coverage']}%")
print(f"\nRegression Performance:")
print(f"  Mean R²: {quality_report['overall_summary']['regression_performance']['mean_r2']:.4f}")
print(f"  Median R²: {quality_report['overall_summary']['regression_performance']['median_r2']:.4f}")
print(f"  Min R²: {quality_report['overall_summary']['regression_performance']['min_r2']:.4f}")
print(f"  Max R²: {quality_report['overall_summary']['regression_performance']['max_r2']:.4f}")

## Step 7: Test Root Cause Detection - Scenario 1 (Server Capacity)

**GROUND TRUTH**: The root cause is `server_capacity` (50% reduction)

Let's see if the model can detect it!

In [None]:
# ============================================================================
# ROOT CAUSE ANALYSIS - SCENARIO 1
# ============================================================================

print("="*70)
print("SCENARIO 1: ROOT CAUSE ANALYSIS")
print("="*70)
print(f"Ground Truth: {anomaly_info_1['node']} ({anomaly_info_1['description']})")
print(f"Expected: RCA should identify '{anomaly_info_1['node']}' as top root cause\n")

# Run RCA on anomalous period
rca_results_1 = ht_model.find_root_causes(
    df_anomaly_period_1[all_nodes],
    anomalous_metrics='revenue',  # We observe revenue drop
    return_paths=True,
    adjustment=False
)

print("\n" + "="*70)
print("DETECTED ROOT CAUSES (Top 5)")
print("="*70)

for idx, rc in enumerate(rca_results_1.root_cause_nodes, 1):
    is_correct = "✅ CORRECT!" if rc['root_cause'] == anomaly_info_1['node'] else ""
    print(f"{idx}. {rc['root_cause']:20s} | Score: {rc['score']:8.2f} | Severity: {rc['severity']} {is_correct}")

# Check if ground truth is in top 3
top_3_roots = [rc['root_cause'] for rc in rca_results_1.root_cause_nodes[:3]]
detection_success_1 = anomaly_info_1['node'] in top_3_roots

print("\n" + "="*70)
if detection_success_1:
    rank = top_3_roots.index(anomaly_info_1['node']) + 1
    print(f"✅ SUCCESS: Ground truth '{anomaly_info_1['node']}' detected at rank {rank}")
else:
    print(f"❌ FAILURE: Ground truth '{anomaly_info_1['node']}' NOT in top 3")
print("="*70)

## Step 8: Test Root Cause Detection - Scenario 2 (Inventory)

**GROUND TRUTH**: The root cause is `inventory_level` (30% reduction)

In [None]:
# ============================================================================
# ROOT CAUSE ANALYSIS - SCENARIO 2
# ============================================================================

print("="*70)
print("SCENARIO 2: ROOT CAUSE ANALYSIS")
print("="*70)
print(f"Ground Truth: {anomaly_info_2['node']} ({anomaly_info_2['description']})")
print(f"Expected: RCA should identify '{anomaly_info_2['node']}' as top root cause\n")

# Run RCA
rca_results_2 = ht_model.find_root_causes(
    df_anomaly_period_2[all_nodes],
    anomalous_metrics='revenue',
    return_paths=True,
    adjustment=False
)

print("\n" + "="*70)
print("DETECTED ROOT CAUSES (Top 5)")
print("="*70)

for idx, rc in enumerate(rca_results_2.root_cause_nodes, 1):
    is_correct = "✅ CORRECT!" if rc['root_cause'] == anomaly_info_2['node'] else ""
    print(f"{idx}. {rc['root_cause']:20s} | Score: {rc['score']:8.2f} | Severity: {rc['severity']} {is_correct}")

top_3_roots = [rc['root_cause'] for rc in rca_results_2.root_cause_nodes[:3]]
detection_success_2 = anomaly_info_2['node'] in top_3_roots

print("\n" + "="*70)
if detection_success_2:
    rank = top_3_roots.index(anomaly_info_2['node']) + 1
    print(f"✅ SUCCESS: Ground truth '{anomaly_info_2['node']}' detected at rank {rank}")
else:
    print(f"❌ FAILURE: Ground truth '{anomaly_info_2['node']}' NOT in top 3")
print("="*70)

## Step 9: Test Root Cause Detection - Scenario 3 (Marketing)

**GROUND TRUTH**: The root cause is `marketing_spend` (40% reduction)

In [None]:
# ============================================================================
# ROOT CAUSE ANALYSIS - SCENARIO 3
# ============================================================================

print("="*70)
print("SCENARIO 3: ROOT CAUSE ANALYSIS")
print("="*70)
print(f"Ground Truth: {anomaly_info_3['node']} ({anomaly_info_3['description']})")
print(f"Expected: RCA should identify '{anomaly_info_3['node']}' as top root cause\n")

# Run RCA
rca_results_3 = ht_model.find_root_causes(
    df_anomaly_period_3[all_nodes],
    anomalous_metrics='revenue',
    return_paths=True,
    adjustment=False
)

print("\n" + "="*70)
print("DETECTED ROOT CAUSES (Top 5)")
print("="*70)

for idx, rc in enumerate(rca_results_3.root_cause_nodes, 1):
    is_correct = "✅ CORRECT!" if rc['root_cause'] == anomaly_info_3['node'] else ""
    print(f"{idx}. {rc['root_cause']:20s} | Score: {rc['score']:8.2f} | Severity: {rc['severity']} {is_correct}")

top_3_roots = [rc['root_cause'] for rc in rca_results_3.root_cause_nodes[:3]]
detection_success_3 = anomaly_info_3['node'] in top_3_roots

print("\n" + "="*70)
if detection_success_3:
    rank = top_3_roots.index(anomaly_info_3['node']) + 1
    print(f"✅ SUCCESS: Ground truth '{anomaly_info_3['node']}' detected at rank {rank}")
else:
    print(f"❌ FAILURE: Ground truth '{anomaly_info_3['node']}' NOT in top 3")
print("="*70)

## Step 10: Summary and Evaluation

Let's summarize the detection performance across all scenarios.

In [None]:
# ============================================================================
# FINAL EVALUATION SUMMARY
# ============================================================================

print("\n" + "="*70)
print("FINAL EVALUATION: ROOT CAUSE DETECTION PERFORMANCE")
print("="*70)

scenarios = [
    {
        'name': 'Scenario 1: Server Capacity Failure',
        'ground_truth': anomaly_info_1['node'],
        'detected_roots': [rc['root_cause'] for rc in rca_results_1.root_cause_nodes[:5]],
        'success': detection_success_1
    },
    {
        'name': 'Scenario 2: Inventory Shortage',
        'ground_truth': anomaly_info_2['node'],
        'detected_roots': [rc['root_cause'] for rc in rca_results_2.root_cause_nodes[:5]],
        'success': detection_success_2
    },
    {
        'name': 'Scenario 3: Marketing Budget Cut',
        'ground_truth': anomaly_info_3['node'],
        'detected_roots': [rc['root_cause'] for rc in rca_results_3.root_cause_nodes[:5]],
        'success': detection_success_3
    }
]

successes = 0
for scenario in scenarios:
    print(f"\n{scenario['name']}")
    print(f"  Ground Truth: {scenario['ground_truth']}")
    print(f"  Detected (Top 5): {scenario['detected_roots']}")
    
    if scenario['ground_truth'] in scenario['detected_roots']:
        rank = scenario['detected_roots'].index(scenario['ground_truth']) + 1
        print(f"  Result: ✅ DETECTED at rank {rank}")
        successes += 1
    else:
        print(f"  Result: ❌ NOT DETECTED in top 5")

accuracy = (successes / len(scenarios)) * 100

print("\n" + "="*70)
print(f"OVERALL ACCURACY: {successes}/{len(scenarios)} ({accuracy:.1f}%)")
print("="*70)

if accuracy >= 66.7:  # At least 2 out of 3
    print("\n✅ SYSTEM VALIDATION: PASSED")
    print("The root cause analysis system successfully identified most injected anomalies!")
else:
    print("\n⚠️ SYSTEM VALIDATION: NEEDS IMPROVEMENT")
    print("The system struggled to identify the injected anomalies.")
    print("Consider: (1) More training data, (2) Better model selection, (3) Reviewing causal structure")

print("\n" + "="*70)
print("NOTEBOOK COMPLETE")
print("="*70)

## Key Takeaways

1. **Auto ML Mode**: The system automatically tested multiple model types (LinearRegression, RandomForest, XGBoost, LightGBM) for each node and selected the best performer

2. **Controlled Testing**: By injecting known anomalies, we can objectively evaluate the detection system's performance

3. **Causal Propagation**: Anomalies naturally cascade through the causal graph, affecting downstream nodes - our synthetic data correctly simulates this

4. **Detection Accuracy**: The system's ability to identify root causes depends on:
   - Quality of causal graph structure
   - Model accuracy for each node
   - Strength of the anomaly signal
   - Amount of training data

5. **Practical Application**: This approach can be used to:
   - Validate the RCA system before production deployment
   - Test different causal graph structures
   - Benchmark different model types
   - Understand detection limitations