# Unified Evaluation Framework Demo

This notebook demonstrates how to use the new unified evaluation framework to compare different causal Bayesian optimization methods.

## Key Features:
- Single interface for all evaluation methods (GRPO, BC, baselines)
- Standardized result format for easy comparison
- Built-in visualization and analysis tools
- Parallel execution support

In [None]:
# Setup imports
import sys
from pathlib import Path
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Add project root to path
project_root = Path.cwd().parent
sys.path.append(str(project_root))

# Import unified evaluation components
from src.causal_bayes_opt.evaluation import (
    setup_evaluation_runner,
    run_evaluation_comparison,
    results_to_dataframe,
    plot_learning_curves,
    create_summary_report
)

# Import SCM creators
from examples.demo_scms import (
    create_easy_scm_base,
    create_medium_scm,
    create_hard_scm
)

print("Unified evaluation framework loaded successfully!")

## 1. Setup Test SCMs

We'll use three SCMs of varying difficulty for evaluation.

In [None]:
# Create test SCMs
test_scms = [
    create_easy_scm_base(),
    create_medium_scm(),
    create_hard_scm()
]

print(f"Created {len(test_scms)} test SCMs")

# Print SCM info
from src.causal_bayes_opt.data_structures.scm import get_target, get_variables, get_parents

for i, scm in enumerate(test_scms):
    target = get_target(scm)
    variables = list(get_variables(scm))
    parents = list(get_parents(scm, target))
    print(f"\nSCM {i} ({'easy' if i==0 else 'medium' if i==1 else 'hard'}):")
    print(f"  Variables: {len(variables)}")
    print(f"  Target: {target}")
    print(f"  True parents: {parents}")

## 2. Configure Evaluation

Set up the evaluation configuration that will be used for all methods.

In [None]:
# Evaluation configuration
eval_config = {
    'experiment': {
        'target': {
            'max_interventions': 15,
            'n_observational_samples': 100,
            'intervention_value_range': (-2.0, 2.0),
            'learning_rate': 1e-3
        }
    }
}

# Number of random seeds per SCM
n_seeds = 5

print(f"Configuration:")
print(f"  Max interventions: {eval_config['experiment']['target']['max_interventions']}")
print(f"  Observational samples: {eval_config['experiment']['target']['n_observational_samples']}")
print(f"  Seeds per SCM: {n_seeds}")
print(f"  Total runs: {len(test_scms) * n_seeds}")

## 3. Run Baseline Methods Comparison

First, let's compare the three baseline methods:
- Random: Uniform random intervention selection
- Learning: Online structure learning with simple policy
- Oracle: Perfect knowledge of causal structure

In [None]:
# Setup evaluation runner with baseline methods
baseline_runner = setup_evaluation_runner(
    methods=['random', 'learning', 'oracle'],
    parallel=True  # Enable parallel execution
)

print("Running baseline comparison...")

# Run evaluation
baseline_results = run_evaluation_comparison(
    runner=baseline_runner,
    test_scms=test_scms,
    config=eval_config,
    n_seeds=n_seeds
)

print("\nBaseline evaluation complete!")

## 4. Analyze Baseline Results

Let's look at the performance of each baseline method.

In [None]:
# Convert results to DataFrame for analysis
baseline_df = results_to_dataframe(baseline_results)

print("Baseline Method Performance:")
print(baseline_df.to_string(index=False, float_format='%.3f'))

# Create bar plot of mean improvements
plt.figure(figsize=(10, 6))
methods = baseline_df['method'].values
improvements = baseline_df['mean_improvement'].values
errors = baseline_df['std_improvement'].values

bars = plt.bar(methods, improvements, yerr=errors, capsize=10)
plt.ylabel('Mean Target Improvement')
plt.title('Baseline Method Comparison')
plt.grid(True, alpha=0.3)

# Color bars based on performance
colors = ['red' if x > 0 else 'green' for x in improvements]
for bar, color in zip(bars, colors):
    bar.set_color(color)

plt.tight_layout()
plt.show()

## 5. Plot Learning Curves

Visualize how each method learns over time.

In [None]:
# Plot learning curves for each SCM
fig, axes = plt.subplots(1, 3, figsize=(15, 5))
scm_names = ['Easy SCM', 'Medium SCM', 'Hard SCM']

for scm_idx, (ax, scm_name) in enumerate(zip(axes, scm_names)):
    # Plot on current axes
    plt.sca(ax)
    plot_learning_curves(
        baseline_results, 
        scm_idx=scm_idx,
        metric='outcome_value',
        title=f'{scm_name} - Target Value'
    )

plt.tight_layout()
plt.show()

# Plot F1 score evolution
fig, axes = plt.subplots(1, 3, figsize=(15, 5))

for scm_idx, (ax, scm_name) in enumerate(zip(axes, scm_names)):
    plt.sca(ax)
    plot_learning_curves(
        baseline_results,
        scm_idx=scm_idx, 
        metric='f1_score',
        title=f'{scm_name} - F1 Score'
    )

plt.tight_layout()
plt.show()

## 6. Generate Summary Report

Create a comprehensive text report of the results.

In [None]:
# Generate and display summary report
report = create_summary_report(baseline_results)
print(report)

## 7. Compare with GRPO/BC Methods (if checkpoints available)

This section demonstrates how to include GRPO and BC methods in the comparison.
You'll need to provide paths to trained checkpoints.

In [None]:
# Example: Setup runner with all methods including GRPO/BC
# NOTE: Update these paths to your actual checkpoint locations

# checkpoint_paths = {
#     'grpo': Path('path/to/grpo/checkpoint'),
#     'bc_surrogate': Path('path/to/bc/surrogate/checkpoint'),
#     'bc_acquisition': Path('path/to/bc/acquisition/checkpoint')
# }

# full_runner = setup_evaluation_runner(
#     methods=['random', 'learning', 'oracle', 'grpo', 'bc_surrogate', 'bc_acquisition', 'bc_both'],
#     checkpoint_paths=checkpoint_paths,
#     parallel=True
# )

# full_results = run_evaluation_comparison(
#     runner=full_runner,
#     test_scms=test_scms,
#     config=eval_config,
#     n_seeds=n_seeds,
#     output_dir=Path('evaluation_results')  # Save results
# )

print("To compare with GRPO/BC methods, uncomment the code above and provide checkpoint paths.")

## 8. Statistical Analysis

Perform statistical tests to determine if differences are significant.

In [None]:
# Access statistical test results
if baseline_results.statistical_tests:
    print("Statistical Test Results:")
    for test_name, test_result in baseline_results.statistical_tests.items():
        print(f"\n{test_name}:")
        print(f"  Statistic: {test_result.get('statistic', 'N/A'):.3f}")
        print(f"  P-value: {test_result.get('p_value', 'N/A'):.4f}")
        print(f"  Significant (p<0.05): {'Yes' if test_result.get('p_value', 1.0) < 0.05 else 'No'}")
else:
    print("No statistical tests were performed (need at least 2 methods).")

## 9. Save and Load Results

Demonstrate how to save results for later analysis.

In [None]:
# Save results
import pickle
from datetime import datetime

output_dir = Path('evaluation_results')
output_dir.mkdir(exist_ok=True)

timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
results_path = output_dir / f'baseline_results_{timestamp}.pkl'

with open(results_path, 'wb') as f:
    pickle.dump(baseline_results, f)

print(f"Results saved to: {results_path}")

# Example: Load and visualize saved results
# from src.causal_bayes_opt.evaluation import load_and_visualize_results
# loaded_results = load_and_visualize_results(results_path)

## Summary

This notebook demonstrated the unified evaluation framework for causal Bayesian optimization:

1. **Simple Setup**: Use `setup_evaluation_runner()` to configure methods
2. **Consistent Interface**: All methods (GRPO, BC, baselines) use the same API
3. **Standardized Results**: Results are in a common format for easy comparison
4. **Built-in Analysis**: Includes visualization and statistical testing tools
5. **Extensible**: Easy to add new evaluation methods by implementing `BaseEvaluator`

The framework makes it straightforward to:
- Compare different methods fairly
- Analyze performance across multiple metrics
- Generate publication-ready plots and tables
- Save and share results

For production use with GRPO/BC methods, simply provide the checkpoint paths when setting up the runner.