# Themis Comparison Tutorial

This notebook demonstrates how to compare multiple experiment runs using statistical tests.

## Setup

In [None]:
from themis import evaluate
from themis.comparison import compare_runs
from themis.comparison.statistics import StatisticalTest

## 1. Run Multiple Experiments

First, let's run evaluations with different configurations:

In [None]:
# Experiment 1: Temperature = 0.0
result1 = evaluate(
    benchmark="demo",
    model="fake-math-llm",
    temperature=0.0,
    run_id="temp-0.0",
    limit=10,
)
print(f"Temp 0.0 accuracy: {result1.metrics['ExactMatch']:.2%}")

# Experiment 2: Temperature = 0.7
result2 = evaluate(
    benchmark="demo",
    model="fake-math-llm",
    temperature=0.7,
    run_id="temp-0.7",
    limit=10,
)
print(f"Temp 0.7 accuracy: {result2.metrics['ExactMatch']:.2%}")

## 2. Compare Two Runs

Now let's compare them statistically:

In [None]:
# Compare with bootstrap test (default)
report = compare_runs(
    run_ids=["temp-0.0", "temp-0.7"],
    storage_path=".cache/experiments",
    statistical_test=StatisticalTest.BOOTSTRAP,
    alpha=0.05,  # 95% confidence
)

# Print summary
print(report.summary())

## 3. Access Comparison Results

In [None]:
# Overall best run
print(f"Overall best: {report.overall_best_run}")

# Best per metric
for metric, run_id in report.best_run_per_metric.items():
    print(f"{metric}: {run_id}")

# Detailed pairwise results
for result in report.pairwise_results:
    print(f"\n{result.summary()}")
    if result.is_significant():
        print("  âœ“ Statistically significant")
    else:
        print("  âœ— Not statistically significant")

## 4. Different Statistical Tests

Try different statistical tests:

In [None]:
# T-test
report_ttest = compare_runs(
    run_ids=["temp-0.0", "temp-0.7"],
    storage_path=".cache/experiments",
    statistical_test=StatisticalTest.T_TEST,
)

print("T-test Results:")
for result in report_ttest.pairwise_results:
    if result.test_result:
        print(f"  {result.metric_name}: p={result.test_result.p_value:.4f}")
        if result.test_result.effect_size:
            print(f"  Effect size (Cohen's d): {result.test_result.effect_size:.3f}")

In [None]:
# Permutation test
report_perm = compare_runs(
    run_ids=["temp-0.0", "temp-0.7"],
    storage_path=".cache/experiments",
    statistical_test=StatisticalTest.PERMUTATION,
    alpha=0.01,  # 99% confidence
)

print("\nPermutation Test Results:")
for result in report_perm.pairwise_results:
    if result.test_result:
        print(f"  {result.metric_name}: p={result.test_result.p_value:.4f}")

## 5. Compare Multiple Runs

Compare 3+ runs with win/loss matrices:

In [None]:
# Run third experiment
result3 = evaluate(
    benchmark="demo",
    model="fake-math-llm",
    temperature=1.0,
    run_id="temp-1.0",
    limit=10,
)
print(f"Temp 1.0 accuracy: {result3.metrics['ExactMatch']:.2%}")

# Compare all three
report_multi = compare_runs(
    run_ids=["temp-0.0", "temp-0.7", "temp-1.0"],
    storage_path=".cache/experiments",
)

print("\n" + "=" * 60)
print(report_multi.summary(include_details=False))

## 6. Win/Loss Matrix

View pairwise comparisons:

In [None]:
# Get win/loss matrix for ExactMatch
matrix = report_multi.win_loss_matrices["ExactMatch"]

print("Win/Loss Matrix for ExactMatch:")
print(matrix.to_table())

# Rankings
print("\nRankings:")
for rank, (run_id, wins, losses, ties) in enumerate(matrix.rank_runs(), 1):
    print(f"{rank}. {run_id}: {wins}W-{losses}L-{ties}T")

## 7. Export Results

Export comparison reports to various formats:

In [None]:
import json
from pathlib import Path

# Export to JSON
report_dict = report_multi.to_dict()
output_file = Path("comparison_report.json")
output_file.write_text(json.dumps(report_dict, indent=2))
print(f"Exported to: {output_file}")

# View structure
print("\nReport keys:", list(report_dict.keys()))

## 8. Programmatic Statistics

Use the statistics module directly:

In [None]:
from themis.comparison.statistics import t_test, bootstrap_confidence_interval

# Example scores
model_a_scores = [0.85, 0.87, 0.83, 0.90, 0.82, 0.88, 0.84, 0.86]
model_b_scores = [0.78, 0.80, 0.79, 0.82, 0.77, 0.81, 0.80, 0.79]

# T-test
t_result = t_test(model_a_scores, model_b_scores, paired=True)
print("T-test:")
print(f"  Statistic: {t_result.statistic:.3f}")
print(f"  P-value: {t_result.p_value:.4f}")
print(f"  Significant: {t_result.significant}")
print(f"  Effect size: {t_result.effect_size:.3f}")

# Bootstrap
boot_result = bootstrap_confidence_interval(
    model_a_scores, model_b_scores, n_bootstrap=10000, confidence_level=0.95, seed=42
)
print("\nBootstrap:")
print(f"  CI: {boot_result.confidence_interval}")
print(f"  Significant: {boot_result.significant}")

## Summary

In this tutorial, you learned:
- âœ… How to compare two or more runs
- âœ… Using different statistical tests (t-test, bootstrap, permutation)
- âœ… Interpreting p-values and significance
- âœ… Win/loss matrices for multiple runs
- âœ… Exporting comparison results
- âœ… Using statistics functions directly

## Next Steps

- Learn about [custom metrics](../docs/EVALUATION.md)
- Explore [backend extensibility](../docs/EXTENDING_BACKENDS.md)
- Check out the [API server](../docs/API_SERVER.md)

Happy comparing! ðŸ”¬