EVA-Bench: A Scenario-Based Benchmark for Evaluating Domain Knowledge and Agentic Capability of Foundation Models in xEVA Operations

EVA-Bench is a research benchmark for evaluating AI agents in Exploration EVA (xEVA) space operations. It provides 651 scenario-based tasks with safety-aware scoring to compare foundation models under realistic mission constraints. Tasks are grounded in 84 publicly available NASA documents spanning Apollo, ISS, and Artemis programs.

Key Statistics

Metric	Count
Total tasks	651
Single-query (SQ) tasks	516
End-to-end (E2E) missions	135
Scenario families	6
Difficulty tiers	3
Models evaluated	9
NASA corpus documents	84
Simulated EVA tools	16
Safety sentinel categories	5

Models Evaluated

Provider	Models
Anthropic	Claude Opus 4.6, Claude Sonnet 4.6, Claude Haiku 4.5
Google	Gemini 2.5 Pro, Gemini 2.5 Flash, Gemini 2.0 Flash
OpenAI	GPT-5.1, GPT-5 Mini, GPT-5 Nano

Installation

# Clone the repository
git clone https://github.com/BXS-Lab/EVA-Bench.git
cd eva-bench

# Install in editable mode (required — CLI resolves paths relative to source tree)
pip install -e ".[dev]"

# Set up API keys (only needed for run, run_batch, and judge commands)
cp .env.example .env
# Edit .env with your API keys

Requirements: Python >= 3.11. Must use editable install (pip install -e .). Scoring, validation, and tests work without API keys.

Quick Start

# Validate all task files
eva-bench validate

# Run a single task
eva-bench run SQ-F1-T1-001 --model openai/gpt-5.1 --seed 42

# Run a batch
eva-bench run_batch --model openai/gpt-5.1 --split public_dev

# Score a trace
eva-bench score traces/SQ-F1-T1-001_openai_gpt-5.1.json

# Generate reports
eva-bench report traces/ --output-dir reports/

Reproducing Published Results

All benchmark traces are included in traces/benchmark/. To reproduce scoring:

# Re-score all traces (deterministic — no API calls needed)
python scripts/run_benchmark.py --score-only

# Replicate LLM-as-judge scoring (requires ANTHROPIC_API_KEY)
python scripts/replicate_llm_judge.py --model claude-opus-4-6 --dry-run   # preview cost
python scripts/replicate_llm_judge.py --model claude-opus-4-6             # run
python scripts/replicate_llm_judge.py --model claude-opus-4-6 --compare   # run + compare with original

The replicate_llm_judge.py script sends each task's judge bundle to Claude Opus 4.6 with the exact evaluation criteria used in the published results. Use --judge-model to specify a different judge model. Use --resume to continue interrupted runs.

Deterministic scoring is pinned to sentence-transformers==5.1.0 and cross-encoder/nli-deberta-v3-small for reproducibility (see src/eva_bench/scorer/reproducibility.py).

Task Format

Single-Query (SQ) Tasks

Each SQ task is a JSON file in tasks/sq/ containing a question, expected facts, multiple-choice options, and scoring configuration:

{
  "task_id": "SQ-F1-T1-001",
  "track": "SQ",
  "scenario_family": "pre_eva_depress_egress",
  "difficulty_tier": 1,
  "title": "ISS Pre-Breathe Protocol Overview",
  "description": "Identify the approved ISS pre-breathe protocols...",
  "split": "public_dev",
  "source_type": "historical",
  "source_documents": ["NTRS-20140009556"],
  "allowed_tools": ["procedure_lookup", "document_search"],
  "query": "What is the standard pre-breathe protocol duration...?",
  "expected_facts": [
    "multiple approved pre-breathe protocols exist...",
    "purpose is nitrogen washout to prevent DCS..."
  ],
  "mc_options": {"A": "...", "B": "...", "C": "...", "D": "..."},
  "mc_correct": "A",
  "disallowed_claims": ["No pre-breathe is required for ISS EVA"],
  "scoring_config": {
    "metric_weights": {"correctness": 0.35, "completeness": 0.25, "relevance": 0.20, "evidence_attribution": 0.20},
    "sentinel_definitions": [
      {"sentinel_id": "SENT-...", "trigger_condition": "...", "severity": "minor", "safety_multiplier": 0.9}
    ]
  }
}

End-to-End (E2E) Tasks

E2E tasks in tasks/e2e/ define multi-phase missions with state machines. In addition to the base fields, they include:

mission_objectives — Prioritized objectives with success criteria
phase_graph — Directed graph of mission phases with entry/exit conditions
state_transition_rules — Deterministic effects of each tool call on environment state
success_detectors — Terminal conditions that determine mission completion
handoff_requirements — Crew coordination and information transfer points
contingency_injections — Fault events injected during execution
gold_artifacts.expected_procedure_steps — Ordered list of expected tool calls (used by scorer)
gold_artifacts.expected_step_order — Required execution ordering (used by scorer)

Task ID format: {SQ|E2E}-F{1-6}-T{1-3}-{NNN} where F = scenario family, T = difficulty tier.

Scoring

EVA-Bench uses a safety-gated composite score:

SQ:  quality = 0.35×correctness + 0.25×completeness + 0.20×relevance + 0.20×evidence_attribution
E2E: objective = 0.50×mission_success + 0.25×protocol_adherence + 0.15×replanning + 0.10×handoff
Final = quality/objective × safety_multiplier

Weights are configurable per-task via scoring_config.metric_weights and are renormalized at runtime.

Three-Tier Matching

Correctness and completeness metrics use a three-tier matching system:

Tier	Method	Deterministic
1	Exact / substring match	Yes
2	Embedding similarity (`all-MiniLM-L6-v2`, threshold 0.50)	Yes
3	LLM-as-judge (configurable model, default `gpt-4o-mini`)	No

Safety Sentinels

Safety sentinels detect protocol violations mapped to STPA Unsafe Control Action (UCA) guidewords and apply severity-based multipliers:

Severity	Multiplier	Effect
Critical	0.0	Zeroes the final score
Major	×0.3	Severe penalty
Minor	×0.9	Light penalty

A floor of 0.05 prevents score collapse below that threshold (except for critical = 0.0).

Five sentinel categories, each mapped to STPA UCA guidewords (see docs/STPA_Mapping.md):

Protocol violations — Executing actions out of required order
Hazardous actions — Performing dangerous operations without prerequisites
Unsupported safety claims — Endorsing factually incorrect safety information (detected via NLI with cross-encoder/nli-deberta-v3-small)
Missing confirmations — Skipping required go/no-go or crew confirmations
Disallowed tool usage — Using tools not permitted for the task

Scoring Modes

Three fidelity levels (set via --scoring-mode in CLI commands):

Mode	Description	API Calls
`mc_only`	Multiple-choice accuracy only	None
`deterministic`	Tiers 1+2: exact match + embedding similarity	None
`full`	All three tiers including LLM-as-judge for borderline cases	Yes

Tool Inventory

The benchmark provides 16 simulated EVA tools across 5 categories:

Category	Tools	Purpose
Information	`procedure_lookup`, `document_search`, `equipment_spec`	Knowledge retrieval from NASA corpus
Environment	`suit_telemetry`, `environment_state`, `crew_status`	Query current mission state
Action	`execute_procedure_step`, `navigate_to`, `deploy_equipment`, `collect_sample`	Execute mission operations
Communication	`comm_to_ground`, `comm_to_crew`, `request_go_nogo`	Crew and ground coordination
Safety	`check_constraint`, `abort_eva`, `declare_emergency`	Safety verification and emergency actions

Each task specifies its allowed_tools list. Using tools outside this list triggers a disallowed-tool sentinel.

LLM-as-Judge Evaluation

For nuanced scoring beyond deterministic metrics, EVA-Bench includes a pairwise LLM judge:

Judge model: Claude Opus 4.6 (default, configurable via --judge-model)
Blind evaluation: Model identifiers are scrubbed from traces before judging
Position debiasing: A/B order is randomized per evaluation seed
Multi-judge panel: panel_judge() runs N judges and computes Cohen's kappa for inter-rater reliability
5 rubric dimensions: clarity, completeness, safety compliance, operational quality, reasoning quality

Judge bundles in judge_bundles/ contain pre-packaged task+trace pairs for reproducible evaluation.

Task Splits

Split	SQ	E2E	Total	Purpose
`public_dev`	149	26	175	Development and debugging
`public_val`	135	31	166	Hyperparameter tuning and validation
`test`	232	78	310	Final evaluation (reported results)

All splits include full gold annotations (reference answers, expected facts, scoring rubrics) to enable reproducible evaluation. Filter by split:

eva-bench run_batch --model openai/gpt-5.1 --split test

Scenario Families

Family	Code	Description
F1	`pre_eva_depress_egress`	Pre-EVA, depress & egress procedures
F2	`tool_equipment_deployment`	Tool & equipment operations
F3	`traverse_navigation`	Traverse & navigation
F4	`science_station_operations`	Science station operations
F5	`contingency_anomaly_response`	Contingency & anomaly response
F6	`eva_closeout_ingress`	Close-out & ingress procedures

Adding New Models

To evaluate a new model:

Add an adapter (if the provider isn't already supported). Adapters live in src/eva_bench/runner/adapters/ and implement the BaseAdapter interface (generate(), count_tokens(), format_tools()). Three adapters are included: OpenAIAdapter, GeminiAdapter, AnthropicAdapter.
Use the provider/model identifier format (e.g., "openai/gpt-5.1", "anthropic/claude-sonnet-4-6"). The ModelConfig class parses .provider and .model_name from this string.

Run via CLI or benchmark script:

# Single task
eva-bench run SQ-F1-T1-001 --model your-provider/your-model

# Full benchmark
python scripts/run_benchmark.py --models your-provider/your-model --num-tasks 0

Score and report:

python scripts/run_benchmark.py --score-only --models your-provider/your-model

NASA Corpus

EVA-Bench tasks are grounded in 84 NASA EVA documents across six categories:

Category	Documents	Examples
Apollo	23	Mission reports, lunar surface procedures
ISS EVA	21	EVA checklists, flight rules, suit systems
Artemis	19	xEMU requirements, lunar EVA concepts
Standards	7	NASA-STD-3001, human factors standards
Research	5	DCS risk models, EVA physiological studies
Health	9	Medical monitoring, crew health protocols

The processed corpus is included in corpus/ (BM25 index + markdown documents). To rebuild from source PDFs:

eva-bench corpus_setup

Source documents are downloaded from the NASA Technical Reports Server (NTRS).

Project Structure

eva-bench/
├── pyproject.toml          # Package configuration & dependencies
├── README.md               # This file
├── LICENSE                 # Apache 2.0
├── CONTRIBUTING.md         # Contribution guidelines
├── SECURITY.md             # Security policy and trust boundaries
├── .env.example            # API key template (copy to .env)
├── src/eva_bench/          # Core benchmark package
│   ├── cli.py              # Typer CLI (validate, run, score, judge, report, corpus_setup)
│   ├── models/             # Pydantic data contracts (Task, Trace, ScoreCard, Config)
│   ├── task_registry/      # Task loading, filtering by split/family/tier
│   ├── runner/             # Execution engine + model adapters (OpenAI, Anthropic, Gemini)
│   ├── simulator/          # E2E deterministic state machine
│   ├── tools/              # 16 simulated EVA operation tools
│   ├── scorer/             # Three-tier scoring with safety sentinels
│   ├── judge/              # Pairwise LLM judge with blind evaluation
│   ├── reporter/           # HELM-style HTML report generation
│   ├── corpus/             # NASA document pipeline (download, parse, index)
│   ├── generation/         # Task generation and cross-validation utilities
│   └── schemas/            # JSON Schema definitions
├── tasks/                  # 651 benchmark tasks
│   ├── sq/                 # 516 single-query tasks
│   └── e2e/                # 135 end-to-end missions
├── traces/benchmark/       # Pre-computed traces for 9 models × 651 tasks
├── scores/                 # Scoring results (mc_only, deterministic, llm_judge)
├── reports/                # Generated reports, leaderboards, and publication figures
├── judge_bundles/          # Pairwise judge evaluation data
├── corpus/                 # Processed NASA EVA documents (84 docs, BM25 index)
├── docs/                   # Technical documentation (STPA sentinel mapping)
├── tests/                  # Test suite (14 test files)
└── scripts/                # Benchmark runner, analysis, and validation scripts

Running Tests

pytest                          # Run all tests
pytest tests/test_scorer.py -v  # Run specific test file

CLI Reference

Command	Description
`eva-bench validate`	Validate task files against schema and check distribution
`eva-bench run <task_id>`	Run a single task with a specified model
`eva-bench run_batch`	Run all tasks in a split
`eva-bench score <trace>`	Score a trace file
`eva-bench judge <trace_a> <trace_b>`	Pairwise LLM judge evaluation
`eva-bench report <dir>`	Generate HELM-style HTML reports and leaderboards
`eva-bench corpus_setup`	Download, parse, and index NASA corpus
`eva-bench validate_provenance`	Cross-validate historical tasks against corpus

Data Statement

The corpus/ directory contains parsed versions of publicly available NASA Technical Reports Server (NTRS) documents. These documents include author names, institutional affiliations, and contact information (email addresses, phone numbers) as published in the original reports. No personal data was collected or generated by the EVA-Bench project. All contact information originates from the publicly available source publications.

Task files and traces contain only fictional character names (e.g., "Commander Torres", "Specialist Nakamura") and do not reference real individuals.

Ethics and Limitations

Simulated environment only. EVA-Bench uses deterministic simulations, not real EVA hardware. Results reflect model capability in constrained scenarios, not readiness for operational deployment.
Task coverage. While 651 tasks span 6 scenario families, real EVA operations involve complexities (hardware failures, communication delays, crew physiology) not fully captured here.
E2E statistical power. E2E family×tier subcells contain 3-10 tasks each. Aggregate E2E claims should note limited statistical power. SQ subcells are well-powered (15-50 tasks each).
LLM judge limitations. LLM-as-judge scoring introduces stochasticity. Multi-seed evaluation and Cohen's kappa are used to quantify reliability, but judge agreement is not perfect.
Safety scoring is conservative. The sentinel system is designed to penalize safety violations, not to certify safety. A high safety score does not imply the model is safe for real operations.
Condition evaluation. Sentinel trigger conditions and contingency triggers use sandboxed eval() on maintainer-authored expressions from committed task JSON (see SECURITY.md).

Citation

@article{evabench2026,
  title={EVA-Bench: A Scenario-Based Benchmark for Evaluating Domain Knowledge and Agentic Capability of Foundation Models in xEVA Operations},
  author={Li, Kaisheng and Whittle, Richard S.},
  year={2026},
  note={Under review}
}

License

This project is licensed under the Apache License 2.0. See LICENSE for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

EVA-Bench: A Scenario-Based Benchmark for Evaluating Domain Knowledge and Agentic Capability of Foundation Models in xEVA Operations

Key Statistics

Models Evaluated

Installation

Quick Start

Reproducing Published Results

Task Format

Single-Query (SQ) Tasks

End-to-End (E2E) Tasks

Scoring

Three-Tier Matching

Safety Sentinels

Scoring Modes

Tool Inventory

LLM-as-Judge Evaluation

Task Splits

Scenario Families

Adding New Models

NASA Corpus

Project Structure

Running Tests

CLI Reference

Data Statement

Ethics and Limitations

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
corpus		corpus
docs		docs
judge_bundles		judge_bundles
reports		reports
scores		scores
scripts		scripts
src/eva_bench		src/eva_bench
tasks		tasks
tests		tests
traces/benchmark		traces/benchmark
.env.example		.env.example
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

EVA-Bench: A Scenario-Based Benchmark for Evaluating Domain Knowledge and Agentic Capability of Foundation Models in xEVA Operations

Key Statistics

Models Evaluated

Installation

Quick Start

Reproducing Published Results

Task Format

Single-Query (SQ) Tasks

End-to-End (E2E) Tasks

Scoring

Three-Tier Matching

Safety Sentinels

Scoring Modes

Tool Inventory

LLM-as-Judge Evaluation

Task Splits

Scenario Families

Adding New Models

NASA Corpus

Project Structure

Running Tests

CLI Reference

Data Statement

Ethics and Limitations

Citation

License

About

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages