Add Stirrup agent + GDPVal eval/RL environment#1090
Merged
Conversation
f582f06 to
dbbd7d3
Compare
cmunley1
reviewed
Apr 17, 2026
| @@ -0,0 +1,269 @@ | |||
| #!/usr/bin/env python3 | |||
Contributor
There was a problem hiding this comment.
should these scripts go into responses_api_agents/stirrup_agent/scripts? same for container
Contributor
Author
There was a problem hiding this comment.
You're right, that would be better. just pushed
dznvidia
reviewed
Apr 22, 2026
bxyu-nvidia
previously approved these changes
Apr 24, 2026
Introduces the Stirrup agent wrapper, a pluggable NeMo Gym responses API
agent built on the Stirrup agent-loop framework. Task-specific logic
(prompt construction, file handling, rubric scoring) lives in a
TaskStrategy subclass, so new benchmarks can be added with one file.
Ships with the GDPVal task strategy out of the box, which evaluates
models on the OpenAI GDPVal benchmark of professional knowledge-work
tasks (220 tasks across 9 sectors). Scoring is done by an LLM judge
against a per-task rubric.
Includes:
- Core framework: StirrupAgentWrapper, TaskStrategy base class,
task registry, Stirrup history conversion utilities, and a
NeMoAgent subclass with tool-response-as-user support.
- GDPVal task: reference-file download, rubric and comparison
scoring modes, deliverable file reader.
- Apptainer provider for optional sandboxed code execution.
- Tavily web-search tool provider (optional, via TAVILY_API_KEY).
- Synthetic single-task example.jsonl for CI smoke testing (no
network dependency).
- Hydra config, Jinja2 prompt templates, README, client example,
unit tests for registry and config instantiation.
- New dependency: stirrup>=0.1.7 (Apache 2.0).
Signed-off-by: Serge Panev <spanev@nvidia.com>
Adds three pieces that support full GDPVal evaluation runs:
1. containers/gdpval.def — Apptainer build definition for a Python 3.12
environment with document-generation dependencies (LibreOffice,
python-docx, openpyxl, reportlab, weasyprint, Pillow, etc.) and
scientific libraries (numpy, pandas, scipy, scikit-learn). When an
operator sets stirrup_agent.gdpval_container_path, all agent
code_exec calls route through this container for isolation.
2. scripts/prepare_gdpval_dataset.py — HuggingFace -> JSONL converter
for the openai/gdpval dataset. Produces the record shape expected
by GDPValTask.extract_task_info (task_id, prompt, sector,
occupation, reference_files, rubric_json, rubric_pretty), with all
metadata values JSON-encoded to satisfy OpenAI's Metadata type
constraint.
3. responses_api_agents/stirrup_agent/setup_scripts/gdpval.sh — the
conventional setup-script wrapper around the prep script, matching
the pattern used by existing agents (swe_agents/setup_scripts/*.sh).
Users can now go from a fresh checkout to a full 220-task dataset with
one command:
bash responses_api_agents/stirrup_agent/setup_scripts/gdpval.sh
Signed-off-by: Serge Panev <spanev@nvidia.com>
Ships five scripts that together form the GDPVal evaluation pipeline
beyond rollout collection:
- scripts/compare_elo.py — pairwise side-by-side comparison of two
models' deliverables by a judge LLM, producing an ELO rating with
configurable priors, position-swap trials, and per-task breakdown.
- scripts/run_rubric_judge.py — rubric-based scoring for collected
rollouts that didn't inline-score during the run, or re-running
with a different judge.
- scripts/calculate_rubric_elo.py — aggregate rubric scores into an
ELO-style ranking across multiple evaluated models.
- scripts/rescore_gdpval.py — serial re-scoring of an existing results
JSONL with a different judge, to compare judges without re-rolling.
- scripts/preconvert_to_pdf.py — pre-converts .docx/.pptx/.xlsx
deliverables to PDF (required before visual pairwise judging).
Uses LibreOffice headless.
All scripts are OpenAI-API-compatible by default: --server-address
defaults to https://api.openai.com/v1 and --judge-model-name to
gpt-4.1-2025-04-14. Users can point at any OpenAI-compatible endpoint
(vLLM, Azure, third-party).
Signed-off-by: Serge Panev <spanev@nvidia.com>
Stirrup's ChatCompletionsClient sends a static max_completion_tokens on every request; on long-context models the server rejects the call once the prompt grows. This commit wraps the client to size max_completion_tokens per call (tokenise messages + tools, subtract from context window, leave a buffer), wires the existing model_id / completion_token_buffer config fields through, and pins a few runtime deps the subprocess venv needs but doesn't receive via transitive resolution. Signed-off-by: Serge Panev <spanev@nvidia.com>
Signed-off-by: Serge Panev <spanev@nvidia.com>
Signed-off-by: Serge Panev <spanev@nvidia.com>
Signed-off-by: Serge Panev <spanev@nvidia.com>
Signed-off-by: Serge Panev <spanev@nvidia.com>
…t_rollouts Signed-off-by: Serge Panev <spanev@nvidia.com>
…ailable Signed-off-by: Serge Panev <spanev@nvidia.com>
Signed-off-by: Serge Panev <spanev@nvidia.com>
Signed-off-by: Serge Panev <spanev@nvidia.com>
Signed-off-by: Serge Panev <spanev@nvidia.com>
…erables_dir Signed-off-by: Serge Panev <spanev@nvidia.com>
Signed-off-by: Serge Panev <spanev@nvidia.com>
Signed-off-by: Serge Panev <spanev@nvidia.com>
Signed-off-by: Serge Panev <spanev@nvidia.com>
0936f16 to
0087d3e
Compare
bxyu-nvidia
previously approved these changes
Apr 26, 2026
…ata validation Signed-off-by: Serge Panev <spanev@nvidia.com>
…_test data validation Signed-off-by: Serge Panev <spanev@nvidia.com>
bxyu-nvidia
approved these changes
Apr 27, 2026
4 tasks
bxyu-nvidia
pushed a commit
that referenced
this pull request
Apr 27, 2026
… reward=0) (#1140) ## Problem `responses_api_agents/stirrup_agent/app.py:117` decorates the Ray remote function with bare `@ray.remote(scheduling_strategy=\"SPREAD\")`, so Ray dispatches the task to workers using the cluster's default Python rather than the stirrup_agent server's `.venv`. The `stirrup>=0.1.7` extra is declared as an *extra of the stirrup_agent server*, not of core (intentionally, per PR #1090), so it's only present in `responses_api_agents/stirrup_agent/.venv` — not on Ray workers. Every rollout therefore hits, at `_run_stirrup_agent` line 150: ``` from stirrup.tools import DEFAULT_TOOLS ``` with `ModuleNotFoundError: No module named 'stirrup'`. The agent catches the exception per-rollout, sets `reward=0.0`, and proceeds. The resources server then short-circuits without calling the judge. ## Fix Add `runtime_env={\"py_executable\": sys.executable}` to the decorator, matching the well-established pattern in the rest of the codebase ## Test plan - [x] `ruff check` / `ruff format --check` clean - [x] AST parse OK - [ ] End-to-end re-run on the failing 220-task GDPVal config from the EFB side; will pin `install_on_the_fly.commit` to this branch's HEAD and post results in the linked GDPVal thread - [ ] (Author of #1090 to confirm) — no other places where the bare `@ray.remote` pattern was intentional for stirrup specifically Signed-off-by: Alex Gronskiy <agronskiy@nvidia.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a Stirrup-based agent + a GDPVal benchmark built on the NeMo-Gym
benchmark convention (
ng_prepare_benchmark+ng_e2e_collect_rollouts),validated on the full 220-task GDPVal set in both rubric and comparison
scoring modes.
Architecture
Split into three pieces, matching NeMo-Gym's server-type convention:
Benchmark —
benchmarks/gdpval/prepare.pydownloadsopenai/gdpvalfrom HF →data/gdpval_benchmark.jsonlconfig.yamlwiresgdpval_judge_model+gdpval_resources_server+gdpval_stirrup_agentng_e2e_collect_rollouts +config_paths=[benchmarks/gdpval/config.yaml]Resources server —
resources_servers/gdpval/verify()andaggregate_metrics()with two modes viareward_mode:rubric(default) — LLM-judge per-criterion score, reward in[0, 1]comparison— pairwise vsreference_deliverables_dir, reward in{0, 0.5, 1};aggregate_metricsreduces W/L/T → ELO anchored atreference_elo(default 1000)Multimodal judge path used whenever content blocks are available.
Agent —
responses_api_agents/stirrup_agent/StirrupAgentWrapperis task-agnostic; task-specific logic in aTaskStrategysubclass (GDPValTask)/runexecutes the agent, persists deliverables, POSTs to the resourcesserver's
/verify, returns the response. Agent is scoring-free.aggregate_metricsproxies to resources server so ELO extras flow through;/verifyerrors caught per-rollout so a single failure can't crash a runcode_exec, Tavily web-searchDependency:
stirrup>=0.1.7(Apache 2.0) declared as an extra of thestirrup_agent server, not of core.
Validation (Ultra V3 SFT iter16k, full 220-task GDPVal, num_repeats=2)
Rubric mode (n=440):
always-visual judge when content blocks are available)
Comparison mode vs fork baseline (4 trials per pairing, n=440):
Running
Test plan
pytest resources_servers/gdpval/tests -x— rubric + comparison unit tests