Add Stirrup agent + GDPVal eval/RL environment by Kh4L · Pull Request #1090 · NVIDIA-NeMo/Gym

Kh4L · 2026-04-17T00:25:14Z

Summary

Adds a Stirrup-based agent + a GDPVal benchmark built on the NeMo-Gym
benchmark convention (ng_prepare_benchmark + ng_e2e_collect_rollouts),
validated on the full 220-task GDPVal set in both rubric and comparison
scoring modes.

Architecture

Split into three pieces, matching NeMo-Gym's server-type convention:

Benchmark — benchmarks/gdpval/

prepare.py downloads openai/gdpval from HF → data/gdpval_benchmark.jsonl
config.yaml wires gdpval_judge_model + gdpval_resources_server +
gdpval_stirrup_agent
Entry point: ng_e2e_collect_rollouts +config_paths=[benchmarks/gdpval/config.yaml]

Resources server — resources_servers/gdpval/

Owns verify() and aggregate_metrics() with two modes via reward_mode:
- rubric (default) — LLM-judge per-criterion score, reward in [0, 1]
- comparison — pairwise vs reference_deliverables_dir, reward in
  {0, 0.5, 1}; aggregate_metrics reduces W/L/T → ELO anchored at
  reference_elo (default 1000)
All scoring, pairwise comparison, Office→PDF preconvert live here.
Multimodal judge path used whenever content blocks are available.

Agent — responses_api_agents/stirrup_agent/

StirrupAgentWrapper is task-agnostic; task-specific logic in a
TaskStrategy subclass (GDPValTask)
/run executes the agent, persists deliverables, POSTs to the resources
server's /verify, returns the response. Agent is scoring-free.
aggregate_metrics proxies to resources server so ELO extras flow through;
/verify errors caught per-rollout so a single failure can't crash a run
Optional: Apptainer-backed code_exec, Tavily web-search

Dependency: stirrup>=0.1.7 (Apache 2.0) declared as an extra of the
stirrup_agent server, not of core.

Validation (Ultra V3 SFT iter16k, full 220-task GDPVal, num_repeats=2)

Rubric mode (n=440):

mean/reward = 0.755, pass@1 = 0.755, pass@2 = 0.821
56% of rollouts score ≥ 0.8
Pre-refactor port-v4 baseline: 0.24 → 3.1× lift (dominant contributor:
always-visual judge when content blocks are available)

Comparison mode vs fork baseline (4 trials per pairing, n=440):

W/L/T = 147 / 208 / 77
win_rate = 0.429
eval_elo = 950.6 (vs fork=1000; port-v3 historical=917 → +34 ELO)

Running

ng_prepare_benchmark '+config_paths=[benchmarks/gdpval/config.yaml]'
ng_e2e_collect_rollouts \
  '+config_paths=[benchmarks/gdpval/config.yaml,responses_api_models/vllm_model/configs/vllm_model.yaml]' \
  '++split=benchmark' \
  '++output_jsonl_fpath=results/gdpval.jsonl' \
  "++gdpval_stirrup_agent.responses_api_agents.stirrup_agent.persist_deliverables_dir=$PWD/output/gdpval" \
  # ... policy_* overrides as usual

# Add for comparison mode:
  '++gdpval_resources_server.resources_servers.gdpval.reward_mode=comparison' \
  "++gdpval_resources_server.resources_servers.gdpval.reference_deliverables_dir=/path/to/reference"

Test plan

pytest resources_servers/gdpval/tests -x — rubric + comparison unit tests
10-task rubric smoke (mean/reward 0.719)
Full 220-task rubric (mean/reward 0.755)
10-task comparison smoke (eval_elo ~1017 on small sample)
Full 220-task comparison (eval_elo 950.6)

copy-pr-bot · 2026-04-17T00:25:17Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

cmunley1 · 2026-04-17T07:44:16Z

@@ -0,0 +1,269 @@
+#!/usr/bin/env python3


should these scripts go into responses_api_agents/stirrup_agent/scripts? same for container

You're right, that would be better. just pushed

Introduces the Stirrup agent wrapper, a pluggable NeMo Gym responses API agent built on the Stirrup agent-loop framework. Task-specific logic (prompt construction, file handling, rubric scoring) lives in a TaskStrategy subclass, so new benchmarks can be added with one file. Ships with the GDPVal task strategy out of the box, which evaluates models on the OpenAI GDPVal benchmark of professional knowledge-work tasks (220 tasks across 9 sectors). Scoring is done by an LLM judge against a per-task rubric. Includes: - Core framework: StirrupAgentWrapper, TaskStrategy base class, task registry, Stirrup history conversion utilities, and a NeMoAgent subclass with tool-response-as-user support. - GDPVal task: reference-file download, rubric and comparison scoring modes, deliverable file reader. - Apptainer provider for optional sandboxed code execution. - Tavily web-search tool provider (optional, via TAVILY_API_KEY). - Synthetic single-task example.jsonl for CI smoke testing (no network dependency). - Hydra config, Jinja2 prompt templates, README, client example, unit tests for registry and config instantiation. - New dependency: stirrup>=0.1.7 (Apache 2.0). Signed-off-by: Serge Panev <spanev@nvidia.com>

Adds three pieces that support full GDPVal evaluation runs: 1. containers/gdpval.def — Apptainer build definition for a Python 3.12 environment with document-generation dependencies (LibreOffice, python-docx, openpyxl, reportlab, weasyprint, Pillow, etc.) and scientific libraries (numpy, pandas, scipy, scikit-learn). When an operator sets stirrup_agent.gdpval_container_path, all agent code_exec calls route through this container for isolation. 2. scripts/prepare_gdpval_dataset.py — HuggingFace -> JSONL converter for the openai/gdpval dataset. Produces the record shape expected by GDPValTask.extract_task_info (task_id, prompt, sector, occupation, reference_files, rubric_json, rubric_pretty), with all metadata values JSON-encoded to satisfy OpenAI's Metadata type constraint. 3. responses_api_agents/stirrup_agent/setup_scripts/gdpval.sh — the conventional setup-script wrapper around the prep script, matching the pattern used by existing agents (swe_agents/setup_scripts/*.sh). Users can now go from a fresh checkout to a full 220-task dataset with one command: bash responses_api_agents/stirrup_agent/setup_scripts/gdpval.sh Signed-off-by: Serge Panev <spanev@nvidia.com>

Ships five scripts that together form the GDPVal evaluation pipeline beyond rollout collection: - scripts/compare_elo.py — pairwise side-by-side comparison of two models' deliverables by a judge LLM, producing an ELO rating with configurable priors, position-swap trials, and per-task breakdown. - scripts/run_rubric_judge.py — rubric-based scoring for collected rollouts that didn't inline-score during the run, or re-running with a different judge. - scripts/calculate_rubric_elo.py — aggregate rubric scores into an ELO-style ranking across multiple evaluated models. - scripts/rescore_gdpval.py — serial re-scoring of an existing results JSONL with a different judge, to compare judges without re-rolling. - scripts/preconvert_to_pdf.py — pre-converts .docx/.pptx/.xlsx deliverables to PDF (required before visual pairwise judging). Uses LibreOffice headless. All scripts are OpenAI-API-compatible by default: --server-address defaults to https://api.openai.com/v1 and --judge-model-name to gpt-4.1-2025-04-14. Users can point at any OpenAI-compatible endpoint (vLLM, Azure, third-party). Signed-off-by: Serge Panev <spanev@nvidia.com>

Stirrup's ChatCompletionsClient sends a static max_completion_tokens on every request; on long-context models the server rejects the call once the prompt grows. This commit wraps the client to size max_completion_tokens per call (tokenise messages + tools, subtract from context window, leave a buffer), wires the existing model_id / completion_token_buffer config fields through, and pins a few runtime deps the subprocess venv needs but doesn't receive via transitive resolution. Signed-off-by: Serge Panev <spanev@nvidia.com>

Signed-off-by: Serge Panev <spanev@nvidia.com>

…t_rollouts Signed-off-by: Serge Panev <spanev@nvidia.com>

…ailable Signed-off-by: Serge Panev <spanev@nvidia.com>

Signed-off-by: Serge Panev <spanev@nvidia.com>

…erables_dir Signed-off-by: Serge Panev <spanev@nvidia.com>

Signed-off-by: Serge Panev <spanev@nvidia.com>

…ata validation Signed-off-by: Serge Panev <spanev@nvidia.com>

…_test data validation Signed-off-by: Serge Panev <spanev@nvidia.com>

… reward=0) (#1140) ## Problem `responses_api_agents/stirrup_agent/app.py:117` decorates the Ray remote function with bare `@ray.remote(scheduling_strategy=\"SPREAD\")`, so Ray dispatches the task to workers using the cluster's default Python rather than the stirrup_agent server's `.venv`. The `stirrup>=0.1.7` extra is declared as an *extra of the stirrup_agent server*, not of core (intentionally, per PR #1090), so it's only present in `responses_api_agents/stirrup_agent/.venv` — not on Ray workers. Every rollout therefore hits, at `_run_stirrup_agent` line 150: ``` from stirrup.tools import DEFAULT_TOOLS ``` with `ModuleNotFoundError: No module named 'stirrup'`. The agent catches the exception per-rollout, sets `reward=0.0`, and proceeds. The resources server then short-circuits without calling the judge. ## Fix Add `runtime_env={\"py_executable\": sys.executable}` to the decorator, matching the well-established pattern in the rest of the codebase ## Test plan - [x] `ruff check` / `ruff format --check` clean - [x] AST parse OK - [ ] End-to-end re-run on the failing 220-task GDPVal config from the EFB side; will pin `install_on_the_fly.commit` to this branch's HEAD and post results in the linked GDPVal thread - [ ] (Author of #1090 to confirm) — no other places where the bare `@ray.remote` pattern was intentional for stirrup specifically Signed-off-by: Alex Gronskiy <agronskiy@nvidia.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Kh4L force-pushed the stirrup-gdpval-port branch from f582f06 to dbbd7d3 Compare April 17, 2026 00:41

cmunley1 reviewed Apr 17, 2026

View reviewed changes

dznvidia reviewed Apr 22, 2026

View reviewed changes

Comment thread resources_servers/gdpval/configs/gdpval.yaml Outdated

Kh4L requested review from cmunley1 and dznvidia April 22, 2026 18:14

Kh4L marked this pull request as ready for review April 23, 2026 06:41

bxyu-nvidia previously approved these changes Apr 24, 2026

View reviewed changes

Kh4L dismissed bxyu-nvidia’s stale review via 69ded52 April 24, 2026 23:14

Kh4L added 17 commits April 24, 2026 16:27

Refactor tool responses as user + thinking defaults

5f2f26c

Signed-off-by: Serge Panev <spanev@nvidia.com>

Address comment: move scripts

37e7979

Signed-off-by: Serge Panev <spanev@nvidia.com>

Fix silent HF reference-file download dropouts

97f448f

Signed-off-by: Serge Panev <spanev@nvidia.com>

refactor(gdpval): extract scoring/ELO into a GDPVal resources server

ec823cb

Signed-off-by: Serge Panev <spanev@nvidia.com>

feat(benchmarks): add gdpval for ng_prepare_benchmark + ng_e2e_collec…

95495dd

…t_rollouts Signed-off-by: Serge Panev <spanev@nvidia.com>

fix(gdpval): use multimodal judge whenever deliverable renders are av…

acd05e4

…ailable Signed-off-by: Serge Panev <spanev@nvidia.com>

style(gdpval): ruff auto-fix imports + format

61d78f9

Signed-off-by: Serge Panev <spanev@nvidia.com>

fix(gdpval): require absolute persist_deliverables_dir

75612cc

Signed-off-by: Serge Panev <spanev@nvidia.com>

Move req to stirrup_agent

8dfd020

Signed-off-by: Serge Panev <spanev@nvidia.com>

fix(gdpval): rename stale reference_rollouts_fpath to reference_deliv…

dd24131

…erables_dir Signed-off-by: Serge Panev <spanev@nvidia.com>

fix(stirrup_agent): proxy aggregate_metrics to resources server

7b9b456

Signed-off-by: Serge Panev <spanev@nvidia.com>

fix(stirrup_agent): don't crash the run when /verify returns non-2xx

dedf8fe

Signed-off-by: Serge Panev <spanev@nvidia.com>

fix(ci): resolve lint, secrets-detector, and test failures

0087d3e

Signed-off-by: Serge Panev <spanev@nvidia.com>

Kh4L force-pushed the stirrup-gdpval-port branch from 0936f16 to 0087d3e Compare April 24, 2026 23:32

bxyu-nvidia previously approved these changes Apr 26, 2026

View reviewed changes

fix(gdpval): populate data/example.jsonl with 5 entries for ng_test d…

7ff6860

…ata validation Signed-off-by: Serge Panev <spanev@nvidia.com>

Kh4L dismissed bxyu-nvidia’s stale review via 7ff6860 April 26, 2026 04:38

fix(gdpval): add example_metrics.json + example_rollouts.jsonl for ng…

98810db

…_test data validation Signed-off-by: Serge Panev <spanev@nvidia.com>

bxyu-nvidia approved these changes Apr 27, 2026

View reviewed changes

bxyu-nvidia merged commit b3c550a into NVIDIA-NeMo:main Apr 27, 2026
6 checks passed

agronskiy mentioned this pull request Apr 27, 2026

fix(stirrup_agent): pin Ray worker venv via runtime_env (fixes GDPVal reward=0) #1140

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Stirrup agent + GDPVal eval/RL environment#1090

Add Stirrup agent + GDPVal eval/RL environment#1090
bxyu-nvidia merged 19 commits into
NVIDIA-NeMo:mainfrom
Kh4L:stirrup-gdpval-port

Kh4L commented Apr 17, 2026 •

edited by agronskiy

Loading

Uh oh!

copy-pr-bot Bot commented Apr 17, 2026

Uh oh!

cmunley1 Apr 17, 2026

Uh oh!

Kh4L Apr 20, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

Kh4L commented Apr 17, 2026 • edited by agronskiy Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Architecture

Validation (Ultra V3 SFT iter16k, full 220-task GDPVal, num_repeats=2)

Running

Test plan

Uh oh!

copy-pr-bot Bot commented Apr 17, 2026

Uh oh!

cmunley1 Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

Kh4L Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Kh4L commented Apr 17, 2026 •

edited by agronskiy

Loading