Skip to content

Add Stirrup agent + GDPVal eval/RL environment#1090

Merged
bxyu-nvidia merged 19 commits into
NVIDIA-NeMo:mainfrom
Kh4L:stirrup-gdpval-port
Apr 27, 2026
Merged

Add Stirrup agent + GDPVal eval/RL environment#1090
bxyu-nvidia merged 19 commits into
NVIDIA-NeMo:mainfrom
Kh4L:stirrup-gdpval-port

Conversation

@Kh4L
Copy link
Copy Markdown
Contributor

@Kh4L Kh4L commented Apr 17, 2026

Summary

Adds a Stirrup-based agent + a GDPVal benchmark built on the NeMo-Gym
benchmark convention (ng_prepare_benchmark + ng_e2e_collect_rollouts),
validated on the full 220-task GDPVal set in both rubric and comparison
scoring modes.

Architecture

Split into three pieces, matching NeMo-Gym's server-type convention:

Benchmarkbenchmarks/gdpval/

  • prepare.py downloads openai/gdpval from HF → data/gdpval_benchmark.jsonl
  • config.yaml wires gdpval_judge_model + gdpval_resources_server +
    gdpval_stirrup_agent
  • Entry point: ng_e2e_collect_rollouts +config_paths=[benchmarks/gdpval/config.yaml]

Resources serverresources_servers/gdpval/

  • Owns verify() and aggregate_metrics() with two modes via reward_mode:
    • rubric (default) — LLM-judge per-criterion score, reward in [0, 1]
    • comparison — pairwise vs reference_deliverables_dir, reward in
      {0, 0.5, 1}; aggregate_metrics reduces W/L/T → ELO anchored at
      reference_elo (default 1000)
  • All scoring, pairwise comparison, Office→PDF preconvert live here.
    Multimodal judge path used whenever content blocks are available.

Agentresponses_api_agents/stirrup_agent/

  • StirrupAgentWrapper is task-agnostic; task-specific logic in a
    TaskStrategy subclass (GDPValTask)
  • /run executes the agent, persists deliverables, POSTs to the resources
    server's /verify, returns the response. Agent is scoring-free.
  • aggregate_metrics proxies to resources server so ELO extras flow through;
    /verify errors caught per-rollout so a single failure can't crash a run
  • Optional: Apptainer-backed code_exec, Tavily web-search

Dependency: stirrup>=0.1.7 (Apache 2.0) declared as an extra of the
stirrup_agent server, not of core.

Validation (Ultra V3 SFT iter16k, full 220-task GDPVal, num_repeats=2)

Rubric mode (n=440):

  • mean/reward = 0.755, pass@1 = 0.755, pass@2 = 0.821
  • 56% of rollouts score ≥ 0.8
  • Pre-refactor port-v4 baseline: 0.24 → 3.1× lift (dominant contributor:
    always-visual judge when content blocks are available)

Comparison mode vs fork baseline (4 trials per pairing, n=440):

  • W/L/T = 147 / 208 / 77
  • win_rate = 0.429
  • eval_elo = 950.6 (vs fork=1000; port-v3 historical=917 → +34 ELO)

Running

ng_prepare_benchmark '+config_paths=[benchmarks/gdpval/config.yaml]'
ng_e2e_collect_rollouts \
  '+config_paths=[benchmarks/gdpval/config.yaml,responses_api_models/vllm_model/configs/vllm_model.yaml]' \
  '++split=benchmark' \
  '++output_jsonl_fpath=results/gdpval.jsonl' \
  "++gdpval_stirrup_agent.responses_api_agents.stirrup_agent.persist_deliverables_dir=$PWD/output/gdpval" \
  # ... policy_* overrides as usual

# Add for comparison mode:
  '++gdpval_resources_server.resources_servers.gdpval.reward_mode=comparison' \
  "++gdpval_resources_server.resources_servers.gdpval.reference_deliverables_dir=/path/to/reference"

Test plan

  • pytest resources_servers/gdpval/tests -x — rubric + comparison unit tests
  • 10-task rubric smoke (mean/reward 0.719)
  • Full 220-task rubric (mean/reward 0.755)
  • 10-task comparison smoke (eval_elo ~1017 on small sample)
  • Full 220-task comparison (eval_elo 950.6)

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Apr 17, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@Kh4L Kh4L force-pushed the stirrup-gdpval-port branch from f582f06 to dbbd7d3 Compare April 17, 2026 00:41
@@ -0,0 +1,269 @@
#!/usr/bin/env python3
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should these scripts go into responses_api_agents/stirrup_agent/scripts? same for container

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, that would be better. just pushed

Comment thread resources_servers/gdpval/configs/gdpval.yaml Outdated
@Kh4L Kh4L requested review from cmunley1 and dznvidia April 22, 2026 18:14
@Kh4L Kh4L marked this pull request as ready for review April 23, 2026 06:41
bxyu-nvidia
bxyu-nvidia previously approved these changes Apr 24, 2026
Kh4L added 17 commits April 24, 2026 16:27
Introduces the Stirrup agent wrapper, a pluggable NeMo Gym responses API
agent built on the Stirrup agent-loop framework. Task-specific logic
(prompt construction, file handling, rubric scoring) lives in a
TaskStrategy subclass, so new benchmarks can be added with one file.

Ships with the GDPVal task strategy out of the box, which evaluates
models on the OpenAI GDPVal benchmark of professional knowledge-work
tasks (220 tasks across 9 sectors). Scoring is done by an LLM judge
against a per-task rubric.

Includes:
  - Core framework: StirrupAgentWrapper, TaskStrategy base class,
    task registry, Stirrup history conversion utilities, and a
    NeMoAgent subclass with tool-response-as-user support.
  - GDPVal task: reference-file download, rubric and comparison
    scoring modes, deliverable file reader.
  - Apptainer provider for optional sandboxed code execution.
  - Tavily web-search tool provider (optional, via TAVILY_API_KEY).
  - Synthetic single-task example.jsonl for CI smoke testing (no
    network dependency).
  - Hydra config, Jinja2 prompt templates, README, client example,
    unit tests for registry and config instantiation.
  - New dependency: stirrup>=0.1.7 (Apache 2.0).
Signed-off-by: Serge Panev <spanev@nvidia.com>
Adds three pieces that support full GDPVal evaluation runs:

  1. containers/gdpval.def — Apptainer build definition for a Python 3.12
     environment with document-generation dependencies (LibreOffice,
     python-docx, openpyxl, reportlab, weasyprint, Pillow, etc.) and
     scientific libraries (numpy, pandas, scipy, scikit-learn). When an
     operator sets stirrup_agent.gdpval_container_path, all agent
     code_exec calls route through this container for isolation.

  2. scripts/prepare_gdpval_dataset.py — HuggingFace -> JSONL converter
     for the openai/gdpval dataset. Produces the record shape expected
     by GDPValTask.extract_task_info (task_id, prompt, sector,
     occupation, reference_files, rubric_json, rubric_pretty), with all
     metadata values JSON-encoded to satisfy OpenAI's Metadata type
     constraint.

  3. responses_api_agents/stirrup_agent/setup_scripts/gdpval.sh — the
     conventional setup-script wrapper around the prep script, matching
     the pattern used by existing agents (swe_agents/setup_scripts/*.sh).

Users can now go from a fresh checkout to a full 220-task dataset with
one command:

    bash responses_api_agents/stirrup_agent/setup_scripts/gdpval.sh

Signed-off-by: Serge Panev <spanev@nvidia.com>
Ships five scripts that together form the GDPVal evaluation pipeline
beyond rollout collection:

  - scripts/compare_elo.py — pairwise side-by-side comparison of two
    models' deliverables by a judge LLM, producing an ELO rating with
    configurable priors, position-swap trials, and per-task breakdown.

  - scripts/run_rubric_judge.py — rubric-based scoring for collected
    rollouts that didn't inline-score during the run, or re-running
    with a different judge.

  - scripts/calculate_rubric_elo.py — aggregate rubric scores into an
    ELO-style ranking across multiple evaluated models.

  - scripts/rescore_gdpval.py — serial re-scoring of an existing results
    JSONL with a different judge, to compare judges without re-rolling.

  - scripts/preconvert_to_pdf.py — pre-converts .docx/.pptx/.xlsx
    deliverables to PDF (required before visual pairwise judging).
    Uses LibreOffice headless.

All scripts are OpenAI-API-compatible by default: --server-address
defaults to https://api.openai.com/v1 and --judge-model-name to
gpt-4.1-2025-04-14. Users can point at any OpenAI-compatible endpoint
(vLLM, Azure, third-party).

Signed-off-by: Serge Panev <spanev@nvidia.com>
Stirrup's ChatCompletionsClient sends a static max_completion_tokens
on every request; on long-context models the server rejects the call
once the prompt grows.  This commit wraps the client to size
max_completion_tokens per call (tokenise messages + tools, subtract
from context window, leave a buffer), wires the existing model_id /
completion_token_buffer config fields through, and pins a few runtime
deps the subprocess venv needs but doesn't receive via transitive
resolution.

Signed-off-by: Serge Panev <spanev@nvidia.com>
Signed-off-by: Serge Panev <spanev@nvidia.com>
Signed-off-by: Serge Panev <spanev@nvidia.com>
Signed-off-by: Serge Panev <spanev@nvidia.com>
Signed-off-by: Serge Panev <spanev@nvidia.com>
…t_rollouts

Signed-off-by: Serge Panev <spanev@nvidia.com>
…ailable

Signed-off-by: Serge Panev <spanev@nvidia.com>
Signed-off-by: Serge Panev <spanev@nvidia.com>
Signed-off-by: Serge Panev <spanev@nvidia.com>
Signed-off-by: Serge Panev <spanev@nvidia.com>
…erables_dir

Signed-off-by: Serge Panev <spanev@nvidia.com>
Signed-off-by: Serge Panev <spanev@nvidia.com>
Signed-off-by: Serge Panev <spanev@nvidia.com>
Signed-off-by: Serge Panev <spanev@nvidia.com>
@Kh4L Kh4L force-pushed the stirrup-gdpval-port branch from 0936f16 to 0087d3e Compare April 24, 2026 23:32
bxyu-nvidia
bxyu-nvidia previously approved these changes Apr 26, 2026
…ata validation

Signed-off-by: Serge Panev <spanev@nvidia.com>
…_test data validation

Signed-off-by: Serge Panev <spanev@nvidia.com>
@bxyu-nvidia bxyu-nvidia merged commit b3c550a into NVIDIA-NeMo:main Apr 27, 2026
6 checks passed
bxyu-nvidia pushed a commit that referenced this pull request Apr 27, 2026
… reward=0) (#1140)

## Problem

`responses_api_agents/stirrup_agent/app.py:117` decorates the Ray remote
function with bare `@ray.remote(scheduling_strategy=\"SPREAD\")`, so Ray
dispatches the task to workers using the cluster's default Python rather
than the stirrup_agent server's `.venv`. The `stirrup>=0.1.7` extra is
declared as an *extra of the stirrup_agent server*, not of core
(intentionally, per PR #1090), so it's only present in
`responses_api_agents/stirrup_agent/.venv` — not on Ray workers.

Every rollout therefore hits, at `_run_stirrup_agent` line 150:

```
from stirrup.tools import DEFAULT_TOOLS
```

with `ModuleNotFoundError: No module named 'stirrup'`. The agent catches
the exception per-rollout, sets `reward=0.0`, and proceeds. The
resources server then short-circuits without calling the judge.

## Fix

Add `runtime_env={\"py_executable\": sys.executable}` to the decorator,
matching the well-established pattern in the rest of the codebase



## Test plan

- [x] `ruff check` / `ruff format --check` clean
- [x] AST parse OK
- [ ] End-to-end re-run on the failing 220-task GDPVal config from the
EFB side; will pin `install_on_the_fly.commit` to this branch's HEAD and
post results in the linked GDPVal thread
- [ ] (Author of #1090 to confirm) — no other places where the bare
`@ray.remote` pattern was intentional for stirrup specifically

Signed-off-by: Alex Gronskiy <agronskiy@nvidia.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants