Release Summary
NeMo Gym v0.4.0 expands evaluation tooling and agent integrations. It establishes a new monthly release cadence; we will continue to provide day-zero support for Nemotron models, datasets, and environments.
Highlights:
- Unified
gymCLI: find agents and benchmarks by name withgym list, and catch config mistakes early withgym env validate - Diagnose evaluations with BLADE, an analysis skill for agents that reads your evaluation results and produces an evidence-backed report of which tasks failed, why, and the highest-impact fix (e.g. to the agent harness, training, verifier, or prompt)
- Measure the impact of agent skills: run the same tasks with different skill sets and compare how each changes agent performance
- Run agents in isolated sandboxes through a new pluggable provider framework
- More agent harnesses out of the box, including OpenClaw, Pi, and OpenCode
- Connect to hosted inference providers: Fireworks, Together.ai, OpenRouter, and more
- New benchmarks across science, long-context, and interactive tasks
First-Time Contributors
We welcomed 20+ new contributors to this release! A few highlights:
- @marta-sd and @wprazuch led the CLI refactor and clearer config errors
- @hemildesai added the pluggable sandbox provider infrastructure and OpenSandbox as the first built-in
- @adil-a laid the groundwork for Gym-owned MCP resources servers, letting a server expose its tools over MCP
- @eric-tramel added the BunsenChem chemistry benchmark
- @jeffwillette added the long machine translation datasets and servers
Thank you to all the new contributors for helping make NeMo Gym better!
Command Line Interface
- One
gymcommand for the full workflow, withgym env,gym eval,gym list, andgym datasetsubcommands - Reference agents, benchmarks, and environments by name: use
gym listto see what is available gym env validatechecks your config for missing, malformed, or empty values before a run and reports actionable errors
Evaluation & Diagnostics
- Skill evaluation: measure how agent skills affect performance by running the same tasks with different skill sets. Skills apply at rollout time as a run-level knob, so one dataset works across all skill variants and every rollout is tagged for comparison
- BLADE (Benchmark Level Analysis and Diagnostics Engine): a built-in analysis skill that reads an agent run's rollouts, metrics, and configs and produces an evidence-backed report of which tasks failed, why, and the highest-impact fix (e.g. harness, training, verifier, or prompt)
Sandboxing
- Run tool-using and coding agents in isolated sandboxes through a pluggable provider framework
- Built-in OpenSandbox and Apptainer providers, with third-party providers discoverable via entry points
Configure Agent Harnesses
New harnesses join the existing built-in set (Claude Code, Hermes, OpenHands, and more):
- Added OpenCode, OpenClaw, and Pi agents for evaluation
- Claude Code runtime capabilities (tool access, MCP servers, and bare vs. native auto-discovery mode) are now easily set via the server config
Configure Models
- New
inference_providermodel server connects to any OpenAI-compatible hosted provider (Fireworks, Together.ai, OpenRouter, DeepInfra, Gemini, and more) with ready-made configs - Every Gym model server now speaks the Anthropic Messages API, so Anthropic-native harnesses like the Claude Code CLI can run against any model you serve with Gym
New Benchmarks
- Science: CritPt (research-level physics), SciCode (scientific coding), BunsenChem (chemistry multiple-choice), and FrontierScience Research (rubric-scored science)
- Long context: Graphwalks (long-context graph reasoning) and Long Machine Translation (PG19, WMT24++)
- Interactive: TALES, a text-adventure game suite
See the Available Environments table for the full list.
Deprecation Notices
- The legacy
ng_*andnemo_gym_*CLI commands (such asng_runandng_collect_rollouts) are deprecated in favor of the unifiedgymCLI. They still work for now but will be removed in a future release.
Bug Fixes
- Fixed intermittent connection errors during high-concurrency rollout collection
- Clear error messages instead of crashes when a config file contains invalid YAML
Documentation
- New Build Verifiers section with verification patterns and multi-reward verification
- New Evaluate section covering benchmarks, evaluation metrics, and a guide to agent-native results diagnostics
- New page for configuring and evaluating agent skills
Full Changelog
- ci: bump _release_library.yml to v1.4.3 (#1508) by @ko3n1g
- fix(vllm_model): use reasoning parser option in converter (#1511) by @cmunley1
- fix: Compatibility with vllm 0.20 tool-calling (#1432) by @tdene
- ci: require SHA for release-ref, fix duplicate changelog, add release docs (#1536) by @kajalj22
- added long machine translation datasets and servers (#1458) by @jeffwillette
- feat(genrm_compare): add style density penalty for formatting control (#1543) by @macandro96
- fix: add example data + metrics for longmt_eval (#1559) by @ananthsub
- Add LC benchmarks (#1437) by @hsiehjackson
- removing questions/expected_answers type formating in ns_tools (#1581) by @OliviaViessmann
- ci: parallelize the server test suite (in-process concurrency, ~17min → faster locally + CI) (#1577) by @wprazuch
- ci: pin uv to 0.11.19 (0.11.20 resolver regression breaks the test suite) (#1576) by @wprazuch
- fix: graphwalks data validation (#1587) by @cmunley1
- Support RDKit chemistry answer formats (#1327) by @danecor
- feat: multi-reward tool-call environment and reward_components for GDPO (#1525) by @anjalibshah
- Add FrontierScience Research benchmark (#1553) by @jiacheng-xu
- fix(config): actionable error for unknown server cross-references (#1561) by @wprazuch
- docs: add NGC authentication step to GRPO setup tutorial (Fern) (#1552) by @lbliii
- feat(cli): document ng_init_resources_server generated config inline (#1205 friction #7) (#1597) by @wprazuch
- fix: Fern preview build (#1610) by @chtruong814
- docs: document the Gym to RL framework token-ID data interface (#1554) by @ananthsub
- Add ArXiv MCP tool config (#1419) by @tamohannes
- Add Wikipedia MCP tool config (#1420) by @tamohannes
- Add periodictable MCP tool config (#1422) by @tamohannes
- ci: add Claude Code review workflow (#1622) by @kajalj22
- docs: document use_absolute_ip config option (supersedes #595) (#1621) by @lbliii
- Add CoolProp MCP tool config (#1421) by @tamohannes
- Add particle MCP tool config (#1423) by @tamohannes
- Add radioactive decay MCP tool config (#1424) by @tamohannes
- fix: make logprobs capture robust to top_logprobs=null in vllm model (#1612) by @ananthsub
- Add SciCode Benchmark (#1592) by @fsiino-nvidia
- Add CritPt Benchmark (#1588) by @fsiino-nvidia
- fix: resolve CritPt benchmark config interpolation and add critpt_agent README (#1642) by @linj-glitch
- docs: describe local_vllm_model and extend docs for vllm_model (#1430) by @marta-sd
- Add BLADE analysis skill (#1591) by @jmabry
- feat: make claude code agent runtime capabilities configurable (#1603) by @cwing-nvidia
- Add sandbox API and mini swe agent 2 resource agent (#1377) by @hemildesai
- feat: abstention environment (#1459) by @cmunley1
- feat: reasoning gym environment (#1378) by @cmunley1
- fix(security): upgrade mlflow, grpcio, torch (longmt_eval) for CVE remediation (#1657) by @kajalj22
- feat(docs): add GitHub link to docs navbar (#1654) by @abhay-codes07
- chore: vendor gh-stack agent skill (#1616) by @ananthsub
- feat: arc agi environment (#1460) by @cmunley1
- chore (local_vllm_model): bump vllm 0.17.0 -> 0.20.0 (#1674) by @ananthsub
- Add sandbox coverage unit tests (#1684) by @hemildesai
- fix: refresh blackjack example rollouts (#1683) by @cmunley1
- feat: blackjack environment (#1464) by @cmunley1
- feat: instruction following environment (#1403) by @cmunley1
- feat: circle vlm environments (#1465) by @cmunley1
- feat: calendar environments (#1468) by @cmunley1
- feat: code gen environment (#1467) by @cmunley1
- fix: ensure client keepalive < server keepalive to avoid client keepalive desync errors (#1555) by @ananthsub
- feat: ether0 environment (#1472) by @cmunley1
- feat: [GDPval-AA v2 Updates 1 / n] - GDPval Multi-Reference Model Support (#1663) by @vadam5
- docs: document stacked pull requests in development setup (#1617) by @ananthsub
- docs(config): document the Domain enum (#1205 friction #9 / FEP-1023) (#1633) by @wprazuch
- docs: define Resources/Agent/Model Server in the glossary (#1205 friction #9, #395) (#1634) by @wprazuch
- fix(config): aggregated error for unset '???' config values (#1575) by @wprazuch
- feat: Add a default /v1/messages (Anthropic Messages) route to the base Gym… (#1627) by @ffrujeri
- feat: [GDPval-AA v2 Updates 2 / n] - Task Execution Only Mode (#1722) by @vadam5
- feat: [GDPval-AA v2 Updates 3 / n] - Judge Only Mode (#1725) by @vadam5
- Gym CLI refactor (#1630) by @marta-sd
- feat(config): unify dataset source via discriminated
source:block (FEP-1025) (#1637) by @wprazuch - feat(config): unified clean errors for bad/malformed/empty config_paths (#1205 #8/#12; #1488/#1489/#1490) (#1609) by @wprazuch
- feat: environment registry + 'gym list environments' (#1205 friction #8 / M2) (#1635) by @wprazuch
- feat: agent registry — name-based agent discovery + composability (M3 core) (#1671) by @wprazuch
- feat(cli): add 'gym env validate' pre-flight config check (#1205 friction #12) (#1599) by @wprazuch
- ci: fail notify job when Slack webhook returns an error (#1739) by @kajalj22
- Support agent-specific num_repeats in ng_collect_rollouts (#1356) by @gwarmstrong
- docs(fern): adding an evaluation section and fixing other eval references (#1740) by @ritaneves
- feat(sandbox): add Apptainer sandbox provider (#1694) by @arti4nvj
- feat: openclaw agent server (#1580) by @cmunley1
- feat: harness agnostic TerminalBench environment (#1631) by @elisam0
- feat: Add Gym-owned MCP resources server support (#1682) by @adil-a
- feat: opencode agent server (#1594) by @cmunley1
- feat: pi agent server (#1595) by @cmunley1
- docs: add BLADE definition and rename Skills to Benchmark Analysis (#1756) by @sephmard
- docs: move Add a Benchmark from Evaluation to Contribute (#1757) by @sephmard
- docs: add Build Verifiers tab and move Verification Patterns (#1771) by @sephmard
- docs: rename Environment Tutorials to Build Environments (#1774) by @sephmard
- docs: reorder nav and rename Configure Agents and Prepare Data (#1777) by @sephmard
- docs(fern): fix inference provider base URLs to match configs (#1753) by @cwing-nvidia
- docs(fern): rebase Build Verifiers content from PR #1751 (#1779) by @sephmard
- docs(fern): remove redundant Resources Server page (#1781) by @cwing-nvidia
- docs: add multi-reward verification (#1665) by @cwing-nvidia
- feat: agent skill evaluation infrastructure (#1605) by @cwing-nvidia
- docs(fern): fix Agent Skills duplicate heading and page order (#1786) by @cwing-nvidia
- fix(deps): pin prometheus-fastapi-instrumentator>=8.0.2 for FastAPI>=0.137 (#1785) by @yfw
- feat: propagate routed experts through training outputs (#1716) by @zyzhou5
- docs: diagnose results (#1789) by @cwing-nvidia
- docs: source: union, clean config errors, and new discovery/validate CLI commands (#1754) by @wprazuch
- feat: cli error handling (#1804) by @marta-sd
- feat(cli): make built-in assets work from a wheel install / external cwd (#1805) by @wprazuch
- feat: resolve built-in data & prompt files from the install root (#1806) by @wprazuch
- fix: gym dataset collate creates the artifact dir when missing (run-from-wheel) (#1811) by @wprazuch
- feat: clean config errors + run/test built-in servers from any cwd (#1807) by @wprazuch
- docs: add Agent Skills page under Contribute (#981) (#1681) by @lbliii
- feat: integrate TALES: Text Adventure Learning Environment Suite (#1556) by @cmunley1
- docs: enable Fern multi-source (#1816) by @lbliii
- Add BunsenChem benchmark (#1501) by @eric-tramel
- [codex] clarify Fern version snapshots (#1813) by @lbliii
- fix: address issues in CLI docs and examples (#1809) by @marta-sd
- fix(security): bump aiohttp 3.13.3 -> 3.14.1 (#1817) by @kajalj22
- fix: add example_metrics to tales (#1819) by @cmunley1
- docs: fill CLI reference gaps for data prep and rollout collection (#1675) by @lbliii
- docs: knock out small documentation fixes (#1670) by @lbliii
- docs: reformat ecosystem page to template shape (#1579) by @lbliii
- [codex] Split sandbox unit tests in CI (#1778) by @hemildesai
- feat(gdpval): AA v2 prompt, deps, and mandatory Apptainer sandbox (#1714) by @agronskiy
- feat(sandbox): decouple sandbox provider config from agent config (#1708) by @ananthsub
- tau3 banking_knowledge integration (#1573) by @jkyi-nvidia
- docs: remove Example Resources Servers page and reorder Build Verifiers nav (#1791) by @cwing-nvidia
- Rfneves/link 87f42dcd (#1828) by @ritaneves
- feat(cli): add --model-type selector to gym dataset collate (#1834) by @wprazuch
- docs(evaluate): Benchmarks page + expanded environment list (#1826) by @sephmard
- Fix documentation links from DX - high and medium (#1831) by @ritaneves
- fix(scicode): drop scipy from [tool.uv] exclude-dependencies (#1832) by @laszkiewiczp
- fix: restore support for num_repeats=null for backward compatibility (#1833) by @marta-sd
- docs: fix 11 documentation and tutorial issues (DOC/LINK/TUT) (#1837) by @sephmard
- feat(sandbox): provider configs contribute default sandbox metadata (#1709) by @ananthsub
- fix(ns_tools): clean up tool state after verification (#1838) by @gchlebus
- feat: [GDPval-AA v2 Updates 6 / n] - Multi-Stage Task Sampling (#1746) by @vadam5
- ci: shard full server suite (#1842) by @chtruong814
- ci: use PAT for code-freeze branch push (#1845) by @kajalj22
- fix: address CLI issues (#1829) by @marta-sd
- fix: handle malformed yaml when loading extra configs (#1854) by @marta-sd
- fix: print full table for
gym listand createcli/utils.pywith helpers (#1858) by @marta-sd - feat: [GDPval-AA v2 Updates 4 / n] - Re-Run Failed Tasks and Judgements Only (#1846) by @vadam5
- feat: [GDPval-AA v2 Updates 5 / n] - Multi-Judge Panel (#1852) by @vadam5
- fix: ERR-225d2c82 Fix HotpotQA dataset license value (#1841) by @ritaneves
- fix: add 'all' extra, surface auth errors in quickstart (QS fixes) (#1840) by @sephmard
- fix(tau3): update repo pin (#1875) by @cmunley1
- [codex] Add sandbox API docs guide (#1717) by @hemildesai
- docs: add v0.4.0 release notes (#1879) by @cwing-nvidia
- fix: Container guidance is inconsistent across v0.3.0 docs (#1827) by @ffrujeri
- chore: update uv.lock (#1876) by @kajalj22
- docs: add v0.4.0 highlights to README News and trim archive (#1886) by @cwing-nvidia
- fix(security): bump aiohttp >=3.14.1 and Pillow >=12.3.0 (CVE mitigations) (#1885) by @kajalj22
- docs: fix typos in README environment table source configs (#1892) by @cwing-nvidia
- ci: use NVIDIA inference for Claude review (#1878) by @chtruong814
- [mini-swe-agent 2] Quickstart fix + gradeable example data & rollouts (#1896) by @ananthsub
- feat: use all available domain info when listing benchmarks (#1857) by @marta-sd
- feat(benchmarks): Add arguments to preparation script; configurable RULER (#1711) by @prokotg
- docs(fern): add v0.4.0 version snapshot for GA release (#1913) by @kajalj22
- fix(mini_swe_agent_2): don't install agent deps into root venv (openai pin conflict) (#1916) by @ananthsub
- release: bump main to 0.5.0rc0 for next dev cycle (#1921) by @ananthsub