Release v0.4.0 · NVIDIA-NeMo/Gym

Release Summary

NeMo Gym v0.4.0 expands evaluation tooling and agent integrations. It establishes a new monthly release cadence; we will continue to provide day-zero support for Nemotron models, datasets, and environments.

Highlights:

Unified gym CLI: find agents and benchmarks by name with gym list, and catch config mistakes early with gym env validate
Diagnose evaluations with BLADE, an analysis skill for agents that reads your evaluation results and produces an evidence-backed report of which tasks failed, why, and the highest-impact fix (e.g. to the agent harness, training, verifier, or prompt)
Measure the impact of agent skills: run the same tasks with different skill sets and compare how each changes agent performance
Run agents in isolated sandboxes through a new pluggable provider framework
More agent harnesses out of the box, including OpenClaw, Pi, and OpenCode
Connect to hosted inference providers: Fireworks, Together.ai, OpenRouter, and more
New benchmarks across science, long-context, and interactive tasks

First-Time Contributors

We welcomed 20+ new contributors to this release! A few highlights:

@marta-sd and @wprazuch led the CLI refactor and clearer config errors
@hemildesai added the pluggable sandbox provider infrastructure and OpenSandbox as the first built-in
@adil-a laid the groundwork for Gym-owned MCP resources servers, letting a server expose its tools over MCP
@eric-tramel added the BunsenChem chemistry benchmark
@jeffwillette added the long machine translation datasets and servers

Thank you to all the new contributors for helping make NeMo Gym better!

Command Line Interface

One gym command for the full workflow, with gym env, gym eval, gym list, and gym dataset subcommands
Reference agents, benchmarks, and environments by name: use gym list to see what is available
gym env validate checks your config for missing, malformed, or empty values before a run and reports actionable errors

Evaluation & Diagnostics

Skill evaluation: measure how agent skills affect performance by running the same tasks with different skill sets. Skills apply at rollout time as a run-level knob, so one dataset works across all skill variants and every rollout is tagged for comparison
BLADE (Benchmark Level Analysis and Diagnostics Engine): a built-in analysis skill that reads an agent run's rollouts, metrics, and configs and produces an evidence-backed report of which tasks failed, why, and the highest-impact fix (e.g. harness, training, verifier, or prompt)

Sandboxing

Run tool-using and coding agents in isolated sandboxes through a pluggable provider framework
Built-in OpenSandbox and Apptainer providers, with third-party providers discoverable via entry points

Configure Agent Harnesses

New harnesses join the existing built-in set (Claude Code, Hermes, OpenHands, and more):

Added OpenCode, OpenClaw, and Pi agents for evaluation
Claude Code runtime capabilities (tool access, MCP servers, and bare vs. native auto-discovery mode) are now easily set via the server config

Configure Models

New inference_provider model server connects to any OpenAI-compatible hosted provider (Fireworks, Together.ai, OpenRouter, DeepInfra, Gemini, and more) with ready-made configs
Every Gym model server now speaks the Anthropic Messages API, so Anthropic-native harnesses like the Claude Code CLI can run against any model you serve with Gym

New Benchmarks

Science: CritPt (research-level physics), SciCode (scientific coding), BunsenChem (chemistry multiple-choice), and FrontierScience Research (rubric-scored science)
Long context: Graphwalks (long-context graph reasoning) and Long Machine Translation (PG19, WMT24++)
Interactive: TALES, a text-adventure game suite

See the Available Environments table for the full list.

Deprecation Notices

The legacy ng_* and nemo_gym_* CLI commands (such as ng_run and ng_collect_rollouts) are deprecated in favor of the unified gym CLI. They still work for now but will be removed in a future release.

Bug Fixes

Fixed intermittent connection errors during high-concurrency rollout collection
Clear error messages instead of crashes when a config file contains invalid YAML

Documentation

New Build Verifiers section with verification patterns and multi-reward verification
New Evaluate section covering benchmarks, evaluation metrics, and a guide to agent-native results diagnostics
New page for configuring and evaluating agent skills

Full Changelog

ci: bump _release_library.yml to v1.4.3 (#1508) by @ko3n1g
fix(vllm_model): use reasoning parser option in converter (#1511) by @cmunley1
fix: Compatibility with vllm 0.20 tool-calling (#1432) by @tdene
ci: require SHA for release-ref, fix duplicate changelog, add release docs (#1536) by @kajalj22
added long machine translation datasets and servers (#1458) by @jeffwillette
feat(genrm_compare): add style density penalty for formatting control (#1543) by @macandro96
fix: add example data + metrics for longmt_eval (#1559) by @ananthsub
Add LC benchmarks (#1437) by @hsiehjackson
removing questions/expected_answers type formating in ns_tools (#1581) by @OliviaViessmann
ci: parallelize the server test suite (in-process concurrency, ~17min → faster locally + CI) (#1577) by @wprazuch
ci: pin uv to 0.11.19 (0.11.20 resolver regression breaks the test suite) (#1576) by @wprazuch
fix: graphwalks data validation (#1587) by @cmunley1
Support RDKit chemistry answer formats (#1327) by @danecor
feat: multi-reward tool-call environment and reward_components for GDPO (#1525) by @anjalibshah
Add FrontierScience Research benchmark (#1553) by @jiacheng-xu
fix(config): actionable error for unknown server cross-references (#1561) by @wprazuch
docs: add NGC authentication step to GRPO setup tutorial (Fern) (#1552) by @lbliii
feat(cli): document ng_init_resources_server generated config inline (#1205 friction #7) (#1597) by @wprazuch
fix: Fern preview build (#1610) by @chtruong814
docs: document the Gym to RL framework token-ID data interface (#1554) by @ananthsub
Add ArXiv MCP tool config (#1419) by @tamohannes
Add Wikipedia MCP tool config (#1420) by @tamohannes
Add periodictable MCP tool config (#1422) by @tamohannes
ci: add Claude Code review workflow (#1622) by @kajalj22
docs: document use_absolute_ip config option (supersedes #595) (#1621) by @lbliii
Add CoolProp MCP tool config (#1421) by @tamohannes
Add particle MCP tool config (#1423) by @tamohannes
Add radioactive decay MCP tool config (#1424) by @tamohannes
fix: make logprobs capture robust to top_logprobs=null in vllm model (#1612) by @ananthsub
Add SciCode Benchmark (#1592) by @fsiino-nvidia
Add CritPt Benchmark (#1588) by @fsiino-nvidia
fix: resolve CritPt benchmark config interpolation and add critpt_agent README (#1642) by @linj-glitch
docs: describe local_vllm_model and extend docs for vllm_model (#1430) by @marta-sd
Add BLADE analysis skill (#1591) by @jmabry
feat: make claude code agent runtime capabilities configurable (#1603) by @cwing-nvidia
Add sandbox API and mini swe agent 2 resource agent (#1377) by @hemildesai
feat: abstention environment (#1459) by @cmunley1
feat: reasoning gym environment (#1378) by @cmunley1
fix(security): upgrade mlflow, grpcio, torch (longmt_eval) for CVE remediation (#1657) by @kajalj22
feat(docs): add GitHub link to docs navbar (#1654) by @abhay-codes07
chore: vendor gh-stack agent skill (#1616) by @ananthsub
feat: arc agi environment (#1460) by @cmunley1
chore (local_vllm_model): bump vllm 0.17.0 -> 0.20.0 (#1674) by @ananthsub
Add sandbox coverage unit tests (#1684) by @hemildesai
fix: refresh blackjack example rollouts (#1683) by @cmunley1
feat: blackjack environment (#1464) by @cmunley1
feat: instruction following environment (#1403) by @cmunley1
feat: circle vlm environments (#1465) by @cmunley1
feat: calendar environments (#1468) by @cmunley1
feat: code gen environment (#1467) by @cmunley1
fix: ensure client keepalive < server keepalive to avoid client keepalive desync errors (#1555) by @ananthsub
feat: ether0 environment (#1472) by @cmunley1
feat: [GDPval-AA v2 Updates 1 / n] - GDPval Multi-Reference Model Support (#1663) by @vadam5
docs: document stacked pull requests in development setup (#1617) by @ananthsub
docs(config): document the Domain enum (#1205 friction #9 / FEP-1023) (#1633) by @wprazuch
docs: define Resources/Agent/Model Server in the glossary (#1205 friction #9, #395) (#1634) by @wprazuch
fix(config): aggregated error for unset '???' config values (#1575) by @wprazuch
feat: Add a default /v1/messages (Anthropic Messages) route to the base Gym… (#1627) by @ffrujeri
feat: [GDPval-AA v2 Updates 2 / n] - Task Execution Only Mode (#1722) by @vadam5
feat: [GDPval-AA v2 Updates 3 / n] - Judge Only Mode (#1725) by @vadam5
Gym CLI refactor (#1630) by @marta-sd
feat(config): unify dataset source via discriminated source: block (FEP-1025) (#1637) by @wprazuch
feat(config): unified clean errors for bad/malformed/empty config_paths (#1205 #8/#12; #1488/#1489/#1490) (#1609) by @wprazuch
feat: environment registry + 'gym list environments' (#1205 friction #8 / M2) (#1635) by @wprazuch
feat: agent registry — name-based agent discovery + composability (M3 core) (#1671) by @wprazuch
feat(cli): add 'gym env validate' pre-flight config check (#1205 friction #12) (#1599) by @wprazuch
ci: fail notify job when Slack webhook returns an error (#1739) by @kajalj22
Support agent-specific num_repeats in ng_collect_rollouts (#1356) by @gwarmstrong
docs(fern): adding an evaluation section and fixing other eval references (#1740) by @ritaneves
feat(sandbox): add Apptainer sandbox provider (#1694) by @arti4nvj
feat: openclaw agent server (#1580) by @cmunley1
feat: harness agnostic TerminalBench environment (#1631) by @elisam0
feat: Add Gym-owned MCP resources server support (#1682) by @adil-a
feat: opencode agent server (#1594) by @cmunley1
feat: pi agent server (#1595) by @cmunley1
docs: add BLADE definition and rename Skills to Benchmark Analysis (#1756) by @sephmard
docs: move Add a Benchmark from Evaluation to Contribute (#1757) by @sephmard
docs: add Build Verifiers tab and move Verification Patterns (#1771) by @sephmard
docs: rename Environment Tutorials to Build Environments (#1774) by @sephmard
docs: reorder nav and rename Configure Agents and Prepare Data (#1777) by @sephmard
docs(fern): fix inference provider base URLs to match configs (#1753) by @cwing-nvidia
docs(fern): rebase Build Verifiers content from PR #1751 (#1779) by @sephmard
docs(fern): remove redundant Resources Server page (#1781) by @cwing-nvidia
docs: add multi-reward verification (#1665) by @cwing-nvidia
feat: agent skill evaluation infrastructure (#1605) by @cwing-nvidia
docs(fern): fix Agent Skills duplicate heading and page order (#1786) by @cwing-nvidia
fix(deps): pin prometheus-fastapi-instrumentator>=8.0.2 for FastAPI>=0.137 (#1785) by @yfw
feat: propagate routed experts through training outputs (#1716) by @zyzhou5
docs: diagnose results (#1789) by @cwing-nvidia
docs: source: union, clean config errors, and new discovery/validate CLI commands (#1754) by @wprazuch
feat: cli error handling (#1804) by @marta-sd
feat(cli): make built-in assets work from a wheel install / external cwd (#1805) by @wprazuch
feat: resolve built-in data & prompt files from the install root (#1806) by @wprazuch
fix: gym dataset collate creates the artifact dir when missing (run-from-wheel) (#1811) by @wprazuch
feat: clean config errors + run/test built-in servers from any cwd (#1807) by @wprazuch
docs: add Agent Skills page under Contribute (#981) (#1681) by @lbliii
feat: integrate TALES: Text Adventure Learning Environment Suite (#1556) by @cmunley1
docs: enable Fern multi-source (#1816) by @lbliii
Add BunsenChem benchmark (#1501) by @eric-tramel
[codex] clarify Fern version snapshots (#1813) by @lbliii
fix: address issues in CLI docs and examples (#1809) by @marta-sd
fix(security): bump aiohttp 3.13.3 -> 3.14.1 (#1817) by @kajalj22
fix: add example_metrics to tales (#1819) by @cmunley1
docs: fill CLI reference gaps for data prep and rollout collection (#1675) by @lbliii
docs: knock out small documentation fixes (#1670) by @lbliii
docs: reformat ecosystem page to template shape (#1579) by @lbliii
[codex] Split sandbox unit tests in CI (#1778) by @hemildesai
feat(gdpval): AA v2 prompt, deps, and mandatory Apptainer sandbox (#1714) by @agronskiy
feat(sandbox): decouple sandbox provider config from agent config (#1708) by @ananthsub
tau3 banking_knowledge integration (#1573) by @jkyi-nvidia
docs: remove Example Resources Servers page and reorder Build Verifiers nav (#1791) by @cwing-nvidia
Rfneves/link 87f42dcd (#1828) by @ritaneves
feat(cli): add --model-type selector to gym dataset collate (#1834) by @wprazuch
docs(evaluate): Benchmarks page + expanded environment list (#1826) by @sephmard
Fix documentation links from DX - high and medium (#1831) by @ritaneves
fix(scicode): drop scipy from [tool.uv] exclude-dependencies (#1832) by @laszkiewiczp
fix: restore support for num_repeats=null for backward compatibility (#1833) by @marta-sd
docs: fix 11 documentation and tutorial issues (DOC/LINK/TUT) (#1837) by @sephmard
feat(sandbox): provider configs contribute default sandbox metadata (#1709) by @ananthsub
fix(ns_tools): clean up tool state after verification (#1838) by @gchlebus
feat: [GDPval-AA v2 Updates 6 / n] - Multi-Stage Task Sampling (#1746) by @vadam5
ci: shard full server suite (#1842) by @chtruong814
ci: use PAT for code-freeze branch push (#1845) by @kajalj22
fix: address CLI issues (#1829) by @marta-sd
fix: handle malformed yaml when loading extra configs (#1854) by @marta-sd
fix: print full table for gym list and create cli/utils.py with helpers (#1858) by @marta-sd
feat: [GDPval-AA v2 Updates 4 / n] - Re-Run Failed Tasks and Judgements Only (#1846) by @vadam5
feat: [GDPval-AA v2 Updates 5 / n] - Multi-Judge Panel (#1852) by @vadam5
fix: ERR-225d2c82 Fix HotpotQA dataset license value (#1841) by @ritaneves
fix: add 'all' extra, surface auth errors in quickstart (QS fixes) (#1840) by @sephmard
fix(tau3): update repo pin (#1875) by @cmunley1
[codex] Add sandbox API docs guide (#1717) by @hemildesai
docs: add v0.4.0 release notes (#1879) by @cwing-nvidia
fix: Container guidance is inconsistent across v0.3.0 docs (#1827) by @ffrujeri
chore: update uv.lock (#1876) by @kajalj22
docs: add v0.4.0 highlights to README News and trim archive (#1886) by @cwing-nvidia
fix(security): bump aiohttp >=3.14.1 and Pillow >=12.3.0 (CVE mitigations) (#1885) by @kajalj22
docs: fix typos in README environment table source configs (#1892) by @cwing-nvidia
ci: use NVIDIA inference for Claude review (#1878) by @chtruong814
[mini-swe-agent 2] Quickstart fix + gradeable example data & rollouts (#1896) by @ananthsub
feat: use all available domain info when listing benchmarks (#1857) by @marta-sd
feat(benchmarks): Add arguments to preparation script; configurable RULER (#1711) by @prokotg
docs(fern): add v0.4.0 version snapshot for GA release (#1913) by @kajalj22
fix(mini_swe_agent_2): don't install agent deps into root venv (openai pin conflict) (#1916) by @ananthsub
release: bump main to 0.5.0rc0 for next dev cycle (#1921) by @ananthsub

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

v0.4.0

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

Release Summary

First-Time Contributors

Command Line Interface

Evaluation & Diagnostics

Sandboxing

Configure Agent Harnesses

Configure Models

New Benchmarks

Deprecation Notices

Bug Fixes

Documentation

Contributors

Uh oh!