fix(nemo_gym): hard-fail when rollouts returned < requested by lonexreb · Pull Request #2394 · NVIDIA-NeMo/RL

lonexreb · 2026-05-04T08:27:04Z

Summary

Closes #2305.

When the NeMo Gym agent dies mid-rollout (commonly: CPU OOM during
rollout collection when `num_prompts_per_step *
num_generations_per_prompt` is too large for the host) the rollout
iterator in `NemoGym.run_rollouts` silently runs out. Today the loop
still `break`s out of the retry-on-NaN block (the NaN check is a
no-op against an empty result list) and the function returns a list
of `None` rollouts. Training keeps running, logging "empty epochs",
instead of crashing — exactly the problem reported in #2305.

What changes

Surface a hard `RuntimeError` immediately when fewer rollouts are
returned than were requested, with a message pointing at the most
likely root cause and the relevant config knob:

```
RuntimeError: NeMo Gym returned only N rollouts but M were requested.
The Gym agent likely crashed or was killed mid-rollout (commonly:
CPU OOM during rollout collection when num_prompts_per_step *
num_generations_per_prompt is too large for the host). Check the Gym
agent's logs for the underlying error and consider lowering
num_prompts_per_step. See issue #2305.
```

The check happens before the retry-on-NaN logic, so the
error-on-incomplete-collection signal isn't masked by the retry path
that's intended for a different failure mode (occasional NaN logprobs
that succeed on retry).

Implementation

Eighteen-line addition inside `NemoGym.run_rollouts` between the
result-collection `for` loop and the NaN check:

```python
if len(nemo_rl_results) < nemo_gym_num_rows:
raise RuntimeError(
f"NeMo Gym returned only {len(nemo_rl_results)} rollouts "
f"but {nemo_gym_num_rows} were requested. ..."
)
```

Why no automated test

`NemoGym` is decorated with `@ray.remote(...) # pragma: no cover`
per the project's coverage convention for actor classes (see
`skills/testing/SKILL.md`). Exercising `run_rollouts` requires
spinning up a Ray actor and a mock NeMo Gym head server, which is
beyond the L0 unit-test tier and is covered by the existing
functional tests. The fix is a counted invariant on values local to
the function, so the failure mode is straightforward to reason about
by inspection.

Out of scope

The reporter also notes `grpo_qwen3_30ba3b_instruct.yaml` uses
`num_prompts_per_step: 4096`. The accompanying inline comment in
the YAML explains this is intentional for off-policy async training
(`256 prompts/step * 16 off-policy steps`), so I'm leaving the
config alone here. With this PR, users who hit the CPU OOM will
get a clear error pointing at the right knob to lower for their
hardware.

Test plan

`ruff check` — clean
`ruff format` — clean
`python3 -m py_compile` — clean
CI `L0`
Functional smoke test from the maintainers on a config that
reproduces the original CPU OOM (the new error should fire
immediately instead of producing empty epochs).

Refs: #2305

…han requested (NVIDIA-NeMo#2305) When the NeMo Gym agent dies mid-rollout (commonly: CPU OOM during rollout collection when ``num_prompts_per_step * num_generations_per_prompt`` is too large for the host) the rollout iterator silently runs out. Today the loop still ``break``s out of the retry-on-NaN block — the NaN check is a no-op against an empty result list — and the function returns a list of ``None`` rollouts. Training keeps running, logging "empty epochs", instead of crashing. Surface a hard ``RuntimeError`` immediately when fewer rollouts are returned than were requested, with a message pointing at the most likely root cause and the relevant config knob: NeMo Gym returned only N rollouts but M were requested. The Gym agent likely crashed or was killed mid-rollout (commonly: CPU OOM during rollout collection when num_prompts_per_step * num_generations_per_prompt is too large for the host). Check the Gym agent's logs for the underlying error and consider lowering num_prompts_per_step. See issue NVIDIA-NeMo#2305. Why no automated test --------------------- ``NemoGym`` is decorated ``@ray.remote(...) # pragma: no cover`` per the project's coverage convention for actor classes (see @skills/testing/SKILL.md). Exercising ``run_rollouts`` requires spinning up a Ray actor and a mock NeMo Gym head server, which is beyond the L0 unit-test tier and is what the existing functional tests cover. The fix here is a counted invariant (``len(results) == requested``) on values local to the function, so the failure mode is straightforward to reason about by inspection. Refs: NVIDIA-NeMo#2305 Signed-off-by: lonexreb <reach2shubhankar@gmail.com>

copy-pr-bot · 2026-05-04T08:27:08Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

lonexreb requested review from a team as code owners May 4, 2026 08:27

github-actions Bot added the community-request label May 4, 2026

lonexreb changed the title ~~fix(nemo_gym): hard-fail when rollout collection returns fewer rows than requested~~ fix(nemo_gym): hard-fail when rollouts returned < requested May 5, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(nemo_gym): hard-fail when rollouts returned < requested#2394

fix(nemo_gym): hard-fail when rollouts returned < requested#2394
lonexreb wants to merge 1 commit intoNVIDIA-NeMo:mainfrom
lonexreb:fix/2305-gym-empty-epochs

lonexreb commented May 4, 2026

Uh oh!

copy-pr-bot Bot commented May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

lonexreb commented May 4, 2026

Summary

What changes

Implementation

Why no automated test

Out of scope

Test plan

Uh oh!

copy-pr-bot Bot commented May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants