Skip to content

fix(nemo_gym): hard-fail when rollouts returned < requested#2394

Open
lonexreb wants to merge 1 commit intoNVIDIA-NeMo:mainfrom
lonexreb:fix/2305-gym-empty-epochs
Open

fix(nemo_gym): hard-fail when rollouts returned < requested#2394
lonexreb wants to merge 1 commit intoNVIDIA-NeMo:mainfrom
lonexreb:fix/2305-gym-empty-epochs

Conversation

@lonexreb
Copy link
Copy Markdown

@lonexreb lonexreb commented May 4, 2026

Summary

Closes #2305.

When the NeMo Gym agent dies mid-rollout (commonly: CPU OOM during
rollout collection when `num_prompts_per_step *
num_generations_per_prompt` is too large for the host) the rollout
iterator in `NemoGym.run_rollouts` silently runs out. Today the loop
still `break`s out of the retry-on-NaN block (the NaN check is a
no-op against an empty result list) and the function returns a list
of `None` rollouts. Training keeps running, logging "empty epochs",
instead of crashing — exactly the problem reported in #2305.

What changes

Surface a hard `RuntimeError` immediately when fewer rollouts are
returned than were requested, with a message pointing at the most
likely root cause and the relevant config knob:

```
RuntimeError: NeMo Gym returned only N rollouts but M were requested.
The Gym agent likely crashed or was killed mid-rollout (commonly:
CPU OOM during rollout collection when num_prompts_per_step *
num_generations_per_prompt is too large for the host). Check the Gym
agent's logs for the underlying error and consider lowering
num_prompts_per_step. See issue #2305.
```

The check happens before the retry-on-NaN logic, so the
error-on-incomplete-collection signal isn't masked by the retry path
that's intended for a different failure mode (occasional NaN logprobs
that succeed on retry).

Implementation

Eighteen-line addition inside `NemoGym.run_rollouts` between the
result-collection `for` loop and the NaN check:

```python
if len(nemo_rl_results) < nemo_gym_num_rows:
raise RuntimeError(
f"NeMo Gym returned only {len(nemo_rl_results)} rollouts "
f"but {nemo_gym_num_rows} were requested. ..."
)
```

Why no automated test

`NemoGym` is decorated with `@ray.remote(...) # pragma: no cover`
per the project's coverage convention for actor classes (see
`skills/testing/SKILL.md`). Exercising `run_rollouts` requires
spinning up a Ray actor and a mock NeMo Gym head server, which is
beyond the L0 unit-test tier and is covered by the existing
functional tests. The fix is a counted invariant on values local to
the function, so the failure mode is straightforward to reason about
by inspection.

Out of scope

  • The reporter also notes `grpo_qwen3_30ba3b_instruct.yaml` uses
    `num_prompts_per_step: 4096`. The accompanying inline comment in
    the YAML explains this is intentional for off-policy async training
    (`256 prompts/step * 16 off-policy steps`), so I'm leaving the
    config alone here. With this PR, users who hit the CPU OOM will
    get a clear error pointing at the right knob to lower for their
    hardware.

Test plan

  • `ruff check` — clean
  • `ruff format` — clean
  • `python3 -m py_compile` — clean
  • CI `L0`
  • Functional smoke test from the maintainers on a config that
    reproduces the original CPU OOM (the new error should fire
    immediately instead of producing empty epochs).

Refs: #2305

…han requested (NVIDIA-NeMo#2305)

When the NeMo Gym agent dies mid-rollout (commonly: CPU OOM during
rollout collection when ``num_prompts_per_step *
num_generations_per_prompt`` is too large for the host) the rollout
iterator silently runs out. Today the loop still ``break``s out of
the retry-on-NaN block — the NaN check is a no-op against an empty
result list — and the function returns a list of ``None`` rollouts.
Training keeps running, logging "empty epochs", instead of crashing.

Surface a hard ``RuntimeError`` immediately when fewer rollouts are
returned than were requested, with a message pointing at the most
likely root cause and the relevant config knob:

    NeMo Gym returned only N rollouts but M were requested. The Gym
    agent likely crashed or was killed mid-rollout (commonly: CPU OOM
    during rollout collection when num_prompts_per_step *
    num_generations_per_prompt is too large for the host). Check the
    Gym agent's logs for the underlying error and consider lowering
    num_prompts_per_step. See issue NVIDIA-NeMo#2305.

Why no automated test
---------------------

``NemoGym`` is decorated ``@ray.remote(...)  # pragma: no cover`` per
the project's coverage convention for actor classes (see
@skills/testing/SKILL.md). Exercising ``run_rollouts`` requires
spinning up a Ray actor and a mock NeMo Gym head server, which is
beyond the L0 unit-test tier and is what the existing functional
tests cover. The fix here is a counted invariant
(``len(results) == requested``) on values local to the function, so
the failure mode is straightforward to reason about by inspection.

Refs: NVIDIA-NeMo#2305
Signed-off-by: lonexreb <reach2shubhankar@gmail.com>
@lonexreb lonexreb requested review from a team as code owners May 4, 2026 08:27
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 4, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@lonexreb lonexreb changed the title fix(nemo_gym): hard-fail when rollout collection returns fewer rows than requested fix(nemo_gym): hard-fail when rollouts returned < requested May 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Gym: Empty epochs if Gym agent fails

2 participants