fix(nemo_gym): hard-fail when rollouts returned < requested#2394
Open
lonexreb wants to merge 1 commit intoNVIDIA-NeMo:mainfrom
Open
fix(nemo_gym): hard-fail when rollouts returned < requested#2394lonexreb wants to merge 1 commit intoNVIDIA-NeMo:mainfrom
lonexreb wants to merge 1 commit intoNVIDIA-NeMo:mainfrom
Conversation
…han requested (NVIDIA-NeMo#2305) When the NeMo Gym agent dies mid-rollout (commonly: CPU OOM during rollout collection when ``num_prompts_per_step * num_generations_per_prompt`` is too large for the host) the rollout iterator silently runs out. Today the loop still ``break``s out of the retry-on-NaN block — the NaN check is a no-op against an empty result list — and the function returns a list of ``None`` rollouts. Training keeps running, logging "empty epochs", instead of crashing. Surface a hard ``RuntimeError`` immediately when fewer rollouts are returned than were requested, with a message pointing at the most likely root cause and the relevant config knob: NeMo Gym returned only N rollouts but M were requested. The Gym agent likely crashed or was killed mid-rollout (commonly: CPU OOM during rollout collection when num_prompts_per_step * num_generations_per_prompt is too large for the host). Check the Gym agent's logs for the underlying error and consider lowering num_prompts_per_step. See issue NVIDIA-NeMo#2305. Why no automated test --------------------- ``NemoGym`` is decorated ``@ray.remote(...) # pragma: no cover`` per the project's coverage convention for actor classes (see @skills/testing/SKILL.md). Exercising ``run_rollouts`` requires spinning up a Ray actor and a mock NeMo Gym head server, which is beyond the L0 unit-test tier and is what the existing functional tests cover. The fix here is a counted invariant (``len(results) == requested``) on values local to the function, so the failure mode is straightforward to reason about by inspection. Refs: NVIDIA-NeMo#2305 Signed-off-by: lonexreb <reach2shubhankar@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Closes #2305.
When the NeMo Gym agent dies mid-rollout (commonly: CPU OOM during
rollout collection when `num_prompts_per_step *
num_generations_per_prompt` is too large for the host) the rollout
iterator in `NemoGym.run_rollouts` silently runs out. Today the loop
still `break`s out of the retry-on-NaN block (the NaN check is a
no-op against an empty result list) and the function returns a list
of `None` rollouts. Training keeps running, logging "empty epochs",
instead of crashing — exactly the problem reported in #2305.
What changes
Surface a hard `RuntimeError` immediately when fewer rollouts are
returned than were requested, with a message pointing at the most
likely root cause and the relevant config knob:
```
RuntimeError: NeMo Gym returned only N rollouts but M were requested.
The Gym agent likely crashed or was killed mid-rollout (commonly:
CPU OOM during rollout collection when num_prompts_per_step *
num_generations_per_prompt is too large for the host). Check the Gym
agent's logs for the underlying error and consider lowering
num_prompts_per_step. See issue #2305.
```
The check happens before the retry-on-NaN logic, so the
error-on-incomplete-collection signal isn't masked by the retry path
that's intended for a different failure mode (occasional NaN logprobs
that succeed on retry).
Implementation
Eighteen-line addition inside `NemoGym.run_rollouts` between the
result-collection `for` loop and the NaN check:
```python
if len(nemo_rl_results) < nemo_gym_num_rows:
raise RuntimeError(
f"NeMo Gym returned only {len(nemo_rl_results)} rollouts "
f"but {nemo_gym_num_rows} were requested. ..."
)
```
Why no automated test
`NemoGym` is decorated with `@ray.remote(...) # pragma: no cover`
per the project's coverage convention for actor classes (see
`skills/testing/SKILL.md`). Exercising `run_rollouts` requires
spinning up a Ray actor and a mock NeMo Gym head server, which is
beyond the L0 unit-test tier and is covered by the existing
functional tests. The fix is a counted invariant on values local to
the function, so the failure mode is straightforward to reason about
by inspection.
Out of scope
`num_prompts_per_step: 4096`. The accompanying inline comment in
the YAML explains this is intentional for off-policy async training
(`256 prompts/step * 16 off-policy steps`), so I'm leaving the
config alone here. With this PR, users who hit the CPU OOM will
get a clear error pointing at the right knob to lower for their
hardware.
Test plan
reproduces the original CPU OOM (the new error should fire
immediately instead of producing empty epochs).
Refs: #2305