fix(workers): extend offload guard to v2 + Megatron paths by lonexreb · Pull Request #2393 · NVIDIA-NeMo/RL

lonexreb · 2026-05-04T08:18:08Z

Summary

Stacks on #2392. Brings the same offload guard from #2392 to the
`DTensorPolicyWorkerV2` and `MegatronPolicyWorker` paths, so users
on the v2 dtensor backend and the Megatron-Core backend get the same
clear `RuntimeError` instead of a CUDA "illegal memory access" when a
custom training loop forgets `prepare_for_*` after
`offload_after_refit`. Together with #2392 this completes the
policy-worker side of #1141.

What changes

Both workers now:

Track `self._weights_offloaded`. v2 sets it to `False` at the end
of `init`; the Megatron guard uses `getattr(self, '_weights_offloaded', False)`
because its `init` is split across many private helpers and
the flag only becomes meaningful once a phase-level offload is
possible.
Flip the flag in `offload_after_refit` and unflip in both
`prepare_for_training` and `prepare_for_lp_inference`.
Call a new `_assert_weights_on_device(method_name)` helper as the
first line of `train`, `get_logprobs`, and (v2 only) `score`.
Pick up docstrings on the prepare/offload methods making the
contract explicit.

The error text matches #2392 — same format, same hint, same issue link
— so the user-visible behaviour is uniform across the three workers.

Coverage

New `tests/unit/models/policy/test_worker_offload_guard_v2_mcore.py`:

v2 tests (no marker, run at L0):
- `test_v2_passes_when_on_gpu`
- `test_v2_train_message_points_at_prepare_for_training`
- `test_v2_inference_message_points_at_prepare_for_lp_inference` (parametrized over `get_logprobs` / `score`)
Megatron tests (`@pytest.mark.mcore`):
- `test_passes_when_on_gpu`
- `test_passes_when_flag_unset` — extra case covering the
  `getattr(..., False)` branch.
- `test_train_message`, `test_get_logprobs_message`

The tests build a bare worker via `new` and patch the flag, so
they need neither Ray, distributed init, nor a real model.

Stacking note

This branch is built on top of #2392's branch. Until #2392 merges, the
diff GitHub shows here will include #2392's changes. After #2392
merges this PR's diff will narrow to the additions documented above
(~221 lines).

Out of scope

Documentation in user-facing guides (the docstring updates in this
PR + fix(dtensor): clear error when train/get_logprobs/score run offloaded #2392 carry the contract; a more comprehensive doc page on the
offload/onload lifecycle could be a separate `docs:` PR).
VLM / Automodel paths if they have their own offload semantics —
those workers were not flagged in More helpful error message if prepare_for_*() not called #1141 and are not touched here.

Test plan

`ruff check` — clean
`ruff format` — clean
`python3 -m py_compile` — clean
CI `L0` (v2 tests run as part of `L0_Unit_Tests_Other.sh`)
CI `L0` with `--mcore-only` (Megatron tests).

Refs: #1141
Depends on: #2392

…le offloaded (NVIDIA-NeMo#1141) Custom training loops sometimes skip ``prepare_for_training()`` / ``prepare_for_lp_inference()`` after ``offload_after_refit()`` has been called. The model is still on CPU, so the next ``train()``, ``get_logprobs()``, or ``score()`` call drives a CUDA op against CPU-resident weights and crashes with the opaque CUDA error: an illegal memory access was encountered There is no signal in that error pointing at the missing prepare step, so debugging it is painful — especially for users writing their first custom loop. Track the offload state on the worker and surface a clear ``RuntimeError`` at the entry of each GPU-bound public method that points the caller at the exact prepare step they skipped: train() was called while the policy weights are offloaded to CPU. This usually means a custom training loop forgot to call prepare_for_training() after offload_after_refit(). Call the appropriate prepare step before invoking train(); otherwise the underlying CUDA op will fail with "illegal memory access". See issue NVIDIA-NeMo#1141. Implementation: * Add ``self._weights_offloaded: bool`` (initialized False) to ``DTensorPolicyWorkerImpl``. * ``offload_after_refit()`` flips it to True. * ``prepare_for_training()`` and ``prepare_for_lp_inference()`` flip it back to False (and gain docstrings making the contract explicit). * New ``_assert_weights_on_device(method_name)`` helper raises the detailed ``RuntimeError`` when the flag is set. * ``train()``, ``get_logprobs()``, ``score()`` call the helper as the first line of their body. The flag is *only* about the explicit phase-level offload performed by ``offload_after_refit()``. The ``dtensor_cfg.cpu_offload`` mode (which keeps weights on CPU but transparently shuttles them per-layer) is unaffected — the flag stays False under that mode and existing behavior is preserved. Coverage: * New ``tests/unit/models/policy/test_dtensor_worker_offload_guard.py`` exercises the on-device pass-through, the train()-side error message, and the inference-side error messages (parametrized over get_logprobs / score). The tests build a bare worker via ``__new__`` and patch the flag, so they run at the L0 tier without needing Ray, distributed init, or a real model. Out of scope (intentionally): the same guard pattern would be useful in ``megatron_policy_worker.py`` and ``dtensor_policy_worker_v2.py``. Those workers have a slightly different offload state model and merit their own focused PRs. Refs: NVIDIA-NeMo#1141 Signed-off-by: lonexreb <reach2shubhankar@gmail.com>

…o#1141) Follow-up to NVIDIA-NeMo#2392 that brings the same offload guard to the ``DTensorPolicyWorkerV2`` and ``MegatronPolicyWorker`` paths so users on the v2 dtensor backend and the Megatron-Core backend get the same clear ``RuntimeError`` instead of a CUDA "illegal memory access" when a custom training loop forgets ``prepare_for_*`` after ``offload_after_refit``. Both workers now: * Track ``self._weights_offloaded`` (initialized False on the v2 worker; the Megatron worker uses ``getattr(..., False)`` because its ``__init__`` is split across many private helpers and the flag is only meaningful once a phase-level offload is possible). * Flip the flag in ``offload_after_refit`` and unflip in both ``prepare_for_training`` and ``prepare_for_lp_inference``. * Call a new ``_assert_weights_on_device(method_name)`` helper as the first line of ``train``, ``get_logprobs``, and (v2 only) ``score``. Coverage: * New ``tests/unit/models/policy/test_worker_offload_guard_v2_mcore.py`` mirrors the v1 guard tests and adds an extra Megatron-only case covering the ``getattr(..., False)`` "flag-unset" branch (the Megatron worker may not have set the attribute yet on first use). All v2 tests run at L0; Megatron tests are marked ``@pytest.mark.mcore``. Together with NVIDIA-NeMo#2392 this completes the policy-worker side of NVIDIA-NeMo#1141. Refs: NVIDIA-NeMo#1141 Signed-off-by: lonexreb <reach2shubhankar@gmail.com>

copy-pr-bot · 2026-05-04T08:18:12Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

…VIDIA-NeMo#1141) Address review feedback on PRs NVIDIA-NeMo#2392/NVIDIA-NeMo#2393: the guard pattern was applied to ``train``, ``get_logprobs``, and ``score`` but missed two other GPU-bound public APIs that exhibit the same opaque-CUDA-crash failure mode if called while weights are offloaded: * ``get_topk_logits`` — exists on all three workers (dtensor v1/v2 and Megatron) and runs a forward pass; exposed via ``lm_policy.py:get_topk_logits``. * ``generate`` — Megatron-only public API for HuggingFace-framework generation; exposed via ``lm_policy.py:generate``. Add ``self._assert_weights_on_device(...)`` as the first line of each. The hint resolves to ``prepare_for_lp_inference()`` for all four inference-side methods (``get_logprobs``, ``score``, ``get_topk_logits``, ``generate``) — same as before — since users of these APIs typically run them after refit / offload. Tests in ``test_dtensor_worker_offload_guard.py`` and ``test_worker_offload_guard_v2_mcore.py`` are extended to parametrize over the new method names so the inference-side hint is locked in end-to-end. Refs: NVIDIA-NeMo#1141 Signed-off-by: lonexreb <reach2shubhankar@gmail.com>

lonexreb · 2026-05-04T08:57:19Z

Extended the offload guard to get_topk_logits and generate (commit 11da369f) per code review.

The original PRs (#2392 and this one) applied the guard to `train`, `get_logprobs`, and `score` but missed two other GPU-bound public APIs that exhibit the same opaque-CUDA-crash failure mode if called while weights are offloaded:

`get_topk_logits` — exists on all three workers (dtensor v1/v2 and Megatron) and runs a forward pass; exposed via `lm_policy.py:get_topk_logits`.
`generate` — Megatron-only public API for HuggingFace-framework generation; exposed via `lm_policy.py:generate`.

Both now call `self._assert_weights_on_device(...)` as the first line. The hint resolves to `prepare_for_lp_inference()` for all four inference-side methods, matching the existing convention.

Tests in `test_dtensor_worker_offload_guard.py` and `test_worker_offload_guard_v2_mcore.py` are extended to parametrize over the new method names so the hint is locked in end-to-end.

lonexreb added 2 commits May 4, 2026 03:08

lonexreb requested review from a team as code owners May 4, 2026 08:18

github-actions Bot added the community-request label May 4, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(workers): extend offload guard to v2 + Megatron paths#2393

fix(workers): extend offload guard to v2 + Megatron paths#2393
lonexreb wants to merge 3 commits intoNVIDIA-NeMo:mainfrom
lonexreb:fix/1141-helpful-prepare-for-error-v2-mcore

lonexreb commented May 4, 2026

Uh oh!

copy-pr-bot Bot commented May 4, 2026

Uh oh!

lonexreb commented May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

lonexreb commented May 4, 2026

Summary

What changes

Coverage

Stacking note

Out of scope

Test plan

Uh oh!

copy-pr-bot Bot commented May 4, 2026

Uh oh!

lonexreb commented May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants