fix(dtensor): clear error when train/get_logprobs/score run offloaded by lonexreb · Pull Request #2392 · NVIDIA-NeMo/RL

lonexreb · 2026-05-04T08:08:50Z

Summary

Closes #1141 (dtensor path; mcore + dtensor-v2 are tracked as out-of-scope below).

Custom training loops sometimes skip `prepare_for_training()` /
`prepare_for_lp_inference()` after `offload_after_refit()` has been
called. The model is still on CPU, so the next `train()`,
`get_logprobs()`, or `score()` call drives a CUDA op against
CPU-resident weights and crashes with the opaque

```
CUDA error: an illegal memory access was encountered
```

There is no signal in that error pointing at the missing prepare step,
so debugging it is painful — especially for users writing their first
custom loop.

Behavioural change

Track the offload state on the worker and surface a clear
`RuntimeError` at the entry of each GPU-bound public method:

```
RuntimeError: train() was called while the policy weights are offloaded
to CPU. This usually means a custom training loop forgot to call
prepare_for_training() after offload_after_refit(). Call the
appropriate prepare step before invoking train(); otherwise the
underlying CUDA op will fail with "illegal memory access". See
issue #1141.
```

The same shape of message is emitted for `get_logprobs()` and
`score()`, with the prepare step in the hint adjusted accordingly
(`prepare_for_lp_inference()` for both).

Implementation

Add `self._weights_offloaded: bool` (init `False`) on
`DTensorPolicyWorkerImpl`.
`offload_after_refit()` flips it `True`.
`prepare_for_training()` and `prepare_for_lp_inference()` flip it
back to `False` (and pick up docstrings that document the contract).
New `_assert_weights_on_device(method_name)` helper raises the
detailed `RuntimeError` when the flag is set.
`train()`, `get_logprobs()`, `score()` call the helper as the
first line of their body.

The flag is only about the explicit phase-level offload performed
by `offload_after_refit()`. The `dtensor_cfg.cpu_offload` mode
(which keeps weights on CPU but transparently shuttles them per-layer)
is unaffected — the flag stays `False` under that mode and the
existing behaviour is preserved exactly.

Coverage

New `tests/unit/models/policy/test_dtensor_worker_offload_guard.py`:

`test_assert_weights_on_device_passes_when_weights_on_gpu` — the
guard is a no-op for all three callers when weights are on GPU.
`test_assert_weights_on_device_train_raises_with_helpful_message` —
validates the message names `train()`,
`prepare_for_training()`, the underlying CUDA failure mode, and
the issue number.
`test_assert_weights_on_device_inference_raises_with_helpful_message`
— parametrized over `get_logprobs` / `score`, validates the
inference-side message points at `prepare_for_lp_inference()`.

The tests build a bare worker via `new` and patch the flag, so
they run at the L0 tier without Ray, distributed init, or a real
model.

Out of scope

The same guard pattern would be useful in
`megatron_policy_worker.py` and `dtensor_policy_worker_v2.py`.
Those workers have a slightly different offload state model
(MegatronCore's own offload semantics, FSDP2 vs FSDP1, etc.) and
merit their own focused PRs. Documentation of the contract via
docstrings on the prepare/offload methods covers the most common path
in the meantime.

Test plan

`ruff check` — clean
`ruff format` — clean
`python3 -m py_compile` — clean
CI `L0` (`L0_Unit_Tests_Other.sh` runs the new test file).

Refs: #1141

…le offloaded (NVIDIA-NeMo#1141) Custom training loops sometimes skip ``prepare_for_training()`` / ``prepare_for_lp_inference()`` after ``offload_after_refit()`` has been called. The model is still on CPU, so the next ``train()``, ``get_logprobs()``, or ``score()`` call drives a CUDA op against CPU-resident weights and crashes with the opaque CUDA error: an illegal memory access was encountered There is no signal in that error pointing at the missing prepare step, so debugging it is painful — especially for users writing their first custom loop. Track the offload state on the worker and surface a clear ``RuntimeError`` at the entry of each GPU-bound public method that points the caller at the exact prepare step they skipped: train() was called while the policy weights are offloaded to CPU. This usually means a custom training loop forgot to call prepare_for_training() after offload_after_refit(). Call the appropriate prepare step before invoking train(); otherwise the underlying CUDA op will fail with "illegal memory access". See issue NVIDIA-NeMo#1141. Implementation: * Add ``self._weights_offloaded: bool`` (initialized False) to ``DTensorPolicyWorkerImpl``. * ``offload_after_refit()`` flips it to True. * ``prepare_for_training()`` and ``prepare_for_lp_inference()`` flip it back to False (and gain docstrings making the contract explicit). * New ``_assert_weights_on_device(method_name)`` helper raises the detailed ``RuntimeError`` when the flag is set. * ``train()``, ``get_logprobs()``, ``score()`` call the helper as the first line of their body. The flag is *only* about the explicit phase-level offload performed by ``offload_after_refit()``. The ``dtensor_cfg.cpu_offload`` mode (which keeps weights on CPU but transparently shuttles them per-layer) is unaffected — the flag stays False under that mode and existing behavior is preserved. Coverage: * New ``tests/unit/models/policy/test_dtensor_worker_offload_guard.py`` exercises the on-device pass-through, the train()-side error message, and the inference-side error messages (parametrized over get_logprobs / score). The tests build a bare worker via ``__new__`` and patch the flag, so they run at the L0 tier without needing Ray, distributed init, or a real model. Out of scope (intentionally): the same guard pattern would be useful in ``megatron_policy_worker.py`` and ``dtensor_policy_worker_v2.py``. Those workers have a slightly different offload state model and merit their own focused PRs. Refs: NVIDIA-NeMo#1141 Signed-off-by: lonexreb <reach2shubhankar@gmail.com>

copy-pr-bot · 2026-05-04T08:08:54Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

…VIDIA-NeMo#1141) Address review feedback on PRs NVIDIA-NeMo#2392/NVIDIA-NeMo#2393: the guard pattern was applied to ``train``, ``get_logprobs``, and ``score`` but missed two other GPU-bound public APIs that exhibit the same opaque-CUDA-crash failure mode if called while weights are offloaded: * ``get_topk_logits`` — exists on all three workers (dtensor v1/v2 and Megatron) and runs a forward pass; exposed via ``lm_policy.py:get_topk_logits``. * ``generate`` — Megatron-only public API for HuggingFace-framework generation; exposed via ``lm_policy.py:generate``. Add ``self._assert_weights_on_device(...)`` as the first line of each. The hint resolves to ``prepare_for_lp_inference()`` for all four inference-side methods (``get_logprobs``, ``score``, ``get_topk_logits``, ``generate``) — same as before — since users of these APIs typically run them after refit / offload. Tests in ``test_dtensor_worker_offload_guard.py`` and ``test_worker_offload_guard_v2_mcore.py`` are extended to parametrize over the new method names so the inference-side hint is locked in end-to-end. Refs: NVIDIA-NeMo#1141 Signed-off-by: lonexreb <reach2shubhankar@gmail.com>

lonexreb requested review from a team as code owners May 4, 2026 08:08

github-actions Bot added the community-request label May 4, 2026

lonexreb mentioned this pull request May 4, 2026

fix(workers): extend offload guard to v2 + Megatron paths #2393

Open

5 tasks

lonexreb changed the title ~~fix(dtensor): clear error when train/get_logprobs/score is called while offloaded~~ fix(dtensor): clear error when train/get_logprobs/score run offloaded May 5, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(dtensor): clear error when train/get_logprobs/score run offloaded#2392

fix(dtensor): clear error when train/get_logprobs/score run offloaded#2392
lonexreb wants to merge 1 commit intoNVIDIA-NeMo:mainfrom
lonexreb:fix/1141-helpful-prepare-for-error

lonexreb commented May 4, 2026

Uh oh!

copy-pr-bot Bot commented May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

lonexreb commented May 4, 2026

Summary

Behavioural change

Implementation

Coverage

Out of scope

Test plan

Uh oh!

copy-pr-bot Bot commented May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants