Fix optimizer CPU offload for megatron-fsdp dtensor param by wplf · Pull Request #4623 · NVIDIA/Megatron-LM

wplf · 2026-05-05T04:48:58Z

Summary

Handle DTensor parameters and gradients by operating on local shards before optimizer CPU offload copies.
Avoid dispatching pin_memory/is_pinned through DTensor and respect pin_cpu_params.

Tests

uv run isort megatron/core/optimizer/cpu_offloading/hybrid_optimizer.py
PYTHONDONTWRITEBYTECODE=1 PYTHONPATH=. python -m pytest tests/unit_tests/test_optimizer_cpu_offloading.py -q

test result

Handle Megatron-FSDP DTensor parameters and gradients by operating on local shards before CPU optimizer offload copies. This avoids dispatching pin_memory/is_pinned through DTensor and lets pin_cpu_params control CPU parameter pinning.

copy-pr-bot · 2026-05-05T04:49:01Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

wplf · 2026-05-05T04:55:21Z

/ok to test

copy-pr-bot · 2026-05-05T04:55:24Z

/ok to test

@wplf, there was an error processing your request: E1

See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/1/

wplf · 2026-05-05T06:02:24Z

/ok to test 5da7d67

cspades

@wplf Could you share some information on what the bug was? Curious what was the root cause and how it is related to DTensors!

cspades · 2026-05-06T21:07:20Z

+
+def _to_local_if_dtensor(tensor):
+    if HAVE_DTENSOR and isinstance(tensor, DTensor):
+        return tensor.to_local()


BTW, if we don't need differentiability (such as in this optimizer step), sometimes dtensor._local_tensor is more robust since we can be sure we do not receive a copy of the original local Tensor. (Don't ask me why I know this.)

Hi, @cspades .

Currently, when using megatron-fsdp and --optimizer-cpu-offload, it will crash. Root cause is hybrid-optimizer does not take mfsdp into consideration. Use Dtensor._local_tensor will resolve this issue.

But I haven't check this fix thoroughly for training convergence.

I'll check training stability and convert Dtensor.to_local() to Dtensor._local_tensor as soon as possible.

Fix optimizer CPU offload for DTensor params

5da7d67

Handle Megatron-FSDP DTensor parameters and gradients by operating on local shards before CPU optimizer offload copies. This avoids dispatching pin_memory/is_pinned through DTensor and lets pin_cpu_params control CPU parameter pinning.

wplf changed the title ~~Fix optimizer CPU offload for DTensor params~~ Fix optimizer CPU offload for megatron-fsdp dtensor param May 5, 2026

wplf added the module: megatron-fsdp label May 5, 2026

wplf self-assigned this May 5, 2026

wplf marked this pull request as ready for review May 5, 2026 04:50

wplf requested review from a team as code owners May 5, 2026 04:50

svcnvidia-nemo-ci added the complexity: low label May 5, 2026

copy-pr-bot Bot temporarily deployed to test May 5, 2026 06:03 Inactive

yaox12 requested a review from shjwudp May 6, 2026 01:38

cspades reviewed May 6, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix optimizer CPU offload for megatron-fsdp dtensor param#4623

Fix optimizer CPU offload for megatron-fsdp dtensor param#4623
wplf wants to merge 1 commit into
NVIDIA:devfrom
wplf:jinliang/fix-mfsdp-optimizer-offload

wplf commented May 5, 2026 •

edited

Loading

Uh oh!

copy-pr-bot Bot commented May 5, 2026

Uh oh!

wplf commented May 5, 2026

Uh oh!

copy-pr-bot Bot commented May 5, 2026

Uh oh!

wplf commented May 5, 2026

Uh oh!

cspades left a comment

Uh oh!

cspades May 6, 2026

Uh oh!

wplf May 7, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

wplf commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Tests

test result

Uh oh!

copy-pr-bot Bot commented May 5, 2026

Uh oh!

wplf commented May 5, 2026

Uh oh!

copy-pr-bot Bot commented May 5, 2026

Uh oh!

wplf commented May 5, 2026

Uh oh!

cspades left a comment

Choose a reason for hiding this comment

Uh oh!

cspades May 6, 2026

Choose a reason for hiding this comment

Uh oh!

wplf May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

wplf commented May 5, 2026 •

edited

Loading

wplf May 7, 2026 •

edited

Loading