Skip to content

Fix optimizer CPU offload for megatron-fsdp dtensor param#4623

Open
wplf wants to merge 1 commit into
NVIDIA:devfrom
wplf:jinliang/fix-mfsdp-optimizer-offload
Open

Fix optimizer CPU offload for megatron-fsdp dtensor param#4623
wplf wants to merge 1 commit into
NVIDIA:devfrom
wplf:jinliang/fix-mfsdp-optimizer-offload

Conversation

@wplf
Copy link
Copy Markdown
Member

@wplf wplf commented May 5, 2026

Summary

  • Handle DTensor parameters and gradients by operating on local shards before optimizer CPU offload copies.
  • Avoid dispatching pin_memory/is_pinned through DTensor and respect pin_cpu_params.

Tests

  • uv run isort megatron/core/optimizer/cpu_offloading/hybrid_optimizer.py
  • PYTHONDONTWRITEBYTECODE=1 PYTHONPATH=. python -m pytest tests/unit_tests/test_optimizer_cpu_offloading.py -q

test result

image

Handle Megatron-FSDP DTensor parameters and gradients by operating on local shards before CPU optimizer offload copies. This avoids dispatching pin_memory/is_pinned through DTensor and lets pin_cpu_params control CPU parameter pinning.
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 5, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@wplf wplf changed the title Fix optimizer CPU offload for DTensor params Fix optimizer CPU offload for megatron-fsdp dtensor param May 5, 2026
@wplf wplf self-assigned this May 5, 2026
@wplf wplf marked this pull request as ready for review May 5, 2026 04:50
@wplf wplf requested review from a team as code owners May 5, 2026 04:50
@wplf
Copy link
Copy Markdown
Member Author

wplf commented May 5, 2026

/ok to test

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 5, 2026

/ok to test

@wplf, there was an error processing your request: E1

See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/1/

@wplf
Copy link
Copy Markdown
Member Author

wplf commented May 5, 2026

/ok to test 5da7d67

@yaox12 yaox12 requested a review from shjwudp May 6, 2026 01:38
Copy link
Copy Markdown
Member

@cspades cspades left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wplf Could you share some information on what the bug was? Curious what was the root cause and how it is related to DTensors!


def _to_local_if_dtensor(tensor):
if HAVE_DTENSOR and isinstance(tensor, DTensor):
return tensor.to_local()
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, if we don't need differentiability (such as in this optimizer step), sometimes dtensor._local_tensor is more robust since we can be sure we do not receive a copy of the original local Tensor. (Don't ask me why I know this.)

Copy link
Copy Markdown
Member Author

@wplf wplf May 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @cspades .

Currently, when using megatron-fsdp and --optimizer-cpu-offload, it will crash. Root cause is hybrid-optimizer does not take mfsdp into consideration. Use Dtensor._local_tensor will resolve this issue.

But I haven't check this fix thoroughly for training convergence.

I'll check training stability and convert Dtensor.to_local() to Dtensor._local_tensor as soon as possible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants