Fix optimizer CPU offload for megatron-fsdp dtensor param#4623
Conversation
Handle Megatron-FSDP DTensor parameters and gradients by operating on local shards before CPU optimizer offload copies. This avoids dispatching pin_memory/is_pinned through DTensor and lets pin_cpu_params control CPU parameter pinning.
|
/ok to test |
@wplf, there was an error processing your request: See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/1/ |
|
/ok to test 5da7d67 |
|
|
||
| def _to_local_if_dtensor(tensor): | ||
| if HAVE_DTENSOR and isinstance(tensor, DTensor): | ||
| return tensor.to_local() |
There was a problem hiding this comment.
BTW, if we don't need differentiability (such as in this optimizer step), sometimes dtensor._local_tensor is more robust since we can be sure we do not receive a copy of the original local Tensor. (Don't ask me why I know this.)
There was a problem hiding this comment.
Hi, @cspades .
Currently, when using megatron-fsdp and --optimizer-cpu-offload, it will crash. Root cause is hybrid-optimizer does not take mfsdp into consideration. Use Dtensor._local_tensor will resolve this issue.
But I haven't check this fix thoroughly for training convergence.
I'll check training stability and convert Dtensor.to_local() to Dtensor._local_tensor as soon as possible.
Summary
Tests
test result