[draft] support async param gather in layer-wise optimizer#2787
[draft] support async param gather in layer-wise optimizer#2787FDecaYed wants to merge 1 commit intoNVIDIA:devfrom
Conversation
| async_op=async_op, | ||
| ) | ||
| else: | ||
| assert async_op, "Layer-wise optimizer requires overlap_param_gather=True" |
There was a problem hiding this comment.
does async allgather with layerwise still require use-distributed-optimizer? I think yes, to let DDP make the buckets right?
There was a problem hiding this comment.
for simplicity to demo the idea, I didn't touch that part. currently if use-distributed-optimizer is off, then all async related functionality will be turned off and code errors out.
But technically this is not required, we just need to change those check from if use-distributed-optimizer to if (use-distributed-optimizer or use-layer-wise)
mkhona-nvidia
left a comment
There was a problem hiding this comment.
Another feature is that the current layerwise allows reshardable EP sizes in the middle of training. AFAIK, using adam's distopt does not allow this because of the way the DDP buckets are constructed. Can we avoid this and continue to allow reshardable EP sizes in the middle of training with layerwise?
| pg_collection: Optional[ProcessGroupCollection] = None, | ||
| init_state_fn_list: Optional[List[Callable]] = None, | ||
| model_chunks: Optional[List] = None, | ||
| async_allgather: Optional[bool] = False, |
There was a problem hiding this comment.
Reuse "overlap_param_gather" if the flag indicates the same thing, no need to introduce new names.
my rough feeling is it should still be supported. the async feature is just moving param allgather from within optimizer.step into forward pre-hook calling bucket function. Checkpointing should not be affected |
I guess the issue is that we don't have a good understanding of why adamW distopt cannot allow resharding of EP in the middle of training. My feeling is that it is because of the way the DDP buckets are made (separate bucketing scheme for EP=true and separate bucketing scheme for rest of the network). If the layerwise distopt overlap_param_gather uses the same scheme again, I think we will run into the same issue |
I think the EP buckets are separate from regular DP weight no matter if dist-opt is on or not. and async-gather doesn't affect it either. |
Integrate async param all-gather from upstream PR NVIDIA#2787 so that dist_muon/dist_mop can overlap parameter all-gather with forward compute via DDP's existing bucket and forward-pre-hook infrastructure. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
What does this PR do ?
This is draft code now, so likely doesn't run. It's meant for demo how to support the feature before proper implementation.
Current Architecture
DistributedOptimizer implements async param gather with these components:
Changes needed for layer-wise
lw_params_listin addition toparams_listthat each bucket already holds, think it as per-bucket layerwise sharding)all_gather_into_tensortoall_gather(_v)finish_param_sync()