Reset AG_pipeline bucket status after validation step.#3155
Reset AG_pipeline bucket status after validation step.#3155gautham-kollu merged 10 commits intoNVIDIA:mainfrom
Conversation
| model_chunk.ddp_config.overlap_param_gather | ||
| and model_chunk.ddp_config.use_megatron_fsdp | ||
| ): | ||
| model_chunk.start_param_sync(sync_and_return=True) |
There was a problem hiding this comment.
Can start_param_sync(sync_and_return=True) be replaced with synchronize_param_gather here?
There was a problem hiding this comment.
No I tried that. But it is not possible because it's not part of FullyShardedDataParallel class. We need to call synchronize_param_gather within start_param_sync
There was a problem hiding this comment.
I see, could you try putting the synchronize_param_gather function on FullyShardedDataParallel(example)? The goal is to help keep the start_param_sync function simple, thank you.
|
/ok to test 1be259b |
1be259b to
b4bef4f
Compare
Use synchronize_param_gather directly in forward_backward_func Cleanup
5e36632 to
45016a4
Compare
|
/ok to test 5e06d9d |
|
/ok to test 300df2f |
|
/ok to test 67ab72e |
| if forward_only: | ||
| for model_chunk in [model]: | ||
| if ( | ||
| model_chunk.ddp_config.overlap_param_gather |
There was a problem hiding this comment.
2026-04-13T16:20:56.5057536Z 2-task-1-0/0 [default0]: def getattr(self, name: str) -> Union[Tensor, "Module"]:
2026-04-13T16:20:56.5058127Z 2-task-1-0/0 [default0]: if "parameters" in self.dict:
2026-04-13T16:20:56.5058546Z 2-task-1-0/0 [default0]: parameters = self.dict["parameters"]
2026-04-13T16:20:56.5058976Z 2-task-1-0/0 [default0]: if name in parameters:
2026-04-13T16:20:56.5059497Z 2-task-1-0/0 [default0]: return parameters[name]
2026-04-13T16:20:56.5059875Z 2-task-1-0/0 [default0]: if "buffers" in self.dict:
2026-04-13T16:20:56.5060350Z 2-task-1-0/0 [default0]: buffers = self.dict["buffers"]
2026-04-13T16:20:56.5060797Z 2-task-1-0/0 [default0]: if name in buffers:
2026-04-13T16:20:56.5061147Z 2-task-1-0/0 [default0]: return buffers[name]
2026-04-13T16:20:56.5061594Z 2-task-1-0/0 [default0]: if "modules" in self.dict:
2026-04-13T16:20:56.5062295Z 2-task-1-0/0 [default0]: modules = self.dict["modules"]
2026-04-13T16:20:56.5062718Z 2-task-1-0/0 [default0]: if name in modules:
2026-04-13T16:20:56.5063184Z 2-task-1-0/0 [default0]: return modules[name]
2026-04-13T16:20:56.5063560Z 2-task-1-0/0 [default0]:> raise AttributeError(
2026-04-13T16:20:56.5064019Z 2-task-1-0/0 [default0]: f"'{type(self).name}' object has no attribute '{name}'"
2026-04-13T16:20:56.5064518Z 2-task-1-0/0 [default0]: )
2026-04-13T16:20:56.5064924Z 2-task-1-0/0 [default0]:E AttributeError: 'GPTModel' object has no attribute 'ddp_config'
2026-04-13T16:20:56.5065430Z 2-task-1-0/0 [default0]:
2026-04-13T16:20:56.5066146Z 2-task-1-0/0 [default0]:/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py:1967: AttributeError
2026-04-13T16:20:56.5066792Z 2-task-1-0/0 [default0]:------------------------------ Captured log call -------------------------------
2026-04-13T16:20:56.5067604Z 2-task-1-0/0 [default0]:WARNING megatron.core.rerun_state_machine:rank_utils.py:43 Implicit initialization of Rerun State Machine!
2026-04-13T16:20:56.5068516Z 2-task-1-0/0 [default0]:WARNING megatron.core.rerun_state_machine:rank_utils.py:43 RerunStateMachine initialized in mode RerunMode.DISABLED
2026-04-13T16:20:56.5069439Z 2-task-1-0/0 [default0]: test_forward_vpp[tp_pp_vpp1-pp_layout1-False-True] ______________
check fsdp in enabled first
|
/ok to test 0db839b |
None if not present
|
/ok to test 23c4fd1 |
|
/ok to test 3b68346 |
|
🔄 Merge queue validation started! You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/24363433680 |
|
🔄 Merge queue validation started! You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/24365170039 |
|
🔄 Merge queue validation started! You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/24369618791 |
|
🔄 Merge queue validation started! You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/24371185458 |
What does this PR do ?
Contribution process
flowchart LR A[Pre-checks] --> B[PR Tests] subgraph Code Review/Approval C1[Expert Review] --> C2[Final Review] end B --> C1 C2 --> D[Merge]Pre-checks
Core 0.8)Code review
The following process is enforced via the CODEOWNERS file for changes into
megatron/core. For changes outside ofmegatron/core, it is up to the PR author whether or not to tag the Final Reviewer team.For MRs into `main` branch
Feel free to message or comment the @mcore-oncall to help accelerate your merge into main. The less complex your PR is, the faster it will be approved and merged!
(Step 1): Add PR label
Expert Review(Step 2): Collect the expert reviewers reviews
Expert Reviewlabel when your PR is ready for review.Final Review might get declined if these requirements are not fulfilled.
(Step 3): Final Review
Final Reviewlabel(Optional Step 4): Cherry-pick into release branch
If this PR also needs to be merged into
core_r*release branches, after this PR has been merged, selectCherry-pickto open a new PR into the release branch.For MRs into `dev` branch
The proposed review process for `dev` branch is under active discussion.MRs are mergable after one approval by either
eharper@nvidia.comorzijiey@nvidia.com.Merging your PR
Any member of core-adlr and
core-nemowill be able to merge your PR.