[dev]: faster implementation of mHC fused kernels#4624
Conversation
|
/claude review |
There was a problem hiding this comment.
Review Summary
Large PR that refactors the mHC fused kernels to add a unified dispatch layer (Triton → cuTile → torch fallback), split-K for the proj_rms forward kernel, a fused proj_rms_compute_h forward+backward path, Triton implementations for sinkhorn/h_aggregate/h_post_bda, BroadcastTensorFused for efficient 3-way gradient accumulation, and cuTile autotuning via cuda.tile_experimental.
The architecture, test coverage, and numerical correctness look solid — new tests cover every new kernel, backend combination, autotune path, and the BroadcastTensorFused autograd function.
Only issue found: several blocks of leftover debug code (commented-out print('hahahahahah'), # split_k = 1 overrides, dead commented-out kernel lines) that should be cleaned up before merge. See inline comments.
|
/claude review |
8d732b1 to
d585a92
Compare
d585a92 to
0361866
Compare
|
/ok to test 0361866 |
@jingqiny-99, there was an error processing your request: See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/2/ |
|
/ok to test 8bfc93a |
|
/ok to test d5163e8 |
62db48a to
cfa4916
Compare
|
/ok to test 62db48a |
@jingqiny-99, there was an error processing your request: See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/2/ |
|
/ok to test cfa4916 |
|
/claude review |
383e1fe to
582637c
Compare
|
/ok to test 582637c |
|
/ok to test 1f678ab |
|
/ok to test 77584db |
What does this PR do ?
Issue tracking
For PRs from open-source community contributors:
Linked issue:
Contribution process
Pre-checks
Code review
Feel free to message or comment the @mcore-oncall to help accelerate your merge into main. The less complex your PR is, the faster it will be approved and merged!
All PRs start as draft. If you open a non-draft PR, it will be automatically converted to draft.
Step 1: Mark PR as "Ready for Review"
.github/CODEOWNERS.Final Review might get declined if these requirements are not fulfilled.
Step 2: Final Review
For PRs that change
megatron/core, once all expert reviewers have approved, theFinal Reviewlabel is applied automatically and final reviewers are assigned.For PRs outside
megatron/core, this step is skipped.Step 3: Approved
Once all required reviewers have approved, the
Approvedlabel is applied automatically.Merge
Any member of mcore-engineers will be able to merge your PR.
For MRs into `dev` branch
The proposed review process for `dev` branch is under active discussion.MRs are mergable after one approval by either
eharper@nvidia.comorzijiey@nvidia.com.