Skip to content

Group_topk: moe_fused_gate support num expert is not power of 2#2604

Merged
valarLip merged 1 commit intomainfrom
jun/group_topk
Apr 3, 2026
Merged

Group_topk: moe_fused_gate support num expert is not power of 2#2604
valarLip merged 1 commit intomainfrom
jun/group_topk

Conversation

@junhaha666
Copy link
Copy Markdown
Contributor

test on MI308:
image
image

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

Copilot AI review requested due to automatic review settings April 3, 2026 05:45
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 3, 2026

🏷️ CI Guide

Runs automatically on every PR:

  • ✅ Pre-checks (submodule verification, code formatting)
  • ✅ Aiter op tests (gfx942 + gfx950)
  • ✅ Triton tests (only when aiter/ops/triton/** or related paths are changed)

Extended tests (opt-in via labels):

Label Tests
ci:triton-355 Run Triton tests on MI355 in addition to MI325
ci:sglang SGLang integration tests
ci:atom ATOM benchmark (DeepSeek-R1 + GPT-OSS)
ci:vllm vLLM benchmark
ci:all All of the above

Add labels via the sidebar or gh pr edit 2604 --add-label <label>

@junhaha666 junhaha666 requested a review from a team April 3, 2026 05:49
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR aims to extend the MoE fused gating/top-k path to handle non-power-of-two expert counts by relaxing prior constraints and adding specialized dispatch configurations, and updates the Python wrapper to rely less on the previous “power-of-2 experts only” limitation.

Changes:

  • Relaxed moe_fused_gate’s host-side “num_experts must be power-of-2” constraint and added tuned specializations for num_experts = 192 and 96.
  • Adjusted moe_fused_gate_impl loading logic to use scalar loads for dynamic params and vector loads for static params.
  • Updated biased_grouped_topk routing logic to no longer force a fallback for non-power-of-two expert counts.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 7 comments.

File Description
csrc/kernels/moe_fused_gate.cu Adds new expert-count specializations, changes input/bias loading strategy, and introduces a HIP device guard; removes (comments out) the power-of-2 check.
aiter/ops/topk.py Updates wrapper routing logic to use moe_fused_gate without checking num_experts power-of-2; updates copyright year.
Comments suppressed due to low confidence (1)

csrc/kernels/moe_fused_gate.cu:581

  • The comment above the divisibility check is incorrect: num_experts % num_expert_group == 0 does not imply num_expert_group is a power of 2. Also, multithread_reduce(..., thread_num) in hip_reduce.h only has explicit implementations for thread_num in {1,2,4,8,16,32}, so non-power-of-two num_expert_group will fall through and produce incorrect results. Please add a TORCH_CHECK that num_expert_group is a supported power-of-two (and ideally <= 32), and update/remove the misleading comment.
    // Check 2: Ensure that num_experts is divisible by num_expert_group. (this also means
    // num_expert_group is power of 2)
    TORCH_CHECK(num_experts % num_expert_group == 0,
                "num_experts must be divisible by num_expert_group, but got ",
                num_experts,
                " / ",
                num_expert_group);

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread csrc/kernels/moe_fused_gate.cu
Comment thread csrc/kernels/moe_fused_gate.cu
Comment thread csrc/kernels/moe_fused_gate.cu
Comment thread csrc/kernels/moe_fused_gate.cu
Comment thread aiter/ops/topk.py
Comment thread aiter/ops/topk.py
Comment thread csrc/kernels/moe_fused_gate.cu
@valarLip valarLip merged commit a381890 into main Apr 3, 2026
37 of 39 checks passed
@valarLip valarLip deleted the jun/group_topk branch April 3, 2026 10:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants