Skip to content

Update some HIP kernels support different warp_size(topksoftmax/grouptopk, cache, sample)#2599

Merged
valarLip merged 6 commits intomainfrom
jun/wip_045_new
Apr 4, 2026
Merged

Update some HIP kernels support different warp_size(topksoftmax/grouptopk, cache, sample)#2599
valarLip merged 6 commits intomainfrom
jun/wip_045_new

Conversation

@junhaha666
Copy link
Copy Markdown
Contributor

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

@junhaha666 junhaha666 requested review from a team and Copilot April 2, 2026 15:05
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 2, 2026

🏷️ CI Guide

Runs automatically on every PR:

  • ✅ Pre-checks (submodule verification, code formatting)
  • ✅ Aiter op tests (gfx942 + gfx950)
  • ✅ Triton tests (only when aiter/ops/triton/** or related paths are changed)

Extended tests (opt-in via labels):

Label Tests
ci:triton-355 Run Triton tests on MI355 in addition to MI325
ci:sglang SGLang integration tests
ci:atom ATOM benchmark (DeepSeek-R1 + GPT-OSS)
ci:vllm vLLM benchmark
ci:all All of the above

Add labels via the sidebar or gh pr edit 2599 --add-label <label>

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates several HIP kernels and helpers to better handle GPUs with different warp (wavefront) sizes by replacing hard-coded 64 assumptions with WARP_SIZE / runtime queries, and tightening architecture gating for certain asm paths.

Changes:

  • Make topk-softmax (including grouped topk) kernels compute warp-related launch/reduction behavior using WARP_SIZE / get_warp_size_func() instead of assuming wave64.
  • Update cache reshape+quant kernels and sampling kernels to remove hard-coded warp-size template parameters and use shared warp-size helpers.
  • Gate MoE asm topk-softmax usage/tests to specific gfx targets (gfx942, gfx950).

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
op_tests/test_moeTopkSoftmax.py Skips asm path except on gfx942/gfx950 to match supported targets.
csrc/kernels/topk_softmax_kernels.cu Makes launch bounds and load vectorization more warp-size aware; adjusts BYTES_PER_LDG selection.
csrc/kernels/topk_softmax_kernels_group.cu Reworks reductions to use shared wave_reduce and makes defaults depend on WARP_SIZE.
csrc/kernels/sample_kernels.cu Removes fixed warp-size template parameters to rely on shared warp utilities.
csrc/kernels/moe_fused_gate.cu Computes rows-per-warp/CTA from WARP_SIZE inside the kernel for portability.
csrc/kernels/cache_kernels.cu Uses WARP_SIZE in per-token-quant reshape kernel indexing and adjusts launch sizing.
csrc/include/hip_reduce.h Changes default warp-size template parameters to use WARP_SIZE.
csrc/include/aiter_hip_common.h Adds warning about using WARP_SIZE as a host-side constexpr.
aiter/fused_moe.py Restricts asm fused_topk usage to gfx942/gfx950.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread csrc/kernels/topk_softmax_kernels_group.cu
Comment thread csrc/kernels/topk_softmax_kernels_group.cu
Comment thread csrc/kernels/topk_softmax_kernels.cu
Comment thread csrc/kernels/topk_softmax_kernels.cu
Comment thread csrc/kernels/topk_softmax_kernels.cu
@valarLip valarLip merged commit 1e9d2cc into main Apr 4, 2026
44 of 47 checks passed
@valarLip valarLip deleted the jun/wip_045_new branch April 4, 2026 08:02
yzhou103 pushed a commit that referenced this pull request Apr 8, 2026
…topk, cache, sample) (#2599)

* update topk_softmax

* update hip group topk

* rm warpsize in  sample_kernels.cu

* update cache.cu

* update

* update2
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants