Skip to content

fix(moe): route SwiGLU MXFP4 unshuffled weights to CK-Tile instead of CK2stages#3518

Open
srinivamd wants to merge 4 commits into
mainfrom
fix/swiglu-mxfp4-ck2stages-dispatch-rocm25478
Open

fix(moe): route SwiGLU MXFP4 unshuffled weights to CK-Tile instead of CK2stages#3518
srinivamd wants to merge 4 commits into
mainfrom
fix/swiglu-mxfp4-ck2stages-dispatch-rocm25478

Conversation

@srinivamd
Copy link
Copy Markdown

@srinivamd srinivamd commented Jun 3, 2026

Summary

Fixes ROCM-25478: AITER JIT build crash (fatal error: 'gemm_moe_ck2stages_lookup.h' file not found) when serving amd/gpt-oss-20b-w-mxfp4-a-bf16 with VLLM_ROCM_USE_AITER=1 on MI355 (gfx950).

Problem

AITER has two CK MoE 2-stages codegen systems:

Path Directory swiglu support
Old CK codegen csrc/ck_gemm_moe_2stages_codegen/ NO (choices=["silu", "gelu"])
New CK Tile codegen csrc/ck_tile_gemm_moe_2stages/ YES (choices=["silu", "gelu", "swiglu"])

When serving gpt-oss-20b (SwiGLU + MXFP4) with unshuffled HuggingFace weights (is_shuffled=False), the dispatch in fused_moe.py falls through three guards:

  1. FlyDSL guard (line ~1375): requires is_shuffled=True → fails for unshuffled weights
  2. CK-Tile heuristic (line ~1455): excludes Swiglu + fp4x2 activations → skipped
  3. CK2stages fallthrough (line ~1491): accepts fp4x2 weights → matches → invokes old CK codegen with --activation swiglu → argparse rejects → no gemm_moe_ck2stages_lookup.h generated → hipcc crash

PRs #2972, #3123, #3153 added FlyDSL SwiGLU MXFP4 support but the FlyDSL guard requires is_shuffled=True, so the fix is unreachable for unshuffled gpt-oss weights.

Fix

Insert a new dispatch guard between the CK-Tile heuristic and the CK2stages fallthrough that catches SwiGLU + MXFP4 combinations and routes them to CK Tile (which already supports swiglu via act_dict["swiglu"] = 2 in csrc/ck_tile_gemm_moe_2stages/gen_instances.py) instead of falling through to the old CK codegen (which structurally cannot handle swiglu).

The guard matches: activation == Swiglu AND q_dtype_w == fp4x2 AND q_type == per_1x32 AND dtype in [bf16, fp16] AND no explicit kernelName1.

Test plan

# Reproduce the failure (before fix):
VLLM_ROCM_USE_AITER=1 vllm serve amd/gpt-oss-20b-w-mxfp4-a-bf16 \
  --tokenizer openai/gpt-oss-20b \
  --tensor-parallel-size 2 \
  --trust-remote-code \
  --enforce-eager

# Should no longer crash with 'gemm_moe_ck2stages_lookup.h' not found

Related

  • ROCM-25478 — original bug report
  • #2972 — FlyDSL SwiGLU MXFP4 path (requires is_shuffled=True)
  • #3470 — native MXFP4 MoE backend (replaces both codegen paths, not yet merged)
  • vllm-project/vllm#36193 — same root cause class in vLLM

… CK2stages

The old CK2stages codegen (gen_instances.py) only supports silu/gelu
activations. Passing swiglu causes it to never generate
gemm_moe_ck2stages_lookup.h, crashing with a hipcc fatal error.

When serving gpt-oss-20b-w-mxfp4-a-bf16 (SwiGLU + MXFP4) with
unshuffled HuggingFace weights (is_shuffled=False), the dispatch in
fused_moe.py falls through three guards:
1. FlyDSL guard requires is_shuffled=True - fails
2. CK-Tile heuristic excludes Swiglu+fp4x2 activations - skipped
3. CK2stages fallthrough accepts fp4x2 weights - matches then crashes

Insert a catch-all guard before the CK2stages fallthrough that routes
SwiGLU + MXFP4 combinations to CK-Tile, which already supports swiglu
(act_dict["swiglu"] = 2 in CK Tile gen_instances.py).

Fixes: ROCM-25478
@srinivamd srinivamd requested a review from a team June 3, 2026 11:48
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 3, 2026

🏷️ CI Guide

Runs automatically on every PR:

  • ✅ Pre-checks (submodule verification, code formatting)
  • ✅ Aiter op tests (gfx942 + gfx950)
  • ✅ Triton tests on MI35X (only when aiter/ops/triton/** or related paths are changed)

Extended tests (opt-in via labels):

Label Tests
ci:triton-300x Run an additional Triton test job on MI300X in PRs; main branch always runs both MI35X and MI300X
ci:sglang SGLang integration tests: DeepSeek-R1-MXFP4 accuracy, Qwen 3.5 accuracy
ci:atom ATOM benchmark: DeepSeek-R1-0528, GPT-OSS-120B
ci:atom_full ATOM accuracy suite for PR and main models from ATOM models_accuracy.json
ci:vllm vLLM benchmark: GPT-OSS-120B, DeepSeek-R1-0528, Kimi-K2.5
ci:all All standard extended tests (excludes ci:atom_full)

Only add ci:atom_full for FlyDSL or Triton upgrades.
Add labels via the sidebar or gh pr edit 3518 --add-label <label>

The previous commit accidentally deleted the CK2stages if-block body
(condition continuation, flydsl/cktile/ck2stages stage2 dispatch, and
return) when inserting the new SwiGLU guard above it. This left a
truncated `if` expression with `return` inside the condition — a syntax
error caught by ruff CI.

Restore the full CK2stages block from main so non-SwiGLU MXFP4 paths
(and all other CK2stages-eligible configurations) continue to work.
@srinivamd
Copy link
Copy Markdown
Author

Fixed in the latest push — the prior commit accidentally truncated the CK2stages if block when inserting the new SwiGLU guard above it, leaving return inside an incomplete condition expression. The full block (condition, flydsl/cktile/ck2stages stage2 dispatch, and return MOEMetadata(...)) is now restored. Both ruff syntax errors on line 1527 should be resolved.

@srinivamd
Copy link
Copy Markdown
Author

@ROCm/team_aiter please review and merge if it is ok

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant