Support Qwen3.5-35B-A3B PTPC model by zovonoir · Pull Request #889 · ROCm/ATOM

zovonoir · 2026-05-22T14:10:44Z

Summary

This patch adds support for the PTPC version of the Qwen3.5-35B-A3B model.

Test plan

python -m compileall atom/model_ops/linear.py

Made with Cursor

Co-authored-by: Cursor <cursoragent@cursor.com>

[fix][attn] fail fast when --page-size < kv element width

Copilot

Pull request overview

This PR adds runtime and kernel-level support needed for the PTPC variant of Qwen3.5-35B-A3B, including a fused Triton MRoPE path and an optional fast decode path for GDN attention, plus some guardrails for SGLang KV-cache layout shuffling.

Changes:

Add a specialized Triton fused Q/K MRoPE implementation and integrate it into Qwen3NextAttention.
Introduce an opt-in “lossy fast” GDN decode kernel path gated by a new env var (ATOM_ENABLE_GDN_DECODE_LOSSY_FAST).
Add early validation/error messaging for invalid SGLang --page-size values that would crash during KV-cache reshaping.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
`tests/test_envs.py`	Adds coverage for the new `ATOM_ENABLE_GDN_DECODE_LOSSY_FAST` env var default and override behavior.
`atom/utils/envs.py`	Registers `ATOM_ENABLE_GDN_DECODE_LOSSY_FAST` in the centralized env var registry.
`atom/plugin/sglang/attention_backend/sgl_attn_backend.py`	Adds fail-fast checks for invalid page sizes in the layout-shuffle KV write path.
`atom/models/qwen3_next.py`	Uses `try_mrope_qk_fused()` to accelerate Q/K rotary application when eligible.
`atom/model_ops/triton_mrope.py`	New Triton kernels implementing Qwen3.5-specific fused MRoPE for Q/K.
`atom/model_ops/linear.py`	Adjusts scale-shard sizing for packed shard loading under `QuantType.per_1x128`.
`atom/model_ops/fla_ops/fused_recurrent.py`	Adds a fused decode-time GDN update kernel (`gdn_decode_update_lossy_fast`).
`atom/model_ops/fla_ops/__init__.py`	Exports `gdn_decode_update_lossy_fast` from `fla_ops`.
`atom/model_ops/attention_gdn.py`	Adds an env-gated switch to the new fused decode kernel path for non-spec decode-only batches.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+    if block_size < x:
+        raise ValueError(
+            f"ATOM reshape_and_cache_shuffle_triton requires block_size (page_size) "
+            f">= {x} for kv_cache_dtype={key_cache.dtype}, got block_size={block_size}. "
+            f"The V cache template uses shape [num_blocks, num_kv_heads, "
+            f"block_size // x, head_size, x], which collapses to a 0-sized dimension "
+            f"and crashes view_as() when page_size < x. "
+            f"Fix: launch sglang with `--page-size {x}` (or larger, e.g. 64). "
+            f"This constraint applies to non-MLA models whose head_dim != 256 "
+            f"(MLA models and head_dim==256 models take a different code path)."
+        )


+        if not self.use_mla and head_dim != 256:
+            required_page_size = 16 // k_buffer.element_size()
+            if self.page_size < required_page_size:
+                raise ValueError(
+                    f"ATOM attention backend requires --page-size >= "
+                    f"{required_page_size} for non-MLA models with "
+                    f"head_dim={head_dim} and kv_cache_dtype={k_buffer.dtype} "
+                    f"(current --page-size={self.page_size}). "
+                    f"The internal layout-shuffle kernel computes "
+                    f"block_size // x with x={required_page_size}, which "
+                    f"degenerates to 0 when page_size < x and crashes during "
+                    f"CUDA graph capture. "
+                    f"Fix: launch sglang with `--page-size {required_page_size}` "
+                    f"(or larger, e.g. 64)."
+                )


            for shard_id, shard_size in zip(loaded_shard_id, shard_sizes):
-                if param is getattr(self, "weight_scale", None) or param is getattr(
-                    self, "input_scale", None
-                ):
-                    shard_size //= 128
+                is_scale_param = param is getattr(
+                    self, "weight_scale", None
+                ) or param is getattr(self, "input_scale", None)
+                if is_scale_param and self.quant_type == QuantType.per_1x128:
+                    shard_size = (shard_size + 127) // 128
                shard = loaded_weight.narrow(self.tp_dim, current_offset, shard_size)


ZhiweiYan-96 · 2026-06-04T05:55:23Z

For the sglang plugin side, please rebase to the main commit for refractored attention backend.

zovonoir and others added 15 commits May 19, 2026 20:51

add gdn decode fast kernel

145ee16

resolve gdn code conflicts

eaf3fb3

Merge branch 'main' into opt-qwen35b

b09ed93

resolve gdn code conflicts

d008b09

solve mispelling error

db009de

solve redundant import error

9f8645f

add layernorm and rope optimization

08c2177

revert non-gdn optimization changes

1791f22

Co-authored-by: Cursor <cursoragent@cursor.com>

revert gdn changes

de840ba

Co-authored-by: Cursor <cursoragent@cursor.com>

Merge remote-tracking branch 'origin/main' into opt-qwen35b

9adc15f

add gdn decode lossy fast kernel

71a373d

revert sglang benchmark file changes

4f11f07

Co-authored-by: Cursor <cursoragent@cursor.com>

gate gdn decode lossy fast path

3054c45

Co-authored-by: Cursor <cursoragent@cursor.com>

add fused mrope qk path

c2db458

Co-authored-by: Cursor <cursoragent@cursor.com>

support ptpc packed linear scales

ce80f33

Co-authored-by: Cursor <cursoragent@cursor.com>

zovonoir assigned qichu-yun May 22, 2026

zovonoir requested a review from qichu-yun May 22, 2026 14:16

zovonoir and others added 3 commits May 23, 2026 11:32

address gdn decode review comments

010ef3e

Co-authored-by: Cursor <cursoragent@cursor.com>

tighten mrope fused path guards

8846dff

Co-authored-by: Cursor <cursoragent@cursor.com>

Merge PR 888 review fixes into PTPC branch

7a5a457

wuhuikx requested review from Jasen2201 and Yuechguo May 25, 2026 03:44

zovonoir and others added 2 commits May 25, 2026 15:23

fail fast on invalid attention page size

fe7ff92

Co-authored-by: Cursor <cursoragent@cursor.com>

support ptpc packed linear scales

c44e701

Co-authored-by: Cursor <cursoragent@cursor.com>

zovonoir force-pushed the opt-qwen35b-ptpc-linear branch from 7a5a457 to c44e701 Compare May 27, 2026 07:47

zovonoir changed the base branch from opt-qwen35b-mrope-fused to main May 27, 2026 07:47

wuhuikx requested review from ZhiweiYan-96 and wanzhenchn May 28, 2026 03:10

ci: trigger workflows after base change

d36bada

sunway513 mentioned this pull request May 30, 2026

ATOM Development Roadmap (2026 Q2) sunway513/ATOM#63

Closed

sunway513 mentioned this pull request May 31, 2026

ATOM Development Roadmap (2026 Q2) #988

Open

zovonoir mentioned this pull request Jun 1, 2026

Qwen3.5-35B-A3B-FP8: GDN decode lossy fast path + fused MRoPE QK #838

Merged

valarLip previously approved these changes Jun 1, 2026

View reviewed changes

Merge pull request #905 from ROCm/fix-attn-page-size-guard

d3e0ed0

[fix][attn] fail fast when --page-size < kv element width

Copilot AI review requested due to automatic review settings June 4, 2026 05:38

zovonoir dismissed valarLip’s stale review via d3e0ed0 June 4, 2026 05:38

Copilot started reviewing on behalf of zovonoir June 4, 2026 05:38 View session

Copilot AI reviewed Jun 4, 2026

View reviewed changes

zovonoir closed this Jun 5, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support Qwen3.5-35B-A3B PTPC model#889

Support Qwen3.5-35B-A3B PTPC model#889
zovonoir wants to merge 22 commits into
mainfrom
opt-qwen35b-ptpc-linear

zovonoir commented May 22, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

ZhiweiYan-96 commented Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

zovonoir commented May 22, 2026

Summary

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

ZhiweiYan-96 commented Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants