Skip to content

Support Qwen3.5-35B-A3B PTPC model#889

Closed
zovonoir wants to merge 22 commits into
mainfrom
opt-qwen35b-ptpc-linear
Closed

Support Qwen3.5-35B-A3B PTPC model#889
zovonoir wants to merge 22 commits into
mainfrom
opt-qwen35b-ptpc-linear

Conversation

@zovonoir
Copy link
Copy Markdown
Contributor

Summary

This patch adds support for the PTPC version of the Qwen3.5-35B-A3B model.

Test plan

  • python -m compileall atom/model_ops/linear.py

Made with Cursor

zovonoir and others added 3 commits May 23, 2026 11:32
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
@wuhuikx wuhuikx requested review from Jasen2201 and Yuechguo May 25, 2026 03:44
zovonoir and others added 2 commits May 25, 2026 15:23
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
@zovonoir zovonoir force-pushed the opt-qwen35b-ptpc-linear branch from 7a5a457 to c44e701 Compare May 27, 2026 07:47
@zovonoir zovonoir changed the base branch from opt-qwen35b-mrope-fused to main May 27, 2026 07:47
valarLip
valarLip previously approved these changes Jun 1, 2026
[fix][attn] fail fast when --page-size < kv element width
Copilot AI review requested due to automatic review settings June 4, 2026 05:38
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds runtime and kernel-level support needed for the PTPC variant of Qwen3.5-35B-A3B, including a fused Triton MRoPE path and an optional fast decode path for GDN attention, plus some guardrails for SGLang KV-cache layout shuffling.

Changes:

  • Add a specialized Triton fused Q/K MRoPE implementation and integrate it into Qwen3NextAttention.
  • Introduce an opt-in “lossy fast” GDN decode kernel path gated by a new env var (ATOM_ENABLE_GDN_DECODE_LOSSY_FAST).
  • Add early validation/error messaging for invalid SGLang --page-size values that would crash during KV-cache reshaping.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
tests/test_envs.py Adds coverage for the new ATOM_ENABLE_GDN_DECODE_LOSSY_FAST env var default and override behavior.
atom/utils/envs.py Registers ATOM_ENABLE_GDN_DECODE_LOSSY_FAST in the centralized env var registry.
atom/plugin/sglang/attention_backend/sgl_attn_backend.py Adds fail-fast checks for invalid page sizes in the layout-shuffle KV write path.
atom/models/qwen3_next.py Uses try_mrope_qk_fused() to accelerate Q/K rotary application when eligible.
atom/model_ops/triton_mrope.py New Triton kernels implementing Qwen3.5-specific fused MRoPE for Q/K.
atom/model_ops/linear.py Adjusts scale-shard sizing for packed shard loading under QuantType.per_1x128.
atom/model_ops/fla_ops/fused_recurrent.py Adds a fused decode-time GDN update kernel (gdn_decode_update_lossy_fast).
atom/model_ops/fla_ops/__init__.py Exports gdn_decode_update_lossy_fast from fla_ops.
atom/model_ops/attention_gdn.py Adds an env-gated switch to the new fused decode kernel path for non-spec decode-only batches.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +139 to +149
if block_size < x:
raise ValueError(
f"ATOM reshape_and_cache_shuffle_triton requires block_size (page_size) "
f">= {x} for kv_cache_dtype={key_cache.dtype}, got block_size={block_size}. "
f"The V cache template uses shape [num_blocks, num_kv_heads, "
f"block_size // x, head_size, x], which collapses to a 0-sized dimension "
f"and crashes view_as() when page_size < x. "
f"Fix: launch sglang with `--page-size {x}` (or larger, e.g. 64). "
f"This constraint applies to non-MLA models whose head_dim != 256 "
f"(MLA models and head_dim==256 models take a different code path)."
)
Comment on lines +304 to +318
if not self.use_mla and head_dim != 256:
required_page_size = 16 // k_buffer.element_size()
if self.page_size < required_page_size:
raise ValueError(
f"ATOM attention backend requires --page-size >= "
f"{required_page_size} for non-MLA models with "
f"head_dim={head_dim} and kv_cache_dtype={k_buffer.dtype} "
f"(current --page-size={self.page_size}). "
f"The internal layout-shuffle kernel computes "
f"block_size // x with x={required_page_size}, which "
f"degenerates to 0 when page_size < x and crashes during "
f"CUDA graph capture. "
f"Fix: launch sglang with `--page-size {required_page_size}` "
f"(or larger, e.g. 64)."
)
Comment thread atom/model_ops/linear.py
Comment on lines 724 to 730
for shard_id, shard_size in zip(loaded_shard_id, shard_sizes):
if param is getattr(self, "weight_scale", None) or param is getattr(
self, "input_scale", None
):
shard_size //= 128
is_scale_param = param is getattr(
self, "weight_scale", None
) or param is getattr(self, "input_scale", None)
if is_scale_param and self.quant_type == QuantType.per_1x128:
shard_size = (shard_size + 127) // 128
shard = loaded_weight.narrow(self.tp_dim, current_offset, shard_size)
@ZhiweiYan-96
Copy link
Copy Markdown
Contributor

For the sglang plugin side, please rebase to the main commit for refractored attention backend.

@zovonoir zovonoir closed this Jun 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants