Support Qwen3.5-35B-A3B PTPC model#889
Closed
zovonoir wants to merge 22 commits into
Closed
Conversation
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
7a5a457 to
c44e701
Compare
valarLip
previously approved these changes
Jun 1, 2026
[fix][attn] fail fast when --page-size < kv element width
Contributor
There was a problem hiding this comment.
Pull request overview
This PR adds runtime and kernel-level support needed for the PTPC variant of Qwen3.5-35B-A3B, including a fused Triton MRoPE path and an optional fast decode path for GDN attention, plus some guardrails for SGLang KV-cache layout shuffling.
Changes:
- Add a specialized Triton fused Q/K MRoPE implementation and integrate it into
Qwen3NextAttention. - Introduce an opt-in “lossy fast” GDN decode kernel path gated by a new env var (
ATOM_ENABLE_GDN_DECODE_LOSSY_FAST). - Add early validation/error messaging for invalid SGLang
--page-sizevalues that would crash during KV-cache reshaping.
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
tests/test_envs.py |
Adds coverage for the new ATOM_ENABLE_GDN_DECODE_LOSSY_FAST env var default and override behavior. |
atom/utils/envs.py |
Registers ATOM_ENABLE_GDN_DECODE_LOSSY_FAST in the centralized env var registry. |
atom/plugin/sglang/attention_backend/sgl_attn_backend.py |
Adds fail-fast checks for invalid page sizes in the layout-shuffle KV write path. |
atom/models/qwen3_next.py |
Uses try_mrope_qk_fused() to accelerate Q/K rotary application when eligible. |
atom/model_ops/triton_mrope.py |
New Triton kernels implementing Qwen3.5-specific fused MRoPE for Q/K. |
atom/model_ops/linear.py |
Adjusts scale-shard sizing for packed shard loading under QuantType.per_1x128. |
atom/model_ops/fla_ops/fused_recurrent.py |
Adds a fused decode-time GDN update kernel (gdn_decode_update_lossy_fast). |
atom/model_ops/fla_ops/__init__.py |
Exports gdn_decode_update_lossy_fast from fla_ops. |
atom/model_ops/attention_gdn.py |
Adds an env-gated switch to the new fused decode kernel path for non-spec decode-only batches. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+139
to
+149
| if block_size < x: | ||
| raise ValueError( | ||
| f"ATOM reshape_and_cache_shuffle_triton requires block_size (page_size) " | ||
| f">= {x} for kv_cache_dtype={key_cache.dtype}, got block_size={block_size}. " | ||
| f"The V cache template uses shape [num_blocks, num_kv_heads, " | ||
| f"block_size // x, head_size, x], which collapses to a 0-sized dimension " | ||
| f"and crashes view_as() when page_size < x. " | ||
| f"Fix: launch sglang with `--page-size {x}` (or larger, e.g. 64). " | ||
| f"This constraint applies to non-MLA models whose head_dim != 256 " | ||
| f"(MLA models and head_dim==256 models take a different code path)." | ||
| ) |
Comment on lines
+304
to
+318
| if not self.use_mla and head_dim != 256: | ||
| required_page_size = 16 // k_buffer.element_size() | ||
| if self.page_size < required_page_size: | ||
| raise ValueError( | ||
| f"ATOM attention backend requires --page-size >= " | ||
| f"{required_page_size} for non-MLA models with " | ||
| f"head_dim={head_dim} and kv_cache_dtype={k_buffer.dtype} " | ||
| f"(current --page-size={self.page_size}). " | ||
| f"The internal layout-shuffle kernel computes " | ||
| f"block_size // x with x={required_page_size}, which " | ||
| f"degenerates to 0 when page_size < x and crashes during " | ||
| f"CUDA graph capture. " | ||
| f"Fix: launch sglang with `--page-size {required_page_size}` " | ||
| f"(or larger, e.g. 64)." | ||
| ) |
Comment on lines
724
to
730
| for shard_id, shard_size in zip(loaded_shard_id, shard_sizes): | ||
| if param is getattr(self, "weight_scale", None) or param is getattr( | ||
| self, "input_scale", None | ||
| ): | ||
| shard_size //= 128 | ||
| is_scale_param = param is getattr( | ||
| self, "weight_scale", None | ||
| ) or param is getattr(self, "input_scale", None) | ||
| if is_scale_param and self.quant_type == QuantType.per_1x128: | ||
| shard_size = (shard_size + 127) // 128 | ||
| shard = loaded_weight.narrow(self.tp_dim, current_offset, shard_size) |
Contributor
|
For the sglang plugin side, please rebase to the main commit for refractored attention backend. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This patch adds support for the PTPC version of the Qwen3.5-35B-A3B model.
Test plan
python -m compileall atom/model_ops/linear.pyMade with Cursor