Qwen3.5-35B-A3B-FP8: GDN decode lossy fast path + fused MRoPE QK by zovonoir · Pull Request #838 · ROCm/ATOM

zovonoir · 2026-05-20T02:11:30Z

Qwen3.5-35B-A3B-FP8: GDN decode lossy fast path + fused MRoPE QK

Summary

This PR ships two complementary decode-time optimizations for Qwen3.5-35B-A3B-FP8. The MRoPE QK fused path was originally in #888 and has been merged here per review feedback that the two changes should land together.

1. GDN decode recurrent fast path

A decode-only fast path for ATOM GDN recurrent attention. The optimized path is enabled only for the common non-spec decode case:

no prefill
no speculative sequence mask
one decode token per active request
1D ssm_state_indices
a/b shaped as [num_decodes, local_num_v_heads]

For unsupported layouts, the existing fused_gdn_gating + fused_recurrent_gated_delta_rule fallback is preserved.

The fast path fuses GDN gating computation with the recurrent state update and avoids launching the separate gating kernel. The implementation is intentionally named gdn_decode_update_lossy_fast because it is an approximate decode path: it keeps the high-performance kernel structure and may introduce small bf16-scale numerical differences in the recurrent state update. End-to-end GSM8K validation shows accuracy remains on par with main.

The path is guarded by ATOM_ENABLE_GDN_DECODE_LOSSY_FAST and defaults to disabled. Set ATOM_ENABLE_GDN_DECODE_LOSSY_FAST=1 to enable the approximate fast path for A/B testing or performance runs.

PAD_SLOT_ID (-1) handling: the kernel writes zeros to out and skips state load/store for idle/padded slots, so downstream consumers of the full out buffer never see uninitialized memory.

2. Fused MRoPE Q/K Triton path

A specialized Triton MRoPE Q/K fusion for the Qwen3.5 hot path, wired into Qwen3NextAttention after Q/K norm and before attention. Falls back to the generic rotary embedding path for unsupported shapes (head_size != 256, non-MRoPE positions, etc.).

The dispatcher try_mrope_qk_fused is decorated with @torch.compiler.disable so the Python-level shape branches (e.g. positions.shape[0] != 3, q.shape[1] != num_q_heads * head_size) never reach Dynamo. Otherwise Dynamo specializes the symbolic dims to constants, conflicting with dims the runner marked dynamic and tripping ConstraintViolationError under MMStar dynamic-shape compile. This matches the convention used by chunk_gated_delta_rule in atom/model_ops/fla_ops/chunk.py.

Notes on the external `SGLANG_MAMBA_SSM_DTYPE=bfloat16` setting

The opt+env measurements below also used the external SGLang runtime setting SGLANG_MAMBA_SSM_DTYPE=bfloat16 to store the Mamba/GDN SSM state in bf16. That setting is not introduced by this PR; it is included only as an additional benchmark configuration. Because the decode kernel is strongly memory-bandwidth bound, bf16 SSM state provides additional throughput improvement, with a larger accuracy tradeoff in the measured GSM8K run.

Performance

Benchmark config (MI308X):

Model: Qwen3.5-35B-A3B-FP8
CONC: 224, ISL: 4094, OSL: 2048, Prompts: 448
TP/EP: 1/1
Runtime: SGLang FP8

Configurations:

main: baseline.
opt (GDN only): opt-qwen35b with ATOM_ENABLE_GDN_DECODE_LOSSY_FAST=1, MRoPE fused path disabled.
opt (GDN+MRoPE): this PR with ATOM_ENABLE_GDN_DECODE_LOSSY_FAST=1, fused MRoPE active.
opt+env: opt (GDN only) plus external SGLANG_MAMBA_SSM_DTYPE=bfloat16.

Metric	main	opt (GDN only)	opt (GDN+MRoPE)	opt+env	GDN+MRoPE vs main
Total token throughput (tok/s)	7466.90	7870.39	8004.41	8103.82	+7.20%
Input token throughput (tok/s)	4977.12	5246.07	5335.41	5401.67	+7.20%
Output token throughput (tok/s)	2489.78	2624.32	2669.01	2702.15	+7.20%
Request throughput (req/s)	1.22	1.28	—	1.32	—
Benchmark duration (s)	368.51	349.62	—	339.55	—
Mean E2E (ms)	176401	167139	164893	162062	-6.52%
Mean TTFT (ms)	17782	17852	17771	17824	-0.06%
Mean TPOT (ms)	77.44	72.93	71.87	70.46	-7.19%
P99 TPOT (ms)	85.47	81.00	79.83	78.46	-6.60%

For reference, a fake-return experiment that bypassed fused_recurrent_gated_delta_rule entirely showed an approximate upper bound of QPS 1.4882, TPOT 61 ms, Total throughput 9140 tok/s — confirming the GDN recurrent update is a meaningful decode-time bottleneck and this PR captures a conservative portion of the opportunity.

Accuracy

GSM8K, 5-shot:

Version	flexible-extract	strict-match	flexible delta vs main	strict delta vs main
main	0.8946 ± 0.0085	0.9052 ± 0.0081	—	—
opt (GDN only)	0.8984 ± 0.0083	0.9090 ± 0.0079	+0.0038	+0.0038
opt (GDN+MRoPE)	0.895 ± 0.0097	0.903 ± 0.0094	+0.0004	-0.0022
opt+env (GDN + bf16 SSM)	0.8825	0.8931	-0.0121	-0.0121

The combined GDN+MRoPE accuracy remains on par with main. opt+env provides the highest throughput but with a measurable GSM8K drop in this run.

Cross-model benchmarks on Qwen3 dense (e.g. 27B) and other MoE models (e.g. 397B) are in progress and will be appended once available.

Validation

Additional standalone kernel checks were run for the target shape:

N=224, H=16, HV=32, K=128, V=128, dtype=bfloat16

The optimized fast path may produce small bf16-scale differences versus the reference recurrent path. End-to-end GSM8K validation and decode benchmark results above were measured with the current gdn_decode_update_lossy_fast + fused MRoPE implementation.

The existing fallbacks remain in place for prefill, speculative decode, mixed layouts, unsupported metadata shapes, and unsupported MRoPE shapes.

Files

atom/model_ops/attention_gdn.py — wire gdn_decode_update_lossy_fast into the GDN attention forward path; restore core_attn_out tail-zeroing for CUDA-graph padding safety.
atom/model_ops/fla_ops/__init__.py — export gdn_decode_update_lossy_fast.
atom/model_ops/fla_ops/fused_recurrent.py — add gdn_decode_update_lossy_fast + Triton kernel with PAD_SLOT_ID handling, contiguous-state validation, and HV % H checks.
atom/model_ops/triton_mrope.py — new specialized Qwen3.5 MRoPE Q/K Triton kernels (tiled + per-token) with @torch.compiler.disable-decorated dispatcher.
atom/models/qwen3_next.py — wire try_mrope_qk_fused into Qwen3NextAttention.
atom/utils/envs.py — add ATOM_ENABLE_GDN_DECODE_LOSSY_FAST env var.
tests/test_envs.py — include new env var in _ATOM_ENV_VARS cleanup list and add override assertion.

Copilot

Pull request overview

This PR adds a dedicated Triton “fast decode” update kernel for Gated Delta Net (GDN) and wires it into the Qwen3.5 decode path to improve decode throughput by avoiding materializing gates and using a streamlined state update.

Changes:

Introduces gdn_decode_update_fast (Triton kernel + Python wrapper) for decode-time state update and output computation.
Updates atom/model_ops/attention_gdn.py to conditionally use the fast decode kernel for non-spec decode cases and to skip gate materialization when not needed.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File	Description
`atom/model_ops/fla_ops/gdn_decode_fast.py`	Adds a new Triton kernel + wrapper intended to accelerate GDN decode updates.
`atom/model_ops/attention_gdn.py`	Integrates the new fast decode path and adjusts gating computation to be conditional.

Comments suppressed due to low confidence (1)

atom/model_ops/fla_ops/gdn_decode_fast.py:170

The kernel launch passes threshold=threshold, but _gdn_decode_update_kernel does not have a threshold parameter (it currently declares threshould). This will fail at runtime when launching the Triton kernel; make the argument name consistent between the kernel signature and the launch call.

        bv,
        n_value_heads // n_heads,
        num_warps=4,
        num_stages=1,
        threshold=threshold
    )

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

Comments suppressed due to low confidence (5)

atom/model_ops/attention_gdn.py:30

This try/except currently re-imports and overwrites gdn_decode_update_fast, but due to the earlier unconditional import it is effectively dead code (and can’t protect startup). After switching to a guarded import, consider using a single import style (either try: from ... import ... or importlib.import_module) to avoid shadowing/duplication.

try:
    gdn_decode_update_fast = importlib.import_module(
        "atom.model_ops.fla_ops.gdn_decode_fast"
    ).gdn_decode_update_fast
except ImportError:
    gdn_decode_update_fast = None

atom/model_ops/fla_ops/gdn_decode_fast.py:167

HEADS_PER_V is computed as n_value_heads // n_heads and then used as a divisor in the kernel (i_h = i_hv // HEADS_PER_V). If n_value_heads < n_heads or the ratio isn’t an integer mapping, this will either divide by zero or silently mis-map heads. Add explicit validation (e.g., n_value_heads % n_heads == 0 and n_value_heads >= n_heads) before launching the kernel.

        bk,
        bv,
        n_value_heads // n_heads,
        num_warps=4,

atom/model_ops/fla_ops/gdn_decode_fast.py:141

cu_seqlens is required (ValueError if None) but is not used anywhere in the fast path. Either remove this requirement or use cu_seqlens to validate that the input really matches the supported decode-only shape (e.g., one token per sequence / cu_seqlens[-1] == n_tokens) so incorrect calls fail early for the right reason.

    if cu_seqlens is None:
        raise ValueError("gdn_decode_update_fast requires cu_seqlens for decode")
    if num_accepted_tokens is not None:
        raise ValueError("gdn_decode_update_fast does not support spec decoding")

atom/model_ops/fla_ops/gdn_decode_fast.py:156

The wrapper doesn’t enforce contiguity/strides for q/k/v but the kernel uses implicit contiguous flattening (no strides passed). To match existing patterns (e.g., fused_recurrent_gated_delta_rule), consider calling .contiguous() on q, k, and v (and potentially out) or passing explicit strides to the kernel.

        a,
        dt_bias,
        q,
        k,
        v,
        b,
        out,

atom/model_ops/fla_ops/gdn_decode_fast.py:127

gdn_decode_update_fast implicitly assumes the flattened varlen layout used elsewhere (batch dimension == 1). Right now it takes q.shape[0]/v.shape[0] as _ and then uses squeeze(0) plus linear indexing that will be incorrect if q/v have batch > 1. Add an explicit check (e.g., q.shape[0] == v.shape[0] == 1) and raise a clear error if not.

    _, n_tokens, n_heads, head_k = q.shape
    _, _, n_value_heads, head_v = v.shape
    out = torch.empty_like(v).squeeze(0) if o is None else o.squeeze(0)

Copilot

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.

zovonoir · 2026-05-21T02:11:17Z

Accuracy dropped confirmed at latest docker image, still debugging.

Co-authored-by: Cursor <cursoragent@cursor.com>

Copilot

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

Co-authored-by: Cursor <cursoragent@cursor.com>

Copilot

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

zovonoir · 2026-05-27T05:24:25Z

Please rebase main and then check the accuracy CI test.

It seems there is a 5% performance gain. Could you provide more details on how the above optimization methods perform on Qwen3 dense models (e.g., 27B) and other MoE models (e.g., 397B)?

testing in progress

When ssm_state_indices contains a negative slot id (e.g. SGLang's PAD_SLOT_ID = -1 for idle/padded decode slots) the kernel previously returned early without writing to out, leaving the corresponding positions in the output tensor uninitialized and propagating garbage into downstream ops. Match the safer behavior expected by callers: write zeros to out for the invalid slot and skip the state load/store entirely. Addresses the latest Copilot review comment on PR #838.

zovonoir · 2026-05-28T14:18:20Z

@wanzhenchn
Here are tesing results
main : main branch without this patch
opt: this patch
bf16: prepend "SGLANG_MAMBA_SSM_DTYPE=bfloat16"
fp32: delete "SGLANG_MAMBA_SSM_DTYPE=bfloat16"

Fix Check Code Style with Black CI failure on #838.

Copilot

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated no new comments.

Merges the MRoPE Q/K fusion work originally in #888 into this PR so the two related Qwen3.5-35B-A3B-FP8 optimizations ship together (per review feedback that #888's stand-alone +1.7% gain is too small to justify a separate PR). Adds: - atom/model_ops/triton_mrope.py: specialized Qwen3.5 MRoPE Q/K Triton kernels (tiled + per-token) with a try_mrope_qk_fused dispatcher decorated with @torch.compiler.disable so Dynamo cannot specialize positions/q/k symbolic dims to constants (was tripping ConstraintViolationError under MMStar dynamic-shape compile). - atom/models/qwen3_next.py: wires try_mrope_qk_fused into Qwen3NextAttention after qk_norm; falls back to the generic rotary_emb path when the shapes don't match. Combined effect over main (MI308X, CONC 224, ISL 4094, OSL 2048, TP/EP 1/1, ATOM_ENABLE_GDN_DECODE_LOSSY_FAST=1): - Total token throughput: 7466.90 -> 8004.41 tok/s (+7.20%) - Mean E2E latency: 176401 -> 164893 ms (-6.52%) - Mean TPOT: 77.44 -> 71.87 ms (-7.19%) GSM8K 5-shot remains on par with main: - flexible-extract: 0.895 (vs 0.8946 baseline) - strict-match: 0.903 (vs 0.9052 baseline)

Copilot

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.

zovonoir · 2026-05-29T05:16:50Z

@wanzhenchn thanks for the review. Status on your two requests:

Rebase main + accuracy CI: done. main has been merged into the branch (commit 7d7f4fa), bringing in 20 commits including [fix](mtp): fix qwen3.5 mtp accept rate zero #966 (qwen3.5 mtp accept-rate fix) and scheduler: fix dead HBM prefix cache under deferred-output mode (regression from #902) #939 (scheduler HBM prefix-cache fix). Accuracy CI is now re-running on the rebased HEAD.
Performance on Qwen3 dense (27B) and other MoE (397B): cross-model benchmarks are in progress on my side. I'll post the numbers here as soon as the runs finish.

Separately, per your suggestion on #888, I've merged the fused MRoPE QK path into this PR (commit 9d0e326) and closed #888. The combined PR now delivers GDN decode + MRoPE fusion together:

Total token throughput: 7466.90 → 8004.41 tok/s (+7.20% vs main)
Mean E2E latency: 176401 → 164893 ms (-6.52% vs main)
Mean TPOT: 77.44 → 71.87 ms (-7.19% vs main)
GSM8K 5-shot: 0.895 (flex) / 0.903 (strict) — on par with main (0.8946 / 0.9052)

PR title and description have been updated to reflect the combined scope. Please take another look when you have a chance.

Previously try_mrope_qk_fused used @torch.compiler.disable to keep the Python shape branches out of Dynamo. That fixed the original ConstraintViolationError but introduced a new MMStar failure: torch._dynamo.exc.BackendCompilerFailed: backend='...VllmBackend' raised: AssertionError: VllmBackend can only be called once The graph break inserted by @torch.compiler.disable inside the compiled Qwen3NextAttention forward causes Dynamo to invoke ATOM's VllmBackend a second time on the same instance. Switch to torch.compiler.is_compiling() early-return: under compile we skip the fused path entirely (fall back to self.rotary_emb, identical to main), eager mode keeps the fused-path perf gain. No graph break, no double-backend invocation.

Copilot

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated no new comments.

zovonoir · 2026-06-01T02:03:31Z

@wanzhenchn gentle ping — when you have a moment, could you take another look at this PR? Quick recap of changes since your last review:

✅ Rebased onto main (now includes [fix](mtp): fix qwen3.5 mtp accept rate zero #966 qwen3.5 mtp fix, scheduler: fix dead HBM prefix cache under deferred-output mode (regression from #902) #939 scheduler fix, feat: support chunk prefill #740 chunk prefill, etc.)
✅ Merged Add fused MRoPE QK path for Qwen3.5-35B #888's fused MRoPE QK path into this PR per your suggestion (Add fused MRoPE QK path for Qwen3.5-35B #888 closed)
✅ Resolved the latest Copilot review item (PAD_SLOT_ID handling in the lossy decode kernel)
✅ Fixed an MMStar-only ConstraintViolationError → VllmBackend can only be called once regression introduced by the MRoPE path under torch.compile (early-return when compiling, eager mode unaffected)
✅ Lint clean (Black / Ruff / Pre Checkin all green)
✅ Qwen3.5-35B-A3B-FP8 TP2 accuracy CI passing on both runs
⏳ Cross-model benchmarks on Qwen3 dense / 397B-MoE still in progress; I'll post them as they come in

Note on the remaining red accuracy jobs (DSR1-FP8, DSV4-Pro, Qwen3.5-397B-MXFP4): these fail identically on #889 which only changes 5 lines in linear.py, so they're upstream main regressions / infra flakes, not from this PR.

Would also appreciate a look at #889 (Qwen3.5-35B-A3B PTPC support, single-file linear.py change, now rebased onto main) — it's MERGEABLE and just waiting on REVIEW_REQUIRED to clear.

Thanks!

zovonoir · 2026-06-01T06:23:20Z

Thanks @wanzhenchn for the offline confirmation!

CI status update for context: the latest run on 1538f12e has the target model (Qwen3.5-35B-A3B-FP8 TP2 SGLang) passing in 13m26s, along with DeepSeek-R1-0528, DeepSeek-V4-Pro / V4-Pro MTP, GLM-5.1 family, Kimi-K2.5, gpt-oss-120b, Qwen3-Next-80B, Meta-Llama-3-8B, DSR1-FP8 TP4, and Copilot review (clean). Lint (Black/Ruff/Pre Checkin) all green.

The 5 remaining red accuracy jobs are all infra (I checked each log):

Job	Real cause
Qwen3.5-35B-A3B-FP8 TP2 (vLLM)	`ValueError: Free memory on device cuda:0 (176.87/251.98 GiB) ... less than 0.9` — GPU held by another process
DSR1-FP4 TP4	`No such image: atom_sglang_base:ci`
Kimi-K2-Thinking-MXFP4 TP4	`No such image: atom_oot_base:ci`
DSR1-FP8 TP8 (failed in 23s)	`No such image: atom_oot_base:ci`
Qwen3.5-397B-A17B-MXFP4	aiter wheel missing

These all reproduce identically on #889 (which only touches 5 lines in linear.py), so they're upstream main regressions / runner flakes, not from this PR.

@valarLip @Pleaplusone since you've both worked recently on atom/model_ops/fla_ops/ and atom/models/qwen3_next.py, would either of you have a moment to formally approve this so it can land? wanzhenchn LGTM'd offline but doesn't have approve rights.

Same ask would apply to #889 (Qwen3.5-35B-A3B PTPC support, single-file linear.py change) once this one merges.

Thanks all!

num_tokens equals positions.shape[1], which changes every batch (mixed prefill/decode, varying decode batch sizes). With tl.constexpr, Triton specializes and recompiles the kernel for every distinct value, which defeats the perf gain of the fused path. num_tokens is only used in a runtime mask (row_mask = rows < num_tokens), so it does not need constexpr semantics. Drop the annotation so the kernel is compiled once per shape group. Addresses Copilot review r3322237301.

valarLip · 2026-05-20T02:27:32Z


+try:
+    gdn_decode_update_fast = importlib.import_module(
+        "atom.model_ops.fla_ops.gdn_decode_fast"


the code itself is in ATOM already why we need this werid try... except

… QK (#838)" This reverts commit dfb8eda.

* Qwen3.5-35B-A3B-FP8: GDN decode lossy fast path + fused MRoPE QK (#838) * add gdn decode fast kernel * resolve gdn code conflicts * resolve gdn code conflicts * solve mispelling error * solve redundant import error * add layernorm and rope optimization * revert non-gdn optimization changes Co-authored-by: Cursor <cursoragent@cursor.com> * revert gdn changes Co-authored-by: Cursor <cursoragent@cursor.com> * add gdn decode lossy fast kernel * revert sglang benchmark file changes Co-authored-by: Cursor <cursoragent@cursor.com> * gate gdn decode lossy fast path Co-authored-by: Cursor <cursoragent@cursor.com> * address gdn decode review comments Co-authored-by: Cursor <cursoragent@cursor.com> * fix(gdn): zero out for PAD_SLOT_ID in lossy fast kernel When ssm_state_indices contains a negative slot id (e.g. SGLang's PAD_SLOT_ID = -1 for idle/padded decode slots) the kernel previously returned early without writing to out, leaving the corresponding positions in the output tensor uninitialized and propagating garbage into downstream ops. Match the safer behavior expected by callers: write zeros to out for the invalid slot and skip the state load/store entirely. Addresses the latest Copilot review comment on PR #838. * style: apply black formatting Fix Check Code Style with Black CI failure on #838. * perf(qwen3.5): add fused MRoPE QK Triton path Merges the MRoPE Q/K fusion work originally in #888 into this PR so the two related Qwen3.5-35B-A3B-FP8 optimizations ship together (per review feedback that #888's stand-alone +1.7% gain is too small to justify a separate PR). Adds: - atom/model_ops/triton_mrope.py: specialized Qwen3.5 MRoPE Q/K Triton kernels (tiled + per-token) with a try_mrope_qk_fused dispatcher decorated with @torch.compiler.disable so Dynamo cannot specialize positions/q/k symbolic dims to constants (was tripping ConstraintViolationError under MMStar dynamic-shape compile). - atom/models/qwen3_next.py: wires try_mrope_qk_fused into Qwen3NextAttention after qk_norm; falls back to the generic rotary_emb path when the shapes don't match. Combined effect over main (MI308X, CONC 224, ISL 4094, OSL 2048, TP/EP 1/1, ATOM_ENABLE_GDN_DECODE_LOSSY_FAST=1): - Total token throughput: 7466.90 -> 8004.41 tok/s (+7.20%) - Mean E2E latency: 176401 -> 164893 ms (-6.52%) - Mean TPOT: 77.44 -> 71.87 ms (-7.19%) GSM8K 5-shot remains on par with main: - flexible-extract: 0.895 (vs 0.8946 baseline) - strict-match: 0.903 (vs 0.9052 baseline) * fix(mrope): early-return under torch.compile instead of graph break Previously try_mrope_qk_fused used @torch.compiler.disable to keep the Python shape branches out of Dynamo. That fixed the original ConstraintViolationError but introduced a new MMStar failure: torch._dynamo.exc.BackendCompilerFailed: backend='...VllmBackend' raised: AssertionError: VllmBackend can only be called once The graph break inserted by @torch.compiler.disable inside the compiled Qwen3NextAttention forward causes Dynamo to invoke ATOM's VllmBackend a second time on the same instance. Switch to torch.compiler.is_compiling() early-return: under compile we skip the fused path entirely (fall back to self.rotary_emb, identical to main), eager mode keeps the fused-path perf gain. No graph break, no double-backend invocation. * perf(mrope): drop tl.constexpr on num_tokens to avoid recompilation num_tokens equals positions.shape[1], which changes every batch (mixed prefill/decode, varying decode batch sizes). With tl.constexpr, Triton specializes and recompiles the kernel for every distinct value, which defeats the perf gain of the fused path. num_tokens is only used in a runtime mask (row_mask = rows < num_tokens), so it does not need constexpr semantics. Drop the annotation so the kernel is compiled once per shape group. Addresses Copilot review r3322237301. --------- Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: zovonoir <zovonoir@users.noreply.github.com> * fix(spec_decode): support DP attention with MTP in Deepseek V4 (#1001) * fix(spec_decode): support DP attention with MTP draft Refresh dp_metadata per draft step (force variable-length DP path) and add num_spec_step + scheduled_spec_decode_tokens to the dummy decode batch so DP+MTP runs stay in lockstep. * style: apply black formatting --------- Co-authored-by: ZhangLirong-amd <ZhangLirong@amd.com> * Remove qkv 256 tok limitation (#999) * [Refactor][ATOM-vLLM][Attention] Refactor ATOM-vLLM Attention (#750) * [feat][Attention Refactor] Reconstruct the Attention arch Signed-off-by: zejunchen-zejun <zejun.chen@amd.com> * ci(benchmark): raise benchmark drain MAX_MIN 30->60 and step timeout 60->80 (#1019) High-concurrency long-context benchmarks (DP-attention 8k/1k c=1024, which runs num_prompts = conc*10 = 10240) need ~48 min wall: ~14 min warmup + ~34 min for the measured run (10 waves of 1024 at ~3:20/wave). The benchmark drain's MAX_MIN=30 cut them off mid-run with exit 4 (timeout), failing the job even though the server was healthy and still making progress. Raise the benchmark drain MAX_MIN 30->60 and the "Run benchmark" step timeout-minutes 60->80 so these runs complete. Fast jobs are unaffected (drain exits on client completion, well before MAX_MIN); genuine hangs/faults still surface quickly via STUCK_POLLS (3 min) and fault detection, not MAX_MIN. The accuracy drain (MAX_MIN=30) is left unchanged. * [atom-vllm-benchmark] Retrieve model case name (#1022) Co-authored-by: root <root@hjbog-srdc-15.amd.com> * ci(accuracy): set Qwen3.5-35B-A3B TP2 baseline to 0.85 (#993) Mean of first 4 valid CI runs after PR #893 (0.8226 / 0.8529 / 0.8620 / 0.8628). Threshold 0.83 unchanged. Co-authored-by: JiaoliangYu <jiaolyu@amd.com> Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> * fix: support PTPC indexer wk FP8 scales (#1009) * fix: support MTP indexer wk FP8 scales Allow DeepSeek-V3.2 MTP checkpoints to load indexer.wk tensors that use per-channel FP8 scales while preserving the existing block-scale path. Co-authored-by: Cursor <cursoragent@cursor.com> * fix: clarify PTPC indexer wk scale support Describe the per-channel FP8 scale path as PTPC quantization support rather than MTP-specific behavior. Co-authored-by: Cursor <cursoragent@cursor.com> --------- Co-authored-by: Cursor <cursoragent@cursor.com> * [Fix] Enable dpsk r1 mxfp4 V2 model (#934) * [Fix] Enable dpsk r1 mxfp4 V2 model * [Benchmark] Change model to dpsk v2 model for sglang plugin * [Fix] Move MXFP4 kv_b_proj preservation into SGLang MLA * [Fix] Handle SGLang MXFP4 kv_b_proj postprocess order * Add fused chunk GDN prefill path for Qwen3.5-35B (#921) * Add fused chunk GDN prefill path for Qwen3.5-35B Port AMD HIP fast path from sglang's flash-linear-attention to chunk_gated_delta_rule prefill. Fuses 4 kernels into 3. * remove unused o_16 in fused_merge_recompute_kernel * format NT_16 ternary on single line for black * [fix](attn): fix slot mapping in model runner v2 (#1015) Co-authored-by: perzhang <perzhang@amd.com> * [MoE] adapt to triton_kernels matmul_ogs -> matmul rename (#763) Upstream triton_kernels merged the `matmul_ogs` module into `matmul` and the `matmul_ogs_details` package into `matmul_details`. The `PrecisionConfig` dataclass was also reshaped: `weight_scale` is now `b_mx_scale`, and setting it requires `b_microblock_size` to be provided explicitly (enforced by an assert in the new `matmul()`). - fused_moe_triton: try importing `FnSpecs / FusedActivation / PrecisionConfig / matmul` from `triton_kernels.matmul` first, fall back to the old `triton_kernels.matmul_ogs` path. Alias `matmul as matmul_ogs` so existing call sites stay unchanged. - moe (Mxfp4MoEMethod.process_weights_after_loading): same dual-path import for `FlexCtx / PrecisionConfig`; detect the kwarg name via `dataclasses.fields` so the old `weight_scale=` path keeps working while the new API takes `b_mx_scale=` + `b_microblock_size=`. - Drop the `_amd_smem_safe_tile` workaround that pinned block_m / block_n on gfx950: the underlying LDS-spill is no longer reproducible against current triton / triton_kernels. Co-authored-by: jianlian <jianlian@amd.com> Co-authored-by: Cursor <cursoragent@cursor.com> * CI: Use linux-atom-mi35x-1 in docker release pipeline * [atom-vllm benchmark] set 0 to random range ratio for vllm bench (#1029) * Fix AW benchmark fixed length config (#1020) Co-authored-by: XiaobingSuper <xiaobingzhangupc@gmail.com> * Clarify AW benchmark matrix job name (#1021) * Clarify AW benchmark matrix job name * Use explicit zero ratio for AW benchmark cases --------- Co-authored-by: XiaobingSuper <xiaobingzhangupc@gmail.com> --------- Co-authored-by: wuhuikx <hattie.wu@amd.com> Co-authored-by: XiaobingSuper <xiaobingzhangupc@gmail.com> * [atom-sgl-benchmark] Debug timeout (#977) Co-authored-by: root <root@hjbog-srdc-15.amd.com> * [atom-vllm benchmark] allow P0 benchmarks at 128 and 256 concurrency (#1036) Allow P0 benchmarks at 128 and 256 concurrency (#1030) Co-authored-by: root <root@hjbog-srdc-15.amd.com> * fix: chunk prefill (#1032) * remove disable deepseek v4 chunk prefill flag * fix(scheduler): use num_tokens for preempted seq re-prefill chunk size Preempted seqs keep their decoded token_ids (preempt() only deallocates KV blocks) so seq.num_tokens > seq.num_prompt_tokens on re-admit. Computing num_new_tokens from num_prompt_tokens caused chunk=0 when a fully-cached prefix exhausted num_prompt_tokens, triggering the "chunk must be positive" assert under high concurrency benchmarks. * fix format * fix sparse_attn_v4_paged_prefill for MI308 (#1003) * [ATOM SGLang] SGL plugin Attention Refractory (#863) * add work log * [ATOM-SGL][Attn refrac] Separate model-specific MLA from SGL full attention backend * remove work log * [ATOM-SGL][Attn refrac] Route DeepSeek MLA through an SGLang wrapper Move the SGLang DeepSeek MLA runtime entry from legacy forward glue into SGLangDeepseekMLAAttention while keeping RadixAttention and the full-attention backend as the host/backend layers. Shrink deepseek_mla_forward.py into a helper module and clarify absorbed vs non-absorbed path naming. * [ATOM SGL] runtime extraction * [ATOM-SGL][Runtime] Introduce model adapter specs Co-authored-by: Cursor <cursoragent@cursor.com> * [ATOM-SGL][Runtime] Keep custom wrappers out of generated entries Co-authored-by: Cursor <cursoragent@cursor.com> * [ATOM-SGL][Attn refrac] Split full attention backend helpers Co-authored-by: Cursor <cursoragent@cursor.com> * [ATOM-SGL][Attn refrac] Format refactored attention files Co-authored-by: Cursor <cursoragent@cursor.com> * [ATOM-SGL][Attn refrac] Fix ruff findings in refactored attention code Co-authored-by: Cursor <cursoragent@cursor.com> * [ATOM-SGL][Attn refrac] Avoid DeepSeek MLA wrapper module cycle Co-authored-by: Cursor <cursoragent@cursor.com> * fix rebase issue * precheckin * prepare for sglang only * import error meet in qwen3.5 * qwen3.5 acc fix * [Fix] Limit static FP4 linear kv_b_proj post-processing --------- Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: qichu-yun <qichu@amd.com> * [enable EP] deepseek V4 (#875) * [enable EP] deepseek V4 * update * [KV-events] ZMQ publisher for KV cache events (#869) * feat(kv-events): ZMQ publisher for KV cache events Add a KV cache lifecycle event pipeline so external consumers can track when blocks become resident, are evicted, or move across tiers. - atom/distributed/kv_events.py: EventBatch + tagged-union schema (BlockStored, BlockRemoved, AllBlocksCleared, BlockTransferred); ZMQ PUB publisher with a background sender thread and bounded queue (drops oldest on slow subscriber). - atom/model_engine/block_manager.py: emit BlockStored on prefix-cache coalesced runs, BlockRemoved on lazy LRU eviction, AllBlocksCleared on clear_cache(); record_remote_store() hook for remote-transfer connectors to emit BlockStored(medium=REMOTE). - atom/model_engine/scheduler.py: publish_kv_events() drains the BlockManager event log per scheduler step into one EventBatch; shutdown_kv_events() tears down the publisher on engine shutdown. - atom/model_engine/engine_core.py: publisher lifecycle wiring. - atom/utils/envs.py: ATOM_KV_EVENTS_{ENABLE,PUBLISHER,ENDPOINT,TOPIC, HWM,BUFFER_STEPS} env vars. - atom/config.py: KV-events config knobs. - tests/test_kv_events.py: schema round-trip + tagged-union batch. BlockTransferred and medium in {CPU, DISK} are reserved in the schema but not emitted yet. The hybrid-cache metadata fields on BlockStored (kv_cache_spec_kind, kv_cache_spec_sliding_window) are reserved wire slots emitted as None until a follow-up wires them from the cache-spec coordinator. Review feedback (incorporated): - Make pyzmq an optional runtime dep: import zmq inside ZmqEventPublisher so BlockManager's unconditional import of this module no longer requires pyzmq when KV events are disabled. - Validate buffer_steps >= 1 in ZmqEventPublisher so 0 (which Python's queue.Queue treats as unbounded) can't silently disable backpressure. - Track encode failures in stats (encode_errors counter) instead of swallowing the exception silently. - Add BlockManager.kv_events_enabled property so the scheduler stops reaching into _event_log directly. - Use the MEDIUM_REMOTE constant rather than the "REMOTE" string literal in record_remote_store. - Use pytest.importorskip("zmq") and an inproc:// endpoint in test_zmq_publisher_roundtrip so the test no longer hard-codes a TCP port and can be skipped cleanly when pyzmq is absent. * chore(kv-events): trim verbose comments and docstrings Remove descriptive comments and docstrings that restated what the code already says, leaving only the ones whose WHY is non-obvious (lazy eviction point, coalesced-store parent semantics, sticky cache_miss invariant, drop-on-overflow design, clear_cache live-seq invariant). * fix(kv-events): import MEDIUM_REMOTE for record_remote_store The earlier commit added a MEDIUM_REMOTE reference at the record_remote_store() emit site but the import line was never added, which would have raised NameError on first remote-store callback. Path wasn't exercised in the local smoke run because we never wired a KV-transfer producer. * fix(kv-events): close shutdown race and drop unused _EventBatch * fix(kv-events): align KVEventsConfig defaults with env * fix(kv-events): teardown safety, multipart docstring, parent_hash dedupe * fix(kv-events): no BlockRemoved on cache-hit block reuse * fix(kv-events): chain parent on remote store, atomic drain, longer linger * fix(kv-events): use sub.poll in test_zmq_publisher_roundtrip * Merge branch 'main' into feat/kv-events * fix(kv-events): publish on every step, skip cached blocks on remote-store, safer shutdown * fix(kv-events): default endpoint to loopback for safer opt-in * fix(kv-events): default group_idx to None to match vLLM wire layout * fix(kv-events): call hash_blocks before fwd_output idx-skip main's postprocess() skipped seqs whose idx is None (prefill step pattern) before calling hash_blocks(), so the prefill seq's hashes were never registered and BlockStored was never emitted. Move the hash_blocks call above the idx-None continue so it runs on every prefill step regardless of the fwd_output idx mapping. * test(kv-events): rename test_cache_hit_emits_no_new_store -> only_new_blocks * kv_events: log first encode error, count shutdown drops, hoist event-log check * black format * kv_events: harden finally, add overflow/encode tests * pyproject: add msgspec to deps * [atom-vllm benchmark] enable DeepSeek V3.2 quick reduce envs (#1047) * [atom-vllm] enable DeepSeek V3.2 quick reduce envs Co-authored-by: Cursor <cursoragent@cursor.com> * add accuracy recipe --------- Co-authored-by: perzhang <perzhang@amd.com> Co-authored-by: Cursor <cursoragent@cursor.com> * fix: warmup uses full token budget for DP (#1024) * fix: warmup uses full token budget * only for dp attn --------- Co-authored-by: ZhangLirong-amd <ZhangLirong@amd.com> * feat: support DeepSeek-V4-Flash-Base model on gfx942 device. (#996) * Expose ATOM test base image input (#1053) * [atom-vllm-benchmark] Add model case amd/DeepSeek-V3.2-mtp-ptpc for AW_P0 (#1039) * Add model case amd/DeepSeek-V3.2-mtp-ptpc for AW_P0 * First run non-mtp version * Remove 'MTP' from choice_label * Add model case amd/DeepSeek-V3.2-mtp-ptpc to accuracy and recipe * Add launch params to deepseek v3.2 ptpc --------- Co-authored-by: root <root@hjbog-srdc-15.amd.com> * [atom-vllm-benchmark] Change AW execution logic from one server one job to one server multi jobs (#1005) * Rename to AW (#1000) Co-authored-by: root <root@hjbog-srdc-15.amd.com> * Debug 'no such file or directory benchmark_matrix.json' (#1002) Co-authored-by: root <root@hjbog-srdc-15.amd.com> * [minimax dev_perf] remove qkv token 256 limitation for ar fusion (#1004) * [atom-vllm benchmark] refine model case name (#995) Co-authored-by: root <root@hjbog-srdc-15.amd.com> * Remove qkv 256 tok limitation --------- Co-authored-by: junyyang-amd <junyyang@amd.com> Co-authored-by: root <root@hjbog-srdc-15.amd.com> * Change AW execution logic from one server one job to one server multi jobs * Change the content as suggested * Fix metadata naming after rebase --------- Co-authored-by: root <root@hjbog-srdc-15.amd.com> Co-authored-by: Yutao Xu <xytpai@foxmail.com> * [Feat] Fused qknorm + quant for dpsk v2 model (#963) * [Feat] Fused qknorm + quant for dpsk v2 model * [Fix] Localize SGLang MXFP4 projection preservation --------- Co-authored-by: Cursor <cursoragent@cursor.com> * use ATOM_USE_FP4_NON_SHUFFLE_TRITON_GEMM to enable non shuffle triton gemm (#1031) * use ATOM_USE_FP4_TRITON_GEMM to enable non shuffle triton gemm Signed-off-by: zhuyuhua-v <yuhzhu@amd.com> * update env name and add comments Signed-off-by: zhuyuhua-v <yuhzhu@amd.com> * Apply suggestions from code review Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> --------- Signed-off-by: zhuyuhua-v <yuhzhu@amd.com> Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> * fix(v4): drop redundant cu_seqlens_q refill in attention metadata builder (#1058) cu_seqlens_q is already populated in ModelRunner as a variable-length prefix sum over num_scheduled_tokens, with the [scheduled_bs+1:bs+1] tail padded to the boundary value for cudagraph. The DeepseekV4 attention metadata builder re-filled it with a uniform np.arange sized scheduled_bs+1, overwriting ModelRunner's correct values. Remove the redundant fill and copy bs+1 entries so the GPU buffer matches the range ModelRunner populates. Also split a grouped local import into per-line imports (isort). * [ATOM-vLLM] Upgrade vLLM version to v0.22.0 (#1006) upgrade atom-vllm vllm version to 0.22.0 Signed-off-by: whx-sjtu <xiaowang990929@gmail.com> * [feat] Add RLHF rollout integration support (verl) (#549) * [verl] feat: add trust_remote_code arg and compilation_config dict support * [verl] feat: add logprobs and request_id support across sampling pipeline * [verl] feat: weight sync, memory lifecycle and DP isolation for verl integration (TP+DP) * [verl] feat: utility command dispatch and broadcast communication * [verl] feat: basic integration with verl - load_weights, sleep/wake_up API * [atom] fix: rope parameters handling, remove CLI trust_remote_code, and minor fixes * [atom] feat: implement packed weight handling in ModelRunner for FP8 parameters * [verl] refactor: decouple RLHF rollout logic from inference engine into atom/rollout/ * [verl] feat: extend tokenIDProcessor for logprobs support and enhance ModelRunner with DP isolation handling * fix: patch NCCL device binding for DP-isolated ModelRunner * refactor: minimize diff against main by reverting non-functional changes * refactor: improve code readability by formatting and organizing function parameters and comments across multiple files * refactor: extract sleep logic from engine_core busy_loop into helper methods * [verl] refactor: merge logprobs and DP isolation into base ModelRunner, simplify RLHFModelRunner * refactor: rename sleep state variables and update related logic for RL training in EngineCore and ModelRunner * fix: restore mark_trace profiler around cudagraph capture * docs: add veRL + Megatron + ATOM environment setup guide for ROCm * [verl] feat: add logprobs and request_id support across sampling pipeline * [verl] refactor: unify load_weights API with auto mode selection * fix: batch token ID processing in tokenIDProcessor * fix: use process group size instead of config for DP-isolated mode Co-Authored-By: Claude Opus 4 <noreply@anthropic.com> * [rollout, atom] fix: align DP logic with main * [rollout] fix: remove unnecessary DP config overrides and RLHF APIs from LLMEngine --------- Co-authored-by: Claude Opus 4 <noreply@anthropic.com> * fix Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com> * trim decode tensors for moe Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com> * [atom-vllm recipe] align recipe to nightly script (#1040) Co-authored-by: perzhang <perzhang@amd.com> * [sgl-atom][docker]add optional sglang_tag_suffix (#1068) * add docker prefix Signed-off-by: zhuyuhua-v <yuhzhu@amd.com> * Enable standalone DeepSeek NextN draft model (#964) Co-authored-by: zhuyuhua-v <yuhzhu@amd.com> Co-authored-by: Cursor <cursoragent@cursor.com> * [Feat] enable dualstream in mtp (#1049) * [atom-vllm-benchmark] Change matrix cell launches one server for one ISL/OSL pair + all concurrency (#1075) --------- Co-authored-by: Jun Yan Yang * [atom-vllm benchmark] recover warmup to concurrency Co-authored-by: perzhang <perzhang@amd.com> * Update SGLANG accuracy runner (#1084) * [plugin][perf] refine pa dispatch for better perf (#1038) * add pa dispatch for GLM-4.7 and clean code * refine the dispatch * fix minimax acc * revert unnecessary change * clean code --------- Co-authored-by: Guanbao Yu <gyu@amd.com> * fix fused_moe (#1076) * fix non triton routing expert mask in moe Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com> * fold heads to 8 Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com> * black Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com> --------- Signed-off-by: zejunchen-zejun <zejun.chen@amd.com> Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com> Signed-off-by: zhuyuhua-v <yuhzhu@amd.com> Signed-off-by: whx-sjtu <xiaowang990929@gmail.com> Co-authored-by: Zhu Jiale <69138280+zovonoir@users.noreply.github.com> Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: zovonoir <zovonoir@users.noreply.github.com> Co-authored-by: ZhangLirong <lirzhang@amd.com> Co-authored-by: ZhangLirong-amd <ZhangLirong@amd.com> Co-authored-by: Yutao Xu <xytpai@foxmail.com> Co-authored-by: zejunchen-zejun <zejun.chen@amd.com> Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com> Co-authored-by: junyyang-amd <junyyang@amd.com> Co-authored-by: root <root@hjbog-srdc-15.amd.com> Co-authored-by: JiaoliangYu <Jiaoliang.Yu@amd.com> Co-authored-by: JiaoliangYu <jiaolyu@amd.com> Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: XiaobingZhang <xiaobingzhangupc@gmail.com> Co-authored-by: qichu-yun <qichu@amd.com> Co-authored-by: ningding01 <niding@amd.com> Co-authored-by: PerryZhang01 <Perry.Zhang@amd.com> Co-authored-by: perzhang <perzhang@amd.com> Co-authored-by: jianhao <Jianhao.Liang@amd.com> Co-authored-by: jianlian <jianlian@amd.com> Co-authored-by: Xin Huang <Xin.Huang@amd.com> Co-authored-by: wuhuikx <hattie.wu@amd.com> Co-authored-by: Jiayun <jiayyu@amd.com> Co-authored-by: Wang, Yiting <18916612990@163.com> Co-authored-by: Zhiwei <yanzhw5@mail3.sysu.edu.cn> Co-authored-by: amd-ruitang3 <145657428+amd-ruitang3@users.noreply.github.com> Co-authored-by: Bongwoo Bak <bongwoobak@gmail.com> Co-authored-by: junna2016 <xingjunna.xjn@alibaba-inc.com> Co-authored-by: Zhu Yuhua <yuhzhu@amd.com> Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> Co-authored-by: Hexiang Wang <56632993+whx-sjtu@users.noreply.github.com> Co-authored-by: Sijing Yang <Sijing.Yang@amd.com> Co-authored-by: Ling Zhang <69022634+ZLkanyo009@users.noreply.github.com> Co-authored-by: gbyu-amd <Guanbao.Yu@amd.com> Co-authored-by: Guanbao Yu <gyu@amd.com> Co-authored-by: Wang, Yiting <yitiwang@amd.com>

zovonoir added 2 commits May 19, 2026 20:51

add gdn decode fast kernel

145ee16

resolve gdn code conflicts

eaf3fb3

Copilot AI review requested due to automatic review settings May 20, 2026 02:11

Copilot started reviewing on behalf of zovonoir May 20, 2026 02:13 View session

zovonoir changed the title ~~add gdn decode fast kernel~~ Add fast GDN decode kernel for Qwen3.5-35B-A3B-FP8 May 20, 2026

Merge branch 'main' into opt-qwen35b

b09ed93

Copilot AI reviewed May 20, 2026

View reviewed changes

Comment thread atom/model_ops/fla_ops/gdn_decode_fast.py Outdated

Comment thread atom/model_ops/fla_ops/gdn_decode_fast.py Outdated

Comment thread atom/model_ops/attention_gdn.py Outdated

Comment thread atom/model_ops/attention_gdn.py

zovonoir added 2 commits May 20, 2026 10:18

resolve gdn code conflicts

d008b09

solve mispelling error

db009de

Copilot AI review requested due to automatic review settings May 20, 2026 02:20

Copilot started reviewing on behalf of zovonoir May 20, 2026 02:21 View session

Copilot AI reviewed May 20, 2026

View reviewed changes

Comment thread atom/model_ops/attention_gdn.py Outdated

Comment thread atom/model_ops/fla_ops/gdn_decode_fast.py Outdated

zovonoir added 2 commits May 20, 2026 10:28

solve redundant import error

9f8645f

add layernorm and rope optimization

08c2177

Copilot AI review requested due to automatic review settings May 20, 2026 13:30

zovonoir changed the title ~~Add fast GDN decode kernel for Qwen3.5-35B-A3B-FP8~~ Add Qwen3.5-35B-A3B-FP8 optimization kernel May 20, 2026

Copilot started reviewing on behalf of zovonoir May 20, 2026 13:31 View session

Copilot AI reviewed May 20, 2026

View reviewed changes

Comment thread atom/model_ops/attention_gdn.py Outdated

Comment thread atom/models/qwen3_next.py Outdated

Comment thread atom/models/qwen3_next.py

Comment thread atom/model_ops/triton_mrope.py

Comment thread atom/model_ops/fla_ops/gdn_decode_fast.py Outdated

zovonoir and others added 4 commits May 21, 2026 10:15

revert non-gdn optimization changes

1791f22

Co-authored-by: Cursor <cursoragent@cursor.com>

revert gdn changes

de840ba

Co-authored-by: Cursor <cursoragent@cursor.com>

Merge remote-tracking branch 'origin/main' into opt-qwen35b

9adc15f

add gdn decode lossy fast kernel

71a373d

Copilot AI review requested due to automatic review settings May 21, 2026 13:30

Copilot started reviewing on behalf of zovonoir May 21, 2026 13:30 View session

revert sglang benchmark file changes

4f11f07

Co-authored-by: Cursor <cursoragent@cursor.com>

Copilot AI reviewed May 21, 2026

View reviewed changes

Comment thread atom/model_ops/fla_ops/fused_recurrent.py Outdated

Comment thread atom/model_ops/fla_ops/fused_recurrent.py Outdated

gate gdn decode lossy fast path

3054c45

Co-authored-by: Cursor <cursoragent@cursor.com>

zovonoir self-assigned this May 22, 2026

Merge branch 'main' into opt-qwen35b

f360f4c

Copilot started reviewing on behalf of zovonoir May 27, 2026 05:12 View session

Copilot AI reviewed May 27, 2026

View reviewed changes

Comment thread atom/model_ops/fla_ops/fused_recurrent.py

wuhuikx requested a review from wanzhenchn May 28, 2026 03:08

style: apply black formatting

1b5fa8f

Fix Check Code Style with Black CI failure on #838.

Copilot AI review requested due to automatic review settings May 29, 2026 03:41

Copilot started reviewing on behalf of zovonoir May 29, 2026 03:42 View session

Copilot AI reviewed May 29, 2026

View reviewed changes

zovonoir added 2 commits May 29, 2026 11:48

Merge remote-tracking branch 'origin/main' into opt-qwen35b

7d7f4fa

Copilot AI review requested due to automatic review settings May 29, 2026 04:58

Copilot started reviewing on behalf of zovonoir May 29, 2026 04:59 View session

Copilot AI reviewed May 29, 2026

View reviewed changes

Comment thread atom/model_ops/triton_mrope.py Outdated

zovonoir changed the title ~~Add Qwen3.5-35B-A3B-FP8 optimization kernel~~ Qwen3.5-35B-A3B-FP8: GDN decode lossy fast path + fused MRoPE QK May 29, 2026

zovonoir and others added 2 commits May 29, 2026 21:03

Merge branch 'main' into opt-qwen35b

40ef94c

Copilot AI review requested due to automatic review settings June 1, 2026 01:55

Copilot started reviewing on behalf of zovonoir June 1, 2026 01:55 View session

Copilot AI reviewed Jun 1, 2026

View reviewed changes

valarLip approved these changes Jun 1, 2026

View reviewed changes

valarLip merged commit dfb8eda into main Jun 1, 2026
25 of 31 checks passed

valarLip deleted the opt-qwen35b branch June 1, 2026 10:00

sunway513 pushed a commit that referenced this pull request Jun 3, 2026

Revert "Qwen3.5-35B-A3B-FP8: GDN decode lossy fast path + fused MRoPE…

26f23a0

… QK (#838)" This reverts commit dfb8eda.

Conversation

zovonoir commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Qwen3.5-35B-A3B-FP8: GDN decode lossy fast path + fused MRoPE QK

Summary

1. GDN decode recurrent fast path

2. Fused MRoPE Q/K Triton path

Notes on the external SGLANG_MAMBA_SSM_DTYPE=bfloat16 setting

Performance

Accuracy

Validation

Files

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zovonoir commented May 21, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

zovonoir commented May 27, 2026

Uh oh!

zovonoir commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

zovonoir commented May 29, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

zovonoir commented Jun 1, 2026

Uh oh!

zovonoir commented Jun 1, 2026

Uh oh!

valarLip May 20, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

zovonoir commented May 20, 2026 •

edited

Loading

Notes on the external `SGLANG_MAMBA_SSM_DTYPE=bfloat16` setting

zovonoir commented May 28, 2026 •

edited

Loading