[ATOM SGLang] SGL plugin Attention Refractory by ZhiweiYan-96 · Pull Request #863 · ROCm/ATOM

ZhiweiYan-96 · 2026-05-21T02:44:43Z

ATOM SGLang Attention Refactor

Status

Summary

This RFC proposes a staged refactor of the ATOM SGLang plugin attention stack. The goal is to make SGLang-specific runtime, model adaptation, and attention backend responsibilities explicit.

The current direction is:

Decouple generic SGLang full-attention backend code from model-specific DeepSeek MLA code.
Route DeepSeek MLA through SGLangDeepseekMLAAttention as an explicit model-level attention adapter.
Extract SGLang runtime state and ForwardBatch -> ATOM forward_context bridging into scoped runtime utilities.
Introduce a first model adapter registry via SGLangModelAdapterSpec so existing special cases are declared instead of hard-coded.
Split ATOMAttnBackendForSgl by backend lifecycle responsibility.
Keep shared metadata / kernel-call reuse experimental until interfaces are proven stable.

Background

The existing SGLang plugin support grew through several overlapping concerns:

ATOMAttnBackendForSgl handles metadata construction, cache writes, CUDA graph metadata, MHA/MLA dispatch, speculative modes, and kernel calls.
DeepSeek MLA model-specific logic used to live close to generic SGLang attention backend code.
base_model_wrapper.py collected generic wrapper logic, runtime state, model-specific flags, and forward-context bridging.
Some model adaptations were hard-coded by architecture name, for example DeepSeek patching and Qwen3.5 prepare-time config remapping.

Recent branches split these concerns:

This PR holds all the change from :

attn_model_decouple [ATOM-SGL][Attn refrac] Separate model-specific MLA from SGL full attention backend zejunchen-zejun/ATOM#28 separates SGLang full-attention backend files from model-specific DeepSeek MLA files.
attn_refrac_share_model [ATOM-SGL][Attn refrac] Route DeepSeek MLA through an SGLang wrapper zejunchen-zejun/ATOM#29 introduces SGLangDeepseekMLAAttention as a model-level DeepSeek MLA adapter.
attn_refractory_runtime [ATOM SGL] runtime extraction zejunchen-zejun/ATOM#30 extracts SGLang runtime context into atom/plugin/sglang/runtime.
attn_backend_split [ATOM SGL] Split AtomAttnSGLBackend based on responsibility zejunchen-zejun/ATOM#31 starts splitting full-attention backend helpers out of ATOMAttnBackendForSgl.
sglang_model_adapter Zhiwei/sglang model adapter zejunchen-zejun/ATOM#32 introduces a first function-based model adapter spec.

Goals

Make file ownership and runtime ownership obvious.
Keep generic SGLang full-attention backend code free of model-specific DeepSeek MLA semantics.
Provide a consistent way to express model adaptation needs.
Preserve existing supported model behavior.
Create extension points for future V3.2 sparse indexer and V4 hybrid attention work.

Target architecture

Model adapter layer
  Qwen3.5 outer wrapper
  DeepSeek MLA semantic adapter
  DeepSeek MTP draft wrapper
  future V3.2 / V4 model-specific adapters

SGLang runtime layer
  SGLangForwardBatchMetadata
  SGLangPluginRuntime
  plugin_runtime_scope
  model adapter registry

SGLang framework attention layer
  RadixAttention
  ATOMAttnBackendForSgl
  SGLang token/KV pools
  decode / extend / graph lifecycle

Kernel interface layer
  ForwardMetadata / AttentionMetaData-like fields
  KV indices / indptr / page table layout
  aiter / triton kernel call interfaces

Refactor Tracks

Track 1: Attention File and Responsibility Decoupling

This track has two parts: first, separate generic SGLang full-attention files from DeepSeek-specific MLA files; second, split the remaining full-attention backend by responsibility instead of keeping all backend lifecycle logic in ATOMAttnBackendForSgl.

The first problem was file ownership. Generic SGLang full-attention backend code and DeepSeek-specific MLA helpers lived too close together. The refactor moves them apart:

atom/plugin/sglang/attention_backend/full_attention/
  full_attention_backend.py
  radix_attention.py

atom/plugin/sglang/models/
  deepseek_mla.py
  deepseek_mla_attention.py
  deepseek_mla_forward.py

This track is represented by attn_model_decouple. Its purpose is not to change runtime behavior. Its purpose is to establish ownership:

full_attention/ owns SGLang framework backend behavior.
models/deepseek_mla*.py owns DeepSeek model-specific MLA behavior.
RadixAttention remains the SGLang framework adapter.

This is the foundation for every later PR. Without this move, DeepSeek-specific logic would continue to leak into generic backend files.

The second problem is that ATOMAttnBackendForSgl still owns too many backend responsibilities after the file move. The refactor starts splitting it into focused helpers:

full_attention/
  full_attention_backend.py  # backend orchestrator and dispatch
  metadata.py                # ForwardMetadata
  kv_cache.py                # cache layout shuffle helpers
  pa_metadata.py             # PA persistent metadata helpers

Future splits can continue along the same responsibility boundary:

metadata_builder.py          # decode / extend metadata construction
cuda_graph.py                # CUDA graph capture/replay metadata
decode.py                    # decode dispatch
extend.py                    # extend/prefill dispatch

This split should not be top-level MHA backend vs MLA backend. MHA and MLA are dispatch cases, but metadata construction, KV cache layout, CUDA graph, PA metadata, and speculative modes cut across both.

Track 2: `SGLangDeepseekMLAAttention`

DeepSeek MLA cannot be treated like Qwen-style q/k/v attention. Its model forward passes latent MLA state:

hidden_states_or_q_c
kv_c_normed
k_pe
positions
q_scale

These are model-level semantic inputs, not backend-ready attention inputs. The refactor introduces SGLangDeepseekMLAAttention to own this lowering:

DeepseekV2MLAAttention.forward()
  -> self.mla_attn(...)
  -> SGLangDeepseekMLAAttention
  -> RadixAttention
  -> ATOMAttnBackendForSgl

This track is represented by attn_refrac_share_model.

The important design choice is that the wrapper sits above RadixAttention. RadixAttention is a SGLang framework adapter: it expects attention-ready tensors and a ForwardBatch. DeepSeek MLA, however, calls self.mla_attn(...) with
model-specific latent state. The wrapper is the place where that semantic gap is closed.

SGLangDeepseekMLAAttention is responsible for:

resolving forward_batch from explicit kwargs or current runtime context,
gathering scattered runtime inputs when SGLang TP communication scatters them,
projecting q_c to final query when needed,
splitting and applying RoPE to q/k RoPE components,
choosing absorbed vs non-absorbed MLA path,
staging latent KV into SGLang's KV pool,
calling the underlying RadixAttention / SGLang backend,
applying DeepSeek MLA V up-projection and output projection.

The absorbed path roughly lowers:

q_input + kv_c_normed + k_pe
  -> q projection
  -> q_nope absorbed BMM
  -> latent KV attention
  -> V up-projection
  -> o_proj

The non-absorbed path roughly lowers:

q_input + kv_c_normed + k_pe
  -> q projection
  -> kv_b_proj expands latent KV into K/V
  -> standard q/k/v-shaped attention
  -> o_proj

The wrapper should not own generic backend concerns such as page table construction, CUDA graph replay, or PA metadata buffers. Those stay under the SGLang framework backend.

It solves several problems:

avoids monkey-patching the entire DeepSeek attention forward path,
keeps absorbed / non-absorbed MLA dispatch near the model semantic boundary,
prevents generic SGLang full-attention backend code from needing to understand DeepSeek latent tensors,
gives future DeepSeek variants a clear place to attach model-level semantic adapters.

Track 3: SGLang Runtime Bridge

The SGLang wrapper must translate framework runtime state into what ATOM model code expects. This includes:

current ForwardBatch,
PP proxy tensors,
dummy / idle batch handling,
ATOM plugin framework/config globals,
ATOM forward_context,
target/draft wrapper state for speculative decoding.

The refactor extracts this into atom/plugin/sglang/runtime:

runtime/context.py
  SGLangForwardBatchMetadata
  get_current_forward_batch
  plugin_runtime_scope

runtime/forward_context.py
  SGLangPluginRuntime

runtime/model_arch.py
  model adapter registry

This track is represented by attn_refractory_runtime.

There are three distinct runtime problems:

1. Current SGLang Forward State

Some model-level adapters need access to the current SGLang ForwardBatch without threading it through every intermediate ATOM model call. The runtime package provides SGLangForwardBatchMetadata for this:

SGLangForwardBatchMetadata
  forward_batch
  pp_proxy_tensors
  save_kv_cache

It also keeps get_current_forward_batch() as a narrow compatibility path for adapters such as RadixAttention fallback lookup and DeepSeek MLA wrapper input resolution.

2. ATOM Plugin Global State

ATOM still has process-global plugin state:

atom.plugin.prepare._CURRENT_FRAMEWORK
atom.config._current_atom_config

SGLang target/draft model wrappers can coexist, especially under speculative decoding. plugin_runtime_scope() scopes those globals around construction, load, patch, and forward sections so one wrapper does not leak runtime state into another.

3. SGLang `ForwardBatch` to ATOM `forward_context`

Many ATOM model ops read atom.utils.forward_context.get_forward_context() for information such as:

positions,
prefill/decode mode,
dummy/idle run status,
graph batch size,
DP token distribution,
attention metadata used by MoE padding or auxiliary ops.

SGLangPluginRuntime is a scoped adapter for model wrappers:

with SGLangPluginRuntime(
    atom_config=atom_config,
    forward_batch=forward_batch,
    positions=positions,
    input_ids=input_ids,
    input_embeds=input_embeds,
) as runtime:
    hidden_states = model(
        input_ids=runtime.input_ids,
        positions=runtime.positions,
        inputs_embeds=runtime.input_embeds,
    )
    hidden_states = runtime.trim_output(hidden_states)

It owns:

binding the current ForwardBatch,
materializing ATOM-compatible dummy inputs for SGLang idle batches,
setting and resetting ATOM forward context,
resolving DP token counts for ATOM-side metadata,
trimming ATOM dummy outputs back to SGLang-visible token counts.

The important boundary is:

SGLang model wrapper -> ATOM model body

The runtime bridge is not for ATOMAttnBackendForSgl kernel dispatch. The full-attention backend should use SGLang ForwardBatch and backend metadata directly.

This separation prevents a common failure mode: pushing model-wrapper runtime concerns into the attention backend simply because both happen to see ForwardBatch.

Track 4: Model Adapter Interface

The current code already has multiple model adaptation patterns:

Qwen3 / Qwen3Moe use the default base wrapper.
Qwen3Next needs GDN runtime context binding.
Qwen3.5 keeps the upstream SGLang outer wrapper and swaps in an ATOM language model stack.
DeepSeek V3 MLA needs install-time attention adaptation.
DeepSeek MTP needs a draft wrapper, config override, layer-id retagging, and embed/head sharing.
Future V3.2 needs sparse indexer side cache / top-k buffer handling.
Future V4 needs hybrid state/cache/metadata ownership.

Using more booleans in ModelArchSpec does not scale. The first implementation step is SGLangModelAdapterSpec:

@dataclass(frozen=True)
class SGLangModelAdapterSpec:
    wrapper_binds_gdn_context: bool = False
    prepare_config: Callable[[Any, str], None] | None = None
    install_adapters: Callable[[Any], None] | None = None

This is intentionally small. It replaces hard-coded special cases without claiming to be a complete future-proof framework.

Current uses:

DeepseekV3ForCausalLM uses install_adapters=setup_deepseek_for_sglang.
Qwen3NextForCausalLM keeps wrapper_binds_gdn_context=True.
Qwen3_5ForConditionalGeneration and Qwen3_5MoeForConditionalGeneration use prepare_config=apply_prepare_model_adaptations.

Future lifecycle hooks may include:

construct_model,
load_weights,
post_load,
runtime_policy,
output_policy,
cache owner registration,
metadata adapter registration.

The key point is that new models should declare adaptation needs through a registry instead of adding new one-off branches in the generic wrapper.

This track is represented by sglang_model_adapter. It is intentionally a small first step: it codifies existing DeepSeek and Qwen3.5 special cases without trying to solve every future model family in one PR.

The intended lifecycle for future adapters is:

prepare_config
  Patch or remap config before ATOM model construction.

construct_model
  Optional custom construction for outer wrappers, draft models, or hybrid runtimes.

install_adapters
  Patch/wrap submodules after construction, such as DeepSeek MLA attention wrappers.

load_weights / post_load
  Optional custom checkpoint mapping, shared-weight binding, or post-load transforms.

runtime_policy
  Declare whether the model needs default runtime, GDN context, context-only forward,
  or a custom runtime bridge.

output_policy
  Declare how hidden states become SGLang-visible outputs.

The first PR only implements the two hooks that are already needed by existing
code:

prepare_config
install_adapters

It deliberately leaves the rest as design direction. That keeps review scope small while still moving away from boolean flags.

Existing mappings:

Qwen3 / Qwen3Moe
  default adapter

Qwen3Next
  wrapper_binds_gdn_context=True

Qwen3.5 / Qwen3.5-MoE
  prepare_config=apply_prepare_model_adaptations

DeepSeek V3 MLA
  install_adapters=setup_deepseek_for_sglang

Future mappings should be additive:

DeepSeek MTP
  construct_model + load_weights + runtime policy + embed/head sharing

DeepSeek V3.2
  install indexer adapter + cache owner hook + sparse metadata adapter

DeepSeek V4
  custom construction + state cache owner + V4 metadata/runtime adapter

The adapter registry is therefore a coordination point, not a replacement for model-specific modules. Complex models should still keep their logic in focused files such as deepseek_mla_attention.py, deepseek_nextn_wrapper.py, or a future deepseek_v4_adapter.py.

Copilot

Pull request overview

This PR refactors the ATOM SGLang plugin attention stack to make SGLang runtime state, model-level adaptation (e.g., DeepSeek MLA), and full-attention backend responsibilities explicit and better separated. It introduces a small model-adapter registry, moves runtime/forward-context bridging into a dedicated runtime package, and splits the previously monolithic backend helpers into focused modules while keeping behavior aligned with existing supported models.

Changes:

Introduces atom.plugin.sglang.runtime (scoped runtime globals, forward-context bridge, and model adapter registry) and updates wrappers to use it.
Decouples DeepSeek MLA model adaptation into atom/plugin/sglang/models/deepseek_mla* and removes the old monolithic sgl_attention_mla.py.
Splits the SGLang full-attention backend into helper modules (metadata, kv_cache, pa_metadata) and updates import paths across plugin and core ops.

Reviewed changes

Copilot reviewed 24 out of 24 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
tests/plugin/test_sglang_register.py	Updates mocks/imports for the renamed full-attention backend module and additional model imports.
tests/plugin/test_sglang_model_wrapper.py	Updates DeepSeek MLA setup-hook import path to the new `models.deepseek_mla` module.
atom/plugin/sglang/runtime/model_arch.py	Adds `SGLangModelAdapterSpec` + registry for prepare/install hooks and wrapper flags.
atom/plugin/sglang/runtime/forward_context.py	Adds `SGLangPluginRuntime` to bridge `ForwardBatch` into ATOM forward_context and handle dummy/idle batches.
atom/plugin/sglang/runtime/context.py	Adds scoped runtime utilities (`plugin_runtime_scope`, forward-batch ContextVars, metadata binding helpers).
atom/plugin/sglang/runtime/init.py	Exposes the runtime utilities as a public package surface.
atom/plugin/sglang/models/qwen3_5.py	Switches to runtime package import and updates comment to reference `MODEL_ARCH_SPECS`.
atom/plugin/sglang/models/deepseek_nextn_wrapper.py	Migrates draft wrapper to `SGLangPluginRuntime` + `plugin_runtime_scope`.
atom/plugin/sglang/models/deepseek_mla.py	Adds install-time DeepSeek MLA patch entrypoint (`setup_deepseek_for_sglang`) in a model-owned module.
atom/plugin/sglang/models/deepseek_mla_forward.py	Extracts DeepSeek MLA shared helper functions (BMM paths, weight post-load processing, KV staging).
atom/plugin/sglang/models/deepseek_mla_attention.py	Adds `SGLangDeepseekMLAAttention` model-level adapter to lower latent MLA inputs into backend-ready attention calls.
atom/plugin/sglang/models/base_model_wrapper.py	Replaces embedded runtime/context logic with `atom.plugin.sglang.runtime` and adapter-driven hooks.
atom/plugin/sglang/attention_backend/sgl_attention_mla.py	Removes the old monolithic DeepSeek MLA SGLang plugin module.
atom/plugin/sglang/attention_backend/full_attention/radix_attention.py	Updates fallback `get_current_forward_batch` import to runtime package.
atom/plugin/sglang/attention_backend/full_attention/pa_metadata.py	Adds helper module for PA persistent metadata buffer allocation/build.
atom/plugin/sglang/attention_backend/full_attention/metadata.py	Adds `ForwardMetadata` dataclass in its own module.
atom/plugin/sglang/attention_backend/full_attention/kv_cache.py	Moves KV layout shuffle kernel + helper into a dedicated module.
atom/plugin/sglang/attention_backend/full_attention/full_attention_backend.py	Refactors backend to use extracted helper modules and updates naming/imports.
atom/plugin/sglang/attention_backend/full_attention/init.py	Adds package exports for full-attention backend components.
atom/plugin/sglang/attention_backend/attention_gdn.py	Updates import path for `SGLangForwardBatchMetadata` to runtime package.
atom/plugin/register.py	Updates custom attention backend import path to the new full-attention backend module.
atom/plugin/prepare.py	Routes model-specific config preparation via the new model adapter spec (`get_model_arch_spec`).
atom/model_ops/attentions/aiter_attention.py	Updates `RadixAttention` import path to the new full-attention location.
atom/model_ops/init.py	Updates `RadixAttention` import path to the new full-attention location.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+            with SGLangPluginRuntime(
+                atom_config=self.atom_config,
+                forward_batch=forward_batch,
+                positions=positions,
+                input_ids=input_ids,
+                input_embeds=input_embeds,
+            ):
                hidden_states = self.model(
                    input_ids=input_ids,
                    positions=positions,
                    hidden_states=forward_batch.spec_info.hidden_states,
                    inputs_embeds=input_embeds,
                )


Copilot

Pull request overview

Copilot reviewed 24 out of 24 changed files in this pull request and generated 2 comments.

+    """Fuse q/k RMSNorm and q quant using ATOM's DeepSeek-V2 path."""
+
+    (q_quantized, q_scale), q_normed, k_nope_normed, _ = _fuse_rmsnorm_quant(
+        q,
+        attn.q_a_layernorm.weight,
+        attn.q_a_layernorm.eps,
+        k_nope,
+        attn.kv_a_layernorm.weight,
+        attn.kv_a_layernorm.eps,
+        None,
+        dtype_quant=attn.quant_dtype,
+        shuffle=False,
+        scale_shuffle_padding=False,
+        group_size=128,
+        quant_type=_linear_quant_type_value(attn.q_b_proj),
+        output_unquantized_inp1=output_unquantized_q,
+        transpose_scale=True,
+    )


Copilot

Pull request overview

Copilot reviewed 24 out of 24 changed files in this pull request and generated 1 comment.

Copilot

Pull request overview

Copilot reviewed 24 out of 24 changed files in this pull request and generated 1 comment.

zejunchen-zejun · 2026-06-01T11:09:22Z

prepare_model method is atom-sglang specific right? Can be moved under atom/plugin/sglang

thanks for suggestions, moved

Copilot

Pull request overview

Copilot reviewed 28 out of 28 changed files in this pull request and generated 3 comments.

Comments suppressed due to low confidence (1)

atom/plugin/init.py:7

atom.plugin no longer re-exports prepare_model, which is a breaking change for callers using from atom.plugin import prepare_model. If you keep the backward-compatible wrapper in atom/plugin/prepare.py, consider re-exporting it here as well.

from .prepare import is_plugin_mode, is_sglang, is_vllm

__all__ = [
    "is_sglang",
    "is_vllm",
    "is_plugin_mode",
]

 def _set_framework_backbone(framework: str) -> None:
    if framework.lower() not in _SUPPORTED_FRAMEWORKS:
        raise ValueError(f"Unsupported framework {framework} for ATOM to plug in")
    global _CURRENT_FRAMEWORK
    _CURRENT_FRAMEWORK = framework


 from atom.model_engine.llm_engine import LLMEngine
 from atom.sampling_params import SamplingParams

-# interface for upper framework to construct the model from ATOM
-from atom.plugin import prepare_model
+from atom.plugin.sglang import prepare_model_for_sglang

 __all__ = [
    "LLMEngine",
    "SamplingParams",
-    "prepare_model",
+    "prepare_model_for_sglang",
 ]


Copilot

Pull request overview

Copilot reviewed 29 out of 29 changed files in this pull request and generated 2 comments.

+                model_inputs = dict(
+                    input_ids=runtime.input_ids,
+                    positions=runtime.positions,
+                    intermediate_tensors=SGLangForwardBatchMetadata.to_intermediate_tensors(
+                        pp_proxy_tensors, metadata
+                    ),


+    num_slots, num_kv_heads, head_dim = k_buffer.shape
+    num_blocks = num_slots // block_size
+    num_slots_with_block = num_blocks * block_size
+    k_buffer = k_buffer[:num_slots_with_block].view(
+        num_blocks, block_size, num_kv_heads, head_dim
+    )
+    v_buffer = v_buffer[:num_slots_with_block].view(
+        num_blocks, block_size, num_kv_heads, head_dim
+    )


…ention backend

Move the SGLang DeepSeek MLA runtime entry from legacy forward glue into SGLangDeepseekMLAAttention while keeping RadixAttention and the full-attention backend as the host/backend layers. Shrink deepseek_mla_forward.py into a helper module and clarify absorbed vs non-absorbed path naming.

Co-authored-by: Cursor <cursoragent@cursor.com>

Copilot

Pull request overview

Copilot reviewed 29 out of 29 changed files in this pull request and generated 1 comment.

+    def to_intermediate_tensors(
+        intermediate_tensors,
+        metadata: Optional["SGLangForwardBatchMetadata"],
+    ):
+        if intermediate_tensors is not None or metadata is None:
+            return intermediate_tensors
+        pp_proxy_tensors = metadata.pp_proxy_tensors
+        if pp_proxy_tensors is None:
+            return intermediate_tensors
+        tensors = getattr(pp_proxy_tensors, "tensors", None)
+        if tensors is None:
+            return intermediate_tensors


* Qwen3.5-35B-A3B-FP8: GDN decode lossy fast path + fused MRoPE QK (#838) * add gdn decode fast kernel * resolve gdn code conflicts * resolve gdn code conflicts * solve mispelling error * solve redundant import error * add layernorm and rope optimization * revert non-gdn optimization changes Co-authored-by: Cursor <cursoragent@cursor.com> * revert gdn changes Co-authored-by: Cursor <cursoragent@cursor.com> * add gdn decode lossy fast kernel * revert sglang benchmark file changes Co-authored-by: Cursor <cursoragent@cursor.com> * gate gdn decode lossy fast path Co-authored-by: Cursor <cursoragent@cursor.com> * address gdn decode review comments Co-authored-by: Cursor <cursoragent@cursor.com> * fix(gdn): zero out for PAD_SLOT_ID in lossy fast kernel When ssm_state_indices contains a negative slot id (e.g. SGLang's PAD_SLOT_ID = -1 for idle/padded decode slots) the kernel previously returned early without writing to out, leaving the corresponding positions in the output tensor uninitialized and propagating garbage into downstream ops. Match the safer behavior expected by callers: write zeros to out for the invalid slot and skip the state load/store entirely. Addresses the latest Copilot review comment on PR #838. * style: apply black formatting Fix Check Code Style with Black CI failure on #838. * perf(qwen3.5): add fused MRoPE QK Triton path Merges the MRoPE Q/K fusion work originally in #888 into this PR so the two related Qwen3.5-35B-A3B-FP8 optimizations ship together (per review feedback that #888's stand-alone +1.7% gain is too small to justify a separate PR). Adds: - atom/model_ops/triton_mrope.py: specialized Qwen3.5 MRoPE Q/K Triton kernels (tiled + per-token) with a try_mrope_qk_fused dispatcher decorated with @torch.compiler.disable so Dynamo cannot specialize positions/q/k symbolic dims to constants (was tripping ConstraintViolationError under MMStar dynamic-shape compile). - atom/models/qwen3_next.py: wires try_mrope_qk_fused into Qwen3NextAttention after qk_norm; falls back to the generic rotary_emb path when the shapes don't match. Combined effect over main (MI308X, CONC 224, ISL 4094, OSL 2048, TP/EP 1/1, ATOM_ENABLE_GDN_DECODE_LOSSY_FAST=1): - Total token throughput: 7466.90 -> 8004.41 tok/s (+7.20%) - Mean E2E latency: 176401 -> 164893 ms (-6.52%) - Mean TPOT: 77.44 -> 71.87 ms (-7.19%) GSM8K 5-shot remains on par with main: - flexible-extract: 0.895 (vs 0.8946 baseline) - strict-match: 0.903 (vs 0.9052 baseline) * fix(mrope): early-return under torch.compile instead of graph break Previously try_mrope_qk_fused used @torch.compiler.disable to keep the Python shape branches out of Dynamo. That fixed the original ConstraintViolationError but introduced a new MMStar failure: torch._dynamo.exc.BackendCompilerFailed: backend='...VllmBackend' raised: AssertionError: VllmBackend can only be called once The graph break inserted by @torch.compiler.disable inside the compiled Qwen3NextAttention forward causes Dynamo to invoke ATOM's VllmBackend a second time on the same instance. Switch to torch.compiler.is_compiling() early-return: under compile we skip the fused path entirely (fall back to self.rotary_emb, identical to main), eager mode keeps the fused-path perf gain. No graph break, no double-backend invocation. * perf(mrope): drop tl.constexpr on num_tokens to avoid recompilation num_tokens equals positions.shape[1], which changes every batch (mixed prefill/decode, varying decode batch sizes). With tl.constexpr, Triton specializes and recompiles the kernel for every distinct value, which defeats the perf gain of the fused path. num_tokens is only used in a runtime mask (row_mask = rows < num_tokens), so it does not need constexpr semantics. Drop the annotation so the kernel is compiled once per shape group. Addresses Copilot review r3322237301. --------- Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: zovonoir <zovonoir@users.noreply.github.com> * fix(spec_decode): support DP attention with MTP in Deepseek V4 (#1001) * fix(spec_decode): support DP attention with MTP draft Refresh dp_metadata per draft step (force variable-length DP path) and add num_spec_step + scheduled_spec_decode_tokens to the dummy decode batch so DP+MTP runs stay in lockstep. * style: apply black formatting --------- Co-authored-by: ZhangLirong-amd <ZhangLirong@amd.com> * Remove qkv 256 tok limitation (#999) * [Refactor][ATOM-vLLM][Attention] Refactor ATOM-vLLM Attention (#750) * [feat][Attention Refactor] Reconstruct the Attention arch Signed-off-by: zejunchen-zejun <zejun.chen@amd.com> * ci(benchmark): raise benchmark drain MAX_MIN 30->60 and step timeout 60->80 (#1019) High-concurrency long-context benchmarks (DP-attention 8k/1k c=1024, which runs num_prompts = conc*10 = 10240) need ~48 min wall: ~14 min warmup + ~34 min for the measured run (10 waves of 1024 at ~3:20/wave). The benchmark drain's MAX_MIN=30 cut them off mid-run with exit 4 (timeout), failing the job even though the server was healthy and still making progress. Raise the benchmark drain MAX_MIN 30->60 and the "Run benchmark" step timeout-minutes 60->80 so these runs complete. Fast jobs are unaffected (drain exits on client completion, well before MAX_MIN); genuine hangs/faults still surface quickly via STUCK_POLLS (3 min) and fault detection, not MAX_MIN. The accuracy drain (MAX_MIN=30) is left unchanged. * [atom-vllm-benchmark] Retrieve model case name (#1022) Co-authored-by: root <root@hjbog-srdc-15.amd.com> * ci(accuracy): set Qwen3.5-35B-A3B TP2 baseline to 0.85 (#993) Mean of first 4 valid CI runs after PR #893 (0.8226 / 0.8529 / 0.8620 / 0.8628). Threshold 0.83 unchanged. Co-authored-by: JiaoliangYu <jiaolyu@amd.com> Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> * fix: support PTPC indexer wk FP8 scales (#1009) * fix: support MTP indexer wk FP8 scales Allow DeepSeek-V3.2 MTP checkpoints to load indexer.wk tensors that use per-channel FP8 scales while preserving the existing block-scale path. Co-authored-by: Cursor <cursoragent@cursor.com> * fix: clarify PTPC indexer wk scale support Describe the per-channel FP8 scale path as PTPC quantization support rather than MTP-specific behavior. Co-authored-by: Cursor <cursoragent@cursor.com> --------- Co-authored-by: Cursor <cursoragent@cursor.com> * [Fix] Enable dpsk r1 mxfp4 V2 model (#934) * [Fix] Enable dpsk r1 mxfp4 V2 model * [Benchmark] Change model to dpsk v2 model for sglang plugin * [Fix] Move MXFP4 kv_b_proj preservation into SGLang MLA * [Fix] Handle SGLang MXFP4 kv_b_proj postprocess order * Add fused chunk GDN prefill path for Qwen3.5-35B (#921) * Add fused chunk GDN prefill path for Qwen3.5-35B Port AMD HIP fast path from sglang's flash-linear-attention to chunk_gated_delta_rule prefill. Fuses 4 kernels into 3. * remove unused o_16 in fused_merge_recompute_kernel * format NT_16 ternary on single line for black * [fix](attn): fix slot mapping in model runner v2 (#1015) Co-authored-by: perzhang <perzhang@amd.com> * [MoE] adapt to triton_kernels matmul_ogs -> matmul rename (#763) Upstream triton_kernels merged the `matmul_ogs` module into `matmul` and the `matmul_ogs_details` package into `matmul_details`. The `PrecisionConfig` dataclass was also reshaped: `weight_scale` is now `b_mx_scale`, and setting it requires `b_microblock_size` to be provided explicitly (enforced by an assert in the new `matmul()`). - fused_moe_triton: try importing `FnSpecs / FusedActivation / PrecisionConfig / matmul` from `triton_kernels.matmul` first, fall back to the old `triton_kernels.matmul_ogs` path. Alias `matmul as matmul_ogs` so existing call sites stay unchanged. - moe (Mxfp4MoEMethod.process_weights_after_loading): same dual-path import for `FlexCtx / PrecisionConfig`; detect the kwarg name via `dataclasses.fields` so the old `weight_scale=` path keeps working while the new API takes `b_mx_scale=` + `b_microblock_size=`. - Drop the `_amd_smem_safe_tile` workaround that pinned block_m / block_n on gfx950: the underlying LDS-spill is no longer reproducible against current triton / triton_kernels. Co-authored-by: jianlian <jianlian@amd.com> Co-authored-by: Cursor <cursoragent@cursor.com> * CI: Use linux-atom-mi35x-1 in docker release pipeline * [atom-vllm benchmark] set 0 to random range ratio for vllm bench (#1029) * Fix AW benchmark fixed length config (#1020) Co-authored-by: XiaobingSuper <xiaobingzhangupc@gmail.com> * Clarify AW benchmark matrix job name (#1021) * Clarify AW benchmark matrix job name * Use explicit zero ratio for AW benchmark cases --------- Co-authored-by: XiaobingSuper <xiaobingzhangupc@gmail.com> --------- Co-authored-by: wuhuikx <hattie.wu@amd.com> Co-authored-by: XiaobingSuper <xiaobingzhangupc@gmail.com> * [atom-sgl-benchmark] Debug timeout (#977) Co-authored-by: root <root@hjbog-srdc-15.amd.com> * [atom-vllm benchmark] allow P0 benchmarks at 128 and 256 concurrency (#1036) Allow P0 benchmarks at 128 and 256 concurrency (#1030) Co-authored-by: root <root@hjbog-srdc-15.amd.com> * fix: chunk prefill (#1032) * remove disable deepseek v4 chunk prefill flag * fix(scheduler): use num_tokens for preempted seq re-prefill chunk size Preempted seqs keep their decoded token_ids (preempt() only deallocates KV blocks) so seq.num_tokens > seq.num_prompt_tokens on re-admit. Computing num_new_tokens from num_prompt_tokens caused chunk=0 when a fully-cached prefix exhausted num_prompt_tokens, triggering the "chunk must be positive" assert under high concurrency benchmarks. * fix format * fix sparse_attn_v4_paged_prefill for MI308 (#1003) * [ATOM SGLang] SGL plugin Attention Refractory (#863) * add work log * [ATOM-SGL][Attn refrac] Separate model-specific MLA from SGL full attention backend * remove work log * [ATOM-SGL][Attn refrac] Route DeepSeek MLA through an SGLang wrapper Move the SGLang DeepSeek MLA runtime entry from legacy forward glue into SGLangDeepseekMLAAttention while keeping RadixAttention and the full-attention backend as the host/backend layers. Shrink deepseek_mla_forward.py into a helper module and clarify absorbed vs non-absorbed path naming. * [ATOM SGL] runtime extraction * [ATOM-SGL][Runtime] Introduce model adapter specs Co-authored-by: Cursor <cursoragent@cursor.com> * [ATOM-SGL][Runtime] Keep custom wrappers out of generated entries Co-authored-by: Cursor <cursoragent@cursor.com> * [ATOM-SGL][Attn refrac] Split full attention backend helpers Co-authored-by: Cursor <cursoragent@cursor.com> * [ATOM-SGL][Attn refrac] Format refactored attention files Co-authored-by: Cursor <cursoragent@cursor.com> * [ATOM-SGL][Attn refrac] Fix ruff findings in refactored attention code Co-authored-by: Cursor <cursoragent@cursor.com> * [ATOM-SGL][Attn refrac] Avoid DeepSeek MLA wrapper module cycle Co-authored-by: Cursor <cursoragent@cursor.com> * fix rebase issue * precheckin * prepare for sglang only * import error meet in qwen3.5 * qwen3.5 acc fix * [Fix] Limit static FP4 linear kv_b_proj post-processing --------- Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: qichu-yun <qichu@amd.com> * [enable EP] deepseek V4 (#875) * [enable EP] deepseek V4 * update * [KV-events] ZMQ publisher for KV cache events (#869) * feat(kv-events): ZMQ publisher for KV cache events Add a KV cache lifecycle event pipeline so external consumers can track when blocks become resident, are evicted, or move across tiers. - atom/distributed/kv_events.py: EventBatch + tagged-union schema (BlockStored, BlockRemoved, AllBlocksCleared, BlockTransferred); ZMQ PUB publisher with a background sender thread and bounded queue (drops oldest on slow subscriber). - atom/model_engine/block_manager.py: emit BlockStored on prefix-cache coalesced runs, BlockRemoved on lazy LRU eviction, AllBlocksCleared on clear_cache(); record_remote_store() hook for remote-transfer connectors to emit BlockStored(medium=REMOTE). - atom/model_engine/scheduler.py: publish_kv_events() drains the BlockManager event log per scheduler step into one EventBatch; shutdown_kv_events() tears down the publisher on engine shutdown. - atom/model_engine/engine_core.py: publisher lifecycle wiring. - atom/utils/envs.py: ATOM_KV_EVENTS_{ENABLE,PUBLISHER,ENDPOINT,TOPIC, HWM,BUFFER_STEPS} env vars. - atom/config.py: KV-events config knobs. - tests/test_kv_events.py: schema round-trip + tagged-union batch. BlockTransferred and medium in {CPU, DISK} are reserved in the schema but not emitted yet. The hybrid-cache metadata fields on BlockStored (kv_cache_spec_kind, kv_cache_spec_sliding_window) are reserved wire slots emitted as None until a follow-up wires them from the cache-spec coordinator. Review feedback (incorporated): - Make pyzmq an optional runtime dep: import zmq inside ZmqEventPublisher so BlockManager's unconditional import of this module no longer requires pyzmq when KV events are disabled. - Validate buffer_steps >= 1 in ZmqEventPublisher so 0 (which Python's queue.Queue treats as unbounded) can't silently disable backpressure. - Track encode failures in stats (encode_errors counter) instead of swallowing the exception silently. - Add BlockManager.kv_events_enabled property so the scheduler stops reaching into _event_log directly. - Use the MEDIUM_REMOTE constant rather than the "REMOTE" string literal in record_remote_store. - Use pytest.importorskip("zmq") and an inproc:// endpoint in test_zmq_publisher_roundtrip so the test no longer hard-codes a TCP port and can be skipped cleanly when pyzmq is absent. * chore(kv-events): trim verbose comments and docstrings Remove descriptive comments and docstrings that restated what the code already says, leaving only the ones whose WHY is non-obvious (lazy eviction point, coalesced-store parent semantics, sticky cache_miss invariant, drop-on-overflow design, clear_cache live-seq invariant). * fix(kv-events): import MEDIUM_REMOTE for record_remote_store The earlier commit added a MEDIUM_REMOTE reference at the record_remote_store() emit site but the import line was never added, which would have raised NameError on first remote-store callback. Path wasn't exercised in the local smoke run because we never wired a KV-transfer producer. * fix(kv-events): close shutdown race and drop unused _EventBatch * fix(kv-events): align KVEventsConfig defaults with env * fix(kv-events): teardown safety, multipart docstring, parent_hash dedupe * fix(kv-events): no BlockRemoved on cache-hit block reuse * fix(kv-events): chain parent on remote store, atomic drain, longer linger * fix(kv-events): use sub.poll in test_zmq_publisher_roundtrip * Merge branch 'main' into feat/kv-events * fix(kv-events): publish on every step, skip cached blocks on remote-store, safer shutdown * fix(kv-events): default endpoint to loopback for safer opt-in * fix(kv-events): default group_idx to None to match vLLM wire layout * fix(kv-events): call hash_blocks before fwd_output idx-skip main's postprocess() skipped seqs whose idx is None (prefill step pattern) before calling hash_blocks(), so the prefill seq's hashes were never registered and BlockStored was never emitted. Move the hash_blocks call above the idx-None continue so it runs on every prefill step regardless of the fwd_output idx mapping. * test(kv-events): rename test_cache_hit_emits_no_new_store -> only_new_blocks * kv_events: log first encode error, count shutdown drops, hoist event-log check * black format * kv_events: harden finally, add overflow/encode tests * pyproject: add msgspec to deps * [atom-vllm benchmark] enable DeepSeek V3.2 quick reduce envs (#1047) * [atom-vllm] enable DeepSeek V3.2 quick reduce envs Co-authored-by: Cursor <cursoragent@cursor.com> * add accuracy recipe --------- Co-authored-by: perzhang <perzhang@amd.com> Co-authored-by: Cursor <cursoragent@cursor.com> * fix: warmup uses full token budget for DP (#1024) * fix: warmup uses full token budget * only for dp attn --------- Co-authored-by: ZhangLirong-amd <ZhangLirong@amd.com> * feat: support DeepSeek-V4-Flash-Base model on gfx942 device. (#996) * Expose ATOM test base image input (#1053) * [atom-vllm-benchmark] Add model case amd/DeepSeek-V3.2-mtp-ptpc for AW_P0 (#1039) * Add model case amd/DeepSeek-V3.2-mtp-ptpc for AW_P0 * First run non-mtp version * Remove 'MTP' from choice_label * Add model case amd/DeepSeek-V3.2-mtp-ptpc to accuracy and recipe * Add launch params to deepseek v3.2 ptpc --------- Co-authored-by: root <root@hjbog-srdc-15.amd.com> * [atom-vllm-benchmark] Change AW execution logic from one server one job to one server multi jobs (#1005) * Rename to AW (#1000) Co-authored-by: root <root@hjbog-srdc-15.amd.com> * Debug 'no such file or directory benchmark_matrix.json' (#1002) Co-authored-by: root <root@hjbog-srdc-15.amd.com> * [minimax dev_perf] remove qkv token 256 limitation for ar fusion (#1004) * [atom-vllm benchmark] refine model case name (#995) Co-authored-by: root <root@hjbog-srdc-15.amd.com> * Remove qkv 256 tok limitation --------- Co-authored-by: junyyang-amd <junyyang@amd.com> Co-authored-by: root <root@hjbog-srdc-15.amd.com> * Change AW execution logic from one server one job to one server multi jobs * Change the content as suggested * Fix metadata naming after rebase --------- Co-authored-by: root <root@hjbog-srdc-15.amd.com> Co-authored-by: Yutao Xu <xytpai@foxmail.com> * [Feat] Fused qknorm + quant for dpsk v2 model (#963) * [Feat] Fused qknorm + quant for dpsk v2 model * [Fix] Localize SGLang MXFP4 projection preservation --------- Co-authored-by: Cursor <cursoragent@cursor.com> * use ATOM_USE_FP4_NON_SHUFFLE_TRITON_GEMM to enable non shuffle triton gemm (#1031) * use ATOM_USE_FP4_TRITON_GEMM to enable non shuffle triton gemm Signed-off-by: zhuyuhua-v <yuhzhu@amd.com> * update env name and add comments Signed-off-by: zhuyuhua-v <yuhzhu@amd.com> * Apply suggestions from code review Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> --------- Signed-off-by: zhuyuhua-v <yuhzhu@amd.com> Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> * fix(v4): drop redundant cu_seqlens_q refill in attention metadata builder (#1058) cu_seqlens_q is already populated in ModelRunner as a variable-length prefix sum over num_scheduled_tokens, with the [scheduled_bs+1:bs+1] tail padded to the boundary value for cudagraph. The DeepseekV4 attention metadata builder re-filled it with a uniform np.arange sized scheduled_bs+1, overwriting ModelRunner's correct values. Remove the redundant fill and copy bs+1 entries so the GPU buffer matches the range ModelRunner populates. Also split a grouped local import into per-line imports (isort). * [ATOM-vLLM] Upgrade vLLM version to v0.22.0 (#1006) upgrade atom-vllm vllm version to 0.22.0 Signed-off-by: whx-sjtu <xiaowang990929@gmail.com> * [feat] Add RLHF rollout integration support (verl) (#549) * [verl] feat: add trust_remote_code arg and compilation_config dict support * [verl] feat: add logprobs and request_id support across sampling pipeline * [verl] feat: weight sync, memory lifecycle and DP isolation for verl integration (TP+DP) * [verl] feat: utility command dispatch and broadcast communication * [verl] feat: basic integration with verl - load_weights, sleep/wake_up API * [atom] fix: rope parameters handling, remove CLI trust_remote_code, and minor fixes * [atom] feat: implement packed weight handling in ModelRunner for FP8 parameters * [verl] refactor: decouple RLHF rollout logic from inference engine into atom/rollout/ * [verl] feat: extend tokenIDProcessor for logprobs support and enhance ModelRunner with DP isolation handling * fix: patch NCCL device binding for DP-isolated ModelRunner * refactor: minimize diff against main by reverting non-functional changes * refactor: improve code readability by formatting and organizing function parameters and comments across multiple files * refactor: extract sleep logic from engine_core busy_loop into helper methods * [verl] refactor: merge logprobs and DP isolation into base ModelRunner, simplify RLHFModelRunner * refactor: rename sleep state variables and update related logic for RL training in EngineCore and ModelRunner * fix: restore mark_trace profiler around cudagraph capture * docs: add veRL + Megatron + ATOM environment setup guide for ROCm * [verl] feat: add logprobs and request_id support across sampling pipeline * [verl] refactor: unify load_weights API with auto mode selection * fix: batch token ID processing in tokenIDProcessor * fix: use process group size instead of config for DP-isolated mode Co-Authored-By: Claude Opus 4 <noreply@anthropic.com> * [rollout, atom] fix: align DP logic with main * [rollout] fix: remove unnecessary DP config overrides and RLHF APIs from LLMEngine --------- Co-authored-by: Claude Opus 4 <noreply@anthropic.com> * fix Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com> * trim decode tensors for moe Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com> * [atom-vllm recipe] align recipe to nightly script (#1040) Co-authored-by: perzhang <perzhang@amd.com> * [sgl-atom][docker]add optional sglang_tag_suffix (#1068) * add docker prefix Signed-off-by: zhuyuhua-v <yuhzhu@amd.com> * Enable standalone DeepSeek NextN draft model (#964) Co-authored-by: zhuyuhua-v <yuhzhu@amd.com> Co-authored-by: Cursor <cursoragent@cursor.com> * [Feat] enable dualstream in mtp (#1049) * [atom-vllm-benchmark] Change matrix cell launches one server for one ISL/OSL pair + all concurrency (#1075) --------- Co-authored-by: Jun Yan Yang * [atom-vllm benchmark] recover warmup to concurrency Co-authored-by: perzhang <perzhang@amd.com> * Update SGLANG accuracy runner (#1084) * [plugin][perf] refine pa dispatch for better perf (#1038) * add pa dispatch for GLM-4.7 and clean code * refine the dispatch * fix minimax acc * revert unnecessary change * clean code --------- Co-authored-by: Guanbao Yu <gyu@amd.com> * fix fused_moe (#1076) * fix non triton routing expert mask in moe Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com> * fold heads to 8 Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com> * black Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com> --------- Signed-off-by: zejunchen-zejun <zejun.chen@amd.com> Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com> Signed-off-by: zhuyuhua-v <yuhzhu@amd.com> Signed-off-by: whx-sjtu <xiaowang990929@gmail.com> Co-authored-by: Zhu Jiale <69138280+zovonoir@users.noreply.github.com> Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: zovonoir <zovonoir@users.noreply.github.com> Co-authored-by: ZhangLirong <lirzhang@amd.com> Co-authored-by: ZhangLirong-amd <ZhangLirong@amd.com> Co-authored-by: Yutao Xu <xytpai@foxmail.com> Co-authored-by: zejunchen-zejun <zejun.chen@amd.com> Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com> Co-authored-by: junyyang-amd <junyyang@amd.com> Co-authored-by: root <root@hjbog-srdc-15.amd.com> Co-authored-by: JiaoliangYu <Jiaoliang.Yu@amd.com> Co-authored-by: JiaoliangYu <jiaolyu@amd.com> Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: XiaobingZhang <xiaobingzhangupc@gmail.com> Co-authored-by: qichu-yun <qichu@amd.com> Co-authored-by: ningding01 <niding@amd.com> Co-authored-by: PerryZhang01 <Perry.Zhang@amd.com> Co-authored-by: perzhang <perzhang@amd.com> Co-authored-by: jianhao <Jianhao.Liang@amd.com> Co-authored-by: jianlian <jianlian@amd.com> Co-authored-by: Xin Huang <Xin.Huang@amd.com> Co-authored-by: wuhuikx <hattie.wu@amd.com> Co-authored-by: Jiayun <jiayyu@amd.com> Co-authored-by: Wang, Yiting <18916612990@163.com> Co-authored-by: Zhiwei <yanzhw5@mail3.sysu.edu.cn> Co-authored-by: amd-ruitang3 <145657428+amd-ruitang3@users.noreply.github.com> Co-authored-by: Bongwoo Bak <bongwoobak@gmail.com> Co-authored-by: junna2016 <xingjunna.xjn@alibaba-inc.com> Co-authored-by: Zhu Yuhua <yuhzhu@amd.com> Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> Co-authored-by: Hexiang Wang <56632993+whx-sjtu@users.noreply.github.com> Co-authored-by: Sijing Yang <Sijing.Yang@amd.com> Co-authored-by: Ling Zhang <69022634+ZLkanyo009@users.noreply.github.com> Co-authored-by: gbyu-amd <Guanbao.Yu@amd.com> Co-authored-by: Guanbao Yu <gyu@amd.com> Co-authored-by: Wang, Yiting <yitiwang@amd.com>

ZhiweiYan-96 requested review from ganyi1996ppo, wuhuikx, zejunchen-zejun and zhuyuhua-v May 21, 2026 02:46

ZhiweiYan-96 marked this pull request as ready for review May 21, 2026 09:18

Copilot AI review requested due to automatic review settings May 21, 2026 09:18

Copilot started reviewing on behalf of ZhiweiYan-96 May 21, 2026 09:19 View session

Copilot AI reviewed May 21, 2026

View reviewed changes

ZhiweiYan-96 force-pushed the zhiwei/attn_refrac_integrated branch from 86024e8 to 8de3516 Compare May 21, 2026 09:26

Copilot AI review requested due to automatic review settings May 21, 2026 09:31

Copilot started reviewing on behalf of ZhiweiYan-96 May 21, 2026 09:31 View session

Copilot AI reviewed May 21, 2026

View reviewed changes

Copilot AI review requested due to automatic review settings May 21, 2026 10:21

Copilot started reviewing on behalf of ZhiweiYan-96 May 21, 2026 10:21 View session

Copilot AI reviewed May 21, 2026

View reviewed changes

Comment thread atom/plugin/sglang/models/deepseek_nextn_wrapper.py Outdated

ZhiweiYan-96 force-pushed the zhiwei/attn_refrac_integrated branch from a2b222a to 8c18045 Compare June 1, 2026 06:02

Copilot AI review requested due to automatic review settings June 1, 2026 06:33

Copilot started reviewing on behalf of ZhiweiYan-96 June 1, 2026 06:33 View session

Copilot AI reviewed Jun 1, 2026

View reviewed changes

Comment thread atom/plugin/sglang/models/base_model_wrapper.py

zejunchen-zejun reviewed Jun 1, 2026

View reviewed changes

Copilot AI review requested due to automatic review settings June 2, 2026 06:24

ZhiweiYan-96 force-pushed the zhiwei/attn_refrac_integrated branch from 9a86586 to c51ad8d Compare June 2, 2026 06:24

Copilot started reviewing on behalf of ZhiweiYan-96 June 2, 2026 06:24 View session

Copilot AI reviewed Jun 2, 2026

View reviewed changes

Copilot AI review requested due to automatic review settings June 2, 2026 07:43

ZhiweiYan-96 force-pushed the zhiwei/attn_refrac_integrated branch from 79cc2ea to 1043270 Compare June 2, 2026 07:43

Copilot started reviewing on behalf of ZhiweiYan-96 June 2, 2026 07:43 View session

Copilot AI reviewed Jun 2, 2026

View reviewed changes

ZhiweiYan-96 added 2 commits June 2, 2026 09:14

add work log

1aada4b

[ATOM-SGL][Attn refrac] Separate model-specific MLA from SGL full att…

755cbf7

…ention backend

ZhiweiYan-96 and others added 15 commits June 2, 2026 09:15

remove work log

83c5f69

[ATOM SGL] runtime extraction

128b37c

[ATOM-SGL][Runtime] Introduce model adapter specs

853ee3f

Co-authored-by: Cursor <cursoragent@cursor.com>

[ATOM-SGL][Runtime] Keep custom wrappers out of generated entries

6ed65b9

Co-authored-by: Cursor <cursoragent@cursor.com>

[ATOM-SGL][Attn refrac] Split full attention backend helpers

6322f77

Co-authored-by: Cursor <cursoragent@cursor.com>

[ATOM-SGL][Attn refrac] Format refactored attention files

1ba3960

Co-authored-by: Cursor <cursoragent@cursor.com>

[ATOM-SGL][Attn refrac] Fix ruff findings in refactored attention code

0680a4b

Co-authored-by: Cursor <cursoragent@cursor.com>

[ATOM-SGL][Attn refrac] Avoid DeepSeek MLA wrapper module cycle

21d6c25

Co-authored-by: Cursor <cursoragent@cursor.com>

fix rebase issue

74829e9

precheckin

f8474a1

prepare for sglang only

bf1cd4f

import error meet in qwen3.5

961821e

qwen3.5 acc fix

94829c7

[Fix] Limit static FP4 linear kv_b_proj post-processing

ffe41ae

ZhiweiYan-96 force-pushed the zhiwei/attn_refrac_integrated branch from 3fd6144 to ffe41ae Compare June 2, 2026 14:23

Copilot AI review requested due to automatic review settings June 2, 2026 14:23

Copilot started reviewing on behalf of ZhiweiYan-96 June 2, 2026 14:23 View session

Copilot AI reviewed Jun 2, 2026

View reviewed changes

valarLip approved these changes Jun 2, 2026

View reviewed changes

zhuyuhua-v merged commit ffb1ae3 into ROCm:main Jun 3, 2026
27 of 31 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ATOM SGLang] SGL plugin Attention Refractory#863

[ATOM SGLang] SGL plugin Attention Refractory#863
zhuyuhua-v merged 17 commits into
ROCm:mainfrom
zejunchen-zejun:zhiwei/attn_refrac_integrated

ZhiweiYan-96 commented May 21, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

zejunchen-zejun Jun 1, 2026

Uh oh!

ZhiweiYan-96 Jun 2, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

ZhiweiYan-96 commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

ATOM SGLang Attention Refactor

Status

Summary

Background

Goals

Target architecture

Refactor Tracks

Track 1: Attention File and Responsibility Decoupling

Track 2: SGLangDeepseekMLAAttention

Track 3: SGLang Runtime Bridge

1. Current SGLang Forward State

2. ATOM Plugin Global State

3. SGLang ForwardBatch to ATOM forward_context

Track 4: Model Adapter Interface

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

zejunchen-zejun Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

ZhiweiYan-96 Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

ZhiweiYan-96 commented May 21, 2026 •

edited

Loading

Track 2: `SGLangDeepseekMLAAttention`

3. SGLang `ForwardBatch` to ATOM `forward_context`