[ATOM-SGLang][Feat] Enable Deepseek v3 MTP by ZhiweiYan-96 · Pull Request #643 · ROCm/ATOM

ZhiweiYan-96 · 2026-04-24T09:45:12Z

Proposed Design

1. MTP module creation: Override the draft architecture through the external model package

As background knowledge, it is helpful to first detail how SGLang loads the DeepSeek MTP module in its native flow. From SGLang's point of view, DeepSeek MTP is not an auxiliary block hidden inside the target model. It is a standalone draft architecture. The loading path is roughly:

The user enables speculative decoding through server arguments such as --speculative-algorithm NEXTN
SGLang normalizes NEXTN into the EAGLE runtime family in server args
The draft ModelConfig rewrites the DeepSeek V3 draft architecture to DeepseekV3ForCausalLMNextN inside _config_draft_model()
ModelRegistry resolves the model class by that architecture name
The resolved class is instantiated as the draft model and then used by the speculative worker in the propose / verify / extend lifecycle

In other words, SGLang first interprets "DeepSeek MTP" as "a separately loaded draft model", and only then enters the runtime phase. The external model package hook works exactly at this architecture-resolution stage.

For MTP side, SGLang uses DeepseekV3ForCausalLMNextN as MTP model architecture.

DeepseekV3ForCausalLMNextN

The following diagram shows the native SGLang view of how the MTP module is loaded:

flowchart TD
    subgraph SGL["SGLang domain"]
        A["CLI / server args<br/>--speculative-algorithm NEXTN"]
        B["Normalize algorithm<br/>NEXTN -> EAGLE"]
        C["Build draft ModelConfig"]
        D["_config_draft_model()<br/>rewrite architecture to<br/>DeepseekV3ForCausalLMNextN"]
        E["ModelRegistry.resolve_model_cls(...)"]
        F["Instantiate draft model class"]
        G["Speculative worker uses draft model<br/>propose / verify / extend"]
    end

    A --> B
    B --> C
    C --> D
    D --> E
    E --> F
    F --> G

SGLang allows external model packages to register architectures through SGLANG_EXTERNAL_MODEL_PACKAGE and module-level EntryClass. This is also the core mechanism for ATOM SGLang plugin. The plugin uses this mechanism to expose a class with the exact architecture name expected by SGLang:

Once this class is available in the plugin package, SGLang resolves the draft architecture to the plugin implementation instead of the upstream one in sglang/srt/models/deepseek_nextn.py.

The following diagram illustrates what "overriding the draft architecture" means in practice:

flowchart TD
    subgraph SGL["SGLang domain"]
        A["launch server<br/>SGLANG_EXTERNAL_MODEL_PACKAGE=atom.plugin.sglang.models"]
        B["Import external model package"]
        C["Read module EntryClass"]
        D["Register architecture:<br/>DeepseekV3ForCausalLMNextN"]
        E["ModelRegistry.resolve_model_cls(...)"]
        H["upstream draft implementation<br/>sglang/srt/models/deepseek_nextn.py"]
    end

    subgraph PLUGIN["ATOM SGLang Plugin domain"]
        F["plugin wrapper<br/>DeepseekV3ForCausalLMNextN"]
    end

    subgraph CORE["ATOM Core domain"]
        G["DeepSeekMTP"]
    end

    A --> B
    B --> C
    C --> D
    D --> E
    E --> F
    F --> G

    H -. "same architecture name is overridden" .-> D

The important point is that architecture resolution and ModelRegistry selection still happen inside the SGLang domain. The ATOM SGLang Plugin domain only contributes a same-name wrapper through the external package entry point, while the actual draft computation is delegated to DeepSeekMTP in the ATOM Core domain. This makes it easier to separate who owns scheduling and model resolution from who owns the draft implementation.

2. MTP Warrper in ATOM: A thin wrapper as the compatibility bridge

The plugin adds a lightweight wrapper named DeepseekV3ForCausalLMNextN. Externally, it matches the draft-model interface expected by SGLang. Internally, it delegates the actual draft computation to ATOM DeepSeekMTP.

The wrapper is responsible for:

creating the plugin-mode atom_config
rewriting configuration semantics so they match DeepSeekMTP
instantiating ATOM/atom/models/deepseek_mtp.py::DeepSeekMTP
exposing get_embed_and_head(), set_embed_and_head(), and set_embed() so speculative workers can share embeddings and LM head weights with the target model
consuming forward_batch.spec_info.hidden_states in forward()
loading weights through the spec_decode=True path

The design principle is:

keep the SGLang architecture name and draft-worker contract unchanged at the top layer
reuse ATOM DeepSeekMTP as the implementation at the lower layer

This minimizes duplication, avoids recreating the upstream NextN hierarchy inside the plugin, and makes future improvements to ATOM's native MTP implementation reusable in plugin mode.

Risks

Intrusive change to formal runtime variable control codes

ATOM core code currently relies on some process-global runtime/config state. In speculative mode, target and draft wrappers coexist. Without isolation, initializing or running the draft wrapper may overwrite global state used by the target path, leading to subtle cross-contamination in MoE or attention behavior.

To address this, the plugin introduces a runtime scope that explicitly binds and restores the proper runtime context around wrapper __init__, forward(), and load_weights(). This allows target and draft instances to coexist safely.

However, this also makes an architectural issue visible: the current plugin system still has meaningful complexity around process-global state management. In order to let multiple wrappers coexist, the plugin must repeatedly switch and restore global runtime state at execution boundaries. In that sense, runtime scoping should be understood as a containment mechanism for the current global-state model, not as the ideal long-term abstraction. It solves the correctness problem for this branch, but it also suggests a future direction toward fewer implicit globals and more explicitly instantiated runtime state.

Direct attn backend replacment `sglang_aiter_backend.AiterAttnBackend = ATOMAttnBackendForSgl`

The reason can be summarized by the key SGLang call chain:

# eagle_worker.py
self.draft_attn_backend = DraftBackendFactory(...).create_decode_backend()

# draft_utils.py
def _create_aiter_decode_backend(self):
    return AiterMultiStepDraftBackend(...)

# aiter_backend.py
for i in range(self.speculative_num_steps - 1):
    self.attn_backends.append(AiterAttnBackend(model_runner, ...))

In other words, EAGLE draft multi-step decode actually goes through:

flowchart LR
    A["EAGLEWorker"] --> B["DraftBackendFactory"]
    B --> C["AiterMultiStepDraftBackend"]
    C --> D["AiterAttnBackend(...)"]
    R["ATOM-sglang attention registry"] -. "not used on this path" .-> D

So if the plugin only overrides the "aiter" registry entry, but does not also rewrite:

sglang.srt.layers.attention.aiter_backend.AiterAttnBackend

then EAGLE draft decode still directly constructs the upstream AiterAttnBackend. That is why this monkeypatch is hacky, but still practically necessary on the current branch.

The plugin is mutating an upstream module symbol directly. This is not a clean extension point.

Others changes

Complete the radix attention forward in specualtive mode, like

TARGET_VERIFY
DRAFT_EXTEND

Accuracy

Acceptance ratio

Copilot

Pull request overview

Enables DeepSeek v3 MTP (NextN/EAGLE draft model) support in the ATOM SGLang plugin by introducing a draft-architecture wrapper and extending the attention backend to handle speculative modes while isolating ATOM’s process-global runtime state between target and draft models.

Changes:

Add a SGLang external-model DeepseekV3ForCausalLMNextN wrapper that delegates draft computation to ATOM DeepSeekMTP.
Introduce plugin_runtime_scope() and apply it around init/forward/load to scope global runtime/config state per wrapper instance.
Extend the SGLang attention backend to support speculative TARGET_VERIFY / DRAFT_EXTEND (including CUDA graph capture/replay paths) and monkeypatch the upstream AiterAttnBackend symbol for draft paths.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
atom/plugin/sglang/models/deepseek_nextn_wrapper.py	New SGLang draft-model wrapper class (`DeepseekV3ForCausalLMNextN`) backed by ATOM `DeepSeekMTP`, including runtime scoping and layer-id retagging.
atom/plugin/sglang/models/base_model_wrapper.py	Add `plugin_runtime_scope()` and use it to scope ATOM globals; add embed/head sharing helpers; scope forward + weight loading.
atom/plugin/sglang/attention_backend/sgl_attn_backend.py	Add speculative-mode metadata init for MLA and extend CUDA-graph capture/replay + speculative forward path handling.
atom/plugin/sglang/attention_backend/radix_attention.py	Ensure `k_scale`/`v_scale` parameters are CUDA-resident for SGLang RadixAttention usage.
atom/plugin/register.py	Monkeypatch SGLang’s `AiterAttnBackend` symbol to route direct draft-path construction to `ATOMAttnBackendForSgl`.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-24T09:53:48Z

+        del embed_owner.embed_tokens.weight
+        del self.model.lm_head.weight
+        embed_owner.embed_tokens.weight = embed
+        self.model.lm_head.weight = head
+        torch.cuda.empty_cache()
+        torch.cuda.synchronize()
+


set_embed_and_head() calls torch.cuda.empty_cache() / torch.cuda.synchronize() unconditionally. In CPU-only environments (or when CUDA isn’t initialized), this will raise. Consider guarding with torch.cuda.is_available() (as done in deepseek_nextn_wrapper._sync_replaced_weights) or syncing based on embed.device.type == "cuda".

Copilot · 2026-04-24T09:53:49Z

+        del embed_owner.embed_tokens.weight
+        embed_owner.embed_tokens.weight = embed
+        torch.cuda.empty_cache()
+        torch.cuda.synchronize()


set_embed() calls torch.cuda.empty_cache() / torch.cuda.synchronize() unconditionally, which will error on CPU-only runs. Please guard these calls (e.g., torch.cuda.is_available() or embed.is_cuda) to keep the wrapper usable in non-CUDA test/mocking environments.

wanzhenchn · 2026-05-08T07:46:46Z

-            if self.attn.v_scale_float is None:
-                self.attn.v_scale_float = 1.0
+            elif not self.attn.v_scale.is_cuda:
+                self.attn.v_scale = torch.nn.Parameter(


modified, thx

wanzhenchn · 2026-05-08T09:07:29Z

+                _set_runtime_layer_id(nested_attn, local_layer_id)
+
+
+class DeepseekV3ForCausalLMNextN(nn.Module):


Perhaps you can implement your custom DeepseekV3ForCausalLMNextN by inheriting from the _AtomCausalLMBaseForSglang base class and adapting prepare_model() in plugin/prepare.py. This would help avoid a lot of redundant initialization code(like self.xx=xx, register_ops_to_sglang(), set_attn_cls(), init_aiter_dist(() and setting up engine type with plugin_runtime_scope()).

Thanks for suggestions. Currently, the most complexity exists at forward and load methods. On the contrary, init is not a big burden. This module is only used for MTP while the _AtomCausalLMBaseForSglang is assumed as target model. I prefer not to mix the responsibility for these two classes currently.

zhuyuhua-v · 2026-05-15T08:40:33Z

+        input_embeds: torch.Tensor = None,
+        **kwargs,
+    ):
+        del kwargs


why are we del kwargs?

Copilot

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

Comments suppressed due to low confidence (1)

atom/plugin/sglang/models/base_model_wrapper.py:437

set_embed() also unconditionally calls torch.cuda.empty_cache() / torch.cuda.synchronize(), which can raise when CUDA is not available. Please guard these calls with torch.cuda.is_available() (or a shared helper) to keep the wrapper usable in CPU-only/test environments.

        del embed_owner.embed_tokens.weight
        embed_owner.embed_tokens.weight = embed
        torch.cuda.empty_cache()
        torch.cuda.synchronize()

ZhiweiYan-96 · 2026-05-18T07:54:00Z

Hi, @valarLip could you please take a look on this PR on sglang side MTP support? Appreciation for all suggestions.

whx-sjtu · 2026-05-18T08:49:26Z

+    def _init_draft_extend_mla(self, bs, forward_batch):
+        """Init MLA metadata for speculative draft_extend."""


It looks like we re-implement the whole metadata build methods. Why can't we reuse sglang's original metadata?

yes, we can reuse, but sglang side would try uses kernel same as ATOM when mature. Add new fields or removing fileds is expected (already met for non speculative case). Generally, it is in my intention for the metadata no reusing here.

Copilot

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Copilot

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

Copilot

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

Copilot AI review requested due to automatic review settings April 24, 2026 09:45

ZhiweiYan-96 marked this pull request as draft April 24, 2026 09:45

Copilot started reviewing on behalf of ZhiweiYan-96 April 24, 2026 09:46 View session

Copilot AI reviewed Apr 24, 2026

View reviewed changes

ZhiweiYan-96 force-pushed the zhiwei/ds_mtp branch from e07ea02 to c34a46f Compare May 8, 2026 05:34

ZhiweiYan-96 requested review from Yuechguo, ZLkanyo009, qichu-yun, wanzhenchn, wuhuikx and zhuyuhua-v May 8, 2026 06:15

wanzhenchn reviewed May 8, 2026

View reviewed changes

zhuyuhua-v reviewed May 15, 2026

View reviewed changes

qichu-yun previously approved these changes May 18, 2026

View reviewed changes

ZhiweiYan-96 dismissed qichu-yun’s stale review via 5f02173 May 18, 2026 06:36

ZhiweiYan-96 force-pushed the zhiwei/ds_mtp branch from 5f02173 to 5b8ebab Compare May 18, 2026 07:36

ZhiweiYan-96 changed the title ~~[Draft][ATOM-SGLang][Feat] Enable Deepseek v3 MTP~~ [ATOM-SGLang][Feat] Enable Deepseek v3 MTP May 18, 2026

ZhiweiYan-96 marked this pull request as ready for review May 18, 2026 07:37

Copilot AI review requested due to automatic review settings May 18, 2026 07:37

Copilot started reviewing on behalf of ZhiweiYan-96 May 18, 2026 07:38 View session

Copilot AI reviewed May 18, 2026

View reviewed changes

Comment thread atom/plugin/sglang/models/base_model_wrapper.py

Comment thread atom/plugin/sglang/attention_backend/sgl_attn_backend.py

Comment thread atom/plugin/sglang/models/deepseek_nextn_wrapper.py

ZhiweiYan-96 requested a review from valarLip May 18, 2026 07:52

whx-sjtu reviewed May 18, 2026

View reviewed changes

valarLip previously approved these changes May 18, 2026

View reviewed changes

Copilot AI review requested due to automatic review settings May 18, 2026 09:54

ZhiweiYan-96 dismissed valarLip’s stale review via 4d44ed5 May 18, 2026 09:54

Copilot started reviewing on behalf of ZhiweiYan-96 May 18, 2026 09:55 View session

Copilot AI reviewed May 18, 2026

View reviewed changes

Comment thread atom/plugin/sglang/models/base_model_wrapper.py

Comment thread atom/plugin/sglang/attention_backend/sgl_attn_backend.py

Copilot AI review requested due to automatic review settings May 18, 2026 13:49

ZhiweiYan-96 force-pushed the zhiwei/ds_mtp branch from c1eb4b3 to 7c0969b Compare May 18, 2026 13:49

Copilot started reviewing on behalf of ZhiweiYan-96 May 18, 2026 13:50 View session

Copilot AI reviewed May 18, 2026

View reviewed changes

Comment thread atom/plugin/sglang/models/deepseek_nextn_wrapper.py

ZhiweiYan-96 and others added 11 commits May 18, 2026 15:11

MTP(num_step=1) for DeeepSeek

01311ee

Add work log for claude debug

04e8ece

adopt new attn constructor args

8345a3d

rm worklog

ecc4e8e

use atom_parameter

9c59168

kwargs handle

f89d34d

rebase main

379f38a

precheckin

e64db7d

fix k_scale v_scale error

e25f7fa

new commit

f7af634

fix blank

08611de

Copilot AI review requested due to automatic review settings May 18, 2026 15:14

ZhiweiYan-96 force-pushed the zhiwei/ds_mtp branch from d2da383 to 08611de Compare May 18, 2026 15:14

Copilot started reviewing on behalf of ZhiweiYan-96 May 18, 2026 15:15 View session

Copilot AI reviewed May 18, 2026

View reviewed changes

Comment thread atom/plugin/sglang/models/deepseek_nextn_wrapper.py

Comment thread atom/plugin/sglang/models/deepseek_nextn_wrapper.py

Comment thread atom/plugin/sglang/attention_backend/sgl_attn_backend.py

fix qwen3.5 acc

bf09cfb

valarLip approved these changes May 19, 2026

View reviewed changes

zhuyuhua-v merged commit 28b455a into ROCm:main May 19, 2026
59 of 71 checks passed

		_set_runtime_layer_id(nested_attn, local_layer_id)


		class DeepseekV3ForCausalLMNextN(nn.Module):

		def _init_draft_extend_mla(self, bs, forward_batch):
		"""Init MLA metadata for speculative draft_extend."""

Conversation

ZhiweiYan-96 commented Apr 24, 2026

Proposed Design

1. MTP module creation: Override the draft architecture through the external model package

2. MTP Warrper in ATOM: A thin wrapper as the compatibility bridge

Risks

Intrusive change to formal runtime variable control codes

Direct attn backend replacment sglang_aiter_backend.AiterAttnBackend = ATOMAttnBackendForSgl

Others changes

Accuracy

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

wanzhenchn May 8, 2026

Choose a reason for hiding this comment

Uh oh!

ZhiweiYan-96 May 11, 2026

Choose a reason for hiding this comment

Uh oh!

wanzhenchn May 8, 2026

Choose a reason for hiding this comment

Uh oh!

ZhiweiYan-96 May 11, 2026

Choose a reason for hiding this comment

Uh oh!

zhuyuhua-v May 15, 2026

Choose a reason for hiding this comment

Uh oh!

ZhiweiYan-96 May 18, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ZhiweiYan-96 commented May 18, 2026

Uh oh!

whx-sjtu May 18, 2026

Choose a reason for hiding this comment

Uh oh!

ZhiweiYan-96 May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Direct attn backend replacment `sglang_aiter_backend.AiterAttnBackend = ATOMAttnBackendForSgl`

ZhiweiYan-96 May 18, 2026 •

edited

Loading