Fix qwen35 dp by grimoire · Pull Request #4535 · InternLM/lmdeploy

grimoire · 2026-04-18T08:57:44Z

dp/cudagraph might padding state_ids with -1, which would be clamp to 0 in model.

0 is reserved state for dummy inputs, multiple dummy inputs might write to the same state, leads to invalid output (nan/inf).

Copilot

Pull request overview

This PR addresses NaN/Inf issues seen with Qwen3.5 under DP + CUDA graph execution by preventing padded/invalid state_ids from colliding with the reserved dummy state, and by hardening a few related kernels/buffer initializations.

Changes:

Preserve negative state_ids for gated-delta decoding and update the CUDA kernel to ignore invalid (<0) states.
Adjust CUDA graph buffer initialization to fill both Q/KV seqlens padding consistently to avoid flash-attn OOB.
Introduce a MoE all-reduce wrapper and route Qwen3.5 MoE output reduction through it; add a small div-by-zero guard in split-K reduce.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
lmdeploy/pytorch/nn/moe/base.py	Adds `build_moe_all_reduce()` and `MoEAllReduce` helper for post-MoE reduction/shared-TP enablement.
lmdeploy/pytorch/nn/gated_delta.py	Keeps original (possibly negative) `state_ids` and uses them for decoding-state indices.
lmdeploy/pytorch/models/utils/cudagraph.py	Initializes both q/kv seqlens padding via `qkv_seqlens` to prevent kernel OOB.
lmdeploy/pytorch/models/qwen3_5_moe.py	Uses new MoE all-reduce helper and updates shared expert TP selection.
lmdeploy/pytorch/kernels/cuda/pagedattention.py	Adds epsilon to avoid div-by-zero in `_reduce_split_kernel`.
lmdeploy/pytorch/kernels/cuda/gated_delta_rule.py	Skips loading initial state when `state_id < 0`.

Comments suppressed due to low confidence (1)

lmdeploy/pytorch/models/qwen3_5_moe.py:106

After switching to self.moe_all_reduce(...), the old DP/TP gating block that sets self._all_reduce (and the dp/world_size locals) is now unused, and the new reduction path is unconditional. Once MoEAllReduce is fixed to use the right group/enable conditions, please either delete the dead _all_reduce logic or wire its condition into build_moe_all_reduce() so behavior matches the previous (dp==1 and moe_tp>1) guard.

        self.moe_all_reduce = self.experts.build_moe_all_reduce()

        self.shared_expert = Qwen3_5MLP(
            config=config,
            intermediate_size=config.shared_expert_intermediate_size,
            dtype=dtype,
            device=device,
            is_tp=self.moe_all_reduce.enable_shared_tp(),
            all_reduce=False,
            prefix=add_prefix('shared_expert', prefix),
        )
        self.shared_expert_gate = torch.nn.Linear(config.hidden_size, 1, bias=False, device=device, dtype=dtype)

        # get all reduce
        dist_ctx = get_dist_manager().current_context()
        dp = dist_ctx.dist_config.dp
        world_size = dist_ctx.dist_config.moe_tp
        if dp == 1 and world_size > 1:
            self._all_reduce = True
        else:
            self._all_reduce = False

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-22T06:49:45Z

+    def forward(self, x: torch.Tensor):
+        """forward."""
+        if self._all_reduce:
+            dist.all_reduce(x)
+        return x


MoEAllReduce.forward() calls dist.all_reduce(x) without specifying a group. In lmdeploy.pytorch.distributed.all_reduce, the default group is 'tp' which maps to the attention TP group, not the MoE TP group, so this can reduce across the wrong ranks when attn_tp != moe_tp. Please pass the correct MoE TP process group (e.g., from DistContext.moe_tp_group.gpu_group) into MoEAllReduce and use it in the all-reduce call.

grimoire added 2 commits April 17, 2026 21:32

fix qwen35 moe dp

05504db

fix qwen35 dp

cd25a83

grimoire requested a review from yao-fengchen April 18, 2026 08:58

yao-fengchen approved these changes Apr 20, 2026

View reviewed changes

lvhan028 requested a review from Copilot April 22, 2026 06:44

Copilot started reviewing on behalf of lvhan028 April 22, 2026 06:44 View session

Copilot AI reviewed Apr 22, 2026

View reviewed changes

fix comment

2d79db7

lvhan028 added the Bug:P1 label Apr 23, 2026

lvhan028 approved these changes Apr 23, 2026

View reviewed changes

lvhan028 merged commit 964c878 into InternLM:main Apr 23, 2026
5 of 6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix qwen35 dp#4535

Fix qwen35 dp#4535
lvhan028 merged 3 commits intoInternLM:mainfrom
grimoire:fix-qwen35-tp

grimoire commented Apr 18, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 22, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

grimoire commented Apr 18, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants