Skip to content

perf: simplify Qwen3.5-MoE state_dict_adapter + DTensor passthrough#1589

Merged
HuiyingLi merged 1 commit intomainfrom
huiyingl/perf-qwen3_5_moe-simplify-state-dict-adapter
Mar 23, 2026
Merged

perf: simplify Qwen3.5-MoE state_dict_adapter + DTensor passthrough#1589
HuiyingLi merged 1 commit intomainfrom
huiyingl/perf-qwen3_5_moe-simplify-state-dict-adapter

Conversation

@HuiyingLi
Copy link
Copy Markdown
Contributor

Summary

Applies the same optimization as #1570 (Qwen3-VL-MoE) to the Qwen3.5-MoE state_dict_adapter, which had the identical unnecessary complexity.

Changes

  • to_hf: Removed all_gather_object, full_tensor(), per-expert CPU copies, .to(dtype) casts, and the device_mesh branch (~85 lines removed). Now just renames keys + transpose(1, 2). No comms needed — DCP handles distributed save.
  • from_hf: Added DTensor vs plain tensor path distinction:
    • DCP path (DTensor): rename + transpose — no EP slicing, no create_dtensor_from_local.
    • Init path (plain tensor): slice + transpose + create DTensor (unchanged behavior).
  • convert_single_tensor_to_hf: Removed to_local() and .to(self.dtype) calls. Simple key rename + transpose; DTensors pass through.
  • Removed import torch.distributed as dist (no longer needed).

Benchmark (Qwen3.5-35B-A3B real weights, 1811 keys, 848 expert keys, single node)

Operation Before After Speedup
from_hf (plain tensor) 1.9ms 1.7ms 1.1x
to_hf (no device_mesh) 1.8ms 0.9ms 2.0x
to_hf (with device_mesh) hangs

The main gain is eliminating the to_hf hang with device_mesh (same issue as #1570) and removing unnecessary all-gather comms on the DCP save path.

E2E validation (8x H100, Qwen3.5-35B-A3B real weights)

  • Run 1: Trained 30 steps → checkpoint saved at step 29
  • Run 2: Resumed from checkpoint → trained to step 59 → checkpoint saved
  • Loss stable, no errors

Test plan

  • 26 unit tests pass (3 old all_gather tests removed, 2 new DTensor/transpose tests added)
  • Round-trip correctness verified (HF → native → HF, all expert keys match)
  • E2E train + resume on 8x H100 with real weights

🤖 Generated with Claude Code

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Mar 21, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@HuiyingLi HuiyingLi force-pushed the huiyingl/perf-qwen3_5_moe-simplify-state-dict-adapter branch from ebf61f8 to 45e0e40 Compare March 21, 2026 01:44
@HuiyingLi
Copy link
Copy Markdown
Contributor Author

/ok to test 45e0e40

Same fix as PR #1570 for Qwen3-VL-MoE, applied to Qwen3.5-MoE:

- to_hf: removed all_gather_object, full_tensor(), per-expert CPU copies,
  .to(dtype) casts. Now just renames keys + transpose(1,2). No comms.
- from_hf: DTensor passthrough for DCP path (rename + transpose).
  Plain tensor init path unchanged (slice + transpose + create DTensor).
- convert_single_tensor_to_hf: removed to_local() and .to(dtype).
  Just rename + transpose, DTensors pass through.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
@HuiyingLi HuiyingLi force-pushed the huiyingl/perf-qwen3_5_moe-simplify-state-dict-adapter branch from 45e0e40 to 389899b Compare March 21, 2026 01:46
@HuiyingLi
Copy link
Copy Markdown
Contributor Author

/ok to test 389899b

@hemildesai
Copy link
Copy Markdown
Contributor

/claude review

@HuiyingLi HuiyingLi merged commit b858b94 into main Mar 23, 2026
52 checks passed
@HuiyingLi HuiyingLi deleted the huiyingl/perf-qwen3_5_moe-simplify-state-dict-adapter branch March 23, 2026 19:20
torsli pushed a commit that referenced this pull request Mar 24, 2026
…1589)

perf: simplify Qwen3.5-MoE state_dict_adapter to rename+transpose only

Same fix as PR #1570 for Qwen3-VL-MoE, applied to Qwen3.5-MoE:

- to_hf: removed all_gather_object, full_tensor(), per-expert CPU copies,
  .to(dtype) casts. Now just renames keys + transpose(1,2). No comms.
- from_hf: DTensor passthrough for DCP path (rename + transpose).
  Plain tensor init path unchanged (slice + transpose + create DTensor).
- convert_single_tensor_to_hf: removed to_local() and .to(dtype).
  Just rename + transpose, DTensors pass through.

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
linnanwang pushed a commit that referenced this pull request Apr 24, 2026
…1589)

perf: simplify Qwen3.5-MoE state_dict_adapter to rename+transpose only

Same fix as PR #1570 for Qwen3-VL-MoE, applied to Qwen3.5-MoE:

- to_hf: removed all_gather_object, full_tensor(), per-expert CPU copies,
  .to(dtype) casts. Now just renames keys + transpose(1,2). No comms.
- from_hf: DTensor passthrough for DCP path (rename + transpose).
  Plain tensor init path unchanged (slice + transpose + create DTensor).
- convert_single_tensor_to_hf: removed to_local() and .to(dtype).
  Just rename + transpose, DTensors pass through.

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants