perf: simplify Qwen3.5-MoE state_dict_adapter + DTensor passthrough by HuiyingLi · Pull Request #1589 · NVIDIA-NeMo/Automodel

HuiyingLi · 2026-03-21T01:41:00Z

Summary

Applies the same optimization as #1570 (Qwen3-VL-MoE) to the Qwen3.5-MoE state_dict_adapter, which had the identical unnecessary complexity.

Changes

to_hf: Removed all_gather_object, full_tensor(), per-expert CPU copies, .to(dtype) casts, and the device_mesh branch (~85 lines removed). Now just renames keys + transpose(1, 2). No comms needed — DCP handles distributed save.
from_hf: Added DTensor vs plain tensor path distinction:
- DCP path (DTensor): rename + transpose — no EP slicing, no create_dtensor_from_local.
- Init path (plain tensor): slice + transpose + create DTensor (unchanged behavior).
convert_single_tensor_to_hf: Removed to_local() and .to(self.dtype) calls. Simple key rename + transpose; DTensors pass through.
Removed import torch.distributed as dist (no longer needed).

Benchmark (Qwen3.5-35B-A3B real weights, 1811 keys, 848 expert keys, single node)

Operation	Before	After	Speedup
`from_hf` (plain tensor)	1.9ms	1.7ms	1.1x
`to_hf` (no device_mesh)	1.8ms	0.9ms	2.0x
`to_hf` (with device_mesh)	hangs	—	—

The main gain is eliminating the to_hf hang with device_mesh (same issue as #1570) and removing unnecessary all-gather comms on the DCP save path.

E2E validation (8x H100, Qwen3.5-35B-A3B real weights)

Run 1: Trained 30 steps → checkpoint saved at step 29
Run 2: Resumed from checkpoint → trained to step 59 → checkpoint saved
Loss stable, no errors

Test plan

26 unit tests pass (3 old all_gather tests removed, 2 new DTensor/transpose tests added)
Round-trip correctness verified (HF → native → HF, all expert keys match)
E2E train + resume on 8x H100 with real weights

🤖 Generated with Claude Code

copy-pr-bot · 2026-03-21T01:41:04Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

HuiyingLi · 2026-03-21T01:44:42Z

/ok to test 45e0e40

Same fix as PR #1570 for Qwen3-VL-MoE, applied to Qwen3.5-MoE: - to_hf: removed all_gather_object, full_tensor(), per-expert CPU copies, .to(dtype) casts. Now just renames keys + transpose(1,2). No comms. - from_hf: DTensor passthrough for DCP path (rename + transpose). Plain tensor init path unchanged (slice + transpose + create DTensor). - convert_single_tensor_to_hf: removed to_local() and .to(dtype). Just rename + transpose, DTensors pass through. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

HuiyingLi · 2026-03-21T01:46:09Z

/ok to test 389899b

hemildesai · 2026-03-23T18:30:14Z

/claude review

…1589) perf: simplify Qwen3.5-MoE state_dict_adapter to rename+transpose only Same fix as PR #1570 for Qwen3-VL-MoE, applied to Qwen3.5-MoE: - to_hf: removed all_gather_object, full_tensor(), per-expert CPU copies, .to(dtype) casts. Now just renames keys + transpose(1,2). No comms. - from_hf: DTensor passthrough for DCP path (rename + transpose). Plain tensor init path unchanged (slice + transpose + create DTensor). - convert_single_tensor_to_hf: removed to_local() and .to(dtype). Just rename + transpose, DTensors pass through. Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

HuiyingLi force-pushed the huiyingl/perf-qwen3_5_moe-simplify-state-dict-adapter branch from ebf61f8 to 45e0e40 Compare March 21, 2026 01:44

HuiyingLi requested review from ZhiyuLi-Nvidia, adil-a, akoumpa and hemildesai as code owners March 21, 2026 01:44

copy-pr-bot Bot had a problem deploying to nemo-ci March 21, 2026 01:45 Error

copy-pr-bot Bot temporarily deployed to nemo-ci March 21, 2026 01:45 Inactive

copy-pr-bot Bot had a problem deploying to nemo-ci March 21, 2026 01:45 Error

copy-pr-bot Bot had a problem deploying to test March 21, 2026 01:45 Error

HuiyingLi force-pushed the huiyingl/perf-qwen3_5_moe-simplify-state-dict-adapter branch from 45e0e40 to 389899b Compare March 21, 2026 01:46

copy-pr-bot Bot temporarily deployed to test March 21, 2026 01:46 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci March 21, 2026 01:46 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci March 21, 2026 02:06 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci March 21, 2026 02:38 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci March 21, 2026 02:56 Inactive

hemildesai approved these changes Mar 23, 2026

View reviewed changes

HuiyingLi merged commit b858b94 into main Mar 23, 2026
52 checks passed

HuiyingLi deleted the huiyingl/perf-qwen3_5_moe-simplify-state-dict-adapter branch March 23, 2026 19:20

HuiyingLi mentioned this pull request Mar 24, 2026

fix: keep to_local() in Qwen3.5-MoE to_hf to avoid ep_shard corruption #1599

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: simplify Qwen3.5-MoE state_dict_adapter + DTensor passthrough#1589

perf: simplify Qwen3.5-MoE state_dict_adapter + DTensor passthrough#1589
HuiyingLi merged 1 commit intomainfrom
huiyingl/perf-qwen3_5_moe-simplify-state-dict-adapter

HuiyingLi commented Mar 21, 2026

Uh oh!

copy-pr-bot Bot commented Mar 21, 2026

Uh oh!

HuiyingLi commented Mar 21, 2026

Uh oh!

HuiyingLi commented Mar 21, 2026

Uh oh!

hemildesai commented Mar 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

HuiyingLi commented Mar 21, 2026

Summary

Changes

Benchmark (Qwen3.5-35B-A3B real weights, 1811 keys, 848 expert keys, single node)

E2E validation (8x H100, Qwen3.5-35B-A3B real weights)

Test plan

Uh oh!

copy-pr-bot Bot commented Mar 21, 2026

Uh oh!

HuiyingLi commented Mar 21, 2026

Uh oh!

HuiyingLi commented Mar 21, 2026

Uh oh!

hemildesai commented Mar 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants