fix: swap DTensor shard placements after transpose in Step3p5 state dict adapter by adil-a · Pull Request #1691 · NVIDIA-NeMo/Automodel

adil-a · 2026-04-06T16:49:55Z

Summary

Fix Step3p5 model checkpoint loading crash when ep_shard_size > 1 (i.e. DP replicas exist beyond EP)
The Step3p5StateDictAdapter transposes dims 1↔2 when converting between HF and native MoE weight formats, but was recreating DTensors with the original Shard placements — causing Shard(1) (from FSDP2's shard_placement_fn=lambda _: Shard(1) in the MoE parallelizer) to apply to the wrong axis after transpose
This produced incorrect global shapes (e.g. [288, 8192, 640] instead of [288, 4096, 1280] for down_proj.weight), making DCP load fail with ValueError: Size mismatch
The fix swaps Shard(1) ↔ Shard(2) in placements after every transpose(1, 2) across all 4 conversion sites (2 in to_hf, 2 in from_hf)

Test plan

All 17 existing Step3p5 state dict adapter unit tests pass
Validated on cw-dfw with same CI config (PP=2, EP=32, 16 nodes, 128 GPUs): model loads, trains 10 steps (loss 5.12→2.92), and saves consolidated safetensors checkpoint successfully
CI job that originally failed: https://gitlab-master.nvidia.com/dl/JoC/nemo-ci/-/jobs/292910876

🤖 Generated with Claude Code

…ict adapter When FSDP2 shards MoE expert params on dim 1 (via ep_shard mesh), the Step3p5StateDictAdapter's to_hf/from_hf conversions transposed dims 1 and 2 of the local tensor but recreated the DTensor with the original placements. This caused Shard(1) to apply to the wrong axis, producing incorrect global shapes (e.g. [288, 8192, 640] instead of [288, 4096, 1280] for down_proj.weight), which made DCP checkpoint loading fail with a size mismatch error. The fix swaps Shard(1) <-> Shard(2) in placements after every transpose(1, 2) so DTensor.from_local infers the correct global shape. Signed-off-by: adil-a <adasif@nvidia.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: adil-a <adil.asif2000@hotmail.com>

copy-pr-bot · 2026-04-06T16:49:59Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Pass swapped placements through _create_dtensor_from_local_or_reference via a new placements_override parameter instead of duplicating the DTensor creation logic inline. Signed-off-by: adil-a <adasif@nvidia.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: adil-a <adil.asif2000@hotmail.com>

adil-a · 2026-04-06T17:12:34Z

/ok to test 276298c

…ict adapter (#1691) * fix: swap DTensor shard placements after transpose in Step3p5 state dict adapter When FSDP2 shards MoE expert params on dim 1 (via ep_shard mesh), the Step3p5StateDictAdapter's to_hf/from_hf conversions transposed dims 1 and 2 of the local tensor but recreated the DTensor with the original placements. This caused Shard(1) to apply to the wrong axis, producing incorrect global shapes (e.g. [288, 8192, 640] instead of [288, 4096, 1280] for down_proj.weight), which made DCP checkpoint loading fail with a size mismatch error. The fix swaps Shard(1) <-> Shard(2) in placements after every transpose(1, 2) so DTensor.from_local infers the correct global shape. Signed-off-by: adil-a <adasif@nvidia.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: adil-a <adil.asif2000@hotmail.com> * refactor: use placements_override param instead of bypassing helper Pass swapped placements through _create_dtensor_from_local_or_reference via a new placements_override parameter instead of duplicating the DTensor creation logic inline. Signed-off-by: adil-a <adasif@nvidia.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: adil-a <adil.asif2000@hotmail.com> --------- Signed-off-by: adil-a <adil.asif2000@hotmail.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

adil-a requested review from HuiyingLi, ZhiyuLi-Nvidia, akoumpa, hemildesai and pthombre as code owners April 6, 2026 16:49

copy-pr-bot Bot temporarily deployed to test April 6, 2026 17:12 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci April 6, 2026 17:12 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci April 6, 2026 17:13 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci April 6, 2026 17:35 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci April 6, 2026 17:57 Inactive

akoumpa approved these changes Apr 6, 2026

View reviewed changes

akoumpa merged commit 5018039 into main Apr 6, 2026
53 of 54 checks passed

akoumpa deleted the fix/step3p5-dtensor-placement-transpose branch April 6, 2026 22:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: swap DTensor shard placements after transpose in Step3p5 state dict adapter#1691

fix: swap DTensor shard placements after transpose in Step3p5 state dict adapter#1691
akoumpa merged 2 commits intomainfrom
fix/step3p5-dtensor-placement-transpose

adil-a commented Apr 6, 2026

Uh oh!

copy-pr-bot Bot commented Apr 6, 2026

Uh oh!

adil-a commented Apr 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

adil-a commented Apr 6, 2026

Summary

Test plan

Uh oh!

copy-pr-bot Bot commented Apr 6, 2026

Uh oh!

adil-a commented Apr 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants