fix: resolve TP+PP for nemotron super 49B by HuiyingLi · Pull Request #1607 · NVIDIA-NeMo/Automodel

HuiyingLi · 2026-03-25T19:15:10Z

single node hellaswag

2nodes squad

When pipeline parallelism splits a model, nn.ModuleList layers are converted to nn.ModuleDict. Three issues surfaced with custom models (e.g. DeciLM/Nemotron-49B) that use explicit self.num_heads in attention views and return tuples from decoder layers: 1. _update_attention_head_counts_for_tp iterates `for layer in layers`, which yields string keys (not modules) for ModuleDict — head counts were never updated, causing shape mismatches in the Q/K/V view. 2. The walrus operator fallback for causal_mask_mapping could leave a raw 2D attention_mask in place of the expected 4D causal mask when the import or computation failed silently. 3. The batch device-move code filtered out None values from nested dicts, dropping causal_mask_mapping entries for sdpa-configured models where create_causal_mask returns None. Additionally, decoder layers in older-style HF models (pre-v5) return tuples rather than bare tensors, and raw 2D padding masks that leak through the pipeline schedule need to be dropped before reaching custom attention code. Verified on nvidia/Llama-3_3-Nemotron-Super-49B-v1_5 with tp4pp2 (100 training steps, hellaswag dataset, 8xH100). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

copy-pr-bot · 2026-03-25T19:15:15Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

akoumpa · 2026-03-25T19:44:46Z

/ok to test 15f9cbe

HuiyingLi · 2026-03-25T21:55:26Z

/claude review

claude

LGTM. Both fixes are correct and well-targeted:

ModuleDict iteration (parallelizer.py): Properly handles the PP-converted ModuleDict by iterating .values() instead of yielding keys.
Tuple unpacking + kwargs removal (hf_utils.py): Correctly extracts hidden_states from the decoder layer's tuple output, matching the standard HF contract.

akoumpa · 2026-03-25T22:12:00Z

ci passed https://github.com/NVIDIA-NeMo/Automodel/actions/runs/23560675769/job/68604843642?pr=1607

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

akoumpa

LGTM. thank you @HuiyingLi

chtruong814 · 2026-03-25T23:41:27Z

/claude review

claude

LGTM

* fix: resolve TP+PP pipeline parallelism bugs for custom HF models When pipeline parallelism splits a model, nn.ModuleList layers are converted to nn.ModuleDict. Three issues surfaced with custom models (e.g. DeciLM/Nemotron-49B) that use explicit self.num_heads in attention views and return tuples from decoder layers: 1. _update_attention_head_counts_for_tp iterates `for layer in layers`, which yields string keys (not modules) for ModuleDict — head counts were never updated, causing shape mismatches in the Q/K/V view. 2. The walrus operator fallback for causal_mask_mapping could leave a raw 2D attention_mask in place of the expected 4D causal mask when the import or computation failed silently. 3. The batch device-move code filtered out None values from nested dicts, dropping causal_mask_mapping entries for sdpa-configured models where create_causal_mask returns None. Additionally, decoder layers in older-style HF models (pre-v5) return tuples rather than bare tensors, and raw 2D padding masks that leak through the pipeline schedule need to be dropped before reaching custom attention code. Verified on nvidia/Llama-3_3-Nemotron-Super-49B-v1_5 with tp4pp2 (100 training steps, hellaswag dataset, 8xH100). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * update recipe Signed-off-by: HuiyingLi <willwin.lee@gmail.com> * fix Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> --------- Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

HuiyingLi requested review from ZhiyuLi-Nvidia, adil-a, akoumpa, hemildesai and pthombre as code owners March 25, 2026 19:15

copy-pr-bot Bot temporarily deployed to test March 25, 2026 19:45 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci March 25, 2026 19:45 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci March 25, 2026 19:55 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci March 25, 2026 20:02 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci March 25, 2026 20:22 Inactive

claude Bot previously approved these changes Mar 25, 2026

View reviewed changes

thomasdhc dismissed claude[bot]’s stale review via 4abe87d March 25, 2026 23:05

update recipe

578e85c

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

thomasdhc force-pushed the huiyingl/fix-tp-pp-pipeline-parallelism-bugs branch from 4abe87d to 578e85c Compare March 25, 2026 23:09

akoumpa added 2 commits March 25, 2026 16:23

Merge branch 'main' into huiyingl/fix-tp-pp-pipeline-parallelism-bugs

13084ff

fix

8e4573f

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

akoumpa approved these changes Mar 25, 2026

View reviewed changes

akoumpa merged commit 9db8c1f into main Mar 25, 2026
4 checks passed

akoumpa deleted the huiyingl/fix-tp-pp-pipeline-parallelism-bugs branch March 25, 2026 23:25

claude Bot approved these changes Mar 25, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: resolve TP+PP for nemotron super 49B#1607

fix: resolve TP+PP for nemotron super 49B#1607
akoumpa merged 4 commits intomainfrom
huiyingl/fix-tp-pp-pipeline-parallelism-bugs

HuiyingLi commented Mar 25, 2026 •

edited

Loading

Uh oh!

copy-pr-bot Bot commented Mar 25, 2026

Uh oh!

akoumpa commented Mar 25, 2026

Uh oh!

HuiyingLi commented Mar 25, 2026

Uh oh!

claude Bot left a comment

Uh oh!

akoumpa commented Mar 25, 2026

Uh oh!

akoumpa left a comment

Uh oh!

Uh oh!

chtruong814 commented Mar 25, 2026

Uh oh!

claude Bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

HuiyingLi commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

copy-pr-bot Bot commented Mar 25, 2026

Uh oh!

akoumpa commented Mar 25, 2026

Uh oh!

HuiyingLi commented Mar 25, 2026

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Uh oh!

akoumpa commented Mar 25, 2026

Uh oh!

akoumpa left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

chtruong814 commented Mar 25, 2026

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

HuiyingLi commented Mar 25, 2026 •

edited

Loading