Skip to content

cp: fix: gpt_oss_20b_single_gpu_peft CI crash with nproc_per_node override (1835) into r0.4.0#1844

Merged
akoumpa merged 1 commit intor0.4.0from
cherry-pick-1835-r0.4.0
Apr 14, 2026
Merged

cp: fix: gpt_oss_20b_single_gpu_peft CI crash with nproc_per_node override (1835) into r0.4.0#1844
akoumpa merged 1 commit intor0.4.0from
cherry-pick-1835-r0.4.0

Conversation

@svcnvidia-nemo-ci
Copy link
Copy Markdown
Contributor

beep boop [🤖]: Hi @adil-a 👋,

we've cherry picked #1835 into  for you! 🚀

Please review and approve this cherry pick by your convenience!

#1835)

fix: gpt_oss_20b_single_gpu_peft CI crash with `nproc_per_node` override

The single-GPU gpt_oss_20b peft recipe was being launched with 8 GPUs
in CI, causing a `ValueError: Unexpected dimension name: dp_replicate`
in the MoE state_dict_adapter during checkpoint loading.

Root cause: the recipe targets DGX Spark (1 GPU) but CI defaults to
NPROC_PER_NODE=8. With 8 GPUs the FSDP2 mesh includes dp_replicate,
which the MoE weight dequantization code did not handle.

Fixes:
- Add `ci.nproc_per_node: 1` to run this recipe on 1 GPU in CI
- Add CONFIG_NPROC_PER_NODE support in finetune_launcher.sh and
  generate_ci_tests.py so per-recipe GPU count overrides work
- Skip EP redistribution in state_dict_adapter when no EP dims
  are present in the DTensor mesh (defensive fix)

Verified: 100 steps on 1 GPU, loss 4.90 → 2.25, 61.7 GiB peak memory.

Signed-off-by: adil-a <adil.asif2000@hotmail.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>
@svcnvidia-nemo-ci
Copy link
Copy Markdown
Contributor Author

/ok to test f7bdbf5

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Apr 14, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@akoumpa akoumpa merged commit 48a8229 into r0.4.0 Apr 14, 2026
52 of 54 checks passed
@akoumpa akoumpa deleted the cherry-pick-1835-r0.4.0 branch April 14, 2026 22:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cherry-pick Run CICD Trigger Testing CICD

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants