fix: auto-compute dp_replicate_size from world_size by yeyu-nvidia · Pull Request #1302 · NVIDIA/Model-Optimizer

yeyu-nvidia · 2026-04-20T19:29:24Z

Summary

When dp_shard_size < world_size (e.g., dp_shard_size=4 on 8 GPUs across 2 nodes), ParallelismConfig raises total_size (4) does not match num_processes (8) because dp_replicate_size defaults to 1
Auto-compute dp_replicate_size = world_size // (dp_shard_size * cp_size) so intra-node FSDP2 sharding + inter-node data-parallel replication works without manual config
This enables dp_shard_size to be set to per-node GPU count (better NVLink utilization) while automatically creating replicas across nodes

Test plan

Verify single-node training (dp_shard_size == world_size, dp_replicate_size == 1) unchanged
Verify multi-node with dp_shard_size < world_size creates correct replica groups
Verify existing EAGLE3/DFlash configs still work

🤖 Generated with Claude Code

Summary by CodeRabbit

Refactor
- Improved parallelism setup in the speculative decoding example to better detect and validate available devices, derive replication size, and ensure consistent distributed-training configuration.

When dp_shard_size < world_size (e.g., dp_shard_size=4 on 8 GPUs), ParallelismConfig raises "total_size does not match num_processes" because dp_replicate_size defaults to 1. Auto-compute dp_replicate_size = world_size // (dp_shard_size * cp_size) so that intra-node FSDP2 sharding + inter-node data-parallel replication works without requiring users to manually set dp_replicate_size. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ye Yu <yeyu@nvidia.com>

coderabbitai · 2026-04-20T19:29:39Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 137650a0-9f04-4159-ae6b-0d9b3bbb0353

📥 Commits

Reviewing files that changed from the base of the PR and between 83226f4 and d9bb6c4.

📒 Files selected for processing (1)

examples/speculative_decoding/main.py

🚧 Files skipped from review as they are similar to previous changes (1)

examples/speculative_decoding/main.py

📝 Walkthrough

Walkthrough

The train() function in the speculative decoding example now computes dp_replicate_size from the world size and existing parallelism parameters, validates divisibility, and includes dp_replicate_size in the ParallelismConfig when cp_size > 1 or dp_shard_size > 1.

Changes

Cohort / File(s)	Summary
ParallelismConfig Initialization `examples/speculative_decoding/main.py`	When `cp_size > 1` or `dp_shard_size > 1`, compute `world_size` from `WORLD_SIZE` (falling back to `torch.cuda.device_count()`), derive `parallel_size = dp_shard_size * cp_size`, assert `world_size % parallel_size == 0`, set `dp_replicate_size = world_size // parallel_size`, and pass `dp_replicate_size` into `training_args.parallelism_config` alongside `cp_size` and `dp_shard_size`.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately describes the main change: auto-computing dp_replicate_size from world_size, which is the core fix addressing the ParallelismConfig error.
Security Anti-Patterns	✅ Passed	No security anti-patterns detected: no torch.load with weights_only=False, numpy.load with allow_pickle=True, trust_remote_code=True, eval(), exec(), or # nosec comments found in modified code.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch yeyu/fix-dp-replicate-size

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2026-04-20T19:32:45Z

PR Preview Action v1.8.1
Preview removed because the pull request was closed.
2026-04-20 20:40 UTC

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@examples/speculative_decoding/main.py`:
- Around line 215-223: Before computing dp_replicate_size, validate the
topology: compute parallel_size = training_args.dp_shard_size *
training_args.cp_size and check that parallel_size > 0 and world_size %
parallel_size == 0; if not, raise a clear ValueError explaining world_size,
dp_shard_size, cp_size and the expected divisibility so we fail fast; only then
compute dp_replicate_size = world_size // parallel_size and pass it into
ParallelismConfig (references: world_size, parallel_size, dp_replicate_size,
training_args.dp_shard_size, training_args.cp_size, ParallelismConfig).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 393b2bf5-3bcb-4a5e-b0fc-63301c39967f

📥 Commits

Reviewing files that changed from the base of the PR and between 289a239 and 83226f4.

📒 Files selected for processing (1)

examples/speculative_decoding/main.py

ChenhanYu

Correct fix for multi-node FSDP2 where dp_shard_size < world_size. One suggestion: add a divisibility guard (if world_size % parallel_size != 0: raise ValueError(...)) to catch misconfigurations early instead of letting ParallelismConfig fail with a confusing error. Also note the torch.cuda.device_count() fallback returns per-node count, not world size — correct for single-node but worth a comment. LGTM.

codecov · 2026-04-20T19:41:53Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 75.98%. Comparing base (289a239) to head (d9bb6c4).
⚠️ Report is 2 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1302      +/-   ##
==========================================
+ Coverage   75.38%   75.98%   +0.60%     
==========================================
  Files         462      462              
  Lines       49960    49960              
==========================================
+ Hits        37662    37962     +300     
+ Misses      12298    11998     -300

Flag	Coverage Δ
examples	`41.56% <ø> (+0.85%)`	⬆️
regression	`14.85% <ø> (+0.06%)`	⬆️
unit	`52.40% <ø> (+0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Address review feedback: - Add ValueError if world_size is not divisible by dp_shard_size * cp_size - Comment that torch.cuda.device_count() is per-node, not world_size Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ye Yu <yeyu@nvidia.com>

## Summary - When `dp_shard_size < world_size` (e.g., `dp_shard_size=4` on 8 GPUs across 2 nodes), `ParallelismConfig` raises `total_size (4) does not match num_processes (8)` because `dp_replicate_size` defaults to 1 - Auto-compute `dp_replicate_size = world_size // (dp_shard_size * cp_size)` so intra-node FSDP2 sharding + inter-node data-parallel replication works without manual config - This enables `dp_shard_size` to be set to per-node GPU count (better NVLink utilization) while automatically creating replicas across nodes ## Test plan - [ ] Verify single-node training (dp_shard_size == world_size, dp_replicate_size == 1) unchanged - [ ] Verify multi-node with dp_shard_size < world_size creates correct replica groups - [ ] Verify existing EAGLE3/DFlash configs still work 🤖 Generated with [Claude Code](https://claude.com/claude-code)  ## Summary by CodeRabbit * **Refactor** * Enhanced parallelism configuration initialization in the speculative decoding example to better handle distributed training scenarios.  Signed-off-by: Ye Yu <yeyu@nvidia.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

yeyu-nvidia requested a review from a team as a code owner April 20, 2026 19:29

yeyu-nvidia requested a review from h-guo18 April 20, 2026 19:29

coderabbitai Bot reviewed Apr 20, 2026

View reviewed changes

Comment thread examples/speculative_decoding/main.py

kevalmorabia97 approved these changes Apr 20, 2026

View reviewed changes

yeyu-nvidia enabled auto-merge (squash) April 20, 2026 19:37

ChenhanYu approved these changes Apr 20, 2026

View reviewed changes

yeyu-nvidia added the cherry-pick-0.44.0 After code freeze, cherry-pick to release branch for next rc (bulk update). Only for bug fixes / doc label Apr 20, 2026

yeyu-nvidia merged commit 2fef374 into main Apr 20, 2026
37 checks passed

yeyu-nvidia deleted the yeyu/fix-dp-replicate-size branch April 20, 2026 20:39

kevalmorabia97 added the cherry-pick-done Added by bot once PR is cherry-picked to the release branch label Apr 21, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: auto-compute dp_replicate_size from world_size#1302

fix: auto-compute dp_replicate_size from world_size#1302
yeyu-nvidia merged 2 commits intomainfrom
yeyu/fix-dp-replicate-size

yeyu-nvidia commented Apr 20, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Apr 20, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

github-actions Bot commented Apr 20, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

ChenhanYu left a comment

Uh oh!

codecov Bot commented Apr 20, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

yeyu-nvidia commented Apr 20, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

github-actions Bot commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ChenhanYu left a comment

Choose a reason for hiding this comment

Uh oh!

codecov Bot commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

yeyu-nvidia commented Apr 20, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 20, 2026 •

edited

Loading

github-actions Bot commented Apr 20, 2026 •

edited

Loading

codecov Bot commented Apr 20, 2026 •

edited

Loading