Skip to content

fix: auto-compute dp_replicate_size from world_size#1302

Merged
yeyu-nvidia merged 2 commits intomainfrom
yeyu/fix-dp-replicate-size
Apr 20, 2026
Merged

fix: auto-compute dp_replicate_size from world_size#1302
yeyu-nvidia merged 2 commits intomainfrom
yeyu/fix-dp-replicate-size

Conversation

@yeyu-nvidia
Copy link
Copy Markdown
Contributor

@yeyu-nvidia yeyu-nvidia commented Apr 20, 2026

Summary

  • When dp_shard_size < world_size (e.g., dp_shard_size=4 on 8 GPUs across 2 nodes), ParallelismConfig raises total_size (4) does not match num_processes (8) because dp_replicate_size defaults to 1
  • Auto-compute dp_replicate_size = world_size // (dp_shard_size * cp_size) so intra-node FSDP2 sharding + inter-node data-parallel replication works without manual config
  • This enables dp_shard_size to be set to per-node GPU count (better NVLink utilization) while automatically creating replicas across nodes

Test plan

  • Verify single-node training (dp_shard_size == world_size, dp_replicate_size == 1) unchanged
  • Verify multi-node with dp_shard_size < world_size creates correct replica groups
  • Verify existing EAGLE3/DFlash configs still work

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Refactor
    • Improved parallelism setup in the speculative decoding example to better detect and validate available devices, derive replication size, and ensure consistent distributed-training configuration.

When dp_shard_size < world_size (e.g., dp_shard_size=4 on 8 GPUs),
ParallelismConfig raises "total_size does not match num_processes"
because dp_replicate_size defaults to 1.

Auto-compute dp_replicate_size = world_size // (dp_shard_size * cp_size)
so that intra-node FSDP2 sharding + inter-node data-parallel replication
works without requiring users to manually set dp_replicate_size.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Ye Yu <yeyu@nvidia.com>
@yeyu-nvidia yeyu-nvidia requested a review from a team as a code owner April 20, 2026 19:29
@yeyu-nvidia yeyu-nvidia requested a review from h-guo18 April 20, 2026 19:29
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 20, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 137650a0-9f04-4159-ae6b-0d9b3bbb0353

📥 Commits

Reviewing files that changed from the base of the PR and between 83226f4 and d9bb6c4.

📒 Files selected for processing (1)
  • examples/speculative_decoding/main.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • examples/speculative_decoding/main.py

📝 Walkthrough

Walkthrough

The train() function in the speculative decoding example now computes dp_replicate_size from the world size and existing parallelism parameters, validates divisibility, and includes dp_replicate_size in the ParallelismConfig when cp_size > 1 or dp_shard_size > 1.

Changes

Cohort / File(s) Summary
ParallelismConfig Initialization
examples/speculative_decoding/main.py
When cp_size > 1 or dp_shard_size > 1, compute world_size from WORLD_SIZE (falling back to torch.cuda.device_count()), derive parallel_size = dp_shard_size * cp_size, assert world_size % parallel_size == 0, set dp_replicate_size = world_size // parallel_size, and pass dp_replicate_size into training_args.parallelism_config alongside cp_size and dp_shard_size.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: auto-computing dp_replicate_size from world_size, which is the core fix addressing the ParallelismConfig error.
Security Anti-Patterns ✅ Passed No security anti-patterns detected: no torch.load with weights_only=False, numpy.load with allow_pickle=True, trust_remote_code=True, eval(), exec(), or # nosec comments found in modified code.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch yeyu/fix-dp-replicate-size

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 20, 2026

PR Preview Action v1.8.1
Preview removed because the pull request was closed.
2026-04-20 20:40 UTC

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@examples/speculative_decoding/main.py`:
- Around line 215-223: Before computing dp_replicate_size, validate the
topology: compute parallel_size = training_args.dp_shard_size *
training_args.cp_size and check that parallel_size > 0 and world_size %
parallel_size == 0; if not, raise a clear ValueError explaining world_size,
dp_shard_size, cp_size and the expected divisibility so we fail fast; only then
compute dp_replicate_size = world_size // parallel_size and pass it into
ParallelismConfig (references: world_size, parallel_size, dp_replicate_size,
training_args.dp_shard_size, training_args.cp_size, ParallelismConfig).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 393b2bf5-3bcb-4a5e-b0fc-63301c39967f

📥 Commits

Reviewing files that changed from the base of the PR and between 289a239 and 83226f4.

📒 Files selected for processing (1)
  • examples/speculative_decoding/main.py

Comment thread examples/speculative_decoding/main.py
@yeyu-nvidia yeyu-nvidia enabled auto-merge (squash) April 20, 2026 19:37
Copy link
Copy Markdown
Collaborator

@ChenhanYu ChenhanYu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct fix for multi-node FSDP2 where dp_shard_size < world_size. One suggestion: add a divisibility guard (if world_size % parallel_size != 0: raise ValueError(...)) to catch misconfigurations early instead of letting ParallelismConfig fail with a confusing error. Also note the torch.cuda.device_count() fallback returns per-node count, not world size — correct for single-node but worth a comment. LGTM.

@yeyu-nvidia yeyu-nvidia added the cherry-pick-0.44.0 After code freeze, cherry-pick to release branch for next rc (bulk update). Only for bug fixes / doc label Apr 20, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 20, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 75.98%. Comparing base (289a239) to head (d9bb6c4).
⚠️ Report is 2 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1302      +/-   ##
==========================================
+ Coverage   75.38%   75.98%   +0.60%     
==========================================
  Files         462      462              
  Lines       49960    49960              
==========================================
+ Hits        37662    37962     +300     
+ Misses      12298    11998     -300     
Flag Coverage Δ
examples 41.56% <ø> (+0.85%) ⬆️
regression 14.85% <ø> (+0.06%) ⬆️
unit 52.40% <ø> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Address review feedback:
- Add ValueError if world_size is not divisible by dp_shard_size * cp_size
- Comment that torch.cuda.device_count() is per-node, not world_size

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Ye Yu <yeyu@nvidia.com>
@yeyu-nvidia yeyu-nvidia merged commit 2fef374 into main Apr 20, 2026
37 checks passed
@yeyu-nvidia yeyu-nvidia deleted the yeyu/fix-dp-replicate-size branch April 20, 2026 20:39
kevalmorabia97 pushed a commit that referenced this pull request Apr 21, 2026
## Summary
- When `dp_shard_size < world_size` (e.g., `dp_shard_size=4` on 8 GPUs
across 2 nodes), `ParallelismConfig` raises `total_size (4) does not
match num_processes (8)` because `dp_replicate_size` defaults to 1
- Auto-compute `dp_replicate_size = world_size // (dp_shard_size *
cp_size)` so intra-node FSDP2 sharding + inter-node data-parallel
replication works without manual config
- This enables `dp_shard_size` to be set to per-node GPU count (better
NVLink utilization) while automatically creating replicas across nodes

## Test plan
- [ ] Verify single-node training (dp_shard_size == world_size,
dp_replicate_size == 1) unchanged
- [ ] Verify multi-node with dp_shard_size < world_size creates correct
replica groups
- [ ] Verify existing EAGLE3/DFlash configs still work

🤖 Generated with [Claude Code](https://claude.com/claude-code)

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Refactor**
* Enhanced parallelism configuration initialization in the speculative
decoding example to better handle distributed training scenarios.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Signed-off-by: Ye Yu <yeyu@nvidia.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
@kevalmorabia97 kevalmorabia97 added the cherry-pick-done Added by bot once PR is cherry-picked to the release branch label Apr 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cherry-pick-0.44.0 After code freeze, cherry-pick to release branch for next rc (bulk update). Only for bug fixes / doc cherry-pick-done Added by bot once PR is cherry-picked to the release branch

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants