fix: auto-compute dp_replicate_size from world_size#1302
Conversation
When dp_shard_size < world_size (e.g., dp_shard_size=4 on 8 GPUs), ParallelismConfig raises "total_size does not match num_processes" because dp_replicate_size defaults to 1. Auto-compute dp_replicate_size = world_size // (dp_shard_size * cp_size) so that intra-node FSDP2 sharding + inter-node data-parallel replication works without requiring users to manually set dp_replicate_size. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ye Yu <yeyu@nvidia.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughThe Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes 🚥 Pre-merge checks | ✅ 3 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
|
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@examples/speculative_decoding/main.py`:
- Around line 215-223: Before computing dp_replicate_size, validate the
topology: compute parallel_size = training_args.dp_shard_size *
training_args.cp_size and check that parallel_size > 0 and world_size %
parallel_size == 0; if not, raise a clear ValueError explaining world_size,
dp_shard_size, cp_size and the expected divisibility so we fail fast; only then
compute dp_replicate_size = world_size // parallel_size and pass it into
ParallelismConfig (references: world_size, parallel_size, dp_replicate_size,
training_args.dp_shard_size, training_args.cp_size, ParallelismConfig).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: 393b2bf5-3bcb-4a5e-b0fc-63301c39967f
📒 Files selected for processing (1)
examples/speculative_decoding/main.py
ChenhanYu
left a comment
There was a problem hiding this comment.
Correct fix for multi-node FSDP2 where dp_shard_size < world_size. One suggestion: add a divisibility guard (if world_size % parallel_size != 0: raise ValueError(...)) to catch misconfigurations early instead of letting ParallelismConfig fail with a confusing error. Also note the torch.cuda.device_count() fallback returns per-node count, not world size — correct for single-node but worth a comment. LGTM.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #1302 +/- ##
==========================================
+ Coverage 75.38% 75.98% +0.60%
==========================================
Files 462 462
Lines 49960 49960
==========================================
+ Hits 37662 37962 +300
+ Misses 12298 11998 -300
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Address review feedback: - Add ValueError if world_size is not divisible by dp_shard_size * cp_size - Comment that torch.cuda.device_count() is per-node, not world_size Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ye Yu <yeyu@nvidia.com>
## Summary - When `dp_shard_size < world_size` (e.g., `dp_shard_size=4` on 8 GPUs across 2 nodes), `ParallelismConfig` raises `total_size (4) does not match num_processes (8)` because `dp_replicate_size` defaults to 1 - Auto-compute `dp_replicate_size = world_size // (dp_shard_size * cp_size)` so intra-node FSDP2 sharding + inter-node data-parallel replication works without manual config - This enables `dp_shard_size` to be set to per-node GPU count (better NVLink utilization) while automatically creating replicas across nodes ## Test plan - [ ] Verify single-node training (dp_shard_size == world_size, dp_replicate_size == 1) unchanged - [ ] Verify multi-node with dp_shard_size < world_size creates correct replica groups - [ ] Verify existing EAGLE3/DFlash configs still work 🤖 Generated with [Claude Code](https://claude.com/claude-code) <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Refactor** * Enhanced parallelism configuration initialization in the speculative decoding example to better handle distributed training scenarios. <!-- end of auto-generated comment: release notes by coderabbit.ai --> Signed-off-by: Ye Yu <yeyu@nvidia.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Summary
dp_shard_size < world_size(e.g.,dp_shard_size=4on 8 GPUs across 2 nodes),ParallelismConfigraisestotal_size (4) does not match num_processes (8)becausedp_replicate_sizedefaults to 1dp_replicate_size = world_size // (dp_shard_size * cp_size)so intra-node FSDP2 sharding + inter-node data-parallel replication works without manual configdp_shard_sizeto be set to per-node GPU count (better NVLink utilization) while automatically creating replicas across nodesTest plan
🤖 Generated with Claude Code
Summary by CodeRabbit