Skip to content

cp: fix: enhancing non-colocated refit performance by having inclusive comm group (1264) into r0.4.0#1311

Merged
terrykong merged 1 commit intor0.4.0from
cherry-pick-1264-r0.4.0
Oct 10, 2025
Merged

cp: fix: enhancing non-colocated refit performance by having inclusive comm group (1264) into r0.4.0#1311
terrykong merged 1 commit intor0.4.0from
cherry-pick-1264-r0.4.0

Conversation

@chtruong814
Copy link
Contributor

@chtruong814 chtruong814 commented Oct 8, 2025

beep boop [🤖]: Hi @youngeunkwon0405 👋,

we've cherry picked #1264 into  for you! 🚀

Please review and approve this cherry pick by your convenience!

Summary by CodeRabbit

  • New Features
    • Support separate training and inference cluster sizes with explicit train-world-size configuration.
    • Unified collective initialization across training and inference, improving reliability in non-colocated deployments.
  • Bug Fixes
    • Corrected world-size and rank calculations to prevent synchronization issues during setup.
    • Standardized weight broadcasting across all ranks for more consistent initialization.
  • Tests
    • Added smoke test validating non-colocated distillation setup.
    • Updated vLLM generation tests to use dynamic world sizes and pass train-world-size, ensuring compatibility with varied cluster configurations.

…mm group (#1264)

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>
Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>
@github-actions
Copy link

github-actions bot commented Oct 8, 2025

ℹ️ File Consistency Check

Check based on commit: 03c4f5c (PR #1311 from cherry-pick-1264-r0.4.0)

✅ DTensor Policy Worker Synchronization Check

Both DTensor policy worker files were modified in this PR:

  • nemo_rl/models/policy/dtensor_policy_worker.py
  • nemo_rl/models/policy/dtensor_policy_worker_v2.py

Please ensure that the changes are consistent between both files where applicable.


This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Oct 8, 2025

📝 Walkthrough

Walkthrough

Adds explicit train_world_size across algorithms, policies, and vLLM generation. World size is now computed as train_world_size + inference_world_size. All init_collective signatures and call sites are updated to accept and pass train_world_size. Rank computations and process group setup are adjusted accordingly; rank==0 special-casing for group creation and broadcasts is removed.

Changes

Cohort / File(s) Summary
Algorithms (world size calc & init calls)
nemo_rl/algorithms/distillation.py, nemo_rl/algorithms/grpo.py
Compute train_world_size from train cluster; set world_size = train_world_size + inference_world_size; pass train_world_size to policy and generation init_collective.
Policy interface & entrypoint
nemo_rl/models/policy/interfaces.py, nemo_rl/models/policy/lm_policy.py
Update ColocatablePolicyInterface.Policy.init_collective signatures to include keyword-only train_world_size; propagate argument when invoking workers.
Policy workers (DTensor/Megatron)
nemo_rl/models/policy/dtensor_policy_worker.py, .../dtensor_policy_worker_v2.py, .../megatron_policy_worker.py
Add keyword-only train_world_size to init_collective; always create StatelessProcessGroup with self.rank; initialize PyNcclCommunicator unconditionally; remove rank==0 gating in broadcasts (still src=0).
vLLM generation stack
nemo_rl/models/generation/vllm/vllm_backend.py, .../vllm_generation.py, .../vllm_worker.py, .../vllm_worker_async.py
Add train_world_size to init_collective APIs; compute worker rank as train_world_size + rank_prefix + local_rank; pass train_world_size through sync/async RPC args.
Tests
tests/unit/algorithms/test_distillation.py, tests/unit/models/generation/test_vllm_generation.py
Add non-colocated distillation setup smoke test with mocks; update vLLM generation tests to derive world sizes dynamically and pass train_world_size to init_collective.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor Orchestrator
  participant TrainCluster as Train Cluster
  participant InferenceCluster as Inference Cluster
  participant Policy as Policy (train)
  participant Gen as vLLM Generation (inference)
  participant WorkerT as Policy Workers
  participant WorkerG as vLLM Workers

  Orchestrator->>TrainCluster: world_size()
  TrainCluster-->>Orchestrator: train_world_size
  Orchestrator->>InferenceCluster: world_size()
  InferenceCluster-->>Orchestrator: inference_world_size
  Note over Orchestrator: world_size = train_world_size + inference_world_size

  Orchestrator->>Policy: init_collective(ip, port, world_size, train_world_size)
  Policy->>WorkerT: create StatelessProcessGroup(rank=self.rank,<br/>world_size)
  WorkerT->>WorkerT: init PyNcclCommunicator(group)
  WorkerT->>WorkerT: broadcast(src=0)

  Orchestrator->>Gen: init_collective(ip, port, world_size, train_world_size)
  Gen->>WorkerG: collective_rpc("init_collective",<br/>rank_prefix, ip, port, world_size, train_world_size)
  WorkerG->>WorkerG: rank = train_world_size + rank_prefix + local_rank
  WorkerG->>WorkerG: create StatelessProcessGroup(rank, world_size)
  WorkerG->>WorkerG: ready
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Suggested labels

r0.4.0, Performance

Suggested reviewers

  • yuki-97
  • terrykong

Pre-merge checks and finishing touches

❌ Failed checks (3 warnings)
Check name Status Explanation Resolution
Title Check ⚠️ Warning The title is structured as a cherry-pick reference rather than focusing on the main change, and includes noise like “cp:” and “into r0.4.0”. It does not explicitly communicate the introduction of a train_world_size parameter or the inclusive communication group improvement. Please update the title to a clear, concise sentence that directly summarizes the key change (for example: “Add train_world_size parameter to include both training and inference nodes in the communication group”) without cherry-pick or branch-merging metadata.
Docstring Coverage ⚠️ Warning Docstring coverage is 32.50% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
Test Results For Major Changes ⚠️ Warning The PR introduces widespread API changes to collective initialization across multiple modules, so it qualifies as a major change, yet the PR description retrieved from GitHub only contains the automatic cherry-pick message without any testing or performance information; therefore the custom check fails. Please update the PR description with the relevant test results or performance data (and numerics if applicable) demonstrating the impact of these changes, then re-run this check.
✅ Passed checks (1 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch cherry-pick-1264-r0.4.0

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
tests/unit/algorithms/test_distillation.py (1)

418-525: Consider modelling world_size in DummyCluster

Nice to have the smoke test, but since DummyCluster.world_size() always returns 1, we won’t catch regressions in the train/inference world-size arithmetic that motivated this PR. It’d be cheap insurance to let the dummy accept a target world size (e.g., derived from bundle_ct_per_node_list) so the mocked setup mirrors the real topology a bit more closely.

nemo_rl/models/policy/dtensor_policy_worker.py (1)

504-516: Document unused train_world_size parameter in init_collective
The train_world_size parameter is required by the interface but isn’t used in this implementation; add a brief comment (e.g. as in megatron_policy_worker) to clarify its purpose.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e4c0103 and 03c4f5c.

📒 Files selected for processing (13)
  • nemo_rl/algorithms/distillation.py (1 hunks)
  • nemo_rl/algorithms/grpo.py (1 hunks)
  • nemo_rl/models/generation/vllm/vllm_backend.py (1 hunks)
  • nemo_rl/models/generation/vllm/vllm_generation.py (2 hunks)
  • nemo_rl/models/generation/vllm/vllm_worker.py (2 hunks)
  • nemo_rl/models/generation/vllm/vllm_worker_async.py (2 hunks)
  • nemo_rl/models/policy/dtensor_policy_worker.py (2 hunks)
  • nemo_rl/models/policy/dtensor_policy_worker_v2.py (2 hunks)
  • nemo_rl/models/policy/interfaces.py (1 hunks)
  • nemo_rl/models/policy/lm_policy.py (1 hunks)
  • nemo_rl/models/policy/megatron_policy_worker.py (2 hunks)
  • tests/unit/algorithms/test_distillation.py (1 hunks)
  • tests/unit/models/generation/test_vllm_generation.py (2 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Follow the Google Python Style Guide for all Python code
Target Python 3.12+ for all Python code in NeMo-RL
Indent Python code with 4 spaces; do not use tabs
Python filenames should be snake_case (e.g., some_file.py)
Class names should be PascalCase
Function and method names should be snake_case
Local variable names should be snake_case; if starting with a number, prefix with k (e.g., k_99th_percentile)
Global variables should be UPPER_SNAKE_CASE and prefixed with G_ (e.g., G_MY_GLOBAL)
Constants should be UPPER_SNAKE_CASE
Avoid shadowing variables declared in an outer scope
Initialize all externally visible members of a class in the constructor
For public interfaces used outside a file, prefer docstrings over comments
Use comments mainly for code within a function or interfaces local to a file
Commented-out code must include a nearby comment explaining usage and why it is commented out; otherwise remove before merging
Use Google-style docstrings for classes and functions (Sphinx-parseable)
Avoid using reflection when functionality can be easily achieved without it
Limit except clauses to the smallest specific set of exceptions possible
For duck-typing via try/except, keep the try body minimal and use else for main logic
Add the NVIDIA copyright header (with current year) at the top of all Python files, excluding tests/ and test-only scripts

Files:

  • nemo_rl/models/generation/vllm/vllm_worker.py
  • nemo_rl/algorithms/grpo.py
  • nemo_rl/models/policy/dtensor_policy_worker.py
  • nemo_rl/models/policy/interfaces.py
  • tests/unit/algorithms/test_distillation.py
  • nemo_rl/models/policy/megatron_policy_worker.py
  • tests/unit/models/generation/test_vllm_generation.py
  • nemo_rl/models/generation/vllm/vllm_generation.py
  • nemo_rl/models/generation/vllm/vllm_backend.py
  • nemo_rl/algorithms/distillation.py
  • nemo_rl/models/policy/lm_policy.py
  • nemo_rl/models/policy/dtensor_policy_worker_v2.py
  • nemo_rl/models/generation/vllm/vllm_worker_async.py
nemo_rl/**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

nemo_rl/**/*.py: Do not set non-None configuration defaults in code; YAML is the single source of truth for defaults
Access required config attributes directly (e.g., policy_cfg["precision"]) and assume presence; do not introduce hidden defaults
Express configuration optionality via TypedDict using typing.NotRequired
When adding a new config key to a TypedDict subclass, document the key’s purpose, valid values/types, and recommended default in code
For any class or function decorated with @ray.remote, add '# pragma: no cover' on the class/def line (and on remote functions)

Files:

  • nemo_rl/models/generation/vllm/vllm_worker.py
  • nemo_rl/algorithms/grpo.py
  • nemo_rl/models/policy/dtensor_policy_worker.py
  • nemo_rl/models/policy/interfaces.py
  • nemo_rl/models/policy/megatron_policy_worker.py
  • nemo_rl/models/generation/vllm/vllm_generation.py
  • nemo_rl/models/generation/vllm/vllm_backend.py
  • nemo_rl/algorithms/distillation.py
  • nemo_rl/models/policy/lm_policy.py
  • nemo_rl/models/policy/dtensor_policy_worker_v2.py
  • nemo_rl/models/generation/vllm/vllm_worker_async.py
🧬 Code graph analysis (12)
nemo_rl/models/generation/vllm/vllm_worker.py (2)
tests/unit/algorithms/test_distillation.py (1)
  • world_size (479-480)
nemo_rl/distributed/virtual_cluster.py (1)
  • world_size (357-358)
nemo_rl/algorithms/grpo.py (10)
tests/unit/algorithms/test_distillation.py (3)
  • world_size (479-480)
  • init_collective (492-493)
  • init_collective (505-506)
nemo_rl/distributed/virtual_cluster.py (1)
  • world_size (357-358)
nemo_rl/models/generation/vllm/vllm_backend.py (1)
  • init_collective (34-55)
nemo_rl/models/generation/vllm/vllm_generation.py (1)
  • init_collective (370-407)
nemo_rl/models/generation/vllm/vllm_worker.py (1)
  • init_collective (479-496)
nemo_rl/models/policy/dtensor_policy_worker.py (1)
  • init_collective (504-515)
nemo_rl/models/policy/dtensor_policy_worker_v2.py (1)
  • init_collective (462-472)
nemo_rl/models/policy/interfaces.py (1)
  • init_collective (143-146)
nemo_rl/models/policy/lm_policy.py (1)
  • init_collective (236-248)
nemo_rl/models/policy/megatron_policy_worker.py (1)
  • init_collective (823-836)
nemo_rl/models/policy/dtensor_policy_worker.py (8)
nemo_rl/models/generation/vllm/vllm_backend.py (1)
  • init_collective (34-55)
nemo_rl/models/generation/vllm/vllm_generation.py (1)
  • init_collective (370-407)
nemo_rl/models/generation/vllm/vllm_worker.py (1)
  • init_collective (479-496)
nemo_rl/models/policy/dtensor_policy_worker_v2.py (1)
  • init_collective (462-472)
nemo_rl/models/policy/interfaces.py (1)
  • init_collective (143-146)
nemo_rl/models/policy/lm_policy.py (1)
  • init_collective (236-248)
nemo_rl/models/policy/megatron_policy_worker.py (1)
  • init_collective (823-836)
tests/unit/algorithms/test_distillation.py (3)
  • init_collective (492-493)
  • init_collective (505-506)
  • world_size (479-480)
nemo_rl/models/policy/interfaces.py (2)
tests/unit/algorithms/test_distillation.py (1)
  • world_size (479-480)
nemo_rl/distributed/virtual_cluster.py (1)
  • world_size (357-358)
tests/unit/algorithms/test_distillation.py (1)
nemo_rl/algorithms/distillation.py (1)
  • setup (150-460)
nemo_rl/models/policy/megatron_policy_worker.py (7)
nemo_rl/models/generation/vllm/vllm_backend.py (1)
  • init_collective (34-55)
nemo_rl/models/generation/vllm/vllm_generation.py (1)
  • init_collective (370-407)
nemo_rl/models/generation/vllm/vllm_worker.py (1)
  • init_collective (479-496)
nemo_rl/models/policy/dtensor_policy_worker.py (1)
  • init_collective (504-515)
nemo_rl/models/policy/dtensor_policy_worker_v2.py (1)
  • init_collective (462-472)
nemo_rl/models/policy/interfaces.py (1)
  • init_collective (143-146)
nemo_rl/models/policy/lm_policy.py (1)
  • init_collective (236-248)
tests/unit/models/generation/test_vllm_generation.py (11)
tests/unit/algorithms/test_distillation.py (3)
  • world_size (479-480)
  • init_collective (492-493)
  • init_collective (505-506)
nemo_rl/distributed/virtual_cluster.py (1)
  • world_size (357-358)
nemo_rl/models/generation/vllm/vllm_backend.py (1)
  • init_collective (34-55)
nemo_rl/models/generation/vllm/vllm_generation.py (1)
  • init_collective (370-407)
nemo_rl/models/generation/vllm/vllm_worker.py (1)
  • init_collective (479-496)
nemo_rl/models/policy/dtensor_policy_worker.py (1)
  • init_collective (504-515)
nemo_rl/models/policy/dtensor_policy_worker_v2.py (1)
  • init_collective (462-472)
nemo_rl/models/policy/interfaces.py (1)
  • init_collective (143-146)
nemo_rl/models/policy/lm_policy.py (1)
  • init_collective (236-248)
nemo_rl/models/policy/megatron_policy_worker.py (1)
  • init_collective (823-836)
nemo_rl/models/generation/interfaces.py (1)
  • init_collective (212-216)
nemo_rl/models/generation/vllm/vllm_generation.py (2)
tests/unit/algorithms/test_distillation.py (1)
  • world_size (479-480)
nemo_rl/distributed/virtual_cluster.py (1)
  • world_size (357-358)
nemo_rl/algorithms/distillation.py (9)
tests/unit/algorithms/test_distillation.py (3)
  • world_size (479-480)
  • init_collective (492-493)
  • init_collective (505-506)
nemo_rl/models/generation/vllm/vllm_backend.py (1)
  • init_collective (34-55)
nemo_rl/models/generation/vllm/vllm_generation.py (1)
  • init_collective (370-407)
nemo_rl/models/generation/vllm/vllm_worker.py (1)
  • init_collective (479-496)
nemo_rl/models/policy/dtensor_policy_worker.py (1)
  • init_collective (504-515)
nemo_rl/models/policy/dtensor_policy_worker_v2.py (1)
  • init_collective (462-472)
nemo_rl/models/policy/interfaces.py (1)
  • init_collective (143-146)
nemo_rl/models/policy/lm_policy.py (1)
  • init_collective (236-248)
nemo_rl/models/policy/megatron_policy_worker.py (1)
  • init_collective (823-836)
nemo_rl/models/policy/lm_policy.py (2)
nemo_rl/distributed/virtual_cluster.py (1)
  • world_size (357-358)
nemo_rl/distributed/worker_groups.py (1)
  • run_all_workers_single_data (728-772)
nemo_rl/models/policy/dtensor_policy_worker_v2.py (8)
nemo_rl/models/generation/vllm/vllm_backend.py (1)
  • init_collective (34-55)
nemo_rl/models/generation/vllm/vllm_generation.py (1)
  • init_collective (370-407)
nemo_rl/models/generation/vllm/vllm_worker.py (1)
  • init_collective (479-496)
nemo_rl/models/policy/dtensor_policy_worker.py (1)
  • init_collective (504-515)
nemo_rl/models/policy/interfaces.py (1)
  • init_collective (143-146)
nemo_rl/models/policy/lm_policy.py (1)
  • init_collective (236-248)
nemo_rl/models/policy/megatron_policy_worker.py (1)
  • init_collective (823-836)
tests/unit/algorithms/test_distillation.py (3)
  • init_collective (492-493)
  • init_collective (505-506)
  • world_size (479-480)
nemo_rl/models/generation/vllm/vllm_worker_async.py (2)
tests/unit/algorithms/test_distillation.py (1)
  • world_size (479-480)
nemo_rl/distributed/virtual_cluster.py (1)
  • world_size (357-358)
🪛 Ruff (0.13.3)
nemo_rl/models/policy/dtensor_policy_worker.py

505-505: Unused method argument: train_world_size

(ARG002)

tests/unit/algorithms/test_distillation.py

492-492: Unused method argument: args

(ARG002)


492-492: Unused method argument: kwargs

(ARG002)


502-502: Unused method argument: args

(ARG002)


502-502: Unused method argument: kwargs

(ARG002)


505-505: Unused method argument: args

(ARG002)


505-505: Unused method argument: kwargs

(ARG002)

nemo_rl/models/policy/megatron_policy_worker.py

824-824: Unused method argument: train_world_size

(ARG002)

nemo_rl/models/policy/dtensor_policy_worker_v2.py

463-463: Unused method argument: train_world_size

(ARG002)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: Lint check
  • GitHub Check: Lint check
  • GitHub Check: Lint check
  • GitHub Check: Post submodule check comment / Comment on PR
🔇 Additional comments (9)
nemo_rl/models/generation/vllm/vllm_backend.py (1)

35-49: Rank offset makes sense

Adding train_world_size to the rank calculation cleanly shifts vLLM workers after the training ranks, which is exactly what we need now that train and inference groups share one collective. Looks good.

nemo_rl/models/generation/vllm/vllm_generation.py (1)

370-404: Propagation of train_world_size is solid

Making train_world_size keyword-only and threading it through common_kwargs keeps the public API unambiguous while ensuring every worker sees the same world metadata.

nemo_rl/models/policy/interfaces.py (1)

142-145: Interface stays consistent

Making train_world_size part of the abstract method keeps every policy implementation honest about the new collective contract.

tests/unit/models/generation/test_vllm_generation.py (1)

944-965: Test update matches the new contract

Deriving world_size from train_world_size + inference_world_size and passing the train slice explicitly keeps the test in lockstep with the revised collective semantics.

nemo_rl/models/policy/lm_policy.py (1)

236-246: Forwarding looks good

The policy now forwards train_world_size alongside the existing parameters, so the worker group gets everything it needs without positional ambiguity.

nemo_rl/models/generation/vllm/vllm_worker_async.py (1)

395-412: Async path is wired correctly

Including train_world_size in the async RPC keeps the worker extension aligned with the new offset logic—parity with the sync worker achieved.

nemo_rl/models/policy/dtensor_policy_worker.py (1)

1812-1813: LGTM! Broadcast simplification aligns with inclusive comm group design.

The removal of rank-specific gating and unconditional dtype conversion and broadcast is correct. All ranks now participate in the communication group, while src=0 ensures rank 0 remains the source for the broadcast operation.

nemo_rl/models/policy/megatron_policy_worker.py (2)

1741-1743: LGTM! Broadcast simplification aligns with inclusive comm group design.

The comment "broadcast from train rank 0 to all other ranks (training and inference)" clearly explains the intent. The removal of rank-specific gating allows all ranks to participate in the communication group, while src=0 ensures rank 0 is the broadcast source.


823-836: Approve code changes: unused train_world_size parameter is intentional
train_world_size is deliberately unused in init_collective of megatron_policy_worker.py for API consistency; vLLM backends use it to compute rank offsets.

@terrykong terrykong added the CI:L1 Run doctests, unit tests, and functional tests label Oct 9, 2025
@terrykong terrykong enabled auto-merge (squash) October 9, 2025 20:31
@terrykong terrykong merged commit 1faf2f1 into r0.4.0 Oct 10, 2025
78 of 88 checks passed
@terrykong terrykong deleted the cherry-pick-1264-r0.4.0 branch October 10, 2025 02:27
@coderabbitai coderabbitai bot mentioned this pull request Dec 2, 2025
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cherry-pick CI:L1 Run doctests, unit tests, and functional tests Run CICD

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants