fix: Add check for world size and parallelism enabled by parthchadha · Pull Request #1190 · NVIDIA-NeMo/RL

parthchadha · 2025-09-22T22:31:32Z

What does this PR do ?

Fixes #1182
Add a one line overview of what this PR aims to accomplish.

Issues

List issues that this PR closes (syntax):

Usage

You can potentially add a usage example below

# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

...

Summary by CodeRabbit

Bug Fixes
- Added runtime validation to ensure cluster world size is compatible with configured parallelism (PP, CP, TP), preventing misconfigurations.
- Improved error messages with clear guidance on required world size and data-parallel factor when configuration is invalid.
Tests
- Introduced comprehensive unit tests covering validation across DTensor and Megatron backends.
- Verified correct handling of valid/invalid configurations, including divisibility and minimum data-parallel requirements.
- Ensured worker groups are only instantiated when validation passes.

Signed-off-by: Parth Chadha <pchadha@nvidia.com>

coderabbitai · 2025-09-22T22:38:40Z

📝 Walkthrough

Walkthrough

Adds runtime validation in Policy.__init__ to check cluster world size against configured PP, CP, and TP. Computes model_parallel_size and DP, raising ValueError if world size is insufficient or not divisible. Validation runs before worker group creation and sharding layout. Adds unit tests covering DTensor and Megatron scenarios.

Changes

Cohort / File(s)	Summary
Policy validation logic `nemo_rl/models/policy/lm_policy.py`	Introduces world-size compatibility checks: compute `model_parallel_size = pp_size * cp_size * tp_size`, compare with `cluster.world_size()`, validate divisibility to ensure integer DP, and raise detailed `ValueError` on failure; executed before sharding layout and worker group setup.
Unit tests for validation `tests/unit/models/policy/test_policy_validation.py`	New tests mocking cluster/tokenizer/worker components; parameterized cases for DTensor (PP=1) and Megatron (PP>1); assert successful init on valid configs; verify specific error messages and that worker group is not instantiated on failure.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor U as Caller
  participant P as Policy.__init__
  participant C as Cluster
  participant WB as WorkerBuilder/RayWorkerGroup
  participant SL as Sharding Layout

  U->>P: create Policy(config)
  P->>C: world_size = cluster.world_size()
  P->>P: compute pp, cp, tp, model_parallel_size = pp*cp*tp
  alt world_size < model_parallel_size
    P-->>U: raise ValueError (insufficient world size)
    note right of P: No worker group constructed
  else world_size % model_parallel_size != 0
    P-->>U: raise ValueError (non-integer DP)
    note right of P: No worker group constructed
  else
    P->>P: DP = world_size / model_parallel_size
    P->>WB: select worker_builder_cls and env
    P->>SL: construct sharding layout
    P-->>U: initialized Policy
  end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Pre-merge checks and finishing touches

✅ Passed checks (6 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check	✅ Passed	The title "fix: Add check for world size and parallelism enabled" succinctly and accurately summarizes the main change: adding runtime validation to ensure cluster.world_size is compatible with configured parallelism (TP/PP/CP) to prevent reshape/crash errors. It is concise, specific to the primary fix in lm_policy.py, and readable for a teammate scanning history.
Linked Issues Check	✅ Passed	The PR implements the requested early validation from linked issue [#1182] by computing model_parallel_size = TP × PP × CP, comparing it to cluster.world_size, and raising descriptive ValueErrors when world_size is too small or not divisible, and the added unit tests exercise these failure and success cases. This directly addresses the coding objective to prevent the reshape crash described in #1182.
Out of Scope Changes Check	✅ Passed	The changes are limited to runtime validation in nemo_rl/models/policy/lm_policy.py and a new unit test file tests/unit/models/policy/test_policy_validation.py, with no edits to other modules or public API signatures, so no out-of-scope or unrelated modifications were introduced.
Docstring Coverage	✅ Passed	Docstring coverage is 85.71% which is sufficient. The required threshold is 80.00%.
Test Results For Major Changes	✅ Passed	The PR adds early runtime validation of world_size vs. PP/CP/TP in Policy.init and introduces a focused unit test suite exercising valid/invalid configurations. This is a minor, defensive change that prevents a crash and does not alter numerics, convergence, or performance characteristics. While the PR description does not include explicit test results, the addition of unit tests covers the new logic and there is no indication of performance- or accuracy-impacting behavior. Therefore, it satisfies the check’s “minor change” criterion.

✨ Finishing touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch pchadha-fix-1182

Tip

👮 Agentic pre-merge checks are now available in preview!

Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.

Built-in checks – Quickly apply ready-made checks to enforce title conventions, require pull request descriptions that follow templates, validate linked issues for compliance, and more.
Custom agentic checks – Define your own rules using CodeRabbit’s advanced agentic capabilities to enforce organization-specific policies and workflows. For example, you can instruct CodeRabbit’s agent to verify that API documentation is updated whenever API schema files are modified in a PR. Note: Upto 5 custom checks are currently allowed during the preview period. Pricing for this feature will be announced in a few weeks.

Please see the documentation for more information.

Example:

reviews:
  pre_merge_checks:
    custom_checks:
      - name: "Undocumented Breaking Changes"
        mode: "warning"
        instructions: |
          Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal).

Please share your feedback with us on this Discord post.

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

nemo_rl/models/policy/lm_policy.py (1)

705-713: Make destructor resilient when init fails before setting worker_group.

If validation raises early, del can run on a partially initialized object and raise AttributeError.

Apply this diff:

     def __del__(self) -> None:
         """Shuts down the worker groups when the object is deleted or is garbage collected.
 
         This is an extra safety net in case the user forgets to call worker_group.shutdown() and the pointer to
         the object is lost due to leaving a function scope. It's always recommended that the
         user calls worker_group.shutdown().
         """
-        self.worker_group.shutdown()
+        try:
+            wg = getattr(self, "worker_group", None)
+            if wg is not None:
+                wg.shutdown()
+        except Exception:
+            # Best-effort cleanup; avoid raising from __del__
+            pass

🧹 Nitpick comments (6)

nemo_rl/models/policy/lm_policy.py (1)

122-127: Slim error messages or move formatting into a helper to satisfy TRY003.

Ruff TRY003 flags long f-strings in raises. Either shorten messages or centralize formatting (e.g., a small _format_world_size_error(...)) and keep the exception text concise.

Example (minimal change):

-            raise ValueError(
-                f"World size ({actual_world_size}) is insufficient for the parallelism configuration. "
-                f"Required minimum world size: PP({pp_size}) * CP({cp_size}) * TP({tp_size}) = {model_parallel_size}. "
-                f"This would result in DP = {actual_world_size}/{model_parallel_size} = {actual_world_size / model_parallel_size:.3f}, but DP must be ≥ 1. "
-                f"Please either increase the number of GPUs/nodes or reduce the parallelism parameters."
-            )
+            dp = actual_world_size / model_parallel_size
+            raise ValueError(
+                f"Insufficient world size ({actual_world_size}); need at least PP({pp_size})*CP({cp_size})*TP({tp_size})={model_parallel_size}. "
+                f"Computed DP={dp:.3f} < 1."
+            )

Also applies to: 131-136

tests/unit/models/policy/test_policy_validation.py (5)

57-59: Remove unused parameter ‘pp’ or make intent explicit.

pp isn’t used in DTensor config. Rename to _pp to silence ARG001 and document PP=1 for DTensor.

-def create_dtensor_config(
-    model_name: str, tp: int, pp: int = 1, cp: int = 1
-) -> PolicyConfig:
+def create_dtensor_config(
+    model_name: str, tp: int, _pp: int = 1, cp: int = 1
+) -> PolicyConfig:

Call sites can remain as pp=1.

210-221: Drop try/except on success path; let pytest surface unexpected exceptions.

Catching broad Exception (BLE001) hides useful tracebacks. Just construct Policy and assert the mock call.

-    if should_pass:
-        # Should succeed without raising an exception
-        try:
-            policy = Policy(cluster=cluster, config=config, tokenizer=tokenizer)
-            # Verify the calculated DP makes sense
-            expected_dp = world_size // (1 * cp * tp)  # PP=1 for DTensor
-            assert expected_dp >= 1, f"Expected DP should be >= 1, got {expected_dp}"
-            # Verify that worker group was created (validation passed)
-            mock_ray_worker_group.assert_called_once()
-        except Exception as e:
-            pytest.fail(f"Expected success for {description}, but got error: {e}")
+    if should_pass:
+        Policy(cluster=cluster, config=config, tokenizer=tokenizer)
+        expected_dp = world_size // (1 * cp * tp)  # PP=1 for DTensor
+        assert expected_dp >= 1
+        mock_ray_worker_group.assert_called_once()

302-313: Same: remove broad try/except in Megatron success path.

-    if should_pass:
-        # Should succeed without raising an exception
-        try:
-            policy = Policy(cluster=cluster, config=config, tokenizer=tokenizer)
-            # Verify the calculated DP makes sense
-            expected_dp = world_size // (pp * cp * tp)
-            assert expected_dp >= 1, f"Expected DP should be >= 1, got {expected_dp}"
-            # Verify that worker group was created (validation passed)
-            mock_ray_worker_group.assert_called_once()
-        except Exception as e:
-            pytest.fail(f"Expected success for {description}, but got error: {e}")
+    if should_pass:
+        Policy(cluster=cluster, config=config, tokenizer=tokenizer)
+        expected_dp = world_size // (pp * cp * tp)
+        assert expected_dp >= 1
+        mock_ray_worker_group.assert_called_once()

165-183: Add tests for invalid zero/negative parallel sizes to lock in the new guard.

Parametrize a few cases like (world_size=8, tp=0) and (tp=-1) for DTensor to assert ValueError.

Example additions:

# Invalid: non-positive TP/CP
(8, 0, 1, False, "invalid", "Invalid: TP=0"),
(8, -1, 1, False, "invalid", "Invalid: TP<0"),
(8, 4, 0, False, "invalid", "Invalid: CP=0"),

And assert on "must be a positive integer" in the error message.

245-275: Mirror zero/negative checks for Megatron (PP/TP/CP).

Add cases like PP=0, TP=0 to ensure the same failure mode in Megatron.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 42aa41b and d98f855.

📒 Files selected for processing (2)

nemo_rl/models/policy/lm_policy.py (1 hunks)
tests/unit/models/policy/test_policy_validation.py (1 hunks)

🧰 Additional context used

📓 Path-based instructions (2)

**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Follow the Google Python Style Guide for all Python code
Target Python 3.12+ for all Python code in NeMo-RL
Indent Python code with 4 spaces; do not use tabs
Python filenames should be snake_case (e.g., some_file.py)
Class names should be PascalCase
Function and method names should be snake_case
Local variable names should be snake_case; if starting with a number, prefix with k (e.g., k_99th_percentile)
Global variables should be UPPER_SNAKE_CASE and prefixed with G_ (e.g., G_MY_GLOBAL)
Constants should be UPPER_SNAKE_CASE
Avoid shadowing variables declared in an outer scope
Initialize all externally visible members of a class in the constructor
For public interfaces used outside a file, prefer docstrings over comments
Use comments mainly for code within a function or interfaces local to a file
Commented-out code must include a nearby comment explaining usage and why it is commented out; otherwise remove before merging
Use Google-style docstrings for classes and functions (Sphinx-parseable)
Avoid using reflection when functionality can be easily achieved without it
Limit except clauses to the smallest specific set of exceptions possible
For duck-typing via try/except, keep the try body minimal and use else for main logic
Add the NVIDIA copyright header (with current year) at the top of all Python files, excluding tests/ and test-only scripts

Files:

nemo_rl/models/policy/lm_policy.py
tests/unit/models/policy/test_policy_validation.py

nemo_rl/**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

nemo_rl/**/*.py: Do not set non-None configuration defaults in code; YAML is the single source of truth for defaults
Access required config attributes directly (e.g., policy_cfg["precision"]) and assume presence; do not introduce hidden defaults
Express configuration optionality via TypedDict using typing.NotRequired
When adding a new config key to a TypedDict subclass, document the key’s purpose, valid values/types, and recommended default in code
For any class or function decorated with @ray.remote, add '# pragma: no cover' on the class/def line (and on remote functions)

Files:

nemo_rl/models/policy/lm_policy.py

🧬 Code graph analysis (2)

nemo_rl/models/policy/lm_policy.py (3)

tests/unit/models/generation/test_vllm_generation.py (1)

cluster (221-232)

tests/unit/utils/test_native_checkpoint.py (1)

cluster (96-109)

nemo_rl/distributed/virtual_cluster.py (1)

world_size (357-358)

tests/unit/models/policy/test_policy_validation.py (4)

nemo_rl/models/policy/__init__.py (1)

PolicyConfig (141-163)

nemo_rl/models/policy/lm_policy.py (1)

Policy (56-722)

nemo_rl/distributed/virtual_cluster.py (3)

world_size (357-358)

get_placement_groups (347-355)

get_available_address_and_port (363-397)

tests/unit/conftest.py (1)

tiny_llama_model_path (456-480)

🪛 Ruff (0.13.1)

nemo_rl/models/policy/lm_policy.py

122-127: Avoid specifying long messages outside the exception class

(TRY003)

131-136: Avoid specifying long messages outside the exception class

(TRY003)

tests/unit/models/policy/test_policy_validation.py

58-58: Unused function argument: pp

(ARG001)

213-213: Local variable policy is assigned to but never used

Remove assignment to unused variable policy

(F841)

219-219: Do not catch blind exception: Exception

(BLE001)

305-305: Local variable policy is assigned to but never used

Remove assignment to unused variable policy

(F841)

311-311: Do not catch blind exception: Exception

(BLE001)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: Post automodel integration comment / Comment on PR
GitHub Check: Post submodule check comment / Comment on PR

nemo_rl/models/policy/lm_policy.py

Signed-off-by: Parth Chadha <pchadha@nvidia.com>

Signed-off-by: Parth Chadha <pchadha@nvidia.com> Signed-off-by: yuanhangs <yuanhangs@nvidia.com>

Add check for world size and parallelism enabled

d98f855

Signed-off-by: Parth Chadha <pchadha@nvidia.com>

parthchadha requested review from a team as code owners September 22, 2025 22:31

coderabbitai bot reviewed Sep 22, 2025

View reviewed changes

nemo_rl/models/policy/lm_policy.py Show resolved Hide resolved

yuki-97 reviewed Sep 23, 2025

View reviewed changes

nemo_rl/models/policy/lm_policy.py Show resolved Hide resolved

yuki-97 approved these changes Sep 23, 2025

View reviewed changes

terrykong approved these changes Sep 23, 2025

View reviewed changes

terrykong added the CI:L1 Run doctests, unit tests, and functional tests label Sep 23, 2025

terrykong enabled auto-merge (squash) September 23, 2025 06:14

terrykong temporarily deployed to nemo-ci September 23, 2025 06:14 — with GitHub Actions Inactive

terrykong temporarily deployed to nemo-ci September 23, 2025 07:53 — with GitHub Actions Inactive

terrykong merged commit 051c2f7 into main Sep 23, 2025
43 checks passed

terrykong deleted the pchadha-fix-1182 branch September 23, 2025 09:20

PrinsYin pushed a commit to PrinsYin/RL that referenced this pull request Nov 30, 2025

fix: Add check for world size and parallelism enabled (NVIDIA-NeMo#1190)

c6310c3

Signed-off-by: Parth Chadha <pchadha@nvidia.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Add check for world size and parallelism enabled#1190

fix: Add check for world size and parallelism enabled#1190
terrykong merged 1 commit intomainfrom
pchadha-fix-1182

parthchadha commented Sep 22, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Sep 22, 2025

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

parthchadha commented Sep 22, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Issues

Usage

Before your PR is "Ready for review"

Additional Information

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Sep 22, 2025

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

parthchadha commented Sep 22, 2025 •

edited by coderabbitai bot

Loading