Skip to content

cp: feat: Overlap param iteration and broadcast in non-colocated refit (1379) into r0.4.0#1418

Merged
terrykong merged 1 commit intor0.4.0from
cherry-pick-1379-r0.4.0
Oct 25, 2025
Merged

cp: feat: Overlap param iteration and broadcast in non-colocated refit (1379) into r0.4.0#1418
terrykong merged 1 commit intor0.4.0from
cherry-pick-1379-r0.4.0

Conversation

@chtruong814
Copy link
Contributor

@chtruong814 chtruong814 commented Oct 23, 2025

beep boop [🤖]: Hi @youngeunkwon0405 👋,

we've cherry picked #1379 into  for you! 🚀

Please review and approve this cherry pick by your convenience!

Summary by CodeRabbit

  • New Features

    • Added NRL_REFIT_NUM_BUFFERS environment variable for configurable buffer management (defaults to 2)
    • Added get_num_buffers() function to query buffer configuration
  • Improvements

    • Enhanced memory efficiency by increasing device memory allocation ratio to 0.02
    • Optimized data processing pipeline with independent per-buffer handling

…1379)

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>
Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Oct 23, 2025

📝 Walkthrough

Walkthrough

The pull request introduces multi-buffer packing and broadcasting logic to nemo_rl/utils/packed_tensor.py, replacing the previous single-buffer approach. It adds environment variable control via NRL_REFIT_NUM_BUFFERS (default 2), implements per-buffer CUDA streams and circular buffer management, and increases the target packed tensor memory ratio from 0.01 to 0.02 of total CUDA device memory.

Changes

Cohort / File(s) Summary
Multi-buffer packing refactor
nemo_rl/utils/packed_tensor.py
Introduces get_num_buffers() function and multi-buffer packing logic with circular buffer advancement, per-buffer CUDA streams, separate packing tensors, and independent per-buffer state. Updates producer/consumer flows to maintain per-buffer metadata, sizes, and offsets. Increases target tensor size memory ratio from 0.01 to 0.02. Preserves existing post-processing contract within per-buffer streams.

Sequence Diagram

sequenceDiagram
    autonumber
    actor Prod as Producer
    participant CB as Circular Buffer<br/>(NRL_REFIT_NUM_BUFFERS)
    participant Stream as Per-Buffer<br/>CUDA Stream
    participant Pack as Packed Tensor<br/>(per-buffer)
    actor Cons as Consumer

    rect rgb(220, 240, 255)
    Note over Prod,Cons: Buffer Cycle (n=0 to num_buffers-1)
    Prod->>CB: Advance to buffer n
    CB->>Stream: Synchronize stream[n]
    Stream->>Pack: Process data in buffer[n]
    Pack->>Pack: Build packed tensor[n]<br/>(per-buffer metadata)
    Pack->>Cons: Broadcast packed tensor[n]
    Cons->>Pack: Unpack using<br/>per-buffer metadata
    end

    Note over Prod,Cons: Repeat until StopIteration per-buffer
    Note over Prod,Cons: Last partial broadcasts preserved
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

The changes involve significant logic modifications within a single file, including circular buffer management, per-buffer stream handling, and updates to producer/consumer flows. The logic density is moderate, requiring careful review of state management and synchronization patterns, but the changes remain cohesive within a single component and the patterns are consistent throughout.

Suggested labels

CI:L1

Suggested reviewers

  • youngeunkwon0405
  • terrykong

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
Test Results For Major Changes ⚠️ Warning This PR is a cherry-pick of PR #1379 into the r0.4.0 release branch and introduces major changes including multi-buffer packing/broadcasting logic, increased memory allocation for packed tensors, new per-buffer CUDA streams, and a new public function. These changes directly affect the refit performance optimization introduced in the original PR #1379, which is part of significant performance improvements documented for large MoE models. However, the PR description provided contains only a standard cherry-pick notification with no test results, convergence validation, or performance metrics. The description lacks evidence that these major architectural changes do not introduce regressions when backported to the r0.4.0 branch. The PR description should be updated to include validation information appropriate for a cherry-pick of major changes into a release branch. Specifically: (1) document the CI/test results confirming the multi-buffer packing implementation passes all unit tests on the r0.4.0 branch, (2) provide convergence testing results or training stability validation to ensure no numeric changes or training degradation occur, (3) include performance metrics showing that the refit optimization remains effective on the r0.4.0 branch, and (4) clarify whether any branch-specific adjustments were needed during the cherry-pick. This documentation is essential for release branch quality assurance.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The pull request title "cp: feat: Overlap param iteration and broadcast in non-colocated refit (1379) into r0.4.0" is a cherry-pick operation bringing changes from PR #1379 into the r0.4.0 branch. The substantive part of the title—"Overlap param iteration and broadcast in non-colocated refit"—clearly relates to the primary changes in the PR, which introduce multi-buffer packing/broadcasting logic designed to overlap parameter iteration and broadcasting operations. While the title includes cherry-pick syntax and procedural elements (cp: prefix, backticks, "into r0.4.0") that add some noise, the core message is clear and specific enough that a teammate scanning history would understand the primary architectural change being implemented.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch cherry-pick-1379-r0.4.0

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3ae2afa and 63b3673.

📒 Files selected for processing (1)
  • nemo_rl/utils/packed_tensor.py (4 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Follow the Google Python Style Guide for all Python code
Target Python 3.12+ for all Python code in NeMo-RL
Indent Python code with 4 spaces; do not use tabs
Python filenames should be snake_case (e.g., some_file.py)
Class names should be PascalCase
Function and method names should be snake_case
Local variable names should be snake_case; if starting with a number, prefix with k (e.g., k_99th_percentile)
Global variables should be UPPER_SNAKE_CASE and prefixed with G_ (e.g., G_MY_GLOBAL)
Constants should be UPPER_SNAKE_CASE
Avoid shadowing variables declared in an outer scope
Initialize all externally visible members of a class in the constructor
For public interfaces used outside a file, prefer docstrings over comments
Use comments mainly for code within a function or interfaces local to a file
Commented-out code must include a nearby comment explaining usage and why it is commented out; otherwise remove before merging
Use Google-style docstrings for classes and functions (Sphinx-parseable)
Avoid using reflection when functionality can be easily achieved without it
Limit except clauses to the smallest specific set of exceptions possible
For duck-typing via try/except, keep the try body minimal and use else for main logic
Add the NVIDIA copyright header (with current year) at the top of all Python files, excluding tests/ and test-only scripts

Files:

  • nemo_rl/utils/packed_tensor.py
nemo_rl/**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

nemo_rl/**/*.py: Do not set non-None configuration defaults in code; YAML is the single source of truth for defaults
Access required config attributes directly (e.g., policy_cfg["precision"]) and assume presence; do not introduce hidden defaults
Express configuration optionality via TypedDict using typing.NotRequired
When adding a new config key to a TypedDict subclass, document the key’s purpose, valid values/types, and recommended default in code
For any class or function decorated with @ray.remote, add '# pragma: no cover' on the class/def line (and on remote functions)

Files:

  • nemo_rl/utils/packed_tensor.py
🧬 Code graph analysis (1)
nemo_rl/utils/packed_tensor.py (1)
tests/unit/utils/test_packed_tensor.py (3)
  • broadcast (33-37)
  • broadcast (47-51)
  • post_unpack_func (111-114)
🔇 Additional comments (4)
nemo_rl/utils/packed_tensor.py (4)

34-36: LGTM!

The new function correctly implements configurable buffer count with appropriate caching and a sensible default for double-buffering.


64-95: LGTM: Correct pipelining pattern for overlapping iteration and broadcast.

The circular buffer pattern with per-buffer CUDA streams correctly implements pipelined execution. Stream synchronization before buffer reuse ensures previous work completes, and the StopIteration handling preserves partial buffers. This enables overlapping parameter iteration/packing with broadcast operations.


142-203: LGTM: Consumer correctly mirrors producer pipelining pattern.

The consumer implementation correctly mirrors the producer's circular buffer and stream synchronization pattern. Per-buffer metadata tracking, unpacking, and StopIteration handling are all correct. The different iterator contract (metadata vs. tensors) is expected by design since producer and consumer operate on synchronized parameter iterations.


25-25: Based on my verification of the memory configuration changes, the concern about memory impact can be safely resolved.

No additional verification needed—memory safeguards are in place.

The doubled memory ratio (0.01 → 0.02) is protected by a 5GB hard cap, making the configuration safe across typical GPU sizes. With 2 buffers (default), maximum expected allocation is ~1–2GB on standard GPUs (40GB–80GB), well below the cap. For smaller GPUs (16GB), the per-buffer allocation remains under 500MB, which is acceptable. The codebase already has memory monitoring utilities, and no OOM issues were identified related to this configuration. This change is reasonable and intentional for performance optimization while maintaining safety constraints.

Comment on lines +78 to +80
tensor = post_iter_func(next(iterator)).view(torch.uint8).view(-1)
packing_tensor_list[buffer_idx].append(tensor)
packing_tensor_sizes[buffer_idx] += tensor.view(torch.uint8).numel()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Remove redundant tensor view on line 80.

Line 78 already converts the tensor to torch.uint8 and stores it in tensor, so the .view(torch.uint8) call on line 80 is redundant. This wastes compute on every tensor packed.

Apply this diff:

                    tensor = post_iter_func(next(iterator)).view(torch.uint8).view(-1)
                    packing_tensor_list[buffer_idx].append(tensor)
-                    packing_tensor_sizes[buffer_idx] += tensor.view(torch.uint8).numel()
+                    packing_tensor_sizes[buffer_idx] += tensor.numel()
                    if packing_tensor_sizes[buffer_idx] > target_packed_tensor_size:
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
tensor = post_iter_func(next(iterator)).view(torch.uint8).view(-1)
packing_tensor_list[buffer_idx].append(tensor)
packing_tensor_sizes[buffer_idx] += tensor.view(torch.uint8).numel()
tensor = post_iter_func(next(iterator)).view(torch.uint8).view(-1)
packing_tensor_list[buffer_idx].append(tensor)
packing_tensor_sizes[buffer_idx] += tensor.numel()
🤖 Prompt for AI Agents
In nemo_rl/utils/packed_tensor.py around lines 78 to 80, remove the redundant
tensor.view(torch.uint8) on line 80 — tensor is already converted to torch.uint8
and flattened on line 78, so change packing_tensor_sizes[buffer_idx] +=
tensor.view(torch.uint8).numel() to packing_tensor_sizes[buffer_idx] +=
tensor.numel() (or tensor.view(-1).numel() if you want to be explicit about
flattening).

@terrykong terrykong enabled auto-merge (squash) October 24, 2025 22:59
@terrykong terrykong added the CI:L1 Run doctests, unit tests, and functional tests label Oct 24, 2025
@terrykong terrykong merged commit 09b8a92 into r0.4.0 Oct 25, 2025
79 of 88 checks passed
@terrykong terrykong deleted the cherry-pick-1379-r0.4.0 branch October 25, 2025 20:02
terrykong pushed a commit that referenced this pull request Nov 19, 2025
…it (1379)` into `r0.4.0` (#1418)

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>
Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>
Co-authored-by: Youngeun Kwon <youngeunk@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cherry-pick CI:L1 Run doctests, unit tests, and functional tests Run CICD

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants