fix: fix mcore train_iters in grpo #1383

yuki-97 · 2025-10-17T06:41:20Z

Previously we forgot to multiple the dataloader length with max_num_epochs when calculating train_iters.
This PR will fix it.

Summary by CodeRabbit

Bug Fixes
- Refined the calculation of total training iterations to provide more granular control over training duration through epoch-based configuration, allowing better flexibility in managing training sessions and step limits.

Signed-off-by: Yuki Huang <yukih@nvidia.com>

coderabbitai · 2025-10-17T06:42:02Z

📝 Walkthrough

Walkthrough

The calculation of total_train_iters in the setup method is updated to incorporate max_num_epochs. The new formula multiplies the dataloader length by max_num_epochs before taking the minimum with max_num_steps, providing epoch-based iteration control instead of a direct step cap.

Changes

Cohort / File(s)	Change Summary
Training iteration calculation `nemo_rl/algorithms/grpo.py`	Modified `total_train_iters` calculation to use `min(grpo_config["max_num_steps"], grpo_config["max_num_epochs"] * len(dataloader))` instead of `min(grpo_config["max_num_steps"], len(dataloader))`, enabling granular epoch-based control

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Test Results For Major Changes	⚠️ Warning	This PR is a bug fix that modifies the `train_iters` calculation in the GRPO algorithm's setup function to correctly include the `max_num_epochs` multiplier. While the change itself is straightforward and logically correct, the CI workflow shows a critical failure: "CI quality check (No tests run: Label CI:L*): FAILURE", indicating that no tests were executed for this PR. The PR description contains no test results, convergence verification, or evidence demonstrating the absence of regressions. Since this change affects the number of training iterations the algorithm may execute (potentially increasing it via the `max_num_epochs` factor), it could impact training convergence and numerical outcomes, requiring explicit verification before merge.	The PR must document that appropriate tests have been run and have passed. Specifically, the PR description should include: (1) confirmation that the CI tests passed (currently showing "CI quality check: FAILURE"), (2) results from the existing unit tests in `tests/unit/algorithms/test_grpo.py` and functional tests in `tests/functional/grpo.sh`, and (3) evidence that training with the fix does not cause regressions (e.g., comparable convergence curves or loss values). At minimum, the PR author should apply the appropriate CI labels to trigger test execution and document the test results in the PR description before merging.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check	✅ Passed	The pull request title "fix: fix mcore train_iters in grpo" is clearly related to the main change in the changeset. The PR specifically fixes the calculation of `total_train_iters` in the GRPO algorithm, and the title accurately identifies both the component being modified (train_iters) and the file/module (grpo). While the title doesn't elaborate on the specific nature of the fix (adding max_num_epochs multiplication) or explain what "mcore" refers to, it is sufficiently clear and specific that a teammate reviewing the commit history would understand the PR addresses a train_iters issue in grpo. The title is not vague, misleading, or off-topic.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch yukih/fix-train-iters

📜 Recent review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between dee3fd9 and b0e62e7.

📒 Files selected for processing (1)

nemo_rl/algorithms/grpo.py (1 hunks)

🧰 Additional context used

📓 Path-based instructions (2)

**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Follow the Google Python Style Guide for all Python code
Target Python 3.12+ for all Python code in NeMo-RL
Indent Python code with 4 spaces; do not use tabs
Python filenames should be snake_case (e.g., some_file.py)
Class names should be PascalCase
Function and method names should be snake_case
Local variable names should be snake_case; if starting with a number, prefix with k (e.g., k_99th_percentile)
Global variables should be UPPER_SNAKE_CASE and prefixed with G_ (e.g., G_MY_GLOBAL)
Constants should be UPPER_SNAKE_CASE
Avoid shadowing variables declared in an outer scope
Initialize all externally visible members of a class in the constructor
For public interfaces used outside a file, prefer docstrings over comments
Use comments mainly for code within a function or interfaces local to a file
Commented-out code must include a nearby comment explaining usage and why it is commented out; otherwise remove before merging
Use Google-style docstrings for classes and functions (Sphinx-parseable)
Avoid using reflection when functionality can be easily achieved without it
Limit except clauses to the smallest specific set of exceptions possible
For duck-typing via try/except, keep the try body minimal and use else for main logic
Add the NVIDIA copyright header (with current year) at the top of all Python files, excluding tests/ and test-only scripts

Files:

nemo_rl/algorithms/grpo.py

nemo_rl/**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

nemo_rl/**/*.py: Do not set non-None configuration defaults in code; YAML is the single source of truth for defaults
Access required config attributes directly (e.g., policy_cfg["precision"]) and assume presence; do not introduce hidden defaults
Express configuration optionality via TypedDict using typing.NotRequired
When adding a new config key to a TypedDict subclass, document the key’s purpose, valid values/types, and recommended default in code
For any class or function decorated with @ray.remote, add '# pragma: no cover' on the class/def line (and on remote functions)

Files:

nemo_rl/algorithms/grpo.py

🧬 Code graph analysis (1)

nemo_rl/algorithms/grpo.py (1)

tests/check_metrics.py (1)

min (24-26)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)

GitHub Check: Docs_Tests
GitHub Check: Lint check
GitHub Check: Lint check
GitHub Check: Post automodel integration comment / Comment on PR
GitHub Check: Post submodule check comment / Comment on PR

🔇 Additional comments (1)

nemo_rl/algorithms/grpo.py (1)

423-426: LGTM! The fix correctly calculates total training iterations.

The updated formula now properly accounts for multiple epochs by multiplying the dataloader length by max_num_epochs. This aligns with the training loop behavior (line 650), which iterates through epochs up to max_num_epochs while respecting the max_num_steps limit. The min() ensures we don't exceed the configured maximum steps.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

ashors1

Thanks for the fix Yuki!

terrykong · 2025-10-17T16:34:22Z

tests passed, but setting to docs since i needed to add r0.4.0 label https://github.com/NVIDIA-NeMo/RL/actions/runs/18584732148?pr=1383

Signed-off-by: Yuki Huang <yukih@nvidia.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>

Signed-off-by: Yuki Huang <yukih@nvidia.com> Signed-off-by: Lawrence Lane <llane@nvidia.com>

Signed-off-by: Yuki Huang <yukih@nvidia.com>

fix grpo mcore train_iters

b0e62e7

Signed-off-by: Yuki Huang <yukih@nvidia.com>

yuki-97 requested a review from a team as a code owner October 17, 2025 06:41

yuki-97 added the CI:L1 Run doctests, unit tests, and functional tests label Oct 17, 2025

yuki-97 requested review from ashors1 and terrykong October 17, 2025 06:41

yuki-97 temporarily deployed to nemo-ci October 17, 2025 06:41 — with GitHub Actions Inactive

yuki-97 temporarily deployed to nemo-ci October 17, 2025 06:42 — with GitHub Actions Inactive

yuki-97 temporarily deployed to nemo-ci October 17, 2025 08:44 — with GitHub Actions Inactive

ashors1 approved these changes Oct 17, 2025

View reviewed changes

terrykong approved these changes Oct 17, 2025

View reviewed changes

terrykong added r0.4.0 CI:docs Run doctest and removed CI:L1 Run doctests, unit tests, and functional tests labels Oct 17, 2025

terrykong enabled auto-merge (squash) October 17, 2025 16:34

terrykong temporarily deployed to nemo-ci October 17, 2025 16:34 — with GitHub Actions Inactive

terrykong temporarily deployed to nemo-ci October 17, 2025 16:35 — with GitHub Actions Inactive

terrykong merged commit 905a224 into main Oct 17, 2025
70 of 72 checks passed

terrykong deleted the yukih/fix-train-iters branch October 17, 2025 16:41

chtruong814 pushed a commit that referenced this pull request Oct 17, 2025

fix: fix mcore train_iters in grpo (#1383)

f4e5564

Signed-off-by: Yuki Huang <yukih@nvidia.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>

coderabbitai bot mentioned this pull request Oct 17, 2025

cp: fix: fix mcore train_iters in grpo (1383) into r0.4.0 #1385

Merged

lbliii pushed a commit that referenced this pull request Nov 3, 2025

fix: fix mcore train_iters in grpo (#1383)

4a4baf5

Signed-off-by: Yuki Huang <yukih@nvidia.com> Signed-off-by: Lawrence Lane <llane@nvidia.com>

PrinsYin pushed a commit to PrinsYin/RL that referenced this pull request Nov 30, 2025

fix: fix mcore train_iters in grpo (NVIDIA-NeMo#1383)

c9d0f0f

Signed-off-by: Yuki Huang <yukih@nvidia.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: fix mcore train_iters in grpo #1383

fix: fix mcore train_iters in grpo #1383

Uh oh!

yuki-97 commented Oct 17, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Oct 17, 2025 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Uh oh!

ashors1 left a comment

Uh oh!

terrykong commented Oct 17, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

fix: fix mcore train_iters in grpo #1383

fix: fix mcore train_iters in grpo #1383

Uh oh!

Conversation

yuki-97 commented Oct 17, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Oct 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Pre-merge checks and finishing touches

Uh oh!

ashors1 left a comment

Choose a reason for hiding this comment

Uh oh!

terrykong commented Oct 17, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

yuki-97 commented Oct 17, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Oct 17, 2025 •

edited

Loading