feat: add valid_tokens_per_sec metric and total_valid_tokens to save state by terrykong · Pull Request #1249 · NVIDIA-NeMo/RL

terrykong · 2025-10-02T00:49:05Z

PR to replace #803

Rebase and apply it to all algorithms

FYI @xxman-google . will make sure to credit the PR to you

coderabbitai · 2025-10-02T00:52:44Z

📝 Walkthrough

Walkthrough

Adds total_valid_tokens tracking to SFT training state and metrics in nemo_rl/algorithms/sft.py. Aggregates minibatch metrics, updates counters, logs derived metrics (total_valid_tokens in millions, valid_tokens_per_sec), and persists the new counter in checkpoints. Validation flow and initial validation remain unchanged.

Changes

Cohort / File(s)	Summary of Changes
SFT training state and metrics `nemo_rl/algorithms/sft.py`	- Add SFTSaveState field: total_valid_tokens: int (init 0). - Track and accumulate global_valid_toks into total_valid_tokens; log derived total_valid_tokens (M). - Aggregate minibatch metrics (mean: lr, wd, global_valid_seqs, global_valid_toks; sum: others). - Compute and log valid_tokens_per_sec from timing and global_valid_toks. - Persist total_valid_tokens in checkpoints; load from state when resuming.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Trainer
  participant SFT as SFT Loop
  participant Aggregator as Metrics Aggregator
  participant Logger
  participant Checkpointer

  Trainer->>SFT: run step()
  SFT->>Aggregator: provide train_results (loss, grad_norm, lr, wd, global_valid_toks, ...)
  Aggregator-->>SFT: aggregated metrics (means/sums)
  Note over SFT: total_valid_tokens += global_valid_toks<br/>derive total_valid_tokens (M)
  SFT->>Logger: log metrics + valid_tokens_per_sec
  alt checkpoint step
    SFT->>Checkpointer: save sft_save_state incl. total_valid_tokens
  end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Test Results For Major Changes	✅ Passed	The PR description lacks any testing information, yet the changes are limited to adding logging metrics and persisting a derived counter without altering training logic or numerical behavior, so they qualify as minor per the custom check criteria.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check	✅ Passed	The title clearly and concisely describes the two main additions in the changeset—namely the new valid_tokens_per_sec metric and the total_valid_tokens field in the save state—so it accurately reflects the primary purpose of the PR.

✨ Finishing touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch tk/pr-803

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d82ca75 and 320c633.

📒 Files selected for processing (1)

nemo_rl/algorithms/sft.py (6 hunks)

🧰 Additional context used

📓 Path-based instructions (2)

**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Follow the Google Python Style Guide for all Python code
Target Python 3.12+ for all Python code in NeMo-RL
Indent Python code with 4 spaces; do not use tabs
Python filenames should be snake_case (e.g., some_file.py)
Class names should be PascalCase
Function and method names should be snake_case
Local variable names should be snake_case; if starting with a number, prefix with k (e.g., k_99th_percentile)
Global variables should be UPPER_SNAKE_CASE and prefixed with G_ (e.g., G_MY_GLOBAL)
Constants should be UPPER_SNAKE_CASE
Avoid shadowing variables declared in an outer scope
Initialize all externally visible members of a class in the constructor
For public interfaces used outside a file, prefer docstrings over comments
Use comments mainly for code within a function or interfaces local to a file
Commented-out code must include a nearby comment explaining usage and why it is commented out; otherwise remove before merging
Use Google-style docstrings for classes and functions (Sphinx-parseable)
Avoid using reflection when functionality can be easily achieved without it
Limit except clauses to the smallest specific set of exceptions possible
For duck-typing via try/except, keep the try body minimal and use else for main logic
Add the NVIDIA copyright header (with current year) at the top of all Python files, excluding tests/ and test-only scripts

Files:

nemo_rl/algorithms/sft.py

nemo_rl/**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

nemo_rl/**/*.py: Do not set non-None configuration defaults in code; YAML is the single source of truth for defaults
Access required config attributes directly (e.g., policy_cfg["precision"]) and assume presence; do not introduce hidden defaults
Express configuration optionality via TypedDict using typing.NotRequired
When adding a new config key to a TypedDict subclass, document the key’s purpose, valid values/types, and recommended default in code
For any class or function decorated with @ray.remote, add '# pragma: no cover' on the class/def line (and on remote functions)

Files:

nemo_rl/algorithms/sft.py

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)

GitHub Check: Lint check
GitHub Check: Post submodule check comment / Comment on PR
GitHub Check: Post automodel integration comment / Comment on PR

🔇 Additional comments (3)

nemo_rl/algorithms/sft.py (3)

53-53: LGTM! Clear documentation of the new tracking field.

The field addition is well-documented and appropriately typed as a required field in the TypedDict.

62-62: LGTM! Correct default initialization.

Initializing the token counter to 0 is appropriate for a new training run.

513-513: LGTM! Correct checkpoint persistence.

The total_valid_tokens counter is correctly persisted in the checkpoint, following the same pattern as other state variables.

nemo_rl/algorithms/sft.py

Co-authored-by: Xuehan <xxman@google.com> Signed-off-by: Xuehan <xxman@google.com> Signed-off-by: Terry Kong <terryk@nvidia.com>

Signed-off-by: Terry Kong <terryk@nvidia.com>

nemo_rl/algorithms/dpo.py

nemo_rl/algorithms/grpo.py

ZhiyuLi-Nvidia · 2025-10-02T04:55:14Z

Thank you @xxman-google & @terrykong.
Just 2 nit, LGTM.

Signed-off-by: Terry Kong <terryk@nvidia.com>

ZhiyuLi-Nvidia

Thank you @terrykong. LGTM!

…state (NVIDIA-NeMo#1249) Signed-off-by: Xuehan <xxman@google.com> Signed-off-by: Terry Kong <terryk@nvidia.com> Co-authored-by: Xuehan <xxman@google.com>

…state (NVIDIA-NeMo#1249) Signed-off-by: Xuehan <xxman@google.com> Signed-off-by: Terry Kong <terryk@nvidia.com> Co-authored-by: Xuehan <xxman@google.com> Signed-off-by: yuanhangs <yuanhangs@nvidia.com>

terrykong requested a review from a team as a code owner October 2, 2025 00:49

coderabbitai bot reviewed Oct 2, 2025

View reviewed changes

nemo_rl/algorithms/sft.py Outdated Show resolved Hide resolved

nemo_rl/algorithms/sft.py Outdated Show resolved Hide resolved

nemo_rl/algorithms/sft.py Outdated Show resolved Hide resolved

terrykong requested a review from a team as a code owner October 2, 2025 01:19

terrykong requested a review from ZhiyuLi-Nvidia October 2, 2025 04:35

feat: add valid token metrics

33b39d6

Co-authored-by: Xuehan <xxman@google.com> Signed-off-by: Xuehan <xxman@google.com> Signed-off-by: Terry Kong <terryk@nvidia.com>

terrykong force-pushed the tk/pr-803 branch from f3ec359 to 33b39d6 Compare October 2, 2025 04:36

terrykong changed the title ~~add two additional metrics for logging.~~ feat: add valid_tokens_per_sec metric and total_valid_tokens to the save state Oct 2, 2025

terrykong changed the title ~~feat: add valid_tokens_per_sec metric and total_valid_tokens to the save state~~ feat: add valid_tokens_per_sec metric and total_valid_tokens to save state Oct 2, 2025

terrykong added the CI:L1 Run doctests, unit tests, and functional tests label Oct 2, 2025

terrykong temporarily deployed to nemo-ci October 2, 2025 04:38 — with GitHub Actions Inactive

coderabbit

c9a3171

Signed-off-by: Terry Kong <terryk@nvidia.com>

terrykong added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Oct 2, 2025

terrykong enabled auto-merge (squash) October 2, 2025 04:48

terrykong temporarily deployed to nemo-ci October 2, 2025 04:49 — with GitHub Actions Inactive

ZhiyuLi-Nvidia reviewed Oct 2, 2025

View reviewed changes

nemo_rl/algorithms/dpo.py Outdated Show resolved Hide resolved

terrykong temporarily deployed to nemo-ci October 2, 2025 04:49 — with GitHub Actions Inactive

ZhiyuLi-Nvidia reviewed Oct 2, 2025

View reviewed changes

nemo_rl/algorithms/grpo.py Show resolved Hide resolved

ZhiyuLi-Nvidia previously approved these changes Oct 2, 2025

View reviewed changes

per gpu

3b51ca0

Signed-off-by: Terry Kong <terryk@nvidia.com>

terrykong dismissed ZhiyuLi-Nvidia’s stale review via 3b51ca0 October 2, 2025 05:31

terrykong requested a review from a team as a code owner October 2, 2025 05:31

terrykong added CI:L1 Run doctests, unit tests, and functional tests r0.4.0 and removed CI:L1 Run doctests, unit tests, and functional tests labels Oct 2, 2025

terrykong temporarily deployed to nemo-ci October 2, 2025 05:32 — with GitHub Actions Inactive

terrykong mentioned this pull request Oct 2, 2025

feat: log total_valid_tokens and tokens_per_sec during SFT training. #803

Closed

4 tasks

terrykong temporarily deployed to nemo-ci October 2, 2025 05:33 — with GitHub Actions Inactive

terrykong added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Oct 3, 2025

terrykong temporarily deployed to nemo-ci October 3, 2025 04:59 — with GitHub Actions Inactive

terrykong temporarily deployed to nemo-ci October 3, 2025 05:00 — with GitHub Actions Inactive

terrykong temporarily deployed to nemo-ci October 3, 2025 07:06 — with GitHub Actions Inactive

ZhiyuLi-Nvidia approved these changes Oct 3, 2025

View reviewed changes

terrykong merged commit d0e203c into main Oct 3, 2025
68 of 72 checks passed

terrykong deleted the tk/pr-803 branch October 3, 2025 21:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add valid_tokens_per_sec metric and total_valid_tokens to save state#1249

feat: add valid_tokens_per_sec metric and total_valid_tokens to save state#1249
terrykong merged 3 commits intomainfrom
tk/pr-803

terrykong commented Oct 2, 2025 •

edited

Loading

Uh oh!

coderabbitai bot commented Oct 2, 2025 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ZhiyuLi-Nvidia commented Oct 2, 2025

Uh oh!

ZhiyuLi-Nvidia left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

terrykong commented Oct 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai bot commented Oct 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ZhiyuLi-Nvidia commented Oct 2, 2025

Uh oh!

ZhiyuLi-Nvidia left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

terrykong commented Oct 2, 2025 •

edited

Loading

coderabbitai bot commented Oct 2, 2025 •

edited

Loading