Skip to content

feat: add valid_tokens_per_sec metric and total_valid_tokens to save state#1249

Merged
terrykong merged 3 commits intomainfrom
tk/pr-803
Oct 3, 2025
Merged

feat: add valid_tokens_per_sec metric and total_valid_tokens to save state#1249
terrykong merged 3 commits intomainfrom
tk/pr-803

Conversation

@terrykong
Copy link
Collaborator

@terrykong terrykong commented Oct 2, 2025

PR to replace #803

Rebase and apply it to all algorithms

FYI @xxman-google . will make sure to credit the PR to you

@terrykong terrykong requested a review from a team as a code owner October 2, 2025 00:49
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Oct 2, 2025

📝 Walkthrough

Walkthrough

Adds total_valid_tokens tracking to SFT training state and metrics in nemo_rl/algorithms/sft.py. Aggregates minibatch metrics, updates counters, logs derived metrics (total_valid_tokens in millions, valid_tokens_per_sec), and persists the new counter in checkpoints. Validation flow and initial validation remain unchanged.

Changes

Cohort / File(s) Summary of Changes
SFT training state and metrics
nemo_rl/algorithms/sft.py
- Add SFTSaveState field: total_valid_tokens: int (init 0).
- Track and accumulate global_valid_toks into total_valid_tokens; log derived total_valid_tokens (M).
- Aggregate minibatch metrics (mean: lr, wd, global_valid_seqs, global_valid_toks; sum: others).
- Compute and log valid_tokens_per_sec from timing and global_valid_toks.
- Persist total_valid_tokens in checkpoints; load from state when resuming.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Trainer
  participant SFT as SFT Loop
  participant Aggregator as Metrics Aggregator
  participant Logger
  participant Checkpointer

  Trainer->>SFT: run step()
  SFT->>Aggregator: provide train_results (loss, grad_norm, lr, wd, global_valid_toks, ...)
  Aggregator-->>SFT: aggregated metrics (means/sums)
  Note over SFT: total_valid_tokens += global_valid_toks<br/>derive total_valid_tokens (M)
  SFT->>Logger: log metrics + valid_tokens_per_sec
  alt checkpoint step
    SFT->>Checkpointer: save sft_save_state incl. total_valid_tokens
  end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (3 passed)
Check name Status Explanation
Test Results For Major Changes ✅ Passed The PR description lacks any testing information, yet the changes are limited to adding logging metrics and persisting a derived counter without altering training logic or numerical behavior, so they qualify as minor per the custom check criteria.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The title clearly and concisely describes the two main additions in the changeset—namely the new valid_tokens_per_sec metric and the total_valid_tokens field in the save state—so it accurately reflects the primary purpose of the PR.
✨ Finishing touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch tk/pr-803

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d82ca75 and 320c633.

📒 Files selected for processing (1)
  • nemo_rl/algorithms/sft.py (6 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Follow the Google Python Style Guide for all Python code
Target Python 3.12+ for all Python code in NeMo-RL
Indent Python code with 4 spaces; do not use tabs
Python filenames should be snake_case (e.g., some_file.py)
Class names should be PascalCase
Function and method names should be snake_case
Local variable names should be snake_case; if starting with a number, prefix with k (e.g., k_99th_percentile)
Global variables should be UPPER_SNAKE_CASE and prefixed with G_ (e.g., G_MY_GLOBAL)
Constants should be UPPER_SNAKE_CASE
Avoid shadowing variables declared in an outer scope
Initialize all externally visible members of a class in the constructor
For public interfaces used outside a file, prefer docstrings over comments
Use comments mainly for code within a function or interfaces local to a file
Commented-out code must include a nearby comment explaining usage and why it is commented out; otherwise remove before merging
Use Google-style docstrings for classes and functions (Sphinx-parseable)
Avoid using reflection when functionality can be easily achieved without it
Limit except clauses to the smallest specific set of exceptions possible
For duck-typing via try/except, keep the try body minimal and use else for main logic
Add the NVIDIA copyright header (with current year) at the top of all Python files, excluding tests/ and test-only scripts

Files:

  • nemo_rl/algorithms/sft.py
nemo_rl/**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

nemo_rl/**/*.py: Do not set non-None configuration defaults in code; YAML is the single source of truth for defaults
Access required config attributes directly (e.g., policy_cfg["precision"]) and assume presence; do not introduce hidden defaults
Express configuration optionality via TypedDict using typing.NotRequired
When adding a new config key to a TypedDict subclass, document the key’s purpose, valid values/types, and recommended default in code
For any class or function decorated with @ray.remote, add '# pragma: no cover' on the class/def line (and on remote functions)

Files:

  • nemo_rl/algorithms/sft.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: Lint check
  • GitHub Check: Post submodule check comment / Comment on PR
  • GitHub Check: Post automodel integration comment / Comment on PR
🔇 Additional comments (3)
nemo_rl/algorithms/sft.py (3)

53-53: LGTM! Clear documentation of the new tracking field.

The field addition is well-documented and appropriately typed as a required field in the TypedDict.


62-62: LGTM! Correct default initialization.

Initializing the token counter to 0 is appropriate for a new training run.


513-513: LGTM! Correct checkpoint persistence.

The total_valid_tokens counter is correctly persisted in the checkpoint, following the same pattern as other state variables.

@terrykong terrykong requested a review from a team as a code owner October 2, 2025 01:19
Co-authored-by: Xuehan <xxman@google.com>
Signed-off-by: Xuehan <xxman@google.com>
Signed-off-by: Terry Kong <terryk@nvidia.com>
@terrykong terrykong changed the title add two additional metrics for logging. feat: add valid_tokens_per_sec metric and total_valid_tokens to the save state Oct 2, 2025
@terrykong terrykong changed the title feat: add valid_tokens_per_sec metric and total_valid_tokens to the save state feat: add valid_tokens_per_sec metric and total_valid_tokens to save state Oct 2, 2025
@terrykong terrykong added the CI:L1 Run doctests, unit tests, and functional tests label Oct 2, 2025
Signed-off-by: Terry Kong <terryk@nvidia.com>
@terrykong terrykong added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Oct 2, 2025
@terrykong terrykong enabled auto-merge (squash) October 2, 2025 04:48
@ZhiyuLi-Nvidia
Copy link
Contributor

Thank you @xxman-google & @terrykong.
Just 2 nit, LGTM.

ZhiyuLi-Nvidia
ZhiyuLi-Nvidia previously approved these changes Oct 2, 2025
Signed-off-by: Terry Kong <terryk@nvidia.com>
@terrykong terrykong requested a review from a team as a code owner October 2, 2025 05:31
@terrykong terrykong added CI:L1 Run doctests, unit tests, and functional tests r0.4.0 and removed CI:L1 Run doctests, unit tests, and functional tests labels Oct 2, 2025
@terrykong terrykong added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Oct 3, 2025
Copy link
Contributor

@ZhiyuLi-Nvidia ZhiyuLi-Nvidia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @terrykong. LGTM!

@terrykong terrykong merged commit d0e203c into main Oct 3, 2025
68 of 72 checks passed
@terrykong terrykong deleted the tk/pr-803 branch October 3, 2025 21:10
PrinsYin pushed a commit to PrinsYin/RL that referenced this pull request Nov 30, 2025
…state (NVIDIA-NeMo#1249)

Signed-off-by: Xuehan <xxman@google.com>
Signed-off-by: Terry Kong <terryk@nvidia.com>
Co-authored-by: Xuehan <xxman@google.com>
aroshanghias-nvd pushed a commit to aroshanghias-nvd/nemo-rl that referenced this pull request Feb 6, 2026
…state (NVIDIA-NeMo#1249)

Signed-off-by: Xuehan <xxman@google.com>
Signed-off-by: Terry Kong <terryk@nvidia.com>
Co-authored-by: Xuehan <xxman@google.com>
yuanhangsu1986 pushed a commit to yuanhangsu1986/RL-Nemontron-Edge-Omni that referenced this pull request Feb 21, 2026
…state (NVIDIA-NeMo#1249)

Signed-off-by: Xuehan <xxman@google.com>
Signed-off-by: Terry Kong <terryk@nvidia.com>
Co-authored-by: Xuehan <xxman@google.com>
Signed-off-by: yuanhangs <yuanhangs@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI:L1 Run doctests, unit tests, and functional tests r0.4.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants