feat: log `total_valid_tokens` and `tokens_per_sec` during SFT training. by xxman-google · Pull Request #803 · NVIDIA-NeMo/RL

xxman-google · 2025-07-30T14:19:54Z

What does this PR do ?

Add two additional metrics for logging during SFT.

It is good to know how many tokens (without padding) in total we have trained and it is also useful to log the training speed in terms of tokens per second.

Issues

List issues that this PR closes (syntax): N/A

Usage

You will find two additional loggings in wandb graphs. (See below for an example)
total_valid_tokens is saved into the checkpoint so it can be resumed when training is interrupted.

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

...

nemo_rl/algorithms/sft.py

Signed-off-by: Xuehan <xxman@google.com>

terrykong · 2025-07-30T21:16:12Z

nemo_rl/algorithms/sft.py

    total_steps: int  # Track total number of steps across all epochs
    val_loss: NotRequired[float]  # Optional field - may not be present during training
    consumed_samples: int
+    total_valid_tokens: int  # Track total number of non-padding tokens during training


is this supposed to track a different metric than

RL/nemo_rl/experience/rollouts.py

Line 605 in 7b3fad8

token_count = 0

?

can you use that instead? that one shows up as mean_total_tokens_per_sample

is this supposed to track a different metric than

RL/nemo_rl/experience/rollouts.py

Line 605 in 7b3fad8

token_count = 0

?
can you use that instead? that one shows up as mean_total_tokens_per_sample

yes, I am tracking non-padded tokens while that one tracks all tokens.

ZhiyuLi-Nvidia · 2025-08-01T23:12:26Z

nemo_rl/algorithms/sft.py

                    print(f"  • {k}: {v:.2f}s ({percent:.1f}%)")

+            timing_metrics["valid_tokens_per_sec"] = (
+                metrics["global_valid_toks"] / total_time


Would it be better to add/switch to valid_tokens_per_sec_per_gpu? It is normalized metrics for num_gpus. Say we can linearly scale using data parallelism with more gpus, then valid_tokens_per_sec_per_gpu would be a better metrics to track efficiency/speed of sft training.

ZhiyuLi-Nvidia · 2025-08-01T23:19:58Z

nemo_rl/algorithms/sft.py

+                }
+                metrics.update(train_results["all_mb_metrics"])
+                for k, v in metrics.items():
+                    if k in {"lr", "wd", "global_valid_seqs", "global_valid_toks"}:


a naive question:
why we do mean average for global_valid_seqs, global_valid_toks? Is the metric already normalized on num_workers?

@xxman-google I think I am mainly confused ad the naming.

It is called "globalxxxx", while we do mean average right after.

if k in {"lr", "wd", "global_valid_seqs", "global_valid_toks"}: metrics[k] = np.mean(v).item()

nemo_rl/algorithms/sft.py

ZhiyuLi-Nvidia · 2025-09-17T00:56:44Z

@xxman-google thank you for contribution. Just curious if you can address the last 3 comments, then we are good to go.

terrykong · 2025-10-02T05:32:53Z

closing. continuing in #1249

terrykong self-requested a review July 30, 2025 16:38

terrykong reviewed Jul 30, 2025

View reviewed changes

nemo_rl/algorithms/sft.py Outdated Show resolved Hide resolved

add two additional metrics for logging.

ded9e11

Signed-off-by: Xuehan <xxman@google.com>

xxman-google force-pushed the xx/logging branch from d253fb9 to ded9e11 Compare July 30, 2025 17:44

terrykong reviewed Jul 30, 2025

View reviewed changes

ZhiyuLi-Nvidia self-requested a review August 1, 2025 23:05

ZhiyuLi-Nvidia reviewed Aug 1, 2025

View reviewed changes

nemo_rl/algorithms/sft.py Show resolved Hide resolved

snowmanwwg added external x-google labels Sep 16, 2025

euronymous-aithal added the r0.4.0 label Sep 17, 2025

terrykong mentioned this pull request Oct 2, 2025

feat: add valid_tokens_per_sec metric and total_valid_tokens to save state #1249

Merged

terrykong closed this Oct 2, 2025

github-actions bot added the community-request label Oct 2, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: log `total_valid_tokens` and `tokens_per_sec` during SFT training.#803

feat: log `total_valid_tokens` and `tokens_per_sec` during SFT training.#803
xxman-google wants to merge 1 commit intoNVIDIA-NeMo:mainfrom
xxman-google:xx/logging

xxman-google commented Jul 30, 2025 •

edited

Loading

Uh oh!

Uh oh!

terrykong Jul 30, 2025

Uh oh!

xxman-google Jul 30, 2025

Uh oh!

ZhiyuLi-Nvidia Aug 1, 2025

Uh oh!

ZhiyuLi-Nvidia Aug 1, 2025

Uh oh!

ZhiyuLi-Nvidia Sep 17, 2025

Uh oh!

Uh oh!

ZhiyuLi-Nvidia commented Sep 17, 2025

Uh oh!

terrykong commented Oct 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

xxman-google commented Jul 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Issues

Usage

Before your PR is "Ready for review"

Additional Information

Uh oh!

Uh oh!

terrykong Jul 30, 2025

Choose a reason for hiding this comment

Uh oh!

xxman-google Jul 30, 2025

Choose a reason for hiding this comment

Uh oh!

ZhiyuLi-Nvidia Aug 1, 2025

Choose a reason for hiding this comment

Uh oh!

ZhiyuLi-Nvidia Aug 1, 2025

Choose a reason for hiding this comment

Uh oh!

ZhiyuLi-Nvidia Sep 17, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ZhiyuLi-Nvidia commented Sep 17, 2025

Uh oh!

terrykong commented Oct 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

xxman-google commented Jul 30, 2025 •

edited

Loading