Skip to content

fix: permit conflicting metrics in nightly tests + config code_snapshot dir#954

Merged
terrykong merged 1 commit intomainfrom
tk/metric-conflicts-in-tests
Aug 21, 2025
Merged

fix: permit conflicting metrics in nightly tests + config code_snapshot dir#954
terrykong merged 1 commit intomainfrom
tk/metric-conflicts-in-tests

Conversation

@terrykong
Copy link
Collaborator

@terrykong terrykong commented Aug 20, 2025

Sometimes the metric calculation can error if there is data for the same step in multiple sets of TB logs. This permits this since the usual reason this happens is one stage is pre-empted, so the following stage needs to go back to the last checkpoint so its value is repeated in both the pre-empted TB logs and the new one.

This change also configures the code_snapshot dir which is helpful for versioning runs

…hot dir configurability

Signed-off-by: Terry Kong <terryk@nvidia.com>
@terrykong terrykong requested a review from chtruong814 August 20, 2025 18:28
@terrykong terrykong enabled auto-merge August 20, 2025 18:29
@terrykong terrykong changed the title fix: permit conflicting metrics in nightly/release tests + code_snapshot dir configurability fix: permit conflicting metrics in nightly tests + code_snapshot dir configurability Aug 20, 2025
@terrykong terrykong changed the title fix: permit conflicting metrics in nightly tests + code_snapshot dir configurability fix: permit conflicting metrics in nightly tests + config code_snapshot dir Aug 20, 2025
@terrykong terrykong added this pull request to the merge queue Aug 20, 2025
Merged via the queue into main with commit eb48be9 Aug 21, 2025
23 of 27 checks passed
@terrykong terrykong deleted the tk/metric-conflicts-in-tests branch August 21, 2025 13:53
jveronvialard pushed a commit that referenced this pull request Aug 27, 2025
…ot dir (#954)

Signed-off-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: Julien Veron Vialard <jveronvialar@nvidia.com>
soodoshll pushed a commit to soodoshll/RL that referenced this pull request Aug 28, 2025
…ot dir (NVIDIA-NeMo#954)

Signed-off-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: Qidong Su <qidongs@nvidia.com>
soodoshll pushed a commit to soodoshll/RL that referenced this pull request Sep 4, 2025
…ot dir (NVIDIA-NeMo#954)

Signed-off-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: Qidong Su <qidongs@nvidia.com>
PrinsYin pushed a commit to PrinsYin/RL that referenced this pull request Nov 30, 2025
…ot dir (NVIDIA-NeMo#954)

Signed-off-by: Terry Kong <terryk@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants