ci: update gpt3_7b_tp4_pp1_memory_speed gb200 golden values by ko3n1g · Pull Request #4601 · NVIDIA/Megatron-LM

ko3n1g · 2026-05-04T09:49:01Z

After https://github.com/NVIDIA/Megatron-LM/commits/main/tests/functional_tests/shell_test_utils/_run_training.sh we forgot to update the nightly values of the test in question. Perhaps the test was still in queue at the time we updated values. Looking at the dashboard, its the only change that happen in the span that clear explains a shift in golden values.

Claude summary

Summary

Heals the deterministic golden-values regression on gpt3_7b_tp4_pp1_memory_speed (dgx_gb200, dev environment) introduced by #3779 (rampup batch size scheduler replacement).

The APPROXIMATE test already passes — only the EXACT (deterministic) comparison failed.
Deltas are last-decimal noise on lm loss and num-zeros; no metric drift on iteration-time, mem-allocated-bytes, or mem-max-allocated-bytes.
Other env/platform combos (a100, h100, lts) are unaffected.

Verification

Failing job (pre-fix): https://gitlab-master.nvidia.com/ADLR/megatron-lm/-/jobs/302065814
Healing pipeline (all green, includes nightly run on gb200): https://gitlab-master.nvidia.com/ADLR/megatron-lm/-/pipelines/50196728
Healed test job: https://gitlab-master.nvidia.com/ADLR/megatron-lm/-/jobs/310457478

Example diff

-            "3": 12.60284,
+            "3": 12.60283,
-            "25": 10.99869
+            "25": 10.99872

Heals deterministic lm-loss / num-zeros mismatch on dev_dgx_gb200 after the rampup batch size scheduler was replaced (NVIDIA#3779). Approximate test already passes; deltas are within last-decimal noise. Verified in https://gitlab-master.nvidia.com/ADLR/megatron-lm/-/pipelines/50196728 Signed-off-by: oliver könig <okoenig@nvidia.com>

ko3n1g · 2026-05-04T09:49:06Z

/ok to test d23d965

github-actions · 2026-05-04T09:49:10Z

This PR has been automatically converted to draft because all PRs must start as drafts.

When you are ready for review, click Ready for Review to begin the review process. This will:

Add the oncall reviewer (optional reviewer)
Add required review teams based on your changes

See the contribution guide for more details.

copy-pr-bot · 2026-05-04T09:49:12Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

svcnvidia-nemo-ci marked this pull request as draft May 4, 2026 09:49

copy-pr-bot Bot temporarily deployed to test May 4, 2026 09:50 Inactive

ko3n1g requested a review from chtruong814 May 4, 2026 09:51

ko3n1g marked this pull request as ready for review May 4, 2026 09:51

svcnvidia-nemo-ci requested a review from a team May 4, 2026 09:51

svcnvidia-nemo-ci added the complexity: low label May 4, 2026

ko3n1g merged commit 0efa47a into NVIDIA:main May 4, 2026
70 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ci: update gpt3_7b_tp4_pp1_memory_speed gb200 golden values#4601

ci: update gpt3_7b_tp4_pp1_memory_speed gb200 golden values#4601
ko3n1g merged 1 commit intoNVIDIA:mainfrom
ko3n1g:ko3n1g/heal/gpt3-7b-tp4-pp1-memory-speed-gb200

ko3n1g commented May 4, 2026 •

edited

Loading

Uh oh!

ko3n1g commented May 4, 2026

Uh oh!

github-actions Bot commented May 4, 2026

Uh oh!

copy-pr-bot Bot commented May 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ko3n1g commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Verification

Example diff

Uh oh!

ko3n1g commented May 4, 2026

Uh oh!

github-actions Bot commented May 4, 2026

Uh oh!

copy-pr-bot Bot commented May 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ko3n1g commented May 4, 2026 •

edited

Loading