Skip to content

ci: update gpt3_7b_tp4_pp1_memory_speed gb200 golden values#4601

Merged
ko3n1g merged 1 commit intoNVIDIA:mainfrom
ko3n1g:ko3n1g/heal/gpt3-7b-tp4-pp1-memory-speed-gb200
May 4, 2026
Merged

ci: update gpt3_7b_tp4_pp1_memory_speed gb200 golden values#4601
ko3n1g merged 1 commit intoNVIDIA:mainfrom
ko3n1g:ko3n1g/heal/gpt3-7b-tp4-pp1-memory-speed-gb200

Conversation

@ko3n1g
Copy link
Copy Markdown
Contributor

@ko3n1g ko3n1g commented May 4, 2026

After https://github.com/NVIDIA/Megatron-LM/commits/main/tests/functional_tests/shell_test_utils/_run_training.sh we forgot to update the nightly values of the test in question. Perhaps the test was still in queue at the time we updated values. Looking at the dashboard, its the only change that happen in the span that clear explains a shift in golden values.

Claude summary

Summary

Heals the deterministic golden-values regression on gpt3_7b_tp4_pp1_memory_speed (dgx_gb200, dev environment) introduced by #3779 (rampup batch size scheduler replacement).

  • The APPROXIMATE test already passes — only the EXACT (deterministic) comparison failed.
  • Deltas are last-decimal noise on lm loss and num-zeros; no metric drift on iteration-time, mem-allocated-bytes, or mem-max-allocated-bytes.
  • Other env/platform combos (a100, h100, lts) are unaffected.

Verification

Example diff

-            "3": 12.60284,
+            "3": 12.60283,
-            "25": 10.99869
+            "25": 10.99872

Heals deterministic lm-loss / num-zeros mismatch on dev_dgx_gb200 after
the rampup batch size scheduler was replaced (NVIDIA#3779). Approximate test
already passes; deltas are within last-decimal noise.

Verified in https://gitlab-master.nvidia.com/ADLR/megatron-lm/-/pipelines/50196728

Signed-off-by: oliver könig <okoenig@nvidia.com>
@ko3n1g
Copy link
Copy Markdown
Contributor Author

ko3n1g commented May 4, 2026

/ok to test d23d965

@svcnvidia-nemo-ci svcnvidia-nemo-ci marked this pull request as draft May 4, 2026 09:49
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 4, 2026

This PR has been automatically converted to draft because all PRs must start as drafts.

When you are ready for review, click Ready for Review to begin the review process. This will:

  1. Add the oncall reviewer (optional reviewer)
  2. Add required review teams based on your changes

See the contribution guide for more details.

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 4, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@ko3n1g ko3n1g requested a review from chtruong814 May 4, 2026 09:51
@ko3n1g ko3n1g marked this pull request as ready for review May 4, 2026 09:51
@svcnvidia-nemo-ci svcnvidia-nemo-ci requested a review from a team May 4, 2026 09:51
@ko3n1g ko3n1g merged commit 0efa47a into NVIDIA:main May 4, 2026
70 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants