Adding blackwell tests by shanmugamr1992 · Pull Request #5113 · NVIDIA/Megatron-LM

shanmugamr1992 · 2026-06-02T18:20:36Z

I, the PR author, have personally reviewed every line of this PR.

What does this PR do ?

⚠️ For major changes (either in lines of code or in its impact), please make sure to first share a design doc with the team. If you're unsure what's the best way to do so, contact @NVIDIA/mcore-oncall.

Issue tracking

For PRs from open-source community contributors:

New features: a linked issue is required. Please open a feature request and reference it here before submitting the PR.
Small updates (bug fixes, minor improvements): a linked issue is recommended and will accelerate the PR review process.

Linked issue:

Contribution process

Pre-checks

I have added relevant unit tests
I have added relevant functional tests
I have added proper typing to my code Typing guidelines
I have added relevant documentation
I have run the autoformatter.sh on my PR

Code review

Feel free to message or comment @NVIDIA/mcore-oncall to help accelerate your merge into main. The less complex your PR is, the faster it will be approved and merged!

All PRs start as draft. If you open a non-draft PR, it will be automatically converted to draft.

Step 1: Mark PR as "Ready for Review"

When your PR is ready, click Ready for Review.
An oncall reviewer is auto-assigned and expert reviewers are notified based on your changes.
- Some PRs may jump straight to step 2. This is determined by .github/CODEOWNERS.

⚠️ Only mark as ready once merge-conflicts are resolved and the CI is passing.
Final Review might get declined if these requirements are not fulfilled.

Step 2: Final Review

For PRs that change megatron/core, once all expert reviewers have approved, the Final Review label is applied automatically and final reviewers are assigned.

For PRs outside megatron/core, this step is skipped.

Step 3: Approved

Once all required reviewers have approved, the Approved label is applied automatically.

Merge

Any member of mcore-engineers will be able to merge your PR.

copy-pr-bot · 2026-06-02T18:20:41Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

github-actions · 2026-06-02T18:20:54Z

This PR has been automatically converted to draft because all PRs must start as drafts.

When you are ready for review, click Ready for Review to begin the review process. This will:

Add the oncall reviewer (optional reviewer)
Add required review teams based on your changes

See the contribution guide for more details.

greptile-apps · 2026-06-02T18:25:14Z

_{Reviews (1): Last reviewed commit: "Adding blackwell tests" | Re-trigger Greptile}

greptile-apps · 2026-06-02T18:25:23Z

+  script_setup: |
+    unset https_proxy
+    echo "machine gitlab-master.nvidia.com login okoenig password $RO_API_TOKEN" | tee -a /root/.netrc
+
+    cd /opt
+    rm -rf /opt/megatron-lm; mkdir megatron-lm; cd megatron-lm
+    git init
+    git remote add origin $MCORE_REPO
+    git fetch origin '+refs/merge-requests/*:refs/remotes/merge-requests/*'
+    git fetch origin $MCORE_MR_COMMIT
+    git checkout $MCORE_MR_COMMIT
+    git rev-parse HEAD
+  script: |-
+    cd /opt/megatron-lm


Missing set -euo pipefail in script_setup and script

Both gpt-perf-dp4.yaml and hybrid-perf-ep4.yaml are missing set -euo pipefail in their script_setup (and script) sections, while the paired gpt-perf.yaml and hybrid-perf.yaml include it. Without set -e, a failing git fetch or git checkout would not abort the setup — the test could proceed against a wrong or stale repo checkout, producing misleading results.

Suggested change

script_setup: |

unset https_proxy

echo "machine gitlab-master.nvidia.com login okoenig password $RO_API_TOKEN" | tee -a /root/.netrc

cd /opt

rm -rf /opt/megatron-lm; mkdir megatron-lm; cd megatron-lm

git init

git remote add origin $MCORE_REPO

git fetch origin '+refs/merge-requests/*:refs/remotes/merge-requests/*'

git fetch origin $MCORE_MR_COMMIT

git checkout $MCORE_MR_COMMIT

git rev-parse HEAD

script: |-

cd /opt/megatron-lm

script_setup: |

set -euo pipefail

unset https_proxy

echo "machine gitlab-master.nvidia.com login okoenig password $RO_API_TOKEN" | tee -a /root/.netrc

cd /opt

rm -rf /opt/megatron-lm; mkdir megatron-lm; cd megatron-lm

git init

git remote add origin $MCORE_REPO

git fetch origin '+refs/merge-requests/*:refs/remotes/merge-requests/*'

git fetch origin $MCORE_MR_COMMIT

git checkout $MCORE_MR_COMMIT

git rev-parse HEAD

script: |-

set -euo pipefail

cd /opt/megatron-lm

greptile-apps · 2026-06-02T18:25:24Z

+  script_setup: |
+    unset https_proxy
+    echo "machine gitlab-master.nvidia.com login okoenig password $RO_API_TOKEN" | tee -a /root/.netrc
+
+    cd /opt
+    rm -rf /opt/megatron-lm; mkdir megatron-lm; cd megatron-lm
+    git init
+    git remote add origin $MCORE_REPO
+    git fetch origin '+refs/merge-requests/*:refs/remotes/merge-requests/*'
+    git fetch origin $MCORE_MR_COMMIT
+    git checkout $MCORE_MR_COMMIT
+    git rev-parse HEAD
+  script: |-
+    cd /opt/megatron-lm


Missing set -euo pipefail in script_setup and script

Same issue as gpt-perf-dp4.yaml: script_setup and script lack set -euo pipefail, while the analogous hybrid-perf.yaml has it in both sections. A failed git checkout during setup would silently be ignored and the test would run against stale code.

Suggested change

script_setup: |

unset https_proxy

echo "machine gitlab-master.nvidia.com login okoenig password $RO_API_TOKEN" | tee -a /root/.netrc

cd /opt

rm -rf /opt/megatron-lm; mkdir megatron-lm; cd megatron-lm

git init

git remote add origin $MCORE_REPO

git fetch origin '+refs/merge-requests/*:refs/remotes/merge-requests/*'

git fetch origin $MCORE_MR_COMMIT

git checkout $MCORE_MR_COMMIT

git rev-parse HEAD

script: |-

cd /opt/megatron-lm

script_setup: |

set -euo pipefail

unset https_proxy

echo "machine gitlab-master.nvidia.com login okoenig password $RO_API_TOKEN" | tee -a /root/.netrc

cd /opt

rm -rf /opt/megatron-lm; mkdir megatron-lm; cd megatron-lm

git init

git remote add origin $MCORE_REPO

git fetch origin '+refs/merge-requests/*:refs/remotes/merge-requests/*'

git fetch origin $MCORE_MR_COMMIT

git checkout $MCORE_MR_COMMIT

git rev-parse HEAD

script: |-

set -euo pipefail

cd /opt/megatron-lm

balasaajay

quick question: why did we add a test launch harness for perf tests rather than reuse the existing one for functional tests? Were there feature gaps in the harness?

The gpt-perf-dp4.yaml and hybrid-perf-ep4.yaml recipes were missing the set -euo pipefail guard present in their sibling recipes. Without it, a failing git fetch/checkout in script_setup would be silently ignored and the test could run against stale code. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

shanmugamr1992 · 2026-06-02T20:35:27Z

@balasaajay Good question. The perf path needs to stand up an inference server, sweep across batch sizes, and compare throughput/latency against a per-platform baseline_values.json with tolerance bands. The functional harness is built around a single training/eval run plus golden-value diffs — it doesn't model batch-size sweeps or perf-baseline tolerances. Rather than retrofit those concepts into the functional harness, we kept the perf driver (run_perf_test.sh + compare_to_baseline.py) separate. No feature gap blocked reuse per se; the two are just measuring different things, so separating them keeps both harnesses simple.

greptile-apps · 2026-06-02T20:40:09Z

_{Reviews (2): Last reviewed commit: "Add set -euo pipefail to GB200 perf reci..." | Re-trigger Greptile}

shanmugamr1992 · 2026-06-03T18:27:16Z

/ok to test 0023a2e

svcnvidia-nemo-ci · 2026-06-03T19:51:15Z

🔄 Merge queue validation started!

You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/26909132625

svcnvidia-nemo-ci · 2026-06-03T20:02:57Z

🔄 Merge queue validation started!

You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/26909742582

Adding blackwell tests

63318d8

shanmugamr1992 requested a review from a team as a code owner June 2, 2026 18:20

svcnvidia-nemo-ci marked this pull request as draft June 2, 2026 18:20

shanmugamr1992 marked this pull request as ready for review June 2, 2026 18:21

svcnvidia-nemo-ci added the complexity: medium label Jun 2, 2026

greptile-apps Bot reviewed Jun 2, 2026

View reviewed changes

balasaajay approved these changes Jun 2, 2026

View reviewed changes

svcnvidia-nemo-ci added the Approved All necessary approvals have been made label Jun 2, 2026

shanmugamr1992 enabled auto-merge June 2, 2026 20:28

copy-pr-bot Bot temporarily deployed to public June 3, 2026 18:28 Inactive

copy-pr-bot Bot temporarily deployed to test June 3, 2026 18:28 Inactive

copy-pr-bot Bot temporarily deployed to public June 3, 2026 18:31 Inactive

copy-pr-bot Bot temporarily deployed to public June 3, 2026 18:39 Inactive

shanmugamr1992 added this pull request to the merge queue Jun 3, 2026

Merged via the queue into NVIDIA:main with commit a377dee Jun 3, 2026
81 checks passed

shanmugamr1992 deleted the oci-tests branch June 3, 2026 21:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding blackwell tests#5113

Adding blackwell tests#5113
shanmugamr1992 merged 2 commits into
NVIDIA:mainfrom
shanmugamr1992:oci-tests

shanmugamr1992 commented Jun 2, 2026

Uh oh!

copy-pr-bot Bot commented Jun 2, 2026

Uh oh!

github-actions Bot commented Jun 2, 2026

Uh oh!

greptile-apps Bot commented Jun 2, 2026

Uh oh!

greptile-apps Bot Jun 2, 2026

Uh oh!

greptile-apps Bot Jun 2, 2026

Uh oh!

balasaajay left a comment

Uh oh!

shanmugamr1992 commented Jun 2, 2026

Uh oh!

greptile-apps Bot commented Jun 2, 2026

Uh oh!

shanmugamr1992 commented Jun 3, 2026

Uh oh!

svcnvidia-nemo-ci commented Jun 3, 2026

Uh oh!

svcnvidia-nemo-ci commented Jun 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

shanmugamr1992 commented Jun 2, 2026

What does this PR do ?

Issue tracking

Contribution process

Pre-checks

Code review

Step 1: Mark PR as "Ready for Review"

Step 2: Final Review

Step 3: Approved

Merge

Uh oh!

copy-pr-bot Bot commented Jun 2, 2026

Uh oh!

github-actions Bot commented Jun 2, 2026

Uh oh!

greptile-apps Bot commented Jun 2, 2026

Uh oh!

greptile-apps Bot Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

balasaajay left a comment

Choose a reason for hiding this comment

Uh oh!

shanmugamr1992 commented Jun 2, 2026

Uh oh!

greptile-apps Bot commented Jun 2, 2026

Uh oh!

shanmugamr1992 commented Jun 3, 2026

Uh oh!

svcnvidia-nemo-ci commented Jun 3, 2026

Uh oh!

svcnvidia-nemo-ci commented Jun 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants