Skip to content

Adding blackwell tests#5113

Merged
shanmugamr1992 merged 2 commits into
NVIDIA:mainfrom
shanmugamr1992:oci-tests
Jun 3, 2026
Merged

Adding blackwell tests#5113
shanmugamr1992 merged 2 commits into
NVIDIA:mainfrom
shanmugamr1992:oci-tests

Conversation

@shanmugamr1992
Copy link
Copy Markdown
Contributor

  • I, the PR author, have personally reviewed every line of this PR.

What does this PR do ?

⚠️ For major changes (either in lines of code or in its impact), please make sure to first share a design doc with the team. If you're unsure what's the best way to do so, contact @NVIDIA/mcore-oncall.

Issue tracking

For PRs from open-source community contributors:

  • New features: a linked issue is required. Please open a feature request and reference it here before submitting the PR.
  • Small updates (bug fixes, minor improvements): a linked issue is recommended and will accelerate the PR review process.

Linked issue:

Contribution process

Pre-checks

  • I have added relevant unit tests
  • I have added relevant functional tests
  • I have added proper typing to my code Typing guidelines
  • I have added relevant documentation
  • I have run the autoformatter.sh on my PR

Code review

Feel free to message or comment @NVIDIA/mcore-oncall to help accelerate your merge into main. The less complex your PR is, the faster it will be approved and merged!

All PRs start as draft. If you open a non-draft PR, it will be automatically converted to draft.

Step 1: Mark PR as "Ready for Review"

  1. When your PR is ready, click Ready for Review.
  2. An oncall reviewer is auto-assigned and expert reviewers are notified based on your changes.
    • Some PRs may jump straight to step 2. This is determined by .github/CODEOWNERS.

⚠️ Only mark as ready once merge-conflicts are resolved and the CI is passing.
Final Review might get declined if these requirements are not fulfilled.

Step 2: Final Review

For PRs that change megatron/core, once all expert reviewers have approved, the Final Review label is applied automatically and final reviewers are assigned.

For PRs outside megatron/core, this step is skipped.

Step 3: Approved

Once all required reviewers have approved, the Approved label is applied automatically.

Merge

Any member of mcore-engineers will be able to merge your PR.

@shanmugamr1992 shanmugamr1992 requested a review from a team as a code owner June 2, 2026 18:20
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Jun 2, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@svcnvidia-nemo-ci svcnvidia-nemo-ci marked this pull request as draft June 2, 2026 18:20
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 2, 2026

This PR has been automatically converted to draft because all PRs must start as drafts.

When you are ready for review, click Ready for Review to begin the review process. This will:

  1. Add the oncall reviewer (optional reviewer)
  2. Add required review teams based on your changes

See the contribution guide for more details.

@shanmugamr1992 shanmugamr1992 marked this pull request as ready for review June 2, 2026 18:21
@greptile-apps
Copy link
Copy Markdown

greptile-apps Bot commented Jun 2, 2026

Reviews (1): Last reviewed commit: "Adding blackwell tests" | Re-trigger Greptile

Comment on lines +17 to +30
script_setup: |
unset https_proxy
echo "machine gitlab-master.nvidia.com login okoenig password $RO_API_TOKEN" | tee -a /root/.netrc

cd /opt
rm -rf /opt/megatron-lm; mkdir megatron-lm; cd megatron-lm
git init
git remote add origin $MCORE_REPO
git fetch origin '+refs/merge-requests/*:refs/remotes/merge-requests/*'
git fetch origin $MCORE_MR_COMMIT
git checkout $MCORE_MR_COMMIT
git rev-parse HEAD
script: |-
cd /opt/megatron-lm
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Missing set -euo pipefail in script_setup and script

Both gpt-perf-dp4.yaml and hybrid-perf-ep4.yaml are missing set -euo pipefail in their script_setup (and script) sections, while the paired gpt-perf.yaml and hybrid-perf.yaml include it. Without set -e, a failing git fetch or git checkout would not abort the setup — the test could proceed against a wrong or stale repo checkout, producing misleading results.

Suggested change
script_setup: |
unset https_proxy
echo "machine gitlab-master.nvidia.com login okoenig password $RO_API_TOKEN" | tee -a /root/.netrc
cd /opt
rm -rf /opt/megatron-lm; mkdir megatron-lm; cd megatron-lm
git init
git remote add origin $MCORE_REPO
git fetch origin '+refs/merge-requests/*:refs/remotes/merge-requests/*'
git fetch origin $MCORE_MR_COMMIT
git checkout $MCORE_MR_COMMIT
git rev-parse HEAD
script: |-
cd /opt/megatron-lm
script_setup: |
set -euo pipefail
unset https_proxy
echo "machine gitlab-master.nvidia.com login okoenig password $RO_API_TOKEN" | tee -a /root/.netrc
cd /opt
rm -rf /opt/megatron-lm; mkdir megatron-lm; cd megatron-lm
git init
git remote add origin $MCORE_REPO
git fetch origin '+refs/merge-requests/*:refs/remotes/merge-requests/*'
git fetch origin $MCORE_MR_COMMIT
git checkout $MCORE_MR_COMMIT
git rev-parse HEAD
script: |-
set -euo pipefail
cd /opt/megatron-lm

Comment on lines +17 to +30
script_setup: |
unset https_proxy
echo "machine gitlab-master.nvidia.com login okoenig password $RO_API_TOKEN" | tee -a /root/.netrc

cd /opt
rm -rf /opt/megatron-lm; mkdir megatron-lm; cd megatron-lm
git init
git remote add origin $MCORE_REPO
git fetch origin '+refs/merge-requests/*:refs/remotes/merge-requests/*'
git fetch origin $MCORE_MR_COMMIT
git checkout $MCORE_MR_COMMIT
git rev-parse HEAD
script: |-
cd /opt/megatron-lm
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Missing set -euo pipefail in script_setup and script

Same issue as gpt-perf-dp4.yaml: script_setup and script lack set -euo pipefail, while the analogous hybrid-perf.yaml has it in both sections. A failed git checkout during setup would silently be ignored and the test would run against stale code.

Suggested change
script_setup: |
unset https_proxy
echo "machine gitlab-master.nvidia.com login okoenig password $RO_API_TOKEN" | tee -a /root/.netrc
cd /opt
rm -rf /opt/megatron-lm; mkdir megatron-lm; cd megatron-lm
git init
git remote add origin $MCORE_REPO
git fetch origin '+refs/merge-requests/*:refs/remotes/merge-requests/*'
git fetch origin $MCORE_MR_COMMIT
git checkout $MCORE_MR_COMMIT
git rev-parse HEAD
script: |-
cd /opt/megatron-lm
script_setup: |
set -euo pipefail
unset https_proxy
echo "machine gitlab-master.nvidia.com login okoenig password $RO_API_TOKEN" | tee -a /root/.netrc
cd /opt
rm -rf /opt/megatron-lm; mkdir megatron-lm; cd megatron-lm
git init
git remote add origin $MCORE_REPO
git fetch origin '+refs/merge-requests/*:refs/remotes/merge-requests/*'
git fetch origin $MCORE_MR_COMMIT
git checkout $MCORE_MR_COMMIT
git rev-parse HEAD
script: |-
set -euo pipefail
cd /opt/megatron-lm

Copy link
Copy Markdown
Contributor

@balasaajay balasaajay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

quick question: why did we add a test launch harness for perf tests rather than reuse the existing one for functional tests? Were there feature gaps in the harness?

@svcnvidia-nemo-ci svcnvidia-nemo-ci added the Approved All necessary approvals have been made label Jun 2, 2026
@shanmugamr1992 shanmugamr1992 enabled auto-merge June 2, 2026 20:28
The gpt-perf-dp4.yaml and hybrid-perf-ep4.yaml recipes were missing the
set -euo pipefail guard present in their sibling recipes. Without it, a
failing git fetch/checkout in script_setup would be silently ignored and
the test could run against stale code.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@shanmugamr1992
Copy link
Copy Markdown
Contributor Author

@balasaajay Good question. The perf path needs to stand up an inference server, sweep across batch sizes, and compare throughput/latency against a per-platform baseline_values.json with tolerance bands. The functional harness is built around a single training/eval run plus golden-value diffs — it doesn't model batch-size sweeps or perf-baseline tolerances. Rather than retrofit those concepts into the functional harness, we kept the perf driver (run_perf_test.sh + compare_to_baseline.py) separate. No feature gap blocked reuse per se; the two are just measuring different things, so separating them keeps both harnesses simple.

@greptile-apps
Copy link
Copy Markdown

greptile-apps Bot commented Jun 2, 2026

Reviews (2): Last reviewed commit: "Add set -euo pipefail to GB200 perf reci..." | Re-trigger Greptile

@shanmugamr1992
Copy link
Copy Markdown
Contributor Author

/ok to test 0023a2e

@svcnvidia-nemo-ci
Copy link
Copy Markdown

🔄 Merge queue validation started!

You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/26909132625

@svcnvidia-nemo-ci
Copy link
Copy Markdown

🔄 Merge queue validation started!

You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/26909742582

Merged via the queue into NVIDIA:main with commit a377dee Jun 3, 2026
81 checks passed
@shanmugamr1992 shanmugamr1992 deleted the oci-tests branch June 3, 2026 21:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Approved All necessary approvals have been made complexity: medium

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants