Adding blackwell tests#5113
Conversation
|
This PR has been automatically converted to draft because all PRs must start as drafts. When you are ready for review, click Ready for Review to begin the review process. This will:
See the contribution guide for more details. |
|
Reviews (1): Last reviewed commit: "Adding blackwell tests" | Re-trigger Greptile |
| script_setup: | | ||
| unset https_proxy | ||
| echo "machine gitlab-master.nvidia.com login okoenig password $RO_API_TOKEN" | tee -a /root/.netrc | ||
|
|
||
| cd /opt | ||
| rm -rf /opt/megatron-lm; mkdir megatron-lm; cd megatron-lm | ||
| git init | ||
| git remote add origin $MCORE_REPO | ||
| git fetch origin '+refs/merge-requests/*:refs/remotes/merge-requests/*' | ||
| git fetch origin $MCORE_MR_COMMIT | ||
| git checkout $MCORE_MR_COMMIT | ||
| git rev-parse HEAD | ||
| script: |- | ||
| cd /opt/megatron-lm |
There was a problem hiding this comment.
Missing
set -euo pipefail in script_setup and script
Both gpt-perf-dp4.yaml and hybrid-perf-ep4.yaml are missing set -euo pipefail in their script_setup (and script) sections, while the paired gpt-perf.yaml and hybrid-perf.yaml include it. Without set -e, a failing git fetch or git checkout would not abort the setup — the test could proceed against a wrong or stale repo checkout, producing misleading results.
| script_setup: | | |
| unset https_proxy | |
| echo "machine gitlab-master.nvidia.com login okoenig password $RO_API_TOKEN" | tee -a /root/.netrc | |
| cd /opt | |
| rm -rf /opt/megatron-lm; mkdir megatron-lm; cd megatron-lm | |
| git init | |
| git remote add origin $MCORE_REPO | |
| git fetch origin '+refs/merge-requests/*:refs/remotes/merge-requests/*' | |
| git fetch origin $MCORE_MR_COMMIT | |
| git checkout $MCORE_MR_COMMIT | |
| git rev-parse HEAD | |
| script: |- | |
| cd /opt/megatron-lm | |
| script_setup: | | |
| set -euo pipefail | |
| unset https_proxy | |
| echo "machine gitlab-master.nvidia.com login okoenig password $RO_API_TOKEN" | tee -a /root/.netrc | |
| cd /opt | |
| rm -rf /opt/megatron-lm; mkdir megatron-lm; cd megatron-lm | |
| git init | |
| git remote add origin $MCORE_REPO | |
| git fetch origin '+refs/merge-requests/*:refs/remotes/merge-requests/*' | |
| git fetch origin $MCORE_MR_COMMIT | |
| git checkout $MCORE_MR_COMMIT | |
| git rev-parse HEAD | |
| script: |- | |
| set -euo pipefail | |
| cd /opt/megatron-lm |
| script_setup: | | ||
| unset https_proxy | ||
| echo "machine gitlab-master.nvidia.com login okoenig password $RO_API_TOKEN" | tee -a /root/.netrc | ||
|
|
||
| cd /opt | ||
| rm -rf /opt/megatron-lm; mkdir megatron-lm; cd megatron-lm | ||
| git init | ||
| git remote add origin $MCORE_REPO | ||
| git fetch origin '+refs/merge-requests/*:refs/remotes/merge-requests/*' | ||
| git fetch origin $MCORE_MR_COMMIT | ||
| git checkout $MCORE_MR_COMMIT | ||
| git rev-parse HEAD | ||
| script: |- | ||
| cd /opt/megatron-lm |
There was a problem hiding this comment.
Missing
set -euo pipefail in script_setup and script
Same issue as gpt-perf-dp4.yaml: script_setup and script lack set -euo pipefail, while the analogous hybrid-perf.yaml has it in both sections. A failed git checkout during setup would silently be ignored and the test would run against stale code.
| script_setup: | | |
| unset https_proxy | |
| echo "machine gitlab-master.nvidia.com login okoenig password $RO_API_TOKEN" | tee -a /root/.netrc | |
| cd /opt | |
| rm -rf /opt/megatron-lm; mkdir megatron-lm; cd megatron-lm | |
| git init | |
| git remote add origin $MCORE_REPO | |
| git fetch origin '+refs/merge-requests/*:refs/remotes/merge-requests/*' | |
| git fetch origin $MCORE_MR_COMMIT | |
| git checkout $MCORE_MR_COMMIT | |
| git rev-parse HEAD | |
| script: |- | |
| cd /opt/megatron-lm | |
| script_setup: | | |
| set -euo pipefail | |
| unset https_proxy | |
| echo "machine gitlab-master.nvidia.com login okoenig password $RO_API_TOKEN" | tee -a /root/.netrc | |
| cd /opt | |
| rm -rf /opt/megatron-lm; mkdir megatron-lm; cd megatron-lm | |
| git init | |
| git remote add origin $MCORE_REPO | |
| git fetch origin '+refs/merge-requests/*:refs/remotes/merge-requests/*' | |
| git fetch origin $MCORE_MR_COMMIT | |
| git checkout $MCORE_MR_COMMIT | |
| git rev-parse HEAD | |
| script: |- | |
| set -euo pipefail | |
| cd /opt/megatron-lm |
balasaajay
left a comment
There was a problem hiding this comment.
quick question: why did we add a test launch harness for perf tests rather than reuse the existing one for functional tests? Were there feature gaps in the harness?
The gpt-perf-dp4.yaml and hybrid-perf-ep4.yaml recipes were missing the set -euo pipefail guard present in their sibling recipes. Without it, a failing git fetch/checkout in script_setup would be silently ignored and the test could run against stale code. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
@balasaajay Good question. The perf path needs to stand up an inference server, sweep across batch sizes, and compare throughput/latency against a per-platform |
|
Reviews (2): Last reviewed commit: "Add set -euo pipefail to GB200 perf reci..." | Re-trigger Greptile |
|
/ok to test 0023a2e |
|
🔄 Merge queue validation started! You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/26909132625 |
|
🔄 Merge queue validation started! You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/26909742582 |
What does this PR do ?
Issue tracking
For PRs from open-source community contributors:
Linked issue:
Contribution process
Pre-checks
Code review
Feel free to message or comment @NVIDIA/mcore-oncall to help accelerate your merge into main. The less complex your PR is, the faster it will be approved and merged!
All PRs start as draft. If you open a non-draft PR, it will be automatically converted to draft.
Step 1: Mark PR as "Ready for Review"
.github/CODEOWNERS.Final Review might get declined if these requirements are not fulfilled.
Step 2: Final Review
For PRs that change
megatron/core, once all expert reviewers have approved, theFinal Reviewlabel is applied automatically and final reviewers are assigned.For PRs outside
megatron/core, this step is skipped.Step 3: Approved
Once all required reviewers have approved, the
Approvedlabel is applied automatically.Merge
Any member of mcore-engineers will be able to merge your PR.