Fix Nemo_CICD_Test not catching cancelled/skipped functional tests#3947
Merged
ko3n1g merged 11 commits intoNVIDIA:mainfrom Mar 19, 2026
Merged
Conversation
Contributor
|
This PR has been automatically converted to draft because all PRs must start as drafts. When you are ready for review, click Ready for Review to begin the review process. This will:
See the contribution guide for more details. |
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
Contributor
Author
|
/ok to test |
5 similar comments
Contributor
Author
|
/ok to test |
Contributor
Author
|
/ok to test |
Contributor
Author
|
/ok to test |
Contributor
Author
|
/ok to test |
Contributor
Author
|
/ok to test |
Previously, Nemo_CICD_Test would pass even when functional test jobs were cancelled mid-run or silently skipped (e.g. when a parse job failed and produced an empty matrix). The broad SKIPPING_IS_ALLOWED flag masked these failures for merge_group and ci_workload triggers. - Add direct needs.result checks for unit tests (must succeed), H100 integration tests (must succeed), and GB200 integration tests (success or skipped allowed for non-maintainer PRs) - Replace SKIPPING_IS_ALLOWED with an explicit early-exit for docs-only and deployment workflows, which intentionally skip all tests - Extend the broad job scan to also catch cancelled individual matrix instances (e.g. a single test cancelled mid-run) - Improve failure output to show both the job name and its conclusion Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Non-maintainer PRs legitimately skip GB200 tests (no runners available), but maintainer runs (PRs, merge queue, nightly) must always run them. Thread IS_MAINTAINER from is-not-external-contributor into Nemo_CICD_Test so the GB200 skipped-allowed exemption only applies when appropriate. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds a test_nemo_cicd_gate workflow_dispatch input with three scenarios:
- all_pass: mocks all expensive jobs and expects Nemo_CICD_Test to pass
- h100_skipped: forces cicd-integration-tests-latest-h100 to be skipped
(via job if-condition) — gate must fail
- gb200_skipped: forces cicd-integration-tests-latest-gb200 to be skipped
— gate must fail (maintainer run, so gb200 skipped is not allowed)
In test mode:
- ubuntu-latest runners replace GPU runners (no GPU cost)
- cicd-wait-in-queue environment gate is bypassed
- cicd-container-build skips the actual Docker build
- parse jobs emit a single-item mock matrix (one cheap job per group)
- test jobs skip the .github/actions composite action and just echo
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… branches nv-gha-runners/get-pr-info only works with push events; calling it during workflow_dispatch (which has no PR event context) causes it to fail and cascades into all downstream jobs being skipped. Guard every Get PR info step with github.event_name == 'push' so manual dispatches are unaffected. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The external FW-CI-templates preflight workflow has its own Get PR info step that fails on workflow_dispatch events. Since we cannot modify the reusable workflow, skip it entirely in test mode. When a job is skipped (not failed), GitHub Actions success() still returns true for downstream jobs, so the full critical path (container-build → parse → unit/integration tests) unblocks without changing any downstream conditions. The pre-flight outputs are empty but all downstream job conditions already fall through to success() which is sufficient. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
success() returns false when a dependency is skipped, causing the whole critical path (container-build → parse → tests) to cascade-skip even though pre-flight was intentionally bypassed in test mode. Add inputs.test_nemo_cicd_gate != 'disabled' as a fallback in every (success() || ...) block so jobs proceed without needing pre-flight outputs. The check is safe for non-dispatch events since inputs default to empty string, and '' != 'disabled' would be true — guard against that by also requiring != ''. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
cicd-container-build: skip entirely in test mode (apt-get fails on ubuntu-latest without root). Downstream parse jobs already bypass the container-build result via the test_nemo_cicd_gate condition added in the previous commit. is-not-external-contributor: treat workflow_dispatch as a maintainer run (triggering a dispatch requires write access to the repo, so the SSO check is unnecessary and fails with an empty username when Get PR info is skipped on non-push events). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Cleans up all workflow_dispatch test_nemo_cicd_gate machinery that was used to validate the Nemo_CICD_Test gate fix: input definition, runner overrides, pre-flight skip, cicd-container-build skip, success() bypass lines, parse job mock outputs, and mock test steps. Retains the two real fixes introduced alongside the scaffolding: - Get PR info steps guarded with github.event_name == 'push' - Nemo_CICD_Test gate rewrite (direct needs result checks + cancelled detection) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
8d95c5e to
0f2e72a
Compare
chtruong814
approved these changes
Mar 19, 2026
Signed-off-by: oliver könig <okoenig@nvidia.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Nemo_CICD_Testpreviously passed even when functional tests were cancelled mid-run or silently skipped (e.g. when a parse job failed and produced an empty matrix), because the broadSKIPPING_IS_ALLOWEDflag masked these failures formerge_groupandci_workloadtriggersgpt/gpt_grpo_basic_function - latestwas cancelled butNemo_CICD_Teststill passedChanges
needs.resultchecks for the three test groups: unit tests and H100 integration tests must besuccess; GB200 integration tests allowskipped(non-maintainer PRs) but notfailure/cancelledSKIPPING_IS_ALLOWED(which allowed all skips on merge queue and nightly runs) with an explicit early-exit fordocs_onlyandis_deployment_workflow, which are the only legitimate cases where tests are intentionally skippedcancelledconclusions (previously onlyfailurewas checked)Test plan
Nemo_CICD_TestpassingNemo_CICD_TestNemo_CICD_Test🤖 Generated with Claude Code