Skip to content

Fix Nemo_CICD_Test not catching cancelled/skipped functional tests#3947

Merged
ko3n1g merged 11 commits intoNVIDIA:mainfrom
ko3n1g:fix/cicd-nemo-test-cancelled-skipped-detection
Mar 19, 2026
Merged

Fix Nemo_CICD_Test not catching cancelled/skipped functional tests#3947
ko3n1g merged 11 commits intoNVIDIA:mainfrom
ko3n1g:fix/cicd-nemo-test-cancelled-skipped-detection

Conversation

@ko3n1g
Copy link
Contributor

@ko3n1g ko3n1g commented Mar 19, 2026

Summary

  • Nemo_CICD_Test previously passed even when functional tests were cancelled mid-run or silently skipped (e.g. when a parse job failed and produced an empty matrix), because the broad SKIPPING_IS_ALLOWED flag masked these failures for merge_group and ci_workload triggers
  • Example broken run: https://github.com/NVIDIA/Megatron-LM/actions/runs/23253541987/job/67621459669gpt/gpt_grpo_basic_function - latest was cancelled but Nemo_CICD_Test still passed

Changes

  • Add direct needs.result checks for the three test groups: unit tests and H100 integration tests must be success; GB200 integration tests allow skipped (non-maintainer PRs) but not failure/cancelled
  • Replace the broad SKIPPING_IS_ALLOWED (which allowed all skips on merge queue and nightly runs) with an explicit early-exit for docs_only and is_deployment_workflow, which are the only legitimate cases where tests are intentionally skipped
  • Extend the individual job scan to also catch cancelled conclusions (previously only failure was checked)
  • Improve failure output to show both job name and conclusion

Test plan

  • Verify a merge queue run where all tests pass still results in Nemo_CICD_Test passing
  • Verify a docs-only PR still passes Nemo_CICD_Test
  • Verify a run with a cancelled functional test now fails Nemo_CICD_Test

🤖 Generated with Claude Code

@ko3n1g ko3n1g requested review from a team as code owners March 19, 2026 15:01
@svcnvidia-nemo-ci svcnvidia-nemo-ci marked this pull request as draft March 19, 2026 15:01
@github-actions
Copy link
Contributor

This PR has been automatically converted to draft because all PRs must start as drafts.

When you are ready for review, click Ready for Review to begin the review process. This will:

  1. Add the oncall reviewer (optional reviewer)
  2. Add required review teams based on your changes

See the contribution guide for more details.

@copy-pr-bot
Copy link

copy-pr-bot bot commented Mar 19, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@ko3n1g
Copy link
Contributor Author

ko3n1g commented Mar 19, 2026

/ok to test

5 similar comments
@ko3n1g
Copy link
Contributor Author

ko3n1g commented Mar 19, 2026

/ok to test

@ko3n1g
Copy link
Contributor Author

ko3n1g commented Mar 19, 2026

/ok to test

@ko3n1g
Copy link
Contributor Author

ko3n1g commented Mar 19, 2026

/ok to test

@ko3n1g
Copy link
Contributor Author

ko3n1g commented Mar 19, 2026

/ok to test

@ko3n1g
Copy link
Contributor Author

ko3n1g commented Mar 19, 2026

/ok to test

ko3n1g and others added 8 commits March 19, 2026 15:45
Previously, Nemo_CICD_Test would pass even when functional test jobs were
cancelled mid-run or silently skipped (e.g. when a parse job failed and
produced an empty matrix). The broad SKIPPING_IS_ALLOWED flag masked these
failures for merge_group and ci_workload triggers.

- Add direct needs.result checks for unit tests (must succeed), H100
  integration tests (must succeed), and GB200 integration tests (success
  or skipped allowed for non-maintainer PRs)
- Replace SKIPPING_IS_ALLOWED with an explicit early-exit for docs-only
  and deployment workflows, which intentionally skip all tests
- Extend the broad job scan to also catch cancelled individual matrix
  instances (e.g. a single test cancelled mid-run)
- Improve failure output to show both the job name and its conclusion

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Non-maintainer PRs legitimately skip GB200 tests (no runners available),
but maintainer runs (PRs, merge queue, nightly) must always run them.
Thread IS_MAINTAINER from is-not-external-contributor into Nemo_CICD_Test
so the GB200 skipped-allowed exemption only applies when appropriate.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds a test_nemo_cicd_gate workflow_dispatch input with three scenarios:
  - all_pass: mocks all expensive jobs and expects Nemo_CICD_Test to pass
  - h100_skipped: forces cicd-integration-tests-latest-h100 to be skipped
    (via job if-condition) — gate must fail
  - gb200_skipped: forces cicd-integration-tests-latest-gb200 to be skipped
    — gate must fail (maintainer run, so gb200 skipped is not allowed)

In test mode:
  - ubuntu-latest runners replace GPU runners (no GPU cost)
  - cicd-wait-in-queue environment gate is bypassed
  - cicd-container-build skips the actual Docker build
  - parse jobs emit a single-item mock matrix (one cheap job per group)
  - test jobs skip the .github/actions composite action and just echo

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… branches

nv-gha-runners/get-pr-info only works with push events; calling it during
workflow_dispatch (which has no PR event context) causes it to fail and
cascades into all downstream jobs being skipped. Guard every Get PR info
step with github.event_name == 'push' so manual dispatches are unaffected.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The external FW-CI-templates preflight workflow has its own Get PR info
step that fails on workflow_dispatch events. Since we cannot modify the
reusable workflow, skip it entirely in test mode.

When a job is skipped (not failed), GitHub Actions success() still returns
true for downstream jobs, so the full critical path (container-build →
parse → unit/integration tests) unblocks without changing any downstream
conditions. The pre-flight outputs are empty but all downstream job
conditions already fall through to success() which is sufficient.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
success() returns false when a dependency is skipped, causing the whole
critical path (container-build → parse → tests) to cascade-skip even
though pre-flight was intentionally bypassed in test mode.

Add inputs.test_nemo_cicd_gate != 'disabled' as a fallback in every
(success() || ...) block so jobs proceed without needing pre-flight
outputs. The check is safe for non-dispatch events since inputs default
to empty string, and '' != 'disabled' would be true — guard against that
by also requiring != ''.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
cicd-container-build: skip entirely in test mode (apt-get fails on
ubuntu-latest without root). Downstream parse jobs already bypass the
container-build result via the test_nemo_cicd_gate condition added
in the previous commit.

is-not-external-contributor: treat workflow_dispatch as a maintainer
run (triggering a dispatch requires write access to the repo, so the
SSO check is unnecessary and fails with an empty username when Get PR
info is skipped on non-push events).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Cleans up all workflow_dispatch test_nemo_cicd_gate machinery that was
used to validate the Nemo_CICD_Test gate fix: input definition, runner
overrides, pre-flight skip, cicd-container-build skip, success() bypass
lines, parse job mock outputs, and mock test steps.

Retains the two real fixes introduced alongside the scaffolding:
- Get PR info steps guarded with github.event_name == 'push'
- Nemo_CICD_Test gate rewrite (direct needs result checks + cancelled detection)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@ko3n1g ko3n1g force-pushed the fix/cicd-nemo-test-cancelled-skipped-detection branch from 8d95c5e to 0f2e72a Compare March 19, 2026 15:45
@ko3n1g ko3n1g marked this pull request as ready for review March 19, 2026 15:45
@svcnvidia-nemo-ci svcnvidia-nemo-ci requested a review from a team March 19, 2026 15:46
@ko3n1g ko3n1g removed request for a team March 19, 2026 15:51
@ko3n1g ko3n1g removed request for a team March 19, 2026 15:51
@svcnvidia-nemo-ci svcnvidia-nemo-ci added the Approved All necessary approvals have been made label Mar 19, 2026
ko3n1g added 2 commits March 19, 2026 16:03
Signed-off-by: oliver könig <okoenig@nvidia.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
@ko3n1g ko3n1g merged commit f9a4196 into NVIDIA:main Mar 19, 2026
55 of 56 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Approved All necessary approvals have been made complexity: low

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants