Skip to content

Fix macOS xdist crash in GHZ noise validation test#4305

Open
KevinSailema wants to merge 1 commit intoNVIDIA:mainfrom
KevinSailema:fix/issue-4281-macos-ghz-noise-xdist-crash
Open

Fix macOS xdist crash in GHZ noise validation test#4305
KevinSailema wants to merge 1 commit intoNVIDIA:mainfrom
KevinSailema:fix/issue-4281-macos-ghz-noise-xdist-crash

Conversation

@KevinSailema
Copy link
Copy Markdown
Contributor

Fixes #4281.

  • This PR addresses an intermittent macOS CI crash where an xdist worker dies while running test_simple_run_ghz_with_noise.
  • The change is intentionally minimal and limited to test/validation flow, without touching product runtime code.

Root cause addressed

  • The failing path uses global target state in a noisy kernel test, and this can be fragile under parallel test execution in macOS CI.
  • The failure mode is a worker crash, not a deterministic assertion failure.

What changed

  • Hardened target lifecycle in test_run_kernel.py:
  • test_simple_run_ghz_with_noise now guarantees target reset via try/finally.
  • Results are materialized before resetting target to avoid teardown/lifecycle races.
  • Updated macOS validation strategy in validate_pycudaq.sh:
  • Keep core tests parallel with xdist.
  • Exclude test_simple_run_ghz_with_noise from the parallel core batch on macOS.
  • Run that single test serially immediately after core tests.

Why this approach

  • Keeps maximum parallel coverage for the rest of the suite.
  • Isolates only the unstable test in macOS CI.
  • Minimizes blast radius and avoids broad skips or module-wide serialization.

Validation

  • Target test passed in serial and xdist locally.
  • Stress runs of the target test in xdist were stable across repeated iterations.
  • Adjacent GHZ tests remained green in serial and xdist after the change.
  • Script syntax validated and change scope confirmed to the two intended files only.

Impact

  • Reduces flaky macOS worker crashes for the validation pipeline.
  • Preserves test coverage while improving CI reliability.

Addresses NVIDIA#4281 by hardening target lifecycle in test_simple_run_ghz_with_noise and running that test serially on macOS validation while keeping the rest of core tests parallel.
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Apr 11, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.


macos_serial_core_tests=()
if $is_macos; then
ghz_noise_test="test_simple_run_ghz_with_noise"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a fan of hardcoding the test here.

I don't believe it always fails this test specifically - I started creating GH issues once I see them to try and paint a better picture. I think the failing test is random, and has something to do with the runner environment likely.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[CI] [macos] worker 'gw1' crashed while running 'build/validation/tests/kernel/test_run_kernel.py::test_simple_run_ghz_with_noise'

2 participants