tests: kill zombie IPC child processes after join timeout#2121
tests: kill zombie IPC child processes after join timeout#2121aryanputta wants to merge 4 commits into
Conversation
When Python 3.12 CI runs, env-vars enables compute-sanitizer with --target-processes=all, which attaches to every mp.Process child the tests spawn. On CUDA 12.9.1 the sanitizer analysis of IPC buffer teardown gets stuck, so child processes never exit. The existing join(timeout=CHILD_TIMEOUT_SEC) returns but leaves the child alive. That zombie keeps its IPC handle open. When pytest teardown runs ipc_memory_resource's mr.close(), it blocks waiting for the handle to be released — tying up the runner for hours until GitHub Actions force-cancels the job. This is the exact pattern in issue NVIDIA#2004 (always Python 3.12 + CUDA 12.9.1 local). Fix: after join(timeout=...), kill any process still alive so the IPC handle is released before fixture teardown. Tests still fail (exit code is non-zero or completed is False), just in seconds rather than hours. Fixes NVIDIA#2004
72ccbe7 to
df61e49
Compare
|
Thanks @aryanputta for jumping in! I'm running a Cursor code review right now... |
When Python 3.12 CI runs, env-vars enables compute-sanitizer with --target-processes=all, which attaches to every mp.Process child the tests spawn. On CUDA 12.9.1 the sanitizer analysis of IPC buffer teardown gets stuck, so child processes never exit. The existing join(timeout=CHILD_TIMEOUT_SEC) returns but leaves the child alive. That zombie keeps its IPC handle open. When pytest teardown runs ipc_memory_resource's mr.close(), it blocks waiting for the handle to be released -- tying up the runner for hours until GitHub Actions force-cancels the job. This is the exact pattern in issue NVIDIA#2004 (always Python 3.12 + CUDA 12.9.1 local). Fix: after join(timeout=...), kill any process still alive so the IPC handle is released before fixture teardown. A stderr warning is printed when kill() fires so the failure is clearly attributable to the sanitizer/IPC deadlock rather than appearing as a generic exitcode != 0. Tests still fail (exit code is non-zero or completed is False), just in seconds rather than hours. Fixes NVIDIA#2004
948d30c to
5b802a0
Compare
rwgk
left a comment
There was a problem hiding this comment.
PR 2121 Initial Review
No blocking findings.
The change matches the issue #2004 observed failure mode: after a child process
exceeds join(timeout=30), it is now killed before parent-side teardown can
block on IPC handles. The TestIpcReexport path also now asserts completed,
so a timed-out child becomes a real fast failure instead of quietly continuing
into teardown.
Residual risks
- Other
memory_ipctests also usejoin(timeout=CHILD_TIMEOUT_SEC)without
killing children. The evidence in issue 2004 points specifically at
test_send_buffers.py, so I would not block this PR on broadening the fix,
but it may be worth a follow-up helper if hangs reappear elsewhere. - Full CI has not run yet because the external-contributor PR is awaiting
validation. Current checks are only metadata/pre-commit. - I could not run the focused test locally because
TestVenvappears stale for
this branch and fails importingcuda.core._memory._managed_buffer.
git diff --checkpassed.
|
Oh - @aryanputta please don't force push. |
|
/ok to test 036a8f5 |
sorry!! |
No worries. It was only very slightly confusing. Holding my breath that the two troublesome jobs (under PR #2118) will sail through now. |
haha, me too, I always test locally aswell, i got no issues so hopefully its the same.. edit, did i stop it by accdient, i clicked comment and it think i clicked stop.. I am so Sorry |
|
|
CI / Test linux-64 / Python 3.12, CUDA 13.2.1 (wheels), GPU rtx4090, wsl (push) seems to be a flake. I'll keep an eye out here and will rerun when the rest has finished. |
|
This still seems to hang unfortunately: CI / Test linux-64 / Python 3.12, CUDA 12.9.1 (local), GPU l4 (push) @Andy-Jost for visibility @aryanputta FYI: Andy is experimenting under #2124 with broader fixes. |
|
I cancelled the hanging job. Waiting for @Andy-Jost's experiments under #2124. |
The fix in the initial commit only covered test_send_buffers.py. The 63-minute CI hang confirmed the deadlock was occurring in one of the other test files. Applies the same kill() after join(timeout=...) pattern to all remaining memory_ipc tests: - test_errors.py (1 join) - test_event_ipc.py (2 joins) - test_ipc_duplicate_import.py (1 join) - test_leaks.py (3 joins) - test_memory_ipc.py (7 joins across 4 test classes) - test_peer_access.py (2 joins) - test_serialize.py (3 joins) See issue NVIDIA#2004.
|
hey @rwgk @Andy-Jost -- after seeing the 63min hang in ci (step 29 died in the cuda.core test run) i checked the rest of the memory_ipc suite and found 7 more files with bare just pushed a second commit that extends the same kill-after-join pattern to all remaining files:
andy's layered approach in #2124 is the cleaner fix so please prioritize that -- but if you need a fallback while #2124 is still in draft, this pr now covers the full suite. no need to run ci here unless #2124 doesn't work out |
What
TestIpcSendBuffersandTestIpcReexportcallprocess.join(timeout=30)to wait for child processes but never kill them if the join times out. This causes multi-hour CI hangs.Root cause
ci/tools/env-varsenables compute-sanitizer (--target-processes=all) on Python 3.12 + local CTK + Linux — exactly the combination in every hanging run from issue #2004. With--target-processes=all, compute-sanitizer attaches to everymp.Processchild. On CUDA 12.9.1 its IPC teardown analysis gets stuck, so the child never exits.The parent's
join(timeout=30)returns after 30 seconds, but the child is still alive. That zombie holds an open IPC memory handle. When pytest runs fixture teardown (ipc_memory_resource'smr.close()), it blocks on the held handle indefinitely. The runner gets tied up for 4–6 hours until GitHub Actions force-cancels the job.Fix
After
join(timeout=...), kill any process that is still alive. This forces IPC handle release before fixture teardown runs. The test still fails (exit code is non-zero, orcompletedis False), just in seconds instead of hours.No behavior change when children exit cleanly.
Related
Closes #2004