Skip to content

tests: kill zombie IPC child processes after join timeout#2121

Open
aryanputta wants to merge 4 commits into
NVIDIA:mainfrom
aryanputta:fix/ipc-test-zombie-hang
Open

tests: kill zombie IPC child processes after join timeout#2121
aryanputta wants to merge 4 commits into
NVIDIA:mainfrom
aryanputta:fix/ipc-test-zombie-hang

Conversation

@aryanputta
Copy link
Copy Markdown
Contributor

@aryanputta aryanputta commented May 21, 2026

What

TestIpcSendBuffers and TestIpcReexport call process.join(timeout=30) to wait for child processes but never kill them if the join times out. This causes multi-hour CI hangs.

Root cause

ci/tools/env-vars enables compute-sanitizer (--target-processes=all) on Python 3.12 + local CTK + Linux — exactly the combination in every hanging run from issue #2004. With --target-processes=all, compute-sanitizer attaches to every mp.Process child. On CUDA 12.9.1 its IPC teardown analysis gets stuck, so the child never exits.

The parent's join(timeout=30) returns after 30 seconds, but the child is still alive. That zombie holds an open IPC memory handle. When pytest runs fixture teardown (ipc_memory_resource's mr.close()), it blocks on the held handle indefinitely. The runner gets tied up for 4–6 hours until GitHub Actions force-cancels the job.

Fix

After join(timeout=...), kill any process that is still alive. This forces IPC handle release before fixture teardown runs. The test still fails (exit code is non-zero, or completed is False), just in seconds instead of hours.

No behavior change when children exit cleanly.

Related

Closes #2004

@copy-pr-bot
Copy link
Copy Markdown
Contributor

copy-pr-bot Bot commented May 21, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions Bot added the cuda.core Everything related to the cuda.core module label May 21, 2026
When Python 3.12 CI runs, env-vars enables compute-sanitizer with
--target-processes=all, which attaches to every mp.Process child the
tests spawn. On CUDA 12.9.1 the sanitizer analysis of IPC buffer teardown
gets stuck, so child processes never exit. The existing
join(timeout=CHILD_TIMEOUT_SEC) returns but leaves the child alive.

That zombie keeps its IPC handle open. When pytest teardown runs
ipc_memory_resource's mr.close(), it blocks waiting for the handle to
be released — tying up the runner for hours until GitHub Actions
force-cancels the job. This is the exact pattern in issue NVIDIA#2004
(always Python 3.12 + CUDA 12.9.1 local).

Fix: after join(timeout=...), kill any process still alive so the IPC
handle is released before fixture teardown. Tests still fail (exit code
is non-zero or completed is False), just in seconds rather than hours.

Fixes NVIDIA#2004
@aryanputta aryanputta force-pushed the fix/ipc-test-zombie-hang branch from 72ccbe7 to df61e49 Compare May 21, 2026 16:10
@rwgk rwgk requested a review from Andy-Jost May 21, 2026 16:17
@rwgk rwgk added the P0 High priority - Must do! label May 21, 2026
@rwgk rwgk added this to the cuda.core next milestone May 21, 2026
@rwgk rwgk added the bug Something isn't working label May 21, 2026
@rwgk rwgk self-requested a review May 21, 2026 16:23
@rwgk
Copy link
Copy Markdown
Contributor

rwgk commented May 21, 2026

Thanks @aryanputta for jumping in! I'm running a Cursor code review right now...

When Python 3.12 CI runs, env-vars enables compute-sanitizer with
--target-processes=all, which attaches to every mp.Process child the
tests spawn. On CUDA 12.9.1 the sanitizer analysis of IPC buffer teardown
gets stuck, so child processes never exit. The existing
join(timeout=CHILD_TIMEOUT_SEC) returns but leaves the child alive.

That zombie keeps its IPC handle open. When pytest teardown runs
ipc_memory_resource's mr.close(), it blocks waiting for the handle to
be released -- tying up the runner for hours until GitHub Actions
force-cancels the job. This is the exact pattern in issue NVIDIA#2004
(always Python 3.12 + CUDA 12.9.1 local).

Fix: after join(timeout=...), kill any process still alive so the IPC
handle is released before fixture teardown. A stderr warning is printed
when kill() fires so the failure is clearly attributable to the
sanitizer/IPC deadlock rather than appearing as a generic exitcode != 0.
Tests still fail (exit code is non-zero or completed is False), just in
seconds rather than hours.

Fixes NVIDIA#2004
@aryanputta aryanputta force-pushed the fix/ipc-test-zombie-hang branch from 948d30c to 5b802a0 Compare May 21, 2026 16:27
Copy link
Copy Markdown
Contributor

@rwgk rwgk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR 2121 Initial Review

No blocking findings.

The change matches the issue #2004 observed failure mode: after a child process
exceeds join(timeout=30), it is now killed before parent-side teardown can
block on IPC handles. The TestIpcReexport path also now asserts completed,
so a timed-out child becomes a real fast failure instead of quietly continuing
into teardown.

Residual risks

  • Other memory_ipc tests also use join(timeout=CHILD_TIMEOUT_SEC) without
    killing children. The evidence in issue 2004 points specifically at
    test_send_buffers.py, so I would not block this PR on broadening the fix,
    but it may be worth a follow-up helper if hangs reappear elsewhere.
  • Full CI has not run yet because the external-contributor PR is awaiting
    validation. Current checks are only metadata/pre-commit.
  • I could not run the focused test locally because TestVenv appears stale for
    this branch and fails importing cuda.core._memory._managed_buffer.
    git diff --check passed.

@rwgk
Copy link
Copy Markdown
Contributor

rwgk commented May 21, 2026

Oh - @aryanputta please don't force push.

@rwgk
Copy link
Copy Markdown
Contributor

rwgk commented May 21, 2026

/ok to test 036a8f5

@aryanputta
Copy link
Copy Markdown
Contributor Author

Oh - @aryanputta please don't force push.

sorry!!

@rwgk
Copy link
Copy Markdown
Contributor

rwgk commented May 21, 2026

Oh - @aryanputta please don't force push.

sorry!!

No worries. It was only very slightly confusing.

Holding my breath that the two troublesome jobs (under PR #2118) will sail through now.

@aryanputta
Copy link
Copy Markdown
Contributor Author

aryanputta commented May 21, 2026

Oh - @aryanputta please don't force push.

sorry!!

No worries. It was only very slightly confusing.

Holding my breath that the two troublesome jobs (under PR #2118) will sail through now.

haha, me too, I always test locally aswell, i got no issues so hopefully its the same..

edit, did i stop it by accdient, i clicked comment and it think i clicked stop.. I am so Sorry

@aryanputta aryanputta closed this May 21, 2026
@github-actions
Copy link
Copy Markdown

@aryanputta aryanputta reopened this May 21, 2026
@rwgk
Copy link
Copy Markdown
Contributor

rwgk commented May 21, 2026

CI / Test linux-64 / Python 3.12, CUDA 13.2.1 (wheels), GPU rtx4090, wsl (push)

seems to be a flake. I'll keep an eye out here and will rerun when the rest has finished.

@rwgk
Copy link
Copy Markdown
Contributor

rwgk commented May 21, 2026

This still seems to hang unfortunately:

CI / Test linux-64 / Python 3.12, CUDA 12.9.1 (local), GPU l4 (push)

@Andy-Jost for visibility

@aryanputta FYI: Andy is experimenting under #2124 with broader fixes.

@rwgk
Copy link
Copy Markdown
Contributor

rwgk commented May 21, 2026

I cancelled the hanging job. Waiting for @Andy-Jost's experiments under #2124.

The fix in the initial commit only covered test_send_buffers.py. The
63-minute CI hang confirmed the deadlock was occurring in one of the
other test files. Applies the same kill() after join(timeout=...) pattern
to all remaining memory_ipc tests:

- test_errors.py (1 join)
- test_event_ipc.py (2 joins)
- test_ipc_duplicate_import.py (1 join)
- test_leaks.py (3 joins)
- test_memory_ipc.py (7 joins across 4 test classes)
- test_peer_access.py (2 joins)
- test_serialize.py (3 joins)

See issue NVIDIA#2004.
@aryanputta
Copy link
Copy Markdown
Contributor Author

hey @rwgk @Andy-Jost -- after seeing the 63min hang in ci (step 29 died in the cuda.core test run) i checked the rest of the memory_ipc suite and found 7 more files with bare join(timeout=CHILD_TIMEOUT_SEC) and no kill after. the original commit here only covered test_send_buffers.py so the hang just moved to the next unprotected test.

just pushed a second commit that extends the same kill-after-join pattern to all remaining files:

  • test_memory_ipc.py (7 joins across 4 classes)
  • test_leaks.py (3 joins)
  • test_serialize.py (3 joins)
  • test_event_ipc.py (2 joins)
  • test_peer_access.py (2 joins)
  • test_errors.py (1 join)
  • test_ipc_duplicate_import.py (1 join)

andy's layered approach in #2124 is the cleaner fix so please prioritize that -- but if you need a fallback while #2124 is still in draft, this pr now covers the full suite. no need to run ci here unless #2124 doesn't work out

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working cuda.core Everything related to the cuda.core module P0 High priority - Must do!

Projects

None yet

Development

Successfully merging this pull request may close these issues.

memory_ipc/test_send_buffers.py::TestIpcReexport::test_main[DeviceMR] hanging in CI

2 participants