Skip to content

wp.copy() cross-device non-contig source can read recycled staging buffer #1384

@shi-eric

Description

@shi-eric

Bug Description

On multi-GPU systems, wp.copy(dst, src) where src is non-contiguous on one CUDA device and dst is contiguous on another CUDA device can occasionally read from a staging buffer whose memory has already been recycled by the source device's memory pool. The destination ends up with stale data from an earlier copy.

Observed on a Windows multi-GPU runner (2x A40, no CUDA peer access) as a rare test_copy_indexed_cuda1_cuda0 failure where the first 16 bytes of the 256-element destination array b4 carried the 16 bytes that should have been written to b1 in a preceding wp.copy call:

AssertionError:
Arrays are not equal
Mismatched elements: 4 / 256 (1.56%)
 [0, 0, 0, 0]: 1111.0 (ACTUAL), 1.0 (DESIRED)
 [0, 0, 0, 1]: 1115.0 (ACTUAL), 5.0 (DESIRED)
 [0, 0, 0, 2]: 1118.0 (ACTUAL), 8.0 (DESIRED)
 [0, 0, 0, 3]: 1119.0 (ACTUAL), 9.0 (DESIRED)

[1, 5, 8, 9] is exactly the payload of the 1D destination b1 from the earlier wp.copy(b1, a1) call in the same test.

Reproduction

The failing test is warp.tests.test_copy.TestCopy.test_copy_indexed_cuda1_cuda0. It pins the source and destination array types via test.assertFalse(a_N.is_contiguous) and test.assertTrue(b_N.is_contiguous) at warp/tests/test_copy.py:165-173, so the non-contig source to contig dest path is the one exercised. The failing assertion is assert_np_equal(a4.numpy(), b4.numpy()) at line 185, after four sequential wp.copy(b_N, a_N) calls across devices. The failure rate on the Windows multi-GPU runner is low enough that retries usually pass.

Root cause

In warp/_src/context.py, the cross-device copy path (copy() around line 10181) allocates a staging buffer on the source device via src = src.contiguous(). After the peer copy is enqueued on the destination device's stream via cuMemcpyPeerAsync, copy() returns, the local src reference to the staging buffer is dropped, and its __del__ runs cudaFreeAsync(ptr, NULL) on the source device.

cudaFreeAsync on the null stream orders the free against the source device's streams only. The pending peer DMA on the destination device's stream is not tracked by the source device's pool, so the pool can recycle the staging slot while the DMA is still reading from it. The wait_stream at the start of the next wp.copy does not help, because the free was already scheduled when the previous wp.copy returned.

On systems without P2P access, cuMemcpyPeerAsync stages through host memory, which widens the window between peer copy enqueue and the device-side read, making the race easier to hit.

System Information

  • Warp 1.13.0.dev (reproduced on Windows multi-GPU CI runner)
  • Windows, 2x NVIDIA A40 (sm_86), mempool enabled, CUDA peer access not supported
  • CUDA Toolkit 12.9, Driver 12.8
  • In principle the same race can occur on Linux multi-GPU systems, though it has not been observed there

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions