Bug Description
On multi-GPU systems, wp.copy(dst, src) where src is non-contiguous on one CUDA device and dst is contiguous on another CUDA device can occasionally read from a staging buffer whose memory has already been recycled by the source device's memory pool. The destination ends up with stale data from an earlier copy.
Observed on a Windows multi-GPU runner (2x A40, no CUDA peer access) as a rare test_copy_indexed_cuda1_cuda0 failure where the first 16 bytes of the 256-element destination array b4 carried the 16 bytes that should have been written to b1 in a preceding wp.copy call:
AssertionError:
Arrays are not equal
Mismatched elements: 4 / 256 (1.56%)
[0, 0, 0, 0]: 1111.0 (ACTUAL), 1.0 (DESIRED)
[0, 0, 0, 1]: 1115.0 (ACTUAL), 5.0 (DESIRED)
[0, 0, 0, 2]: 1118.0 (ACTUAL), 8.0 (DESIRED)
[0, 0, 0, 3]: 1119.0 (ACTUAL), 9.0 (DESIRED)
[1, 5, 8, 9] is exactly the payload of the 1D destination b1 from the earlier wp.copy(b1, a1) call in the same test.
Reproduction
The failing test is warp.tests.test_copy.TestCopy.test_copy_indexed_cuda1_cuda0. It pins the source and destination array types via test.assertFalse(a_N.is_contiguous) and test.assertTrue(b_N.is_contiguous) at warp/tests/test_copy.py:165-173, so the non-contig source to contig dest path is the one exercised. The failing assertion is assert_np_equal(a4.numpy(), b4.numpy()) at line 185, after four sequential wp.copy(b_N, a_N) calls across devices. The failure rate on the Windows multi-GPU runner is low enough that retries usually pass.
Root cause
In warp/_src/context.py, the cross-device copy path (copy() around line 10181) allocates a staging buffer on the source device via src = src.contiguous(). After the peer copy is enqueued on the destination device's stream via cuMemcpyPeerAsync, copy() returns, the local src reference to the staging buffer is dropped, and its __del__ runs cudaFreeAsync(ptr, NULL) on the source device.
cudaFreeAsync on the null stream orders the free against the source device's streams only. The pending peer DMA on the destination device's stream is not tracked by the source device's pool, so the pool can recycle the staging slot while the DMA is still reading from it. The wait_stream at the start of the next wp.copy does not help, because the free was already scheduled when the previous wp.copy returned.
On systems without P2P access, cuMemcpyPeerAsync stages through host memory, which widens the window between peer copy enqueue and the device-side read, making the race easier to hit.
System Information
- Warp 1.13.0.dev (reproduced on Windows multi-GPU CI runner)
- Windows, 2x NVIDIA A40 (sm_86), mempool enabled, CUDA peer access not supported
- CUDA Toolkit 12.9, Driver 12.8
- In principle the same race can occur on Linux multi-GPU systems, though it has not been observed there
Bug Description
On multi-GPU systems,
wp.copy(dst, src)wheresrcis non-contiguous on one CUDA device anddstis contiguous on another CUDA device can occasionally read from a staging buffer whose memory has already been recycled by the source device's memory pool. The destination ends up with stale data from an earlier copy.Observed on a Windows multi-GPU runner (2x A40, no CUDA peer access) as a rare
test_copy_indexed_cuda1_cuda0failure where the first 16 bytes of the 256-element destination arrayb4carried the 16 bytes that should have been written tob1in a precedingwp.copycall:[1, 5, 8, 9]is exactly the payload of the 1D destinationb1from the earlierwp.copy(b1, a1)call in the same test.Reproduction
The failing test is
warp.tests.test_copy.TestCopy.test_copy_indexed_cuda1_cuda0. It pins the source and destination array types viatest.assertFalse(a_N.is_contiguous)andtest.assertTrue(b_N.is_contiguous)atwarp/tests/test_copy.py:165-173, so the non-contig source to contig dest path is the one exercised. The failing assertion isassert_np_equal(a4.numpy(), b4.numpy())at line 185, after four sequentialwp.copy(b_N, a_N)calls across devices. The failure rate on the Windows multi-GPU runner is low enough that retries usually pass.Root cause
In
warp/_src/context.py, the cross-device copy path (copy()around line 10181) allocates a staging buffer on the source device viasrc = src.contiguous(). After the peer copy is enqueued on the destination device's stream viacuMemcpyPeerAsync,copy()returns, the localsrcreference to the staging buffer is dropped, and its__del__runscudaFreeAsync(ptr, NULL)on the source device.cudaFreeAsyncon the null stream orders the free against the source device's streams only. The pending peer DMA on the destination device's stream is not tracked by the source device's pool, so the pool can recycle the staging slot while the DMA is still reading from it. Thewait_streamat the start of the nextwp.copydoes not help, because the free was already scheduled when the previouswp.copyreturned.On systems without P2P access,
cuMemcpyPeerAsyncstages through host memory, which widens the window between peer copy enqueue and the device-side read, making the race easier to hit.System Information