wp.copy() cross-device non-contig source can read recycled staging buffer

## Bug Description

On multi-GPU systems, `wp.copy(dst, src)` where `src` is non-contiguous on one CUDA device and `dst` is contiguous on another CUDA device can occasionally read from a staging buffer whose memory has already been recycled by the source device's memory pool. The destination ends up with stale data from an earlier copy.

Observed on a Windows multi-GPU runner (2x A40, no CUDA peer access) as a rare `test_copy_indexed_cuda1_cuda0` failure where the first 16 bytes of the 256-element destination array `b4` carried the 16 bytes that should have been written to `b1` in a preceding `wp.copy` call:

```
AssertionError:
Arrays are not equal
Mismatched elements: 4 / 256 (1.56%)
 [0, 0, 0, 0]: 1111.0 (ACTUAL), 1.0 (DESIRED)
 [0, 0, 0, 1]: 1115.0 (ACTUAL), 5.0 (DESIRED)
 [0, 0, 0, 2]: 1118.0 (ACTUAL), 8.0 (DESIRED)
 [0, 0, 0, 3]: 1119.0 (ACTUAL), 9.0 (DESIRED)
```

`[1, 5, 8, 9]` is exactly the payload of the 1D destination `b1` from the earlier `wp.copy(b1, a1)` call in the same test.

## Reproduction

The failing test is `warp.tests.test_copy.TestCopy.test_copy_indexed_cuda1_cuda0`. It pins the source and destination array types via `test.assertFalse(a_N.is_contiguous)` and `test.assertTrue(b_N.is_contiguous)` at `warp/tests/test_copy.py:165-173`, so the non-contig source to contig dest path is the one exercised. The failing assertion is `assert_np_equal(a4.numpy(), b4.numpy())` at line 185, after four sequential `wp.copy(b_N, a_N)` calls across devices. The failure rate on the Windows multi-GPU runner is low enough that retries usually pass.

## Root cause

In `warp/_src/context.py`, the cross-device copy path (`copy()` around line 10181) allocates a staging buffer on the source device via `src = src.contiguous()`. After the peer copy is enqueued on the destination device's stream via `cuMemcpyPeerAsync`, `copy()` returns, the local `src` reference to the staging buffer is dropped, and its `__del__` runs `cudaFreeAsync(ptr, NULL)` on the source device.

`cudaFreeAsync` on the null stream orders the free against the source device's streams only. The pending peer DMA on the destination device's stream is not tracked by the source device's pool, so the pool can recycle the staging slot while the DMA is still reading from it. The `wait_stream` at the start of the next `wp.copy` does not help, because the free was already scheduled when the previous `wp.copy` returned.

On systems without P2P access, `cuMemcpyPeerAsync` stages through host memory, which widens the window between peer copy enqueue and the device-side read, making the race easier to hit.

## System Information

- Warp 1.13.0.dev (reproduced on Windows multi-GPU CI runner)
- Windows, 2x NVIDIA A40 (sm_86), mempool enabled, CUDA peer access not supported
- CUDA Toolkit 12.9, Driver 12.8
- In principle the same race can occur on Linux multi-GPU systems, though it has not been observed there

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

wp.copy() cross-device non-contig source can read recycled staging buffer #1384

Bug Description

Reproduction

Root cause

System Information

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

wp.copy() cross-device non-contig source can read recycled staging buffer #1384

Description

Bug Description

Reproduction

Root cause

System Information

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions