Running into a bottleneck where I am trying to launch many thrust scan instances concurrently over multiple streams and CPU threads. The problem is, I believe it uses CUB's underlying implementation for scan that needs temporary storage to perform the operation. And allocating and deallocating this simple temporary storage causes the entire device to synchronize on a cudaFree() call.
CUDA 11.2 adds support for cudaFreeAsync() (stream ordered memory allocators), so if a stream is supplied to the thrust execution policy, shouldn't the default behavior there be to have the malloc and free calls also happen on the same stream for the temporary storage?
I can provide a minimal working example if needed.
Running into a bottleneck where I am trying to launch many
thrustscan instances concurrently over multiple streams and CPU threads. The problem is, I believe it uses CUB's underlying implementation for scan that needs temporary storage to perform the operation. And allocating and deallocating this simple temporary storage causes the entire device to synchronize on acudaFree()call.CUDA 11.2 adds support for
cudaFreeAsync()(stream ordered memory allocators), so if a stream is supplied to the thrust execution policy, shouldn't the default behavior there be to have the malloc and free calls also happen on the same stream for the temporary storage?I can provide a minimal working example if needed.