Skip to content

Support for stream ordered memory allocators. #768

@neoblizz

Description

@neoblizz

Running into a bottleneck where I am trying to launch many thrust scan instances concurrently over multiple streams and CPU threads. The problem is, I believe it uses CUB's underlying implementation for scan that needs temporary storage to perform the operation. And allocating and deallocating this simple temporary storage causes the entire device to synchronize on a cudaFree() call.

CUDA 11.2 adds support for cudaFreeAsync() (stream ordered memory allocators), so if a stream is supplied to the thrust execution policy, shouldn't the default behavior there be to have the malloc and free calls also happen on the same stream for the temporary storage?

I can provide a minimal working example if needed.

Metadata

Metadata

Assignees

Labels

thrustFor all items related to Thrust.

Type

No type

Projects

Status

Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions