Support for stream ordered memory allocators.

Running into a bottleneck where I am trying to launch many `thrust` scan instances concurrently over multiple streams and CPU threads. The problem is, I believe it uses CUB's underlying implementation for scan that needs temporary storage to perform the operation. And allocating and deallocating this simple temporary storage causes the entire device to synchronize on a `cudaFree()` call. 

CUDA 11.2 adds support for `cudaFreeAsync()` (stream ordered memory allocators), so if a stream is supplied to the thrust execution policy, shouldn't the default behavior there be to have the malloc and free calls also happen on the same stream for the temporary storage?

I can provide a minimal working example if needed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for stream ordered memory allocators. #768

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Support for stream ordered memory allocators. #768

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions