Is this a duplicate?
Area
Thrust
Is your feature request related to a problem? Please describe.
We want to be able to address common workloads for sparse-attention mechanisms. Some of the key requirements are:
- batch sizes in the range
[1, 64] (can be provided at compile time)
- segment size is in the range of
[512, 1M] (provided via device-accessible iterator)
- k is in the range of
[100, 9999], focus on k=2048
- key type is a composed type comprising f32 together with index type of i32
- input provided from a contiguous iterator, where each segment is specified by an offset into that iterator. Seeing NaN or +/-Inf can be considered UB.
- output requirements:
- sorted output (want to be able to specify the tie-breaker criterion),
- [deterministic selection, achieved via fusing f32 keys with i32 indexes].
- targeted GPUs: B200, B300, and beyond
- support for CUDA graphs
This issue can be closed with benchmark results comparing cub::DeviceTopK with segmented TopK using clusters.
Describe the solution you'd like
Prototype cluster optimization for segmented TopK
Describe alternatives you've considered
No response
Additional context
No response
Is this a duplicate?
Area
Thrust
Is your feature request related to a problem? Please describe.
We want to be able to address common workloads for sparse-attention mechanisms. Some of the key requirements are:
[1, 64](can be provided at compile time)[512, 1M](provided via device-accessible iterator)[100, 9999], focus onk=2048This issue can be closed with benchmark results comparing
cub::DeviceTopKwith segmented TopK using clusters.Describe the solution you'd like
Prototype cluster optimization for segmented TopK
Describe alternatives you've considered
No response
Additional context
No response