[FEA]: PoC segmented TopK cluster optimization

### Is this a duplicate?

- [x] I confirmed there appear to be no [duplicate issues](https://github.com/NVIDIA/cccl/issues) for this request and that I agree to the [Code of Conduct](CODE_OF_CONDUCT.md)

### Area

Thrust

### Is your feature request related to a problem? Please describe.

We want to be able to address common workloads for sparse-attention mechanisms. Some of the key requirements are:

- batch sizes in the range `[1, 64]` (can be provided at compile time)
- segment size is in the range of `[512, 1M]` (provided via device-accessible iterator)
- k is in the range of `[100, 9999]`, focus on `k=2048` 
- key type is a composed type comprising f32 together with index type of i32
- input provided from a contiguous iterator, where each segment is specified by an offset into that iterator. Seeing NaN or +/-Inf can be considered UB.
- output requirements: 
  - sorted output (want to be able to specify the tie-breaker criterion), 
  - [deterministic selection, achieved via fusing f32 keys with i32 indexes]. 
- targeted GPUs: B200, B300, and beyond
- support for CUDA graphs



This issue can be closed with [benchmark](https://github.com/NVIDIA/cccl/blob/main/cub/benchmarks/bench/segmented_topk/variable/keys.cu) results comparing `cub::DeviceTopK` with segmented TopK using clusters. 

### Describe the solution you'd like

Prototype cluster optimization for segmented TopK

### Describe alternatives you've considered

_No response_

### Additional context

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA]: PoC segmented TopK cluster optimization #9077

Is this a duplicate?

Area

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Describe alternatives you've considered

Additional context

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[FEA]: PoC segmented TopK cluster optimization #9077

Description

Is this a duplicate?

Area

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Describe alternatives you've considered

Additional context

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions