Skip to content

[FEA]: PoC segmented TopK cluster optimization #9077

@gevtushenko

Description

@gevtushenko

Is this a duplicate?

Area

Thrust

Is your feature request related to a problem? Please describe.

We want to be able to address common workloads for sparse-attention mechanisms. Some of the key requirements are:

  • batch sizes in the range [1, 64] (can be provided at compile time)
  • segment size is in the range of [512, 1M] (provided via device-accessible iterator)
  • k is in the range of [100, 9999], focus on k=2048
  • key type is a composed type comprising f32 together with index type of i32
  • input provided from a contiguous iterator, where each segment is specified by an offset into that iterator. Seeing NaN or +/-Inf can be considered UB.
  • output requirements:
    • sorted output (want to be able to specify the tie-breaker criterion),
    • [deterministic selection, achieved via fusing f32 keys with i32 indexes].
  • targeted GPUs: B200, B300, and beyond
  • support for CUDA graphs

This issue can be closed with benchmark results comparing cub::DeviceTopK with segmented TopK using clusters.

Describe the solution you'd like

Prototype cluster optimization for segmented TopK

Describe alternatives you've considered

No response

Additional context

No response

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

Status
In Progress

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions