Skip to content

perf(group): gate the multi-key Top-K candidate finder on input size#227

Merged
ser-vasilich merged 1 commit into
masterfrom
perf/group-topk-large-input-gate
Jun 5, 2026
Merged

perf(group): gate the multi-key Top-K candidate finder on input size#227
ser-vasilich merged 1 commit into
masterfrom
perf/group-topk-large-input-gate

Conversation

@ser-vasilich
Copy link
Copy Markdown
Collaborator

Summary

The TopK candidate finder in exec_group (the use_emit_filter && top_count_take > 0 && n_keys > 1 block) is single-threaded: it
builds one SoA hash table sized to n_scan * 4/3 (cc[]/ck64[]/
ck32[]) and scans the full input sequentially, then refines
aggregates for the K winners in a second pass. The shortcut pays
off when n_groups ≪ n_scan and the K winners absorb most of the rows
— Pass-2 then re-aggregates only K rows worth of state.

For uniform high-cardinality inputs (10M rows × ~10M distinct
composite keys, e.g. ClickBench q32 by {WatchID, ClientIP}) the
SoA HT is hundreds of MB, every probe is an L3/DRAM miss, the
single-threaded scan is latency-bound, and Pass-2 gains nothing
because nearly every group already has count = 1. The parallel
radix_v2_phase1_fn path with per-(worker, partition) shards runs
~3-4× faster on such inputs.

Add n_scan <= 1000000 to the TopK candidate gate so large inputs
fall through to the parallel path. Smaller inputs (where the
single-thread SoA HT fits L2/L3 and Pass-1's skip-the-other-aggs
trade is still worth it) keep the existing fast path.

Profile of the hot loop before the gate (q32, 30 reps):

movzwl 0x0(%r13),%esi   # load cc[slot]
test   %si,%si          # ← 64% of exec_group self-time waits here

ClickBench 10M:

q32  ~890 → ~204 ms   (-77%)

Tests: 3232/3234 pass (unchanged).

The TopK candidate finder in exec_group is single-threaded: it builds
one SoA hash table sized to n_scan * 4/3 (cc[]/ck64[]/ck32[]) and
scans the full input sequentially, then refines aggregates for the K
winners in a second pass.  The shortcut pays off when n_groups is
much smaller than n_scan and the K winners absorb most of the rows
— Pass-2 then re-aggregates only K << n_groups rows worth of state.

For uniform high-cardinality inputs (10M rows × ~10M distinct
composite keys) the SoA HT is hundreds of MB, every probe is an
L3/DRAM miss, the single-threaded scan is latency-bound, and Pass-2
gains nothing because nearly every group already has count = 1.
The parallel radix_v2 path with per-(worker, partition) shards runs
~3-4× faster on such inputs.

Add `n_scan <= 1000000` to the TopK-candidate gate so large inputs
fall through to the parallel path.  Smaller inputs (where the
single-thread SoA HT fits L2/L3 and Pass-1's skip-the-other-aggs
trade is worthwhile) keep the existing fast path.

ClickBench 10M:
  q32  ~890 → ~204 ms
@ser-vasilich ser-vasilich merged commit cb54b51 into master Jun 5, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant