perf(group): gate the multi-key Top-K candidate finder on input size by ser-vasilich · Pull Request #227 · RayforceDB/rayforce

ser-vasilich · 2026-06-05T15:16:37Z

Summary

The TopK candidate finder in exec_group (the use_emit_filter && top_count_take > 0 && n_keys > 1 block) is single-threaded: it
builds one SoA hash table sized to n_scan * 4/3 (cc[]/ck64[]/
ck32[]) and scans the full input sequentially, then refines
aggregates for the K winners in a second pass. The shortcut pays
off when n_groups ≪ n_scan and the K winners absorb most of the rows
— Pass-2 then re-aggregates only K rows worth of state.

For uniform high-cardinality inputs (10M rows × ~10M distinct
composite keys, e.g. ClickBench q32 by {WatchID, ClientIP}) the
SoA HT is hundreds of MB, every probe is an L3/DRAM miss, the
single-threaded scan is latency-bound, and Pass-2 gains nothing
because nearly every group already has count = 1. The parallel
radix_v2_phase1_fn path with per-(worker, partition) shards runs
~3-4× faster on such inputs.

Add n_scan <= 1000000 to the TopK candidate gate so large inputs
fall through to the parallel path. Smaller inputs (where the
single-thread SoA HT fits L2/L3 and Pass-1's skip-the-other-aggs
trade is still worth it) keep the existing fast path.

Profile of the hot loop before the gate (q32, 30 reps):

movzwl 0x0(%r13),%esi   # load cc[slot]
test   %si,%si          # ← 64% of exec_group self-time waits here

ClickBench 10M:

q32  ~890 → ~204 ms   (-77%)

Tests: 3232/3234 pass (unchanged).

The TopK candidate finder in exec_group is single-threaded: it builds one SoA hash table sized to n_scan * 4/3 (cc[]/ck64[]/ck32[]) and scans the full input sequentially, then refines aggregates for the K winners in a second pass. The shortcut pays off when n_groups is much smaller than n_scan and the K winners absorb most of the rows — Pass-2 then re-aggregates only K << n_groups rows worth of state. For uniform high-cardinality inputs (10M rows × ~10M distinct composite keys) the SoA HT is hundreds of MB, every probe is an L3/DRAM miss, the single-threaded scan is latency-bound, and Pass-2 gains nothing because nearly every group already has count = 1. The parallel radix_v2 path with per-(worker, partition) shards runs ~3-4× faster on such inputs. Add `n_scan <= 1000000` to the TopK-candidate gate so large inputs fall through to the parallel path. Smaller inputs (where the single-thread SoA HT fits L2/L3 and Pass-1's skip-the-other-aggs trade is worthwhile) keep the existing fast path. ClickBench 10M: q32 ~890 → ~204 ms

ser-vasilich mentioned this pull request Jun 5, 2026

perf(query): narrow + materialize before group for WHERE + multi-key #228

Merged

ser-vasilich merged commit cb54b51 into master Jun 5, 2026
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(group): gate the multi-key Top-K candidate finder on input size#227

perf(group): gate the multi-key Top-K candidate finder on input size#227
ser-vasilich merged 1 commit into
masterfrom
perf/group-topk-large-input-gate

ser-vasilich commented Jun 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ser-vasilich commented Jun 5, 2026

Summary

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant