Add non-allocating IColumn::computeHashInto column hash kernel by harikrishnan94 · Pull Request #106538 · ClickHouse/ClickHouse

harikrishnan94 · 2026-06-05T10:08:10Z

Split out the hash-production half of #106120 into its own pull request, as requested in review: keep the per-column hash calculation separate from the batched scatter interface so each can be reviewed and benchmarked in isolation.

Motivation

Several in-memory routing paths — sharded aggregation (BufferedShardByHashTransform), grace_hash join bucketing (JoinUtils::scatterBlockByHash), and parallel window partitioning (ScatterByPartitionTransform) — hashed their key columns by allocating a WeakHash32 per chunk and chaining IColumn::getWeakHash32 across the key columns, then applying a separate intHashCRC32 finalizer before mapping to a shard. The per-chunk allocation and the extra finalizer dominate for narrow keys at high throughput.

Changes

IColumn::computeHashInto(row_begin, row_end, hash_out, initial) is a non-allocating per-row hash kernel that writes a finalized 32-bit hash for each row into a caller-provided buffer. The per-row hash is produced with hardware CRC32C — hashCRC32/intHashCRC32/updateWeakHash32, i.e. _mm_crc32_u64 on x86 (SSE4.2), __crc32cd on ARM, with a software fallback — seeded with WEAK_HASH32_INITIAL_VALUE, consuming a full 64-bit word in a single instruction. Multi-column keys compose in place — with no intermediate arrays — via combineWeakHash32, a single order-sensitive CRC32C step that packs the prior accumulator and the new per-row hash into the two halves of one 64-bit CRC input. Type-specialized kernels are provided for the common column types (MULTITARGET_FUNCTION_X86_V4).

The kernel obeys a representation-independence contract: a materialized column and a transparent wrapper of the same values (ColumnConst, ColumnLowCardinality, ColumnSparse, ColumnReplicated) compose to identical hashes, so equal multi-column keys are never split across shards or join buckets because of physical representation. Every column type combines its finalized per-row hash through the same combineWeakHash32; wrappers and composite columns (ColumnTuple/ColumnArray/ColumnNullable/ColumnObject/ColumnFunction/ColumnVariant, and the dummy ColumnNothing) combine the finalized nested hash.

IColumn::getWeakHash32, WeakHash32, and WeakHash.h/WeakHash.cpp are removed; the three consumers now fill a PaddedPODArray<UInt32> via chained computeHashInto. The CRC32C updateWeakHash32 primitive stays, with its default initial value WEAK_HASH32_INITIAL_VALUE and the new combineWeakHash32 combiner co-located in Hash.h. Because every column already writes a finalized, well-distributed 32-bit hash, the separate intHashCRC32 finalizer in hashToSelector is dropped — the sharder consumes the hash directly: hash & mask for power-of-two bucket counts (the grace_hash path) and Lemire fastrange (h * P) >> 32 (mapToRange) otherwise.

The per-row hash values differ from the old per-chunk getWeakHash32 chain. Changing the exact hash values is safe because all three consumers route in-memory per query; per-key stable shard identity was never guaranteed.

Testing

New unit test gtest_compute_hash_into (33 cases): row-range independence, initial/combine semantics, null-payload independence, array-length seeding, BFloat16 raw-bit hashing, chi-squared distribution uniformity, order sensitivity, wide 128/256-bit folding (UInt128/UInt256/Decimal128/Decimal256), and representation independence across Const/Sparse/LowCardinality/Replicated/Nothing for UInt64/String/Tuple/Array/Decimal/FixedString keys.
New stateless test 04312_grace_hash_vs_hash_join_key_types asserts grace_hash joins match the default hash join across UInt/String/Nullable/LowCardinality/Array/Tuple keys.
Verified locally against the existing sharded-aggregation suite (03927, 03928, 04097–04100) and a parallel-window correctness check that routes through ScatterByPartitionTransform.

Related: #105936
Related: #106120
Related: #106120 (review)

Changelog category (leave one):

Performance Improvement

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Speed up sharded aggregation, grace_hash join bucketing, and parallel window partitioning by replacing the per-chunk allocating getWeakHash32 column hashing with a non-allocating IColumn::computeHashInto kernel that writes finalized per-row hardware-CRC32C hashes and composes multi-column keys in place via combineWeakHash32.

Made with Cursor

Split out the hash-production half of ClickHouse#106120 into its own change, as requested in review: keep the per-column hash calculation separate from the batched scatter interface so each can be reviewed and benchmarked in isolation. `IColumn::computeHashInto(row_begin, row_end, hash_out, initial)` is a non-allocating per-row hash kernel that writes 32-bit `fmix32`-based hashes into a caller-provided buffer, with SIMD overrides (`MULTITARGET_FUNCTION_X86_V4_V3`) for the common column types. Multi-column keys compose in place via `fmix32Combined` with no intermediate arrays, and the contract is representation-independent: a materialized column and a transparent wrapper (`ColumnConst`, `ColumnLowCardinality`, `ColumnSparse`, `ColumnReplicated`) of the same values hash identically. `IColumn::getWeakHash32`, `WeakHash32` and `WeakHash.h`/`WeakHash.cpp` are removed; the remaining consumers (`JoinUtils::scatterBlockByHash`, `ScatterByPartitionTransform`, `BufferedShardByHashTransform`) fill a `PaddedPODArray<UInt32>` via chained `computeHashInto`. The CRC32C `updateWeakHash32` primitive stays, with its default initial value relocated to `WEAK_HASH32_INITIAL_VALUE` in `Hash.h`. `mapToRange` maps a 32-bit hash uniformly to a shard index via Lemire fastrange. Routing hash values change from CRC32C to the `fmix32` finalizer, which is safe because both consumers route in-memory per query; per-key stable shard identity was never guaranteed. Verified that `grace_hash` joins match the default `hash` join across the changed key families (`04312_grace_hash_vs_hash_join_key_types`). This intentionally excludes the `DB::ColumnsScatter::scatter` primitive, the `shard_by_hash_input_batch_bytes` input-batching mode, and their tests from ClickHouse#106120; the consumers retain the legacy per-chunk `IColumn::scatter`. Related: ClickHouse#105936 Related: ClickHouse#106120 Related: ClickHouse#106120 (review) Co-authored-by: Cursor <cursoragent@cursor.com>

clickhouse-gh · 2026-06-05T10:08:47Z

Workflow [PR], commit [1c5a9fc]

Summary: ✅

Performance Comparison: Performance dashboard

AI Review

Summary

This PR replaces the allocating getWeakHash32 path with IColumn::computeHashInto and wires it into sharded aggregation, grace_hash join bucketing, and parallel-window partitioning. The current code addresses the earlier correctness and test issues I rechecked, including dummy-column composition, ColumnFunction composition, ColumnLowCardinality/ColumnReplicated tests, wide-value coverage, the grace_hash final selector mix, and the end-to-end benchmark loop. I am still not ready to approve because the PR is categorized as Performance Improvement, but the latest visible evidence does not prove the final sharded aggregation path is faster.

Missing context / blind spots

⚠️ I could not validate the current CI performance comparison with .claude/tools/fetch_perf_report.py: it exits with clickhouse not found in PATH, and no local clickhouse or clickhouse-local binary is available in this checkout. A latest final-commit performance report or benchmark output would close this gap.

Findings

⚠️ Majors

[src/Processors/Transforms/BufferedShardByHashTransform.cpp:150] [dismissed by author -- https://github.com/Add non-allocating IColumn::computeHashInto column hash kernel #106538#discussion_r3371924944] The PR promises a Performance Improvement for the sharded aggregation hashing path, but the public discussion still only contains a cloud report showing ClickBench GROUP BY regressions (Q30 +10.6%, Q33 +8.2%) and no latest before/after numbers for the final 1c5a9fc1 state. The benchmark now matches the consumer loop, but the performance claim still needs measured proof. Suggested fix: attach current benchmark_compute_hash_into before/after output or wait for a current performance report covering the affected high-cardinality sharded aggregation cases; otherwise adjust the changelog category or entry so it does not claim an unproven performance improvement.

Tests

⚠️ Missing focused performance evidence for the final 1c5a9fc1 state. Correctness coverage is much stronger now, but the Performance Improvement contract needs measured proof.

Final Verdict

Status: ⚠️ Request changes

Minimum required action: provide current final-commit before/after performance evidence for the affected sharded aggregation and hashing paths, or change the PR metadata/changelog so it no longer promises Performance Improvement.

Fix two test issues flagged in review of the `computeHashInto` kernel: - `InitialFlagSemantics` computed `combined_manual` but never asserted it, so the test passed for any `initial=false` implementation that merely mutated the buffer. Assert the documented contract `EXPECT_EQ(combined_api, combined_manual)`. - The distributional chi-square tests seeded `rng` from `randomSeed`, so a correct hash could exceed the `p=0.001` cutoff and flake in CI. Use a fixed seed for reproducibility. Document the intentional layout-dependence of `ColumnDynamic::computeHashInto` and `ColumnObject::computeHashInto`: like the former `getWeakHash32` and `updateHashWithValueRange`, the weak routing hash is not guaranteed to be equal for logically equal values stored with different variant/path layouts. This matches prior behavior; the in-memory scatter consumers only need a fast per-query partitioning, not a layout-stable hash. Co-authored-by: Cursor <cursoragent@cursor.com>

`ColumnFunction::computeHashInto` streamed each captured column directly into the caller's buffer on the `initial == false` path, producing `combine(h_b, combine(h_a, prior))` instead of the contract's `combine(combine(h_b, h_a), prior)`. Since `ColumnConst` feeds back the finalized per-row value, a materialized `ColumnFunction` and a `ColumnConst(ColumnFunction)` of the same value composed differently when the function was not the first key, which could route equal keys to different shards or `grace_hash` partitions. Mirror `ColumnTuple`: build the finalized function hash in a scratch buffer, then combine it once via `fmix32Combined`. The empty-capture non-initial case now combines `0` instead of leaving the prior hash unchanged. Add deterministic `gtest_compute_hash_into` coverage for the >8-byte folding path in `hashValueRaw32`/`hashDecimalRaw32`, which the existing cases never reached: `UInt128`, `UInt256`, `Decimal128`, and `Decimal256`, each checked for distinct hashes, row-range split independence, and `initial == false` (`ColumnConst`) composition. Co-authored-by: Cursor <cursoragent@cursor.com>

…nd Replicated The representation-independence block (test 16) advertised `ColumnLowCardinality` and `ColumnReplicated` in its comment, but only built `ColumnConst` and `ColumnSparse`. The stateless `04312_grace_hash` test does not cover `ColumnLowCardinality::computeHashInto` either, because `JoinUtils::materializeColumn` strips LowCardinality keys before hashing. Add direct gtests for both wrappers: build a materialized column and a logically equal `ColumnLowCardinality` / `ColumnReplicated` with per-row varying values (a small pool with repeats so the dictionary/index mapping is non-trivial), then assert `composeKey(prefix, materialized) == composeKey(prefix, wrapper)` (the `initial == false` composition path) plus row-range split consistency via `expectRangeSplitConsistent`. Co-authored-by: Cursor <cursoragent@cursor.com>

clickhouse-gh · 2026-06-05T12:16:17Z

📊 Cloud Performance Report

🟢 AI verdict: improvement — 2 query(s) improved out of 38 analysed

This PR rewrites the column-level weak-hash routine into a range-based, partially AVX-512-vectorized implementation. That hash drives row partitioning for hash joins and parallel aggregation, so the two flagged improvements both land on join-heavy TPC-H queries: Q7 is 34% faster and Q20 is 47% faster, both confirmed by both tests. Q7's measurements were tight and the result is clean; Q20's underlying runs were noisier, but the delta is large enough to be a real shift. Both improvements are consistent with what this change touches.

clickbench

🟢 No significant changes

tpch_adapted_1_official

🟢 2 improved

Flagged queries (2 of 22)

	Query	Verdict	Baseline med (ms)	PR med (ms)	Change	q-value	Hint
🟢	7	improvement	113	75	-33.6%	<0.0001	join_operator: PR vectorizes weak-hash row partitioning used by hash joins; -34% on this join-heavy query is plausible.
🟢	20	improvement	246	131	-46.8%	<0.0001	join_operator: Faster weak-hash partitioning fits this join/subquery-heavy query; large delta though base runs were noisy.

_{q-value = BH-FDR adjusted p; smaller is stronger evidence. MIRAI flags a query when q < fdr_q (default 0.10) — the value the verdict is based on.}

Debug info

StressHouse run: cdc9dc5f-85dd-4d3c-aea3-a7ffc0c8afff
MIRAI run: ffc67b0f-8d06-40f7-bad7-2f945c53ed5c
PR check IDs:
- clickbench_32812_1780938588
- clickbench_32825_1780938588
- clickbench_32831_1780938588
- tpch_adapted_1_official_32837_1780938588
- tpch_adapted_1_official_33238_1780938620

The non-allocating `IColumn::computeHashInto` kernel (PR ClickHouse#106538) bundled a CRC32C -> `fmix32` routing-hash switch that regressed join-heavy workloads: the Cloud Performance Report on the PR flagged TPC-H Q4 +33% slower because `fmix32` is far costlier than hardware CRC32C on the in-memory, per-query scatter paths (sharded aggregation, `grace_hash` bucketing, parallel-window partitioning, hash-join scatter) -- paths that never need a stable shard identity. Keep the non-allocating per-row buffer and the representation-independence contract, but restore CRC32C as both the per-row hash and the cross-column combiner: - Per-row `h(row)` is a finalized 32-bit CRC32C hash seeded with `WEAK_HASH32_INITIAL_VALUE`: `hashCRC32(value, ...)` for `ColumnVector` (one `_mm_crc32_u64` / `__crc32cd`, no hi/lo split), `intHashCRC32(v.value, ...)` for `ColumnDecimal`, and `updateWeakHash32` for the string/fixed-string/ aggregate families. Float `-0.0` -> `+0.0` normalization is preserved. - Composite keys combine each column's FINALIZED per-row hash via one shared, order-sensitive `combineWeakHash32` that packs `prior` and `value` into the two halves of a single 64-bit CRC input (a plain `intHashCRC32(value, prior)` is order-insensitive for 32-bit operands and would collide `(a, b)` with `(b, a)`). Wrappers (`Const` / `LowCardinality` / `Sparse` / `Replicated` / `Tuple` / `Array` / `Nullable` / `Object` / `Function` / `Variant`) combine the finalized nested hash, so SQL-equal keys hash identically across representations. - Remove `Common/HashCombine32.h` (`fmix32` / `fmix32Combined`) and every reference, comment, and doc. Validated by `gtest_compute_hash_into` (32/32) and `04312_grace_hash_vs_hash_join_key_types`. On the per-row hash benchmark the UInt64 (bigint) join-key path drops from ~1.36 to ~0.27 ns/row (~5.1x faster), restoring the fast scalar CRC32C path TPC-H Q4 joins use. Co-authored-by: Cursor <cursoragent@cursor.com>

`IColumnDummy::computeHashInto` only wrote `hash_out` on the `initial == true` path and left it unchanged when `initial == false`. A dummy column's finalized per-row hash is `0`, so a key `(prefix, materialized ColumnNothing)` hashed as just `prefix`, while `(prefix, ColumnConst(ColumnNothing))` combined `0` into `prefix` through `ColumnConst`. That breaks the representation-independence contract and can split equal multi-column keys if `Nothing`/dummy columns reach the scatter paths. Mirror the empty `ColumnTuple`/`ColumnFunction` handling: combine `0` via `combineWeakHash32` on the non-initial path. Add a `gtest_compute_hash_into` case asserting a materialized `ColumnNothing` and `ColumnConst(ColumnNothing)` compose identically. Addresses review comment: ClickHouse#106538 (comment) Co-authored-by: Cursor <cursoragent@cursor.com>

Stacked on top of ClickHouse#106538 (`IColumn::computeHashInto` hash kernel PR). This commit adds the scatter half and the `shard_by_hash_input_batch_bytes` input-batching mode. `DB::ColumnsScatter::scatter(source_columns, pids_per_source, num_shards, rows_per_shard)` is a batched, O(1)-dispatch physical split. The caller supplies one `IColumn` from each of B pending chunks at the same column-position, along with per-source `UInt32` partition-ids and precomputed per-shard row counts. A static function-pointer table keyed on `IColumn::getDataType()` selects the typed kernel with no virtual call on the hot path. Fast-path typed kernels for `ColumnVector`, `ColumnDecimal`, `ColumnFixedString`, `ColumnString`, `ColumnNullable`, and `ColumnTuple`; all other types delegate to the legacy `IColumn::scatter` fallback. Mixed-representation batches (`ColumnConst`/`ColumnSparse`/`ColumnReplicated` mixed with materialized columns) are normalized via `materializeTransparentWrappers` before dispatch, preserving `LowCardinality`; homogeneous `LowCardinality` batches use the fallback. `BufferedShardByHashTransform` input batching — new setting `shard_by_hash_input_batch_bytes` (default 2 MiB; 0 = legacy per-chunk path). The transform accumulates input chunks until the byte budget is reached, then flushes via `ColumnsScatter::scatter` once per column-position. Hash production uses `IColumn::computeHashInto` (CRC32C, from ClickHouse#106538) + `mapToRange`. `ScatterByPartitionTransform` is wired to `ColumnsScatter::scatter` for the same O(1)-dispatch benefit on the parallel-window partitioning path. Also adds: - `benchmark_columns_scatter.cpp` — RFC sweep over K and P - `gtest_columns_scatter.cpp` — equivalence vs legacy for all fast-path types, mixed-representation regressions, `LowCardinality` type preservation, and per-key count assertions - `04278_04101_sharded_aggregation_batched_input.sql` — batched vs legacy per-key-identical `GROUP BY` results across key types with small block size to exercise the threshold path, multi-chunk batches, and EOF flush Related: ClickHouse#105936 Related: ClickHouse#106120 Related: ClickHouse#106538 Co-authored-by: Cursor <cursoragent@cursor.com>

…pansion Two correctness/resource fixes in `ColumnsScatter::scatter`: 1. **`scatterFallback` nested-representation mismatch** — when `IColumn::scatter` is called across source chunks that share a logical type but differ in their nested concrete representation (e.g. `Array(UInt64)` with a full nested `ColumnUInt64` in one chunk vs `ColumnSparse(UInt64)` in another), the subsequent `insertRangeFrom` call on the destination's nested column would `assert_cast` on the wrong concrete type, causing undefined behaviour in release builds. Fix: add `deepNormalizeForFallback` which strips `ColumnConst`, `ColumnSparse`, and `ColumnReplicated` recursively at every nesting level (recursing into `ColumnArray` data and `ColumnTuple` elements) while preserving `ColumnLowCardinality`. `scatterFallback` normalizes all sources through this function before calling `IColumn::scatter`, and clones the destination from the normalized first source so all cross-chunk `insertRangeFrom` calls operate on matching concrete column types. 2. **Eager `ColumnConst` materialization before the UInt32 overflow guard** — the overflow guard previously ran after `materializeTransparentWrappers`, expanding O(1)-byte `ColumnConst` sources into full columns even when the only path taken was the size_t-safe `scatterFallback`. For const-heavy batches accumulated by `BufferedShardByHashTransform`, this could expand millions of logical rows into full allocations unnecessarily. Fix: move the UInt32 overflow guard before the wrapper materialization block. Additionally, add a fast compact path for a homogeneous batch of equal-valued `ColumnConst` sources: when all sources share the same constant value, produce `ColumnConst(value, rows_per_shard[s])` per shard via `cloneResized` — no materialization, O(1) memory, matching the behavior the legacy per-chunk `IColumn::scatter` had for constant-key workloads. Co-authored-by: Cursor <cursoragent@cursor.com>

…nConst` compaction Two new tests in `gtest_columns_scatter`: - `FallbackArrayMixedNestedRepresentation`: scatters two `Array(UInt64)` chunks where the first has a full nested `ColumnUInt64` and the second has a `ColumnSparse(UInt64)` nested column (identical logical values). Verifies that the scattered results match a reference built from two fully-materialized sources. Before the fix, the cross-chunk `insertRangeFrom` in `scatterFallback` would `assert_cast` a `ColumnSparse` as `ColumnVector` and produce wrong values or UB. - `ConstColumnStaysCompact`: scatters three `ColumnConst(UInt64=99)` sources through `ColumnsScatter::scatter` and asserts that every per-shard output column `isConst()`. Verifies that the new all-const compact path returns `ColumnConst(99, rows_per_shard[s])` without materializing the sources into full columns. Co-authored-by: Cursor <cursoragent@cursor.com>

…ce_hash buckets The power-of-two bucket selection in `scatterBlockByHashPow2` used `hash & mask`, which extracts only the low bits of the CRC32C hash. CRC32C has poor avalanche in the low bits for structured input (e.g. sequential integers), leaving the bucket distribution/performance contract unproven for the `grace_hash` path. Replace with Knuth's multiplicative (Fibonacci) hash: multiply by `floor(2^32/phi) = 0x9E3779B1` and take the top `log2(num_shards)` bits via `>> (32 - K)`. The high bits of the product depend on all 32 input bits, so even poor CRC32C low-bit distribution becomes uniform bucket assignment. This mirrors the approach already used by `scatterBlockByHashGeneric` and `mapToRange` (both use high-bit Lemire fastrange); the pow2 path now has consistent behaviour. Add two chi-squared distribution tests in `gtest_compute_hash_into` that exercise the Fibonacci sharder path directly: - `DistributionPow2FibRandom`: random UInt32 keys, 64 buckets. - `DistributionPow2FibSequential`: sequential UInt64 keys (0..65535), the pathological case for raw CRC32C low bits, verified uniform after the fix. Addresses reviewer comment on: ClickHouse#106538 Related: ClickHouse#106538 Related: ClickHouse#106120 Co-authored-by: Cursor <cursoragent@cursor.com>

Keep the `IColumn::computeHashInto` kernel work while removing the `ColumnsScatter` batching path, related setting, tests, and benchmark from this branch. Co-authored-by: Cursor <cursoragent@cursor.com>

Three correctness fixes for the `IColumn::computeHashInto` branch: 1. **BLOCKER — stale hash buffer in scatter transforms**: `ScatterByPartitionTransform` and `BufferedShardByHashTransform` reused a member `PaddedPODArray<UInt32>` across blocks via `resize_fill(n, WEAK_HASH32_INITIAL_VALUE)`. `resize_fill` only fills the grown tail `[old_size, n)`, so when a later block is `<=` the prior size (the common case with equal-size blocks) the buffer is NOT re-initialized. `computeHashInto` with `initial=false` then combines into stale data, routing equal keys to different shards across blocks — splitting window partitions and aggregation groups → wrong results. Fixed by `resize` + `std::fill(..., WEAK_HASH32_INITIAL_VALUE)`. 2. **Dropped grace_hash finalizer**: `JoinUtils::hashToSelector` was calling `sharder(hashes[i])` instead of `sharder(intHashCRC32(hashes[i]))`, losing the CRC32 avalanche step that the old `scatterBlockByHash` path depended on. Restored. 3. **Float `-0.0` normalization deviation**: `weakHashValue32` in `ColumnVector.cpp` was normalizing `-0.0` to `+0.0` before hashing, deviating from the old `getWeakHash32` byte-for-byte behavior. Reverted to raw-bit hashing. Updated the `ColumnBFloat16RawBitsAndSignedZero` test accordingly (`EXPECT_NE`). Also fixes a pre-existing `-Wimplicit-int-float-conversion` build error in `benchmark_column_insert_many_from.cpp`. Adds regression test `04316_fix_scatter_buffer_reuse_window` that catches the stale-buffer bug via a parallel window function cross-joined with a `GROUP BY` reference. Co-authored-by: Cursor <cursoragent@cursor.com>

Both ScatterByPartitionTransform and BufferedShardByHashTransform reused a member PaddedPODArray<UInt32> across blocks via resize_fill(n, value), which only fills the grown tail [old_size, n). When a later block is <= a prior block in size (the common case), no fill happens and the buffer retains stale hashes from the previous block. computeHashInto(initial=false) then combines into those stale values, routing equal keys to different shards across blocks and producing split window partitions / aggregation groups. Fix: replace resize_fill with resize + std::fill(…, WEAK_HASH32_INITIAL_VALUE) so the full buffer is always re-initialized each block. Also: - Restore the intHashCRC32 finalizer in JoinUtils::hashToSelector (grace_hash routing is now bit-identical to old WeakHash32-based code). - Revert float -0.0 normalization in weakHashValue32 (raw-bit hashing matches old getWeakHash32 byte-for-byte; SQL-equal -0.0/+0.0 routing to different shards is pre-existing old behavior). - Add regression test 04316_fix_scatter_buffer_reuse_window that cross-joins a parallel window function result against a GROUP BY reference and expects 0 mismatches. Co-authored-by: Cursor <cursoragent@cursor.com>

The `BM_HashAndFastrange_New` benchmark claimed to capture the `BufferedShardByHashTransform` hot path but diverged from the actual consumer loop in `generateOutputChunks`: it started the chain with `initial=true` for the first key column and omitted the per-iteration buffer seeding. The transform instead fills `hash_buffer` with `WEAK_HASH32_INITIAL_VALUE` every block and calls `computeHashInto` with `initial=false` for every key column. For `K=1` the benchmark therefore measured just `h(row) + mapToRange` instead of `combineWeakHash32(h(row), WEAK_HASH32_INITIAL_VALUE) + mapToRange`, overstating the improvement and missing regressions in the real loop. Seed `hash_buf` with `WEAK_HASH32_INITIAL_VALUE` each iteration (the re-seed is part of the consumer hot path) and call `computeHashInto` with `initial=false` for all columns, so the benchmark times the exact loop the PR's performance contract is about. Addresses review comment: ClickHouse#106538 (review) Co-authored-by: Cursor <cursoragent@cursor.com>

clickhouse-gh · 2026-06-08T19:32:56Z

LLVM Coverage Report

Metric	Baseline	Current	Δ
Lines	84.60%	84.60%	+0.00%
Functions	92.30%	92.30%	+0.00%
Branches	77.20%	77.30%	+0.10%

Changed lines: Changed C/C++ lines covered by tests: 1001/1183 (84.62%) | Lost baseline coverage (was covered on master, now uncovered in this PR): 12 line(s) · Uncovered code

Full report · Diff report

nickitat

That is indeed a much better-scoped and easier-to-review patch. Thanks.

nickitat · 2026-06-08T18:59:02Z

+///
+/// Two output-width overloads share the same kernel shape:
+///   - UInt64 output feeds an IColumn::Selector directly (no widening copy).
+///   - UInt32 output halves the selector bandwidth where a 32-bit index suffices.


The implementation itself is about the same length as the comments - I'm not sure it is actually useful. Also, if you want to comment on the implementation details, please do so next to the actual code and not in the header.

nickitat · 2026-06-08T19:18:24Z

+    /// Partition rows by shard using Lemire fastrange.
+    /// The CRC32C-based per-row hash is well-distributed across all 32 bits, so the high
+    /// bits fed into the shift are uniform — no extra finalizer is needed.


Suggested change

/// Partition rows by shard using Lemire fastrange.

/// The CRC32C-based per-row hash is well-distributed across all 32 bits, so the high

/// bits fed into the shift are uniform — no extra finalizer is needed.

Let's not repeat the same comment at every call site.

nickitat · 2026-06-08T19:19:04Z

+    hash_buffer.resize(num_rows);
+    std::fill(hash_buffer.begin(), hash_buffer.end(), WEAK_HASH32_INITIAL_VALUE);


Suggested change

hash_buffer.resize(num_rows);

std::fill(hash_buffer.begin(), hash_buffer.end(), WEAK_HASH32_INITIAL_VALUE);

hash_buffer.assign(num_rows, WEAK_HASH32_INITIAL_VALUE);

nickitat · 2026-06-08T19:22:41Z

    chassert(!columns.empty());

-    hash.reset(num_rows);
+    hash.resize(num_rows);


The same assign suggestion.

nickitat · 2026-06-08T19:25:20Z

    for (size_t row = 0; row < num_rows; ++row)
-        selector[row] = (static_cast<UInt64>(hash_data[row]) * output_size) >> 32; /// The "fastrange" method from Daniel Lemire
+        selector[row] = (static_cast<UInt64>(hash[row]) * output_size) >> 32; /// The "fastrange" method from Daniel Lemire


Can we use mapToRange here?

nickitat · 2026-06-08T19:27:45Z

+        columns[column_number]->computeHashInto(0, num_rows, hash.data(), false);

-    const PaddedPODArray<UInt32> & hash_data = hash.getData();
    IColumn::Selector selector(num_rows);


Looks like we can reuse selector too.

nickitat · 2026-06-08T19:39:03Z

+    return static_cast<uint32_t>(intHashCRC32(v.value, WEAK_HASH32_INITIAL_VALUE));
+}
+
+MULTITARGET_FUNCTION_X86_V4(


What are we trying to achieve with avx-512 here? crc builtins won't vectorise AFAIK, and potential unrolling also shouldn't really speed anything up.

nickitat · 2026-06-08T19:39:47Z

                if (direction == IColumn::PermutationSortDirection::Ascending && stability == IColumn::PermutationSortStability::Unstable)
                    try_sort = trySort(res.begin(), res.end(), comparator_ascending);
-                else if (direction == IColumn::PermutationSortDirection::Ascending && stability == IColumn::PermutationSortStability::Stable)
+                else if (


Please avoid formatting unchanged code.

clickhouse-gh Bot added the pr-performance Pull request with some performance improvements label Jun 5, 2026

harikrishnan94 requested review from nickitat and nihalzp June 5, 2026 10:12

harikrishnan94 added the can be tested Allows running workflows for external contributors label Jun 5, 2026