Skip to content

Add non-allocating IColumn::computeHashInto column hash kernel#106538

Open
harikrishnan94 wants to merge 14 commits into
ClickHouse:masterfrom
harikrishnan94:icolumn-compute-hash-into
Open

Add non-allocating IColumn::computeHashInto column hash kernel#106538
harikrishnan94 wants to merge 14 commits into
ClickHouse:masterfrom
harikrishnan94:icolumn-compute-hash-into

Conversation

@harikrishnan94

@harikrishnan94 harikrishnan94 commented Jun 5, 2026

Copy link
Copy Markdown

Split out the hash-production half of #106120 into its own pull request, as requested in review: keep the per-column hash calculation separate from the batched scatter interface so each can be reviewed and benchmarked in isolation.

Motivation

Several in-memory routing paths — sharded aggregation (BufferedShardByHashTransform), grace_hash join bucketing (JoinUtils::scatterBlockByHash), and parallel window partitioning (ScatterByPartitionTransform) — hashed their key columns by allocating a WeakHash32 per chunk and chaining IColumn::getWeakHash32 across the key columns, then applying a separate intHashCRC32 finalizer before mapping to a shard. The per-chunk allocation and the extra finalizer dominate for narrow keys at high throughput.

Changes

IColumn::computeHashInto(row_begin, row_end, hash_out, initial) is a non-allocating per-row hash kernel that writes a finalized 32-bit hash for each row into a caller-provided buffer. The per-row hash is produced with hardware CRC32C — hashCRC32/intHashCRC32/updateWeakHash32, i.e. _mm_crc32_u64 on x86 (SSE4.2), __crc32cd on ARM, with a software fallback — seeded with WEAK_HASH32_INITIAL_VALUE, consuming a full 64-bit word in a single instruction. Multi-column keys compose in place — with no intermediate arrays — via combineWeakHash32, a single order-sensitive CRC32C step that packs the prior accumulator and the new per-row hash into the two halves of one 64-bit CRC input. Type-specialized kernels are provided for the common column types (MULTITARGET_FUNCTION_X86_V4).

The kernel obeys a representation-independence contract: a materialized column and a transparent wrapper of the same values (ColumnConst, ColumnLowCardinality, ColumnSparse, ColumnReplicated) compose to identical hashes, so equal multi-column keys are never split across shards or join buckets because of physical representation. Every column type combines its finalized per-row hash through the same combineWeakHash32; wrappers and composite columns (ColumnTuple/ColumnArray/ColumnNullable/ColumnObject/ColumnFunction/ColumnVariant, and the dummy ColumnNothing) combine the finalized nested hash.

IColumn::getWeakHash32, WeakHash32, and WeakHash.h/WeakHash.cpp are removed; the three consumers now fill a PaddedPODArray<UInt32> via chained computeHashInto. The CRC32C updateWeakHash32 primitive stays, with its default initial value WEAK_HASH32_INITIAL_VALUE and the new combineWeakHash32 combiner co-located in Hash.h. Because every column already writes a finalized, well-distributed 32-bit hash, the separate intHashCRC32 finalizer in hashToSelector is dropped — the sharder consumes the hash directly: hash & mask for power-of-two bucket counts (the grace_hash path) and Lemire fastrange (h * P) >> 32 (mapToRange) otherwise.

The per-row hash values differ from the old per-chunk getWeakHash32 chain. Changing the exact hash values is safe because all three consumers route in-memory per query; per-key stable shard identity was never guaranteed.

Testing

  • New unit test gtest_compute_hash_into (33 cases): row-range independence, initial/combine semantics, null-payload independence, array-length seeding, BFloat16 raw-bit hashing, chi-squared distribution uniformity, order sensitivity, wide 128/256-bit folding (UInt128/UInt256/Decimal128/Decimal256), and representation independence across Const/Sparse/LowCardinality/Replicated/Nothing for UInt64/String/Tuple/Array/Decimal/FixedString keys.
  • New stateless test 04312_grace_hash_vs_hash_join_key_types asserts grace_hash joins match the default hash join across UInt/String/Nullable/LowCardinality/Array/Tuple keys.
  • Verified locally against the existing sharded-aggregation suite (03927, 03928, 0409704100) and a parallel-window correctness check that routes through ScatterByPartitionTransform.

Related: #105936
Related: #106120
Related: #106120 (review)

Changelog category (leave one):

  • Performance Improvement

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Speed up sharded aggregation, grace_hash join bucketing, and parallel window partitioning by replacing the per-chunk allocating getWeakHash32 column hashing with a non-allocating IColumn::computeHashInto kernel that writes finalized per-row hardware-CRC32C hashes and composes multi-column keys in place via combineWeakHash32.

Made with Cursor

Split out the hash-production half of ClickHouse#106120 into its own change, as
requested in review: keep the per-column hash calculation separate from
the batched scatter interface so each can be reviewed and benchmarked in
isolation.

`IColumn::computeHashInto(row_begin, row_end, hash_out, initial)` is a
non-allocating per-row hash kernel that writes 32-bit `fmix32`-based hashes
into a caller-provided buffer, with SIMD overrides
(`MULTITARGET_FUNCTION_X86_V4_V3`) for the common column types. Multi-column
keys compose in place via `fmix32Combined` with no intermediate arrays, and
the contract is representation-independent: a materialized column and a
transparent wrapper (`ColumnConst`, `ColumnLowCardinality`, `ColumnSparse`,
`ColumnReplicated`) of the same values hash identically.

`IColumn::getWeakHash32`, `WeakHash32` and `WeakHash.h`/`WeakHash.cpp` are
removed; the remaining consumers (`JoinUtils::scatterBlockByHash`,
`ScatterByPartitionTransform`, `BufferedShardByHashTransform`) fill a
`PaddedPODArray<UInt32>` via chained `computeHashInto`. The CRC32C
`updateWeakHash32` primitive stays, with its default initial value relocated
to `WEAK_HASH32_INITIAL_VALUE` in `Hash.h`. `mapToRange` maps a 32-bit hash
uniformly to a shard index via Lemire fastrange.

Routing hash values change from CRC32C to the `fmix32` finalizer, which is
safe because both consumers route in-memory per query; per-key stable shard
identity was never guaranteed. Verified that `grace_hash` joins match the
default `hash` join across the changed key families
(`04312_grace_hash_vs_hash_join_key_types`).

This intentionally excludes the `DB::ColumnsScatter::scatter` primitive, the
`shard_by_hash_input_batch_bytes` input-batching mode, and their tests from
ClickHouse#106120; the consumers retain the legacy per-chunk `IColumn::scatter`.

Related: ClickHouse#105936
Related: ClickHouse#106120
Related: ClickHouse#106120 (review)

Co-authored-by: Cursor <cursoragent@cursor.com>
@clickhouse-gh

clickhouse-gh Bot commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

Workflow [PR], commit [1c5a9fc]

Summary:


AI Review

Summary

This PR replaces the allocating getWeakHash32 path with IColumn::computeHashInto and wires it into sharded aggregation, grace_hash join bucketing, and parallel-window partitioning. The current code addresses the earlier correctness and test issues I rechecked, including dummy-column composition, ColumnFunction composition, ColumnLowCardinality/ColumnReplicated tests, wide-value coverage, the grace_hash final selector mix, and the end-to-end benchmark loop. I am still not ready to approve because the PR is categorized as Performance Improvement, but the latest visible evidence does not prove the final sharded aggregation path is faster.

Missing context / blind spots
  • ⚠️ I could not validate the current CI performance comparison with .claude/tools/fetch_perf_report.py: it exits with clickhouse not found in PATH, and no local clickhouse or clickhouse-local binary is available in this checkout. A latest final-commit performance report or benchmark output would close this gap.
Findings

⚠️ Majors

  • [src/Processors/Transforms/BufferedShardByHashTransform.cpp:150] [dismissed by author -- https://github.com/Add non-allocating IColumn::computeHashInto column hash kernel #106538#discussion_r3371924944] The PR promises a Performance Improvement for the sharded aggregation hashing path, but the public discussion still only contains a cloud report showing ClickBench GROUP BY regressions (Q30 +10.6%, Q33 +8.2%) and no latest before/after numbers for the final 1c5a9fc1 state. The benchmark now matches the consumer loop, but the performance claim still needs measured proof. Suggested fix: attach current benchmark_compute_hash_into before/after output or wait for a current performance report covering the affected high-cardinality sharded aggregation cases; otherwise adjust the changelog category or entry so it does not claim an unproven performance improvement.
Tests
  • ⚠️ Missing focused performance evidence for the final 1c5a9fc1 state. Correctness coverage is much stronger now, but the Performance Improvement contract needs measured proof.
Final Verdict

Status: ⚠️ Request changes

Minimum required action: provide current final-commit before/after performance evidence for the affected sharded aggregation and hashing paths, or change the PR metadata/changelog so it no longer promises Performance Improvement.

@clickhouse-gh clickhouse-gh Bot added the pr-performance Pull request with some performance improvements label Jun 5, 2026
@harikrishnan94 harikrishnan94 added the can be tested Allows running workflows for external contributors label Jun 5, 2026
Comment thread src/Columns/ColumnDynamic.h
Comment thread src/Columns/ColumnObject.cpp
Comment thread src/Columns/tests/gtest_compute_hash_into.cpp
Comment thread src/Columns/tests/gtest_compute_hash_into.cpp Outdated
Fix two test issues flagged in review of the `computeHashInto` kernel:

- `InitialFlagSemantics` computed `combined_manual` but never asserted it,
  so the test passed for any `initial=false` implementation that merely
  mutated the buffer. Assert the documented contract
  `EXPECT_EQ(combined_api, combined_manual)`.
- The distributional chi-square tests seeded `rng` from `randomSeed`, so a
  correct hash could exceed the `p=0.001` cutoff and flake in CI. Use a
  fixed seed for reproducibility.

Document the intentional layout-dependence of `ColumnDynamic::computeHashInto`
and `ColumnObject::computeHashInto`: like the former `getWeakHash32` and
`updateHashWithValueRange`, the weak routing hash is not guaranteed to be
equal for logically equal values stored with different variant/path layouts.
This matches prior behavior; the in-memory scatter consumers only need a fast
per-query partitioning, not a layout-stable hash.

Co-authored-by: Cursor <cursoragent@cursor.com>
Comment thread src/Columns/ColumnFunction.cpp Outdated
Comment thread src/Columns/ColumnVector.cpp Outdated
Comment thread src/Columns/tests/gtest_compute_hash_into.cpp
harikrishnan94 and others added 2 commits June 5, 2026 17:14
`ColumnFunction::computeHashInto` streamed each captured column directly into
the caller's buffer on the `initial == false` path, producing
`combine(h_b, combine(h_a, prior))` instead of the contract's
`combine(combine(h_b, h_a), prior)`. Since `ColumnConst` feeds back the
finalized per-row value, a materialized `ColumnFunction` and a
`ColumnConst(ColumnFunction)` of the same value composed differently when the
function was not the first key, which could route equal keys to different
shards or `grace_hash` partitions. Mirror `ColumnTuple`: build the finalized
function hash in a scratch buffer, then combine it once via `fmix32Combined`.
The empty-capture non-initial case now combines `0` instead of leaving the
prior hash unchanged.

Add deterministic `gtest_compute_hash_into` coverage for the >8-byte folding
path in `hashValueRaw32`/`hashDecimalRaw32`, which the existing cases never
reached: `UInt128`, `UInt256`, `Decimal128`, and `Decimal256`, each checked
for distinct hashes, row-range split independence, and `initial == false`
(`ColumnConst`) composition.

Co-authored-by: Cursor <cursoragent@cursor.com>
…nd Replicated

The representation-independence block (test 16) advertised `ColumnLowCardinality`
and `ColumnReplicated` in its comment, but only built `ColumnConst` and
`ColumnSparse`. The stateless `04312_grace_hash` test does not cover
`ColumnLowCardinality::computeHashInto` either, because `JoinUtils::materializeColumn`
strips LowCardinality keys before hashing.

Add direct gtests for both wrappers: build a materialized column and a logically
equal `ColumnLowCardinality` / `ColumnReplicated` with per-row varying values (a
small pool with repeats so the dictionary/index mapping is non-trivial), then
assert `composeKey(prefix, materialized) == composeKey(prefix, wrapper)` (the
`initial == false` composition path) plus row-range split consistency via
`expectRangeSplitConsistent`.

Co-authored-by: Cursor <cursoragent@cursor.com>
Comment thread src/Columns/IColumnDummy.h
@clickhouse-gh

clickhouse-gh Bot commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

📊 Cloud Performance Report

🟢 AI verdict: improvement2 query(s) improved out of 38 analysed

This PR rewrites the column-level weak-hash routine into a range-based, partially AVX-512-vectorized implementation. That hash drives row partitioning for hash joins and parallel aggregation, so the two flagged improvements both land on join-heavy TPC-H queries: Q7 is 34% faster and Q20 is 47% faster, both confirmed by both tests. Q7's measurements were tight and the result is clean; Q20's underlying runs were noisier, but the delta is large enough to be a real shift. Both improvements are consistent with what this change touches.

clickbench

🟢 No significant changes

tpch_adapted_1_official

🟢 2 improved

Flagged queries (2 of 22)
Query Verdict Baseline med (ms) PR med (ms) Change q-value Hint
🟢 7 improvement 113 75 -33.6% <0.0001 join_operator: PR vectorizes weak-hash row partitioning used by hash joins; -34% on this join-heavy query is plausible.
🟢 20 improvement 246 131 -46.8% <0.0001 join_operator: Faster weak-hash partitioning fits this join/subquery-heavy query; large delta though base runs were noisy.

q-value = BH-FDR adjusted p; smaller is stronger evidence. MIRAI flags a query when q < fdr_q (default 0.10) — the value the verdict is based on.

Debug info
  • StressHouse run: cdc9dc5f-85dd-4d3c-aea3-a7ffc0c8afff
  • MIRAI run: ffc67b0f-8d06-40f7-bad7-2f945c53ed5c
  • PR check IDs:
    • clickbench_32812_1780938588
    • clickbench_32825_1780938588
    • clickbench_32831_1780938588
    • tpch_adapted_1_official_32837_1780938588
    • tpch_adapted_1_official_33238_1780938620

harikrishnan94 and others added 2 commits June 5, 2026 19:41
The non-allocating `IColumn::computeHashInto` kernel (PR ClickHouse#106538) bundled a
CRC32C -> `fmix32` routing-hash switch that regressed join-heavy workloads: the
Cloud Performance Report on the PR flagged TPC-H Q4 +33% slower because `fmix32`
is far costlier than hardware CRC32C on the in-memory, per-query scatter paths
(sharded aggregation, `grace_hash` bucketing, parallel-window partitioning,
hash-join scatter) -- paths that never need a stable shard identity.

Keep the non-allocating per-row buffer and the representation-independence
contract, but restore CRC32C as both the per-row hash and the cross-column
combiner:

- Per-row `h(row)` is a finalized 32-bit CRC32C hash seeded with
  `WEAK_HASH32_INITIAL_VALUE`: `hashCRC32(value, ...)` for `ColumnVector`
  (one `_mm_crc32_u64` / `__crc32cd`, no hi/lo split), `intHashCRC32(v.value, ...)`
  for `ColumnDecimal`, and `updateWeakHash32` for the string/fixed-string/
  aggregate families. Float `-0.0` -> `+0.0` normalization is preserved.
- Composite keys combine each column's FINALIZED per-row hash via one shared,
  order-sensitive `combineWeakHash32` that packs `prior` and `value` into the two
  halves of a single 64-bit CRC input (a plain `intHashCRC32(value, prior)` is
  order-insensitive for 32-bit operands and would collide `(a, b)` with `(b, a)`).
  Wrappers (`Const` / `LowCardinality` / `Sparse` / `Replicated` / `Tuple` /
  `Array` / `Nullable` / `Object` / `Function` / `Variant`) combine the finalized
  nested hash, so SQL-equal keys hash identically across representations.
- Remove `Common/HashCombine32.h` (`fmix32` / `fmix32Combined`) and every
  reference, comment, and doc.

Validated by `gtest_compute_hash_into` (32/32) and
`04312_grace_hash_vs_hash_join_key_types`. On the per-row hash benchmark the
UInt64 (bigint) join-key path drops from ~1.36 to ~0.27 ns/row (~5.1x faster),
restoring the fast scalar CRC32C path TPC-H Q4 joins use.

Co-authored-by: Cursor <cursoragent@cursor.com>
`IColumnDummy::computeHashInto` only wrote `hash_out` on the `initial == true`
path and left it unchanged when `initial == false`. A dummy column's finalized
per-row hash is `0`, so a key `(prefix, materialized ColumnNothing)` hashed as
just `prefix`, while `(prefix, ColumnConst(ColumnNothing))` combined `0` into
`prefix` through `ColumnConst`. That breaks the representation-independence
contract and can split equal multi-column keys if `Nothing`/dummy columns reach
the scatter paths.

Mirror the empty `ColumnTuple`/`ColumnFunction` handling: combine `0` via
`combineWeakHash32` on the non-initial path. Add a `gtest_compute_hash_into`
case asserting a materialized `ColumnNothing` and `ColumnConst(ColumnNothing)`
compose identically.

Addresses review comment:
ClickHouse#106538 (comment)

Co-authored-by: Cursor <cursoragent@cursor.com>
Comment thread src/Interpreters/JoinUtils.h Outdated
@nickitat nickitat self-assigned this Jun 5, 2026
Stacked on top of ClickHouse#106538 (`IColumn::computeHashInto` hash kernel PR).
This commit adds the scatter half and the `shard_by_hash_input_batch_bytes`
input-batching mode.

`DB::ColumnsScatter::scatter(source_columns, pids_per_source, num_shards, rows_per_shard)`
is a batched, O(1)-dispatch physical split. The caller supplies one `IColumn`
from each of B pending chunks at the same column-position, along with
per-source `UInt32` partition-ids and precomputed per-shard row counts.
A static function-pointer table keyed on `IColumn::getDataType()` selects the
typed kernel with no virtual call on the hot path.

Fast-path typed kernels for `ColumnVector`, `ColumnDecimal`, `ColumnFixedString`,
`ColumnString`, `ColumnNullable`, and `ColumnTuple`; all other types delegate to
the legacy `IColumn::scatter` fallback. Mixed-representation batches
(`ColumnConst`/`ColumnSparse`/`ColumnReplicated` mixed with materialized columns)
are normalized via `materializeTransparentWrappers` before dispatch, preserving
`LowCardinality`; homogeneous `LowCardinality` batches use the fallback.

`BufferedShardByHashTransform` input batching — new setting
`shard_by_hash_input_batch_bytes` (default 2 MiB; 0 = legacy per-chunk path).
The transform accumulates input chunks until the byte budget is reached, then
flushes via `ColumnsScatter::scatter` once per column-position. Hash production
uses `IColumn::computeHashInto` (CRC32C, from ClickHouse#106538) + `mapToRange`.

`ScatterByPartitionTransform` is wired to `ColumnsScatter::scatter` for the
same O(1)-dispatch benefit on the parallel-window partitioning path.

Also adds:
- `benchmark_columns_scatter.cpp` — RFC sweep over K and P
- `gtest_columns_scatter.cpp` — equivalence vs legacy for all fast-path types,
  mixed-representation regressions, `LowCardinality` type preservation, and
  per-key count assertions
- `04278_04101_sharded_aggregation_batched_input.sql` — batched vs legacy
  per-key-identical `GROUP BY` results across key types with small block size
  to exercise the threshold path, multi-chunk batches, and EOF flush

Related: ClickHouse#105936
Related: ClickHouse#106120
Related: ClickHouse#106538

Co-authored-by: Cursor <cursoragent@cursor.com>
harikrishnan94 and others added 2 commits June 6, 2026 16:42
…pansion

Two correctness/resource fixes in `ColumnsScatter::scatter`:

1. **`scatterFallback` nested-representation mismatch** — when `IColumn::scatter`
   is called across source chunks that share a logical type but differ in their
   nested concrete representation (e.g. `Array(UInt64)` with a full nested
   `ColumnUInt64` in one chunk vs `ColumnSparse(UInt64)` in another), the
   subsequent `insertRangeFrom` call on the destination's nested column would
   `assert_cast` on the wrong concrete type, causing undefined behaviour in
   release builds.

   Fix: add `deepNormalizeForFallback` which strips `ColumnConst`,
   `ColumnSparse`, and `ColumnReplicated` recursively at every nesting level
   (recursing into `ColumnArray` data and `ColumnTuple` elements) while
   preserving `ColumnLowCardinality`. `scatterFallback` normalizes all sources
   through this function before calling `IColumn::scatter`, and clones the
   destination from the normalized first source so all cross-chunk
   `insertRangeFrom` calls operate on matching concrete column types.

2. **Eager `ColumnConst` materialization before the UInt32 overflow guard** —
   the overflow guard previously ran after `materializeTransparentWrappers`,
   expanding O(1)-byte `ColumnConst` sources into full columns even when the
   only path taken was the size_t-safe `scatterFallback`. For const-heavy
   batches accumulated by `BufferedShardByHashTransform`, this could expand
   millions of logical rows into full allocations unnecessarily.

   Fix: move the UInt32 overflow guard before the wrapper materialization block.
   Additionally, add a fast compact path for a homogeneous batch of equal-valued
   `ColumnConst` sources: when all sources share the same constant value,
   produce `ColumnConst(value, rows_per_shard[s])` per shard via
   `cloneResized` — no materialization, O(1) memory, matching the behavior the
   legacy per-chunk `IColumn::scatter` had for constant-key workloads.

Co-authored-by: Cursor <cursoragent@cursor.com>
…nConst` compaction

Two new tests in `gtest_columns_scatter`:

- `FallbackArrayMixedNestedRepresentation`: scatters two `Array(UInt64)`
  chunks where the first has a full nested `ColumnUInt64` and the second has a
  `ColumnSparse(UInt64)` nested column (identical logical values). Verifies that
  the scattered results match a reference built from two fully-materialized
  sources. Before the fix, the cross-chunk `insertRangeFrom` in `scatterFallback`
  would `assert_cast` a `ColumnSparse` as `ColumnVector` and produce wrong values
  or UB.

- `ConstColumnStaysCompact`: scatters three `ColumnConst(UInt64=99)` sources
  through `ColumnsScatter::scatter` and asserts that every per-shard output
  column `isConst()`. Verifies that the new all-const compact path returns
  `ColumnConst(99, rows_per_shard[s])` without materializing the sources into
  full columns.

Co-authored-by: Cursor <cursoragent@cursor.com>
Comment thread src/Columns/ColumnsScatter.cpp Outdated
Comment thread src/Columns/ColumnsScatter.cpp Outdated
harikrishnan94 and others added 2 commits June 7, 2026 14:40
…ce_hash buckets

The power-of-two bucket selection in `scatterBlockByHashPow2` used `hash & mask`,
which extracts only the low bits of the CRC32C hash. CRC32C has poor avalanche in
the low bits for structured input (e.g. sequential integers), leaving the bucket
distribution/performance contract unproven for the `grace_hash` path.

Replace with Knuth's multiplicative (Fibonacci) hash: multiply by
`floor(2^32/phi) = 0x9E3779B1` and take the top `log2(num_shards)` bits
via `>> (32 - K)`. The high bits of the product depend on all 32 input bits,
so even poor CRC32C low-bit distribution becomes uniform bucket assignment.
This mirrors the approach already used by `scatterBlockByHashGeneric` and
`mapToRange` (both use high-bit Lemire fastrange); the pow2 path now has
consistent behaviour.

Add two chi-squared distribution tests in `gtest_compute_hash_into` that
exercise the Fibonacci sharder path directly:
- `DistributionPow2FibRandom`: random UInt32 keys, 64 buckets.
- `DistributionPow2FibSequential`: sequential UInt64 keys (0..65535), the
  pathological case for raw CRC32C low bits, verified uniform after the fix.

Addresses reviewer comment on:
ClickHouse#106538

Related: ClickHouse#106538
Related: ClickHouse#106120

Co-authored-by: Cursor <cursoragent@cursor.com>
Keep the `IColumn::computeHashInto` kernel work while removing the `ColumnsScatter` batching path, related setting, tests, and benchmark from this branch.

Co-authored-by: Cursor <cursoragent@cursor.com>
Comment thread src/Processors/Transforms/BufferedShardByHashTransform.cpp
harikrishnan94 and others added 2 commits June 8, 2026 19:39
Three correctness fixes for the `IColumn::computeHashInto` branch:

1. **BLOCKER — stale hash buffer in scatter transforms**: `ScatterByPartitionTransform`
   and `BufferedShardByHashTransform` reused a member `PaddedPODArray<UInt32>` across
   blocks via `resize_fill(n, WEAK_HASH32_INITIAL_VALUE)`. `resize_fill` only fills the
   grown tail `[old_size, n)`, so when a later block is `<=` the prior size (the common
   case with equal-size blocks) the buffer is NOT re-initialized. `computeHashInto` with
   `initial=false` then combines into stale data, routing equal keys to different shards
   across blocks — splitting window partitions and aggregation groups → wrong results.
   Fixed by `resize` + `std::fill(..., WEAK_HASH32_INITIAL_VALUE)`.

2. **Dropped grace_hash finalizer**: `JoinUtils::hashToSelector` was calling
   `sharder(hashes[i])` instead of `sharder(intHashCRC32(hashes[i]))`, losing the
   CRC32 avalanche step that the old `scatterBlockByHash` path depended on. Restored.

3. **Float `-0.0` normalization deviation**: `weakHashValue32` in `ColumnVector.cpp`
   was normalizing `-0.0` to `+0.0` before hashing, deviating from the old
   `getWeakHash32` byte-for-byte behavior. Reverted to raw-bit hashing.
   Updated the `ColumnBFloat16RawBitsAndSignedZero` test accordingly (`EXPECT_NE`).

Also fixes a pre-existing `-Wimplicit-int-float-conversion` build error in
`benchmark_column_insert_many_from.cpp`.

Adds regression test `04316_fix_scatter_buffer_reuse_window` that catches the
stale-buffer bug via a parallel window function cross-joined with a `GROUP BY`
reference.

Co-authored-by: Cursor <cursoragent@cursor.com>
Both ScatterByPartitionTransform and BufferedShardByHashTransform reused
a member PaddedPODArray<UInt32> across blocks via resize_fill(n, value),
which only fills the grown tail [old_size, n). When a later block is <=
a prior block in size (the common case), no fill happens and the buffer
retains stale hashes from the previous block. computeHashInto(initial=false)
then combines into those stale values, routing equal keys to different shards
across blocks and producing split window partitions / aggregation groups.

Fix: replace resize_fill with resize + std::fill(…, WEAK_HASH32_INITIAL_VALUE)
so the full buffer is always re-initialized each block.

Also:
- Restore the intHashCRC32 finalizer in JoinUtils::hashToSelector (grace_hash
  routing is now bit-identical to old WeakHash32-based code).
- Revert float -0.0 normalization in weakHashValue32 (raw-bit hashing matches
  old getWeakHash32 byte-for-byte; SQL-equal -0.0/+0.0 routing to different
  shards is pre-existing old behavior).
- Add regression test 04316_fix_scatter_buffer_reuse_window that cross-joins a
  parallel window function result against a GROUP BY reference and expects 0
  mismatches.

Co-authored-by: Cursor <cursoragent@cursor.com>
Comment thread src/Columns/benchmarks/benchmark_compute_hash_into.cpp Outdated
The `BM_HashAndFastrange_New` benchmark claimed to capture the
`BufferedShardByHashTransform` hot path but diverged from the actual
consumer loop in `generateOutputChunks`: it started the chain with
`initial=true` for the first key column and omitted the per-iteration
buffer seeding. The transform instead fills `hash_buffer` with
`WEAK_HASH32_INITIAL_VALUE` every block and calls `computeHashInto` with
`initial=false` for every key column. For `K=1` the benchmark therefore
measured just `h(row) + mapToRange` instead of
`combineWeakHash32(h(row), WEAK_HASH32_INITIAL_VALUE) + mapToRange`,
overstating the improvement and missing regressions in the real loop.

Seed `hash_buf` with `WEAK_HASH32_INITIAL_VALUE` each iteration (the
re-seed is part of the consumer hot path) and call `computeHashInto`
with `initial=false` for all columns, so the benchmark times the exact
loop the PR's performance contract is about.

Addresses review comment:
ClickHouse#106538 (review)

Co-authored-by: Cursor <cursoragent@cursor.com>
@clickhouse-gh

clickhouse-gh Bot commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

LLVM Coverage Report

Metric Baseline Current Δ
Lines 84.60% 84.60% +0.00%
Functions 92.30% 92.30% +0.00%
Branches 77.20% 77.30% +0.10%

Changed lines: Changed C/C++ lines covered by tests: 1001/1183 (84.62%) | Lost baseline coverage (was covered on master, now uncovered in this PR): 12 line(s) · Uncovered code

Full report · Diff report

@nickitat nickitat left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is indeed a much better-scoped and easier-to-review patch. Thanks.

Comment thread src/Common/MapToRange.h
///
/// Two output-width overloads share the same kernel shape:
/// - UInt64 output feeds an IColumn::Selector directly (no widening copy).
/// - UInt32 output halves the selector bandwidth where a 32-bit index suffices.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation itself is about the same length as the comments - I'm not sure it is actually useful. Also, if you want to comment on the implementation details, please do so next to the actual code and not in the header.

Comment on lines +146 to +148
/// Partition rows by shard using Lemire fastrange.
/// The CRC32C-based per-row hash is well-distributed across all 32 bits, so the high
/// bits fed into the shift are uniform — no extra finalizer is needed.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// Partition rows by shard using Lemire fastrange.
/// The CRC32C-based per-row hash is well-distributed across all 32 bits, so the high
/// bits fed into the shift are uniform — no extra finalizer is needed.

Let's not repeat the same comment at every call site.

Comment on lines +141 to +142
hash_buffer.resize(num_rows);
std::fill(hash_buffer.begin(), hash_buffer.end(), WEAK_HASH32_INITIAL_VALUE);

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
hash_buffer.resize(num_rows);
std::fill(hash_buffer.begin(), hash_buffer.end(), WEAK_HASH32_INITIAL_VALUE);
hash_buffer.assign(num_rows, WEAK_HASH32_INITIAL_VALUE);

chassert(!columns.empty());

hash.reset(num_rows);
hash.resize(num_rows);

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same assign suggestion.

Comment on lines 148 to +149
for (size_t row = 0; row < num_rows; ++row)
selector[row] = (static_cast<UInt64>(hash_data[row]) * output_size) >> 32; /// The "fastrange" method from Daniel Lemire
selector[row] = (static_cast<UInt64>(hash[row]) * output_size) >> 32; /// The "fastrange" method from Daniel Lemire

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use mapToRange here?

columns[column_number]->computeHashInto(0, num_rows, hash.data(), false);

const PaddedPODArray<UInt32> & hash_data = hash.getData();
IColumn::Selector selector(num_rows);

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like we can reuse selector too.

return static_cast<uint32_t>(intHashCRC32(v.value, WEAK_HASH32_INITIAL_VALUE));
}

MULTITARGET_FUNCTION_X86_V4(

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are we trying to achieve with avx-512 here? crc builtins won't vectorise AFAIK, and potential unrolling also shouldn't really speed anything up.

if (direction == IColumn::PermutationSortDirection::Ascending && stability == IColumn::PermutationSortStability::Unstable)
try_sort = trySort(res.begin(), res.end(), comparator_ascending);
else if (direction == IColumn::PermutationSortDirection::Ascending && stability == IColumn::PermutationSortStability::Stable)
else if (

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please avoid formatting unchanged code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

can be tested Allows running workflows for external contributors pr-performance Pull request with some performance improvements

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants