Add non-allocating IColumn::computeHashInto column hash kernel#106538
Add non-allocating IColumn::computeHashInto column hash kernel#106538harikrishnan94 wants to merge 14 commits into
Conversation
Split out the hash-production half of ClickHouse#106120 into its own change, as requested in review: keep the per-column hash calculation separate from the batched scatter interface so each can be reviewed and benchmarked in isolation. `IColumn::computeHashInto(row_begin, row_end, hash_out, initial)` is a non-allocating per-row hash kernel that writes 32-bit `fmix32`-based hashes into a caller-provided buffer, with SIMD overrides (`MULTITARGET_FUNCTION_X86_V4_V3`) for the common column types. Multi-column keys compose in place via `fmix32Combined` with no intermediate arrays, and the contract is representation-independent: a materialized column and a transparent wrapper (`ColumnConst`, `ColumnLowCardinality`, `ColumnSparse`, `ColumnReplicated`) of the same values hash identically. `IColumn::getWeakHash32`, `WeakHash32` and `WeakHash.h`/`WeakHash.cpp` are removed; the remaining consumers (`JoinUtils::scatterBlockByHash`, `ScatterByPartitionTransform`, `BufferedShardByHashTransform`) fill a `PaddedPODArray<UInt32>` via chained `computeHashInto`. The CRC32C `updateWeakHash32` primitive stays, with its default initial value relocated to `WEAK_HASH32_INITIAL_VALUE` in `Hash.h`. `mapToRange` maps a 32-bit hash uniformly to a shard index via Lemire fastrange. Routing hash values change from CRC32C to the `fmix32` finalizer, which is safe because both consumers route in-memory per query; per-key stable shard identity was never guaranteed. Verified that `grace_hash` joins match the default `hash` join across the changed key families (`04312_grace_hash_vs_hash_join_key_types`). This intentionally excludes the `DB::ColumnsScatter::scatter` primitive, the `shard_by_hash_input_batch_bytes` input-batching mode, and their tests from ClickHouse#106120; the consumers retain the legacy per-chunk `IColumn::scatter`. Related: ClickHouse#105936 Related: ClickHouse#106120 Related: ClickHouse#106120 (review) Co-authored-by: Cursor <cursoragent@cursor.com>
|
Workflow [PR], commit [1c5a9fc] Summary: ✅
AI ReviewSummaryThis PR replaces the allocating Missing context / blind spots
Findings
Tests
Final VerdictStatus: Minimum required action: provide current final-commit before/after performance evidence for the affected sharded aggregation and hashing paths, or change the PR metadata/changelog so it no longer promises |
Fix two test issues flagged in review of the `computeHashInto` kernel: - `InitialFlagSemantics` computed `combined_manual` but never asserted it, so the test passed for any `initial=false` implementation that merely mutated the buffer. Assert the documented contract `EXPECT_EQ(combined_api, combined_manual)`. - The distributional chi-square tests seeded `rng` from `randomSeed`, so a correct hash could exceed the `p=0.001` cutoff and flake in CI. Use a fixed seed for reproducibility. Document the intentional layout-dependence of `ColumnDynamic::computeHashInto` and `ColumnObject::computeHashInto`: like the former `getWeakHash32` and `updateHashWithValueRange`, the weak routing hash is not guaranteed to be equal for logically equal values stored with different variant/path layouts. This matches prior behavior; the in-memory scatter consumers only need a fast per-query partitioning, not a layout-stable hash. Co-authored-by: Cursor <cursoragent@cursor.com>
`ColumnFunction::computeHashInto` streamed each captured column directly into the caller's buffer on the `initial == false` path, producing `combine(h_b, combine(h_a, prior))` instead of the contract's `combine(combine(h_b, h_a), prior)`. Since `ColumnConst` feeds back the finalized per-row value, a materialized `ColumnFunction` and a `ColumnConst(ColumnFunction)` of the same value composed differently when the function was not the first key, which could route equal keys to different shards or `grace_hash` partitions. Mirror `ColumnTuple`: build the finalized function hash in a scratch buffer, then combine it once via `fmix32Combined`. The empty-capture non-initial case now combines `0` instead of leaving the prior hash unchanged. Add deterministic `gtest_compute_hash_into` coverage for the >8-byte folding path in `hashValueRaw32`/`hashDecimalRaw32`, which the existing cases never reached: `UInt128`, `UInt256`, `Decimal128`, and `Decimal256`, each checked for distinct hashes, row-range split independence, and `initial == false` (`ColumnConst`) composition. Co-authored-by: Cursor <cursoragent@cursor.com>
…nd Replicated The representation-independence block (test 16) advertised `ColumnLowCardinality` and `ColumnReplicated` in its comment, but only built `ColumnConst` and `ColumnSparse`. The stateless `04312_grace_hash` test does not cover `ColumnLowCardinality::computeHashInto` either, because `JoinUtils::materializeColumn` strips LowCardinality keys before hashing. Add direct gtests for both wrappers: build a materialized column and a logically equal `ColumnLowCardinality` / `ColumnReplicated` with per-row varying values (a small pool with repeats so the dictionary/index mapping is non-trivial), then assert `composeKey(prefix, materialized) == composeKey(prefix, wrapper)` (the `initial == false` composition path) plus row-range split consistency via `expectRangeSplitConsistent`. Co-authored-by: Cursor <cursoragent@cursor.com>
|
📊 Cloud Performance Report 🟢 AI verdict: This PR rewrites the column-level weak-hash routine into a range-based, partially AVX-512-vectorized implementation. That hash drives row partitioning for hash joins and parallel aggregation, so the two flagged improvements both land on join-heavy TPC-H queries: Q7 is 34% faster and Q20 is 47% faster, both confirmed by both tests. Q7's measurements were tight and the result is clean; Q20's underlying runs were noisier, but the delta is large enough to be a real shift. Both improvements are consistent with what this change touches. clickbench🟢 No significant changes tpch_adapted_1_official🟢 2 improved Flagged queries (2 of 22)
q-value = BH-FDR adjusted p; smaller is stronger evidence. MIRAI flags a query when q < fdr_q (default 0.10) — the value the verdict is based on. Debug info
|
The non-allocating `IColumn::computeHashInto` kernel (PR ClickHouse#106538) bundled a CRC32C -> `fmix32` routing-hash switch that regressed join-heavy workloads: the Cloud Performance Report on the PR flagged TPC-H Q4 +33% slower because `fmix32` is far costlier than hardware CRC32C on the in-memory, per-query scatter paths (sharded aggregation, `grace_hash` bucketing, parallel-window partitioning, hash-join scatter) -- paths that never need a stable shard identity. Keep the non-allocating per-row buffer and the representation-independence contract, but restore CRC32C as both the per-row hash and the cross-column combiner: - Per-row `h(row)` is a finalized 32-bit CRC32C hash seeded with `WEAK_HASH32_INITIAL_VALUE`: `hashCRC32(value, ...)` for `ColumnVector` (one `_mm_crc32_u64` / `__crc32cd`, no hi/lo split), `intHashCRC32(v.value, ...)` for `ColumnDecimal`, and `updateWeakHash32` for the string/fixed-string/ aggregate families. Float `-0.0` -> `+0.0` normalization is preserved. - Composite keys combine each column's FINALIZED per-row hash via one shared, order-sensitive `combineWeakHash32` that packs `prior` and `value` into the two halves of a single 64-bit CRC input (a plain `intHashCRC32(value, prior)` is order-insensitive for 32-bit operands and would collide `(a, b)` with `(b, a)`). Wrappers (`Const` / `LowCardinality` / `Sparse` / `Replicated` / `Tuple` / `Array` / `Nullable` / `Object` / `Function` / `Variant`) combine the finalized nested hash, so SQL-equal keys hash identically across representations. - Remove `Common/HashCombine32.h` (`fmix32` / `fmix32Combined`) and every reference, comment, and doc. Validated by `gtest_compute_hash_into` (32/32) and `04312_grace_hash_vs_hash_join_key_types`. On the per-row hash benchmark the UInt64 (bigint) join-key path drops from ~1.36 to ~0.27 ns/row (~5.1x faster), restoring the fast scalar CRC32C path TPC-H Q4 joins use. Co-authored-by: Cursor <cursoragent@cursor.com>
`IColumnDummy::computeHashInto` only wrote `hash_out` on the `initial == true` path and left it unchanged when `initial == false`. A dummy column's finalized per-row hash is `0`, so a key `(prefix, materialized ColumnNothing)` hashed as just `prefix`, while `(prefix, ColumnConst(ColumnNothing))` combined `0` into `prefix` through `ColumnConst`. That breaks the representation-independence contract and can split equal multi-column keys if `Nothing`/dummy columns reach the scatter paths. Mirror the empty `ColumnTuple`/`ColumnFunction` handling: combine `0` via `combineWeakHash32` on the non-initial path. Add a `gtest_compute_hash_into` case asserting a materialized `ColumnNothing` and `ColumnConst(ColumnNothing)` compose identically. Addresses review comment: ClickHouse#106538 (comment) Co-authored-by: Cursor <cursoragent@cursor.com>
Stacked on top of ClickHouse#106538 (`IColumn::computeHashInto` hash kernel PR). This commit adds the scatter half and the `shard_by_hash_input_batch_bytes` input-batching mode. `DB::ColumnsScatter::scatter(source_columns, pids_per_source, num_shards, rows_per_shard)` is a batched, O(1)-dispatch physical split. The caller supplies one `IColumn` from each of B pending chunks at the same column-position, along with per-source `UInt32` partition-ids and precomputed per-shard row counts. A static function-pointer table keyed on `IColumn::getDataType()` selects the typed kernel with no virtual call on the hot path. Fast-path typed kernels for `ColumnVector`, `ColumnDecimal`, `ColumnFixedString`, `ColumnString`, `ColumnNullable`, and `ColumnTuple`; all other types delegate to the legacy `IColumn::scatter` fallback. Mixed-representation batches (`ColumnConst`/`ColumnSparse`/`ColumnReplicated` mixed with materialized columns) are normalized via `materializeTransparentWrappers` before dispatch, preserving `LowCardinality`; homogeneous `LowCardinality` batches use the fallback. `BufferedShardByHashTransform` input batching — new setting `shard_by_hash_input_batch_bytes` (default 2 MiB; 0 = legacy per-chunk path). The transform accumulates input chunks until the byte budget is reached, then flushes via `ColumnsScatter::scatter` once per column-position. Hash production uses `IColumn::computeHashInto` (CRC32C, from ClickHouse#106538) + `mapToRange`. `ScatterByPartitionTransform` is wired to `ColumnsScatter::scatter` for the same O(1)-dispatch benefit on the parallel-window partitioning path. Also adds: - `benchmark_columns_scatter.cpp` — RFC sweep over K and P - `gtest_columns_scatter.cpp` — equivalence vs legacy for all fast-path types, mixed-representation regressions, `LowCardinality` type preservation, and per-key count assertions - `04278_04101_sharded_aggregation_batched_input.sql` — batched vs legacy per-key-identical `GROUP BY` results across key types with small block size to exercise the threshold path, multi-chunk batches, and EOF flush Related: ClickHouse#105936 Related: ClickHouse#106120 Related: ClickHouse#106538 Co-authored-by: Cursor <cursoragent@cursor.com>
…pansion Two correctness/resource fixes in `ColumnsScatter::scatter`: 1. **`scatterFallback` nested-representation mismatch** — when `IColumn::scatter` is called across source chunks that share a logical type but differ in their nested concrete representation (e.g. `Array(UInt64)` with a full nested `ColumnUInt64` in one chunk vs `ColumnSparse(UInt64)` in another), the subsequent `insertRangeFrom` call on the destination's nested column would `assert_cast` on the wrong concrete type, causing undefined behaviour in release builds. Fix: add `deepNormalizeForFallback` which strips `ColumnConst`, `ColumnSparse`, and `ColumnReplicated` recursively at every nesting level (recursing into `ColumnArray` data and `ColumnTuple` elements) while preserving `ColumnLowCardinality`. `scatterFallback` normalizes all sources through this function before calling `IColumn::scatter`, and clones the destination from the normalized first source so all cross-chunk `insertRangeFrom` calls operate on matching concrete column types. 2. **Eager `ColumnConst` materialization before the UInt32 overflow guard** — the overflow guard previously ran after `materializeTransparentWrappers`, expanding O(1)-byte `ColumnConst` sources into full columns even when the only path taken was the size_t-safe `scatterFallback`. For const-heavy batches accumulated by `BufferedShardByHashTransform`, this could expand millions of logical rows into full allocations unnecessarily. Fix: move the UInt32 overflow guard before the wrapper materialization block. Additionally, add a fast compact path for a homogeneous batch of equal-valued `ColumnConst` sources: when all sources share the same constant value, produce `ColumnConst(value, rows_per_shard[s])` per shard via `cloneResized` — no materialization, O(1) memory, matching the behavior the legacy per-chunk `IColumn::scatter` had for constant-key workloads. Co-authored-by: Cursor <cursoragent@cursor.com>
…nConst` compaction Two new tests in `gtest_columns_scatter`: - `FallbackArrayMixedNestedRepresentation`: scatters two `Array(UInt64)` chunks where the first has a full nested `ColumnUInt64` and the second has a `ColumnSparse(UInt64)` nested column (identical logical values). Verifies that the scattered results match a reference built from two fully-materialized sources. Before the fix, the cross-chunk `insertRangeFrom` in `scatterFallback` would `assert_cast` a `ColumnSparse` as `ColumnVector` and produce wrong values or UB. - `ConstColumnStaysCompact`: scatters three `ColumnConst(UInt64=99)` sources through `ColumnsScatter::scatter` and asserts that every per-shard output column `isConst()`. Verifies that the new all-const compact path returns `ColumnConst(99, rows_per_shard[s])` without materializing the sources into full columns. Co-authored-by: Cursor <cursoragent@cursor.com>
…ce_hash buckets The power-of-two bucket selection in `scatterBlockByHashPow2` used `hash & mask`, which extracts only the low bits of the CRC32C hash. CRC32C has poor avalanche in the low bits for structured input (e.g. sequential integers), leaving the bucket distribution/performance contract unproven for the `grace_hash` path. Replace with Knuth's multiplicative (Fibonacci) hash: multiply by `floor(2^32/phi) = 0x9E3779B1` and take the top `log2(num_shards)` bits via `>> (32 - K)`. The high bits of the product depend on all 32 input bits, so even poor CRC32C low-bit distribution becomes uniform bucket assignment. This mirrors the approach already used by `scatterBlockByHashGeneric` and `mapToRange` (both use high-bit Lemire fastrange); the pow2 path now has consistent behaviour. Add two chi-squared distribution tests in `gtest_compute_hash_into` that exercise the Fibonacci sharder path directly: - `DistributionPow2FibRandom`: random UInt32 keys, 64 buckets. - `DistributionPow2FibSequential`: sequential UInt64 keys (0..65535), the pathological case for raw CRC32C low bits, verified uniform after the fix. Addresses reviewer comment on: ClickHouse#106538 Related: ClickHouse#106538 Related: ClickHouse#106120 Co-authored-by: Cursor <cursoragent@cursor.com>
Keep the `IColumn::computeHashInto` kernel work while removing the `ColumnsScatter` batching path, related setting, tests, and benchmark from this branch. Co-authored-by: Cursor <cursoragent@cursor.com>
Three correctness fixes for the `IColumn::computeHashInto` branch: 1. **BLOCKER — stale hash buffer in scatter transforms**: `ScatterByPartitionTransform` and `BufferedShardByHashTransform` reused a member `PaddedPODArray<UInt32>` across blocks via `resize_fill(n, WEAK_HASH32_INITIAL_VALUE)`. `resize_fill` only fills the grown tail `[old_size, n)`, so when a later block is `<=` the prior size (the common case with equal-size blocks) the buffer is NOT re-initialized. `computeHashInto` with `initial=false` then combines into stale data, routing equal keys to different shards across blocks — splitting window partitions and aggregation groups → wrong results. Fixed by `resize` + `std::fill(..., WEAK_HASH32_INITIAL_VALUE)`. 2. **Dropped grace_hash finalizer**: `JoinUtils::hashToSelector` was calling `sharder(hashes[i])` instead of `sharder(intHashCRC32(hashes[i]))`, losing the CRC32 avalanche step that the old `scatterBlockByHash` path depended on. Restored. 3. **Float `-0.0` normalization deviation**: `weakHashValue32` in `ColumnVector.cpp` was normalizing `-0.0` to `+0.0` before hashing, deviating from the old `getWeakHash32` byte-for-byte behavior. Reverted to raw-bit hashing. Updated the `ColumnBFloat16RawBitsAndSignedZero` test accordingly (`EXPECT_NE`). Also fixes a pre-existing `-Wimplicit-int-float-conversion` build error in `benchmark_column_insert_many_from.cpp`. Adds regression test `04316_fix_scatter_buffer_reuse_window` that catches the stale-buffer bug via a parallel window function cross-joined with a `GROUP BY` reference. Co-authored-by: Cursor <cursoragent@cursor.com>
Both ScatterByPartitionTransform and BufferedShardByHashTransform reused a member PaddedPODArray<UInt32> across blocks via resize_fill(n, value), which only fills the grown tail [old_size, n). When a later block is <= a prior block in size (the common case), no fill happens and the buffer retains stale hashes from the previous block. computeHashInto(initial=false) then combines into those stale values, routing equal keys to different shards across blocks and producing split window partitions / aggregation groups. Fix: replace resize_fill with resize + std::fill(…, WEAK_HASH32_INITIAL_VALUE) so the full buffer is always re-initialized each block. Also: - Restore the intHashCRC32 finalizer in JoinUtils::hashToSelector (grace_hash routing is now bit-identical to old WeakHash32-based code). - Revert float -0.0 normalization in weakHashValue32 (raw-bit hashing matches old getWeakHash32 byte-for-byte; SQL-equal -0.0/+0.0 routing to different shards is pre-existing old behavior). - Add regression test 04316_fix_scatter_buffer_reuse_window that cross-joins a parallel window function result against a GROUP BY reference and expects 0 mismatches. Co-authored-by: Cursor <cursoragent@cursor.com>
The `BM_HashAndFastrange_New` benchmark claimed to capture the `BufferedShardByHashTransform` hot path but diverged from the actual consumer loop in `generateOutputChunks`: it started the chain with `initial=true` for the first key column and omitted the per-iteration buffer seeding. The transform instead fills `hash_buffer` with `WEAK_HASH32_INITIAL_VALUE` every block and calls `computeHashInto` with `initial=false` for every key column. For `K=1` the benchmark therefore measured just `h(row) + mapToRange` instead of `combineWeakHash32(h(row), WEAK_HASH32_INITIAL_VALUE) + mapToRange`, overstating the improvement and missing regressions in the real loop. Seed `hash_buf` with `WEAK_HASH32_INITIAL_VALUE` each iteration (the re-seed is part of the consumer hot path) and call `computeHashInto` with `initial=false` for all columns, so the benchmark times the exact loop the PR's performance contract is about. Addresses review comment: ClickHouse#106538 (review) Co-authored-by: Cursor <cursoragent@cursor.com>
LLVM Coverage Report
Changed lines: Changed C/C++ lines covered by tests: 1001/1183 (84.62%) | Lost baseline coverage (was covered on master, now uncovered in this PR): 12 line(s) · Uncovered code |
| /// | ||
| /// Two output-width overloads share the same kernel shape: | ||
| /// - UInt64 output feeds an IColumn::Selector directly (no widening copy). | ||
| /// - UInt32 output halves the selector bandwidth where a 32-bit index suffices. |
There was a problem hiding this comment.
The implementation itself is about the same length as the comments - I'm not sure it is actually useful. Also, if you want to comment on the implementation details, please do so next to the actual code and not in the header.
| /// Partition rows by shard using Lemire fastrange. | ||
| /// The CRC32C-based per-row hash is well-distributed across all 32 bits, so the high | ||
| /// bits fed into the shift are uniform — no extra finalizer is needed. |
There was a problem hiding this comment.
| /// Partition rows by shard using Lemire fastrange. | |
| /// The CRC32C-based per-row hash is well-distributed across all 32 bits, so the high | |
| /// bits fed into the shift are uniform — no extra finalizer is needed. |
Let's not repeat the same comment at every call site.
| hash_buffer.resize(num_rows); | ||
| std::fill(hash_buffer.begin(), hash_buffer.end(), WEAK_HASH32_INITIAL_VALUE); |
There was a problem hiding this comment.
| hash_buffer.resize(num_rows); | |
| std::fill(hash_buffer.begin(), hash_buffer.end(), WEAK_HASH32_INITIAL_VALUE); | |
| hash_buffer.assign(num_rows, WEAK_HASH32_INITIAL_VALUE); |
| chassert(!columns.empty()); | ||
|
|
||
| hash.reset(num_rows); | ||
| hash.resize(num_rows); |
| for (size_t row = 0; row < num_rows; ++row) | ||
| selector[row] = (static_cast<UInt64>(hash_data[row]) * output_size) >> 32; /// The "fastrange" method from Daniel Lemire | ||
| selector[row] = (static_cast<UInt64>(hash[row]) * output_size) >> 32; /// The "fastrange" method from Daniel Lemire |
| columns[column_number]->computeHashInto(0, num_rows, hash.data(), false); | ||
|
|
||
| const PaddedPODArray<UInt32> & hash_data = hash.getData(); | ||
| IColumn::Selector selector(num_rows); |
There was a problem hiding this comment.
Looks like we can reuse selector too.
| return static_cast<uint32_t>(intHashCRC32(v.value, WEAK_HASH32_INITIAL_VALUE)); | ||
| } | ||
|
|
||
| MULTITARGET_FUNCTION_X86_V4( |
There was a problem hiding this comment.
What are we trying to achieve with avx-512 here? crc builtins won't vectorise AFAIK, and potential unrolling also shouldn't really speed anything up.
| if (direction == IColumn::PermutationSortDirection::Ascending && stability == IColumn::PermutationSortStability::Unstable) | ||
| try_sort = trySort(res.begin(), res.end(), comparator_ascending); | ||
| else if (direction == IColumn::PermutationSortDirection::Ascending && stability == IColumn::PermutationSortStability::Stable) | ||
| else if ( |
There was a problem hiding this comment.
Please avoid formatting unchanged code.
Split out the hash-production half of #106120 into its own pull request, as requested in review: keep the per-column hash calculation separate from the batched scatter interface so each can be reviewed and benchmarked in isolation.
Motivation
Several in-memory routing paths — sharded aggregation (
BufferedShardByHashTransform),grace_hashjoin bucketing (JoinUtils::scatterBlockByHash), and parallel window partitioning (ScatterByPartitionTransform) — hashed their key columns by allocating aWeakHash32per chunk and chainingIColumn::getWeakHash32across the key columns, then applying a separateintHashCRC32finalizer before mapping to a shard. The per-chunk allocation and the extra finalizer dominate for narrow keys at high throughput.Changes
IColumn::computeHashInto(row_begin, row_end, hash_out, initial)is a non-allocating per-row hash kernel that writes a finalized 32-bit hash for each row into a caller-provided buffer. The per-row hash is produced with hardware CRC32C —hashCRC32/intHashCRC32/updateWeakHash32, i.e._mm_crc32_u64on x86 (SSE4.2),__crc32cdon ARM, with a software fallback — seeded withWEAK_HASH32_INITIAL_VALUE, consuming a full 64-bit word in a single instruction. Multi-column keys compose in place — with no intermediate arrays — viacombineWeakHash32, a single order-sensitive CRC32C step that packs the prior accumulator and the new per-row hash into the two halves of one 64-bit CRC input. Type-specialized kernels are provided for the common column types (MULTITARGET_FUNCTION_X86_V4).The kernel obeys a representation-independence contract: a materialized column and a transparent wrapper of the same values (
ColumnConst,ColumnLowCardinality,ColumnSparse,ColumnReplicated) compose to identical hashes, so equal multi-column keys are never split across shards or join buckets because of physical representation. Every column type combines its finalized per-row hash through the samecombineWeakHash32; wrappers and composite columns (ColumnTuple/ColumnArray/ColumnNullable/ColumnObject/ColumnFunction/ColumnVariant, and the dummyColumnNothing) combine the finalized nested hash.IColumn::getWeakHash32,WeakHash32, andWeakHash.h/WeakHash.cppare removed; the three consumers now fill aPaddedPODArray<UInt32>via chainedcomputeHashInto. The CRC32CupdateWeakHash32primitive stays, with its default initial valueWEAK_HASH32_INITIAL_VALUEand the newcombineWeakHash32combiner co-located inHash.h. Because every column already writes a finalized, well-distributed 32-bit hash, the separateintHashCRC32finalizer inhashToSelectoris dropped — the sharder consumes the hash directly:hash & maskfor power-of-two bucket counts (thegrace_hashpath) and Lemire fastrange(h * P) >> 32(mapToRange) otherwise.The per-row hash values differ from the old per-chunk
getWeakHash32chain. Changing the exact hash values is safe because all three consumers route in-memory per query; per-key stable shard identity was never guaranteed.Testing
gtest_compute_hash_into(33 cases): row-range independence,initial/combine semantics, null-payload independence, array-length seeding,BFloat16raw-bit hashing, chi-squared distribution uniformity, order sensitivity, wide 128/256-bit folding (UInt128/UInt256/Decimal128/Decimal256), and representation independence acrossConst/Sparse/LowCardinality/Replicated/NothingforUInt64/String/Tuple/Array/Decimal/FixedStringkeys.04312_grace_hash_vs_hash_join_key_typesassertsgrace_hashjoins match the defaulthashjoin acrossUInt/String/Nullable/LowCardinality/Array/Tuplekeys.03927,03928,04097–04100) and a parallel-window correctness check that routes throughScatterByPartitionTransform.Related: #105936
Related: #106120
Related: #106120 (review)
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):
Speed up sharded aggregation,
grace_hashjoin bucketing, and parallel window partitioning by replacing the per-chunk allocatinggetWeakHash32column hashing with a non-allocatingIColumn::computeHashIntokernel that writes finalized per-row hardware-CRC32C hashes and composes multi-column keys in place viacombineWeakHash32.Made with Cursor