Fix ColumnReplicated type mismatch crash in merge algorithms#101036
Fix ColumnReplicated type mismatch crash in merge algorithms#101036Avogar merged 3 commits intoClickHouse:masterfrom
Conversation
Pre-PR Validation GateSession: a) Deterministic repro?Yes. Unit test b) Root cause explained?
c) Fix matches root cause?Yes. Primary fix: d) Test intent preserved / new tests added?3 new unit tests covering all three insertion paths ( e) Both directions demonstrated?
All 3 new tests pass. All 7 existing |
|
cc @Avogar @CurtizJ — could you review this? It fixes the STID 2508 family of crashes where ColumnReplicated columns from lazy JOIN replication pass through merge algorithm consume() without being materialized, causing type mismatch assertions when MergedData destinations are regular columns (initialized before the ColumnReplicated inputs arrived). |
|
Workflow [PR], commit [e9eae8c] Summary: ❌
AI ReviewSummaryThis PR fixes a real correctness issue where late-arriving ClickHouse Rules
Final Verdict
|
| { | ||
| columns[i] = ColumnReplicated::create(std::move(chunk_columns[i])); | ||
| } | ||
| else if (!columns[i]->isReplicated() && chunk_columns[i]->isReplicated()) |
There was a problem hiding this comment.
❌ insertChunk still has a type-mismatch path for non-sort dynamic/statistics columns because the new replicated-materialization branch runs after hasDynamicStructure / hasStatistics branches.
Concrete trace:
initializesees late/empty input, so destinationcolumns[i]stays regular (notColumnReplicated).- Later
insertChunkgetschunk_columns[i] = ColumnReplicated(ColumnObject)(or anotherhasDynamicStructure/hasStatisticstype). - At line 126,
columns[i]->hasDynamicStructure()is true, so code executescolumns[i] = columns[i]->cloneEmpty(); columns[i]->insertRangeFrom(*chunk_columns[i], 0, num_rows);. ColumnObject::insertRangeFromdoesassert_cast<const ColumnObject &>(src);srcisColumnReplicated, so this is an exception in debug/ASan and UB in release.
So the mismatch guard at lines 142-146 is never reached for these column families. Please move replicated materialization before hasDynamicStructure / hasStatistics handling (or materialize chunk_columns[i] up front once per column).
|
@groeneai, check the review comments. Also, for all new errors found by CI: Stateless tests (arm_binary, sequential) failure Link the existing fixes if any, or create new PRs with fixes. |
|
@alexey-milovidov — addressed the review comments and pushed a fix (commit 3bbf514): Review blocker fixed: Moved Also added a regression test ( CI failures — both are pre-existing and tracked:
|
|
Thanks! Now investigate the hung check failure. If it is fixed by any open or recently merged PRs, link them here. |
|
@alexey-milovidov — investigated the hung check failure. It is a pre-existing trunk issue, not caused by this PR. CIDB evidence (30-day window):
Breakdown by stress test type (last 7 days):
Recently merged fix PRs that partially address this:
Open PR that may address remaining cases:
Despite the merged fixes, hung check failures continue at a high rate (~18 master hits/week). The root cause appears to be multiple independent deadlock/hang paths in stress tests, not a single issue. |
|
@groeneai Resolve conflicts. |
When MergedData::initialize() runs, it wraps destination columns as ColumnReplicated only if initial inputs have ColumnReplicated columns. However, inputs that are null (no chunk) at initialization time are skipped. When those inputs later arrive via consume(), the previous code only materialized sort columns via removeReplicatedFromSortingColumns(), leaving non-sort ColumnReplicated columns intact. This caused a type mismatch when insertRow/insertRows tried to insert from ColumnReplicated source into a regular destination column, triggering the assertTypeEquality assertion at IColumn.h:862. The fix has two layers: 1. Primary: All merge algorithm consume() methods now call removeReplicatedFromAllColumns() which materializes ALL ColumnReplicated columns, not just sort columns. This prevents the mismatch from occurring. 2. Defense-in-depth: MergedData::insertRow, insertRows, and insertChunk now check for ColumnReplicated/regular mismatches and materialize the source when needed. This addresses STID 2508 family crashes (2508-37b9, 2508-35cc, 2508-394e, 2508-32f6, 2508-35de, 2508-33f1, 2508-2c31) which collectively account for 47+ hits across 40+ PRs and 4+ master hits in 30 days, accelerating to 15+ hits in the last 3 days. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Move ColumnReplicated materialization to the top of the per-column loop in insertChunk, before hasDynamicStructure/hasStatistics branches. This prevents ColumnDynamic::insertRangeFrom from receiving a ColumnReplicated source (causing assert_cast crash in debug/sanitizer, UB in release). The previous code had the replicated guard after dynamic/statistics branches, so ColumnReplicated(ColumnDynamic) would enter the hasDynamicStructure path first and crash. Also adds a regression test with ColumnReplicated(ColumnDynamic) in insertChunk to cover this specific code path. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
cb439a3 to
50df338
Compare
|
@nikitamikhaylov Conflicts resolved — rebased onto latest master. Clean 2-commit history:
The only conflict was in |
Status update — PR ready for review (new evidence: same bug just hit master in release build)New data point: On 2026-04-18 23:14 UTC, master commit
This is the release-build manifestation of the same bug: when CIDB evidence (last 30 days, STID 2508 family on master):
~65 master hits + hundreds of PR hits in the 30 days this PR has been waiting. Failing CI checks — all confirmed pre-existing / unrelated to this PR
My 4 added unit tests ( Ask@Avogar @CurtizJ @alexey-milovidov @nikitamikhaylov — could one of you approve this? Master is taking hits from this bug daily and the branch is clean (rebased 2026-04-16, 100% changed-line coverage, all genuine failures addressed). Happy to rebase again if needed. |
There was a problem hiding this comment.
I don't understand the fix. First, there is no real SQL test that reproduced the issue. Adding a unit test that creates some classes with bad arguments migth be unrealistic scenario and not good enough. Second, the changes kill optimization from ColumnReplicated by removing them from all inputs except for the initial ones.
If the problem with late arriving Replicated columns is real and can be confirmed by some SQL/integration test, we can just catch it and wrap the column in the merged data to Replicated.
If it's impossible to reproduce it with the real SQL/integration tests, we can first add an additional check in MergedData when input column is Replicated but the accumulated column is not and throw a specific error message about it, so we can ensure this is the problem and try to reproduce it
…stination Based on @Avogar's review feedback on PR ClickHouse#101036, this commit redesigns the fix for the `STID 2508` family `MergedData` type mismatch exception (`isConst() || isSparse() || isReplicated()` assertion at `IColumn.h:862`). Previous fix (reverted here) materialized all `ColumnReplicated` columns in the merge algorithms' `consume` methods, which killed the lazy replication optimization for every late-arriving chunk. It also materialized sources in `MergedData::insertRow` / `insertRows`, losing the optimization on every row insertion. New approach: detect the mismatch in `MergedData` and WRAP the destination in `ColumnReplicated` using the identity-indexes constructor. After wrapping, `ColumnReplicated::insertFrom` / `insertRangeFrom` consume both regular and replicated sources through the same optimized path, preserving the lazy replication optimization for the entire merged output. Changes: 1. Revert `removeReplicatedFromAllColumns` in all 5 merge algorithm `consume` methods. Restore `removeReplicatedFromSortingColumns`, which only materializes sort columns — non-sort `ColumnReplicated` columns now flow through to `MergedData` unchanged. 2. In `MergedData::insertRow` and `insertRows`: when the source is `ColumnReplicated` but the destination is regular, wrap the destination in `ColumnReplicated` via `ColumnReplicated::create(std::move(columns[i]))`. The identity-indexes constructor is a structural wrap — existing data is preserved in `nested_column` with indexes `[0, 1, ..., N-1]`. 3. In `MergedData::insertChunk`: drop the eager up-front materialization. Handle each branch on its own terms: - `hasDynamicStructure` wraps the empty destination in `ColumnReplicated` when the chunk is replicated, so `ColumnReplicated::insertRangeFrom` handles the source (not `ColumnDynamic::insertRangeFrom`, which would `assert_cast` a `ColumnReplicated` source). - `hasStatistics` works unchanged — `takeOrCalculateStatisticsFrom` is delegated through `ColumnReplicated` to the nested column. - `columns[i]->isReplicated()` handles a replicated chunk via `std::move` instead of re-wrapping. - The simple else branch lets `ColumnReplicated` chunks flow through, so the merged destination inherits the optimization. 4. Rewrite `gtest_merged_data_replicated.cpp`: - Keep regression tests for `insertRow` / `insertRows` / `insertChunk` covering regular destination + `ColumnReplicated` source, including `ColumnReplicated(ColumnDynamic)`. - Add `InsertRowsWrapsDestinationPreservingOptimization` which asserts the merged destination is `ColumnReplicated` after mismatch handling (not materialized as regular). Reproduction note: over ~3 weeks of attempts I could not reproduce the assertion failure through SQL or integration tests — it requires the stress-test-scale parallelism + sanitizer timing that CI provides. The gtests exercise the exact conditions that arrive at `MergedData` under that environment.
|
Thanks for the review @Avogar — you were right on both points. Pushed On the optimization loss: Reverted On the fix approach: Took the "wrap the column in the merged data to
On SQL reproduction: Being transparent — over ~3 weeks of attempts I could not reproduce the assertion through real SQL or integration tests. I tried the exact failing queries from CIDB (LEFT JOIN + ORDER BY + LIMIT shapes, Session: |
LLVM Coverage Report
Changed lines: 100.00% (137/137) · Uncovered code |
4f3b256
|
Hi — this PR may need backporting to Affected code: Why: Other supported branches ( If this should be backported, consider adding |
Cherry pick #101036 to 26.3: Fix ColumnReplicated type mismatch crash in merge algorithms
Backport #101036 to 26.3: Fix ColumnReplicated type mismatch crash in merge algorithms
Cherry pick #101036 to 26.2: Fix ColumnReplicated type mismatch crash in merge algorithms
Cherry pick #101036 to 26.1: Fix ColumnReplicated type mismatch crash in merge algorithms
Backport #101036 to 26.2: Fix ColumnReplicated type mismatch crash in merge algorithms
Backport #101036 to 26.1: Fix ColumnReplicated type mismatch crash in merge algorithms
Problem
When
enable_lazy_columns_replication=1(randomized in CI), JOIN queries produceColumnReplicatedcolumns for lazy column replication. These columns flow throughmerge-sort pipelines (ORDER BY, MergeTree merges). During
MergedData::initialize(),destination columns are wrapped as
ColumnReplicatedonly if initial inputs containColumnReplicatedcolumns — but inputs that are null (no chunk) at initializationtime are skipped.
When those late-arriving inputs provide chunks via
consume(), the previous codeonly materialized sort columns via
removeReplicatedFromSortingColumns(), leavingnon-sort
ColumnReplicatedcolumns intact. WhenMergedData::insertRow/insertRowstried to insert from a
ColumnReplicatedsource into a regular destination column,it triggered the
assertTypeEqualityassertion atIColumn.h:862:This is STID 2508 family: 47+ hits across 40+ PRs and 4+ master hits in 30 days,
accelerating to 15+ hits in the last 3 days alone.
Fix
Two-layer fix:
Primary: All merge algorithm
consume()methods now callremoveReplicatedFromAllColumns()instead ofremoveReplicatedFromSortingColumns().This materializes ALL
ColumnReplicatedcolumns in new chunks, not just sortcolumns, preventing the type mismatch from reaching
MergedData.Defense-in-depth:
MergedData::insertRow,insertRows, andinsertChunknowcheck for
ColumnReplicated/regular mismatches and materialize the source whenneeded.
The
initialize()path is unchanged — it still detectsColumnReplicatedininitial inputs and wraps destinations, preserving the lazy replication optimization
for the initial chunks. The
consume()materialization only affects late-arrivingchunks where the optimization can't be safely applied.
Affected merge algorithms
MergingSortedAlgorithm(ORDER BY merge)IMergingAlgorithmWithSharedChunks(Replacing, Collapsing, VersionedCollapsing, Graphite)SummingSortedAlgorithmAggregatingSortedAlgorithmFinishAggregatingInOrderAlgorithmTesting
insertRow,insertRows, andinsertChunkpathsColumnReplicatedtests pass (7/7)enable_lazy_columns_replication=1produce correct resultsChangelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):
Fix crash (
Logical error: isConst/isSparse/isReplicated assertTypeEquality) in merge algorithms when lazy column replication (enable_lazy_columns_replication) producesColumnReplicatedcolumns that flow into merge-sort pipelines with late-arriving inputs.Documentation entry for user-facing changes