Skip to content

Fix exception when reading .size subcolumn of sparse Nullable(String) with PREWHERE#97264

Merged
alexey-milovidov merged 3 commits intomasterfrom
fix-sparse-string-size-subcolumn-prewhere
Feb 19, 2026
Merged

Fix exception when reading .size subcolumn of sparse Nullable(String) with PREWHERE#97264
alexey-milovidov merged 3 commits intomasterfrom
fix-sparse-string-size-subcolumn-prewhere

Conversation

@alexey-milovidov
Copy link
Member

Summary

  • Fix a LOGICAL_ERROR exception in SerializationSparse::deserializeBinaryBulkWithMultipleStreams (offsets_column->size() + 1 != values_column->size()) when reading .size subcolumn of a sparse Nullable(String) inside a Tuple together with the full Tuple via PREWHERE.
  • The root cause: SerializationStringSize::deserializeWithStringData cached the accumulated string_state.column (which grows across marks) instead of only the current range's data. On mark 1+, insertDataFromCachedColumn saw the size mismatch and replaced ColumnSparse's values column entirely, breaking the sparse invariant.
  • Fix caches only the current range via cut(prev_size, num_read_rows), preserving the sizes-only optimization for queries that only need string sizes (length(s), empty(s), etc.).

Found by AST fuzzer: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=97242&sha=60722e5f8a3edd99281b6bd858dee3f0f9bf84d7&name_0=PR&name_1=AST%20fuzzer%20%28amd_ubsan%29

Test plan

  • Verified fix with the minimal reproducer from the fuzzer
  • Verified sizes-only optimization still works (length(), empty(), standalone .size reads)
  • Added regression test 03924_sparse_string_size_subcolumn_prewhere

Changelog category (leave one):

  • Critical Bug Fix (crash, data loss, RBAC) or LOGICAL_ERROR

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Fix LOGICAL_ERROR exception when reading .size subcolumn of a sparse Nullable(String) in a Tuple with PREWHERE.

🤖 Generated with Claude Code

alexey-milovidov and others added 2 commits February 18, 2026 10:11
…ng)` with PREWHERE

When reading `.size` subcolumn of a sparse `Nullable(String)` inside a Tuple
together with the full Tuple via PREWHERE, a LOGICAL_ERROR was thrown in
`SerializationSparse::deserializeBinaryBulkWithMultipleStreams` because
`offsets_column->size() + 1 != values_column->size()`.

Two issues in `SerializationStringSize`:

1. `deserializeBinaryBulkStatePrefix` only set `need_string_data = true` when
   there was no state cache. When both `t.a.size` and `t` share a
   `SubstreamsCache` (keyed by column name), reading only sizes cached a
   `ColumnUInt64` under the `Substream::Regular` key, poisoning the cache for
   `SerializationString` which expects a `ColumnString`.

2. `deserializeWithStringData` cached the accumulated `string_state.column`
   directly. This column grows across marks (persistent state), so on mark 1+
   it contained elements from all previous marks. When `insertDataFromCachedColumn`
   saw `cached_column->size() != num_read_rows`, it replaced `ColumnSparse`'s
   values column entirely, breaking the sparse invariant.

The fix always sets `need_string_data = true` and caches only the current
range via `cut(prev_size, num_read_rows)` instead of the full accumulated
column.

Found by AST fuzzer: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=97242&sha=60722e5f8a3edd99281b6bd858dee3f0f9bf84d7&name_0=PR&name_1=AST%20fuzzer%20%28amd_ubsan%29

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Revert need_string_data back to conditional (only when !cache).

SerializationString::deserializeBinaryBulkStatePrefix already upgrades
the shared state to need_string_data=true when both the .size subcolumn
and the full column are read. Forcing it unconditionally would disable
the optimization for queries that only need string sizes (e.g.,
`length(s)`, `empty(s)`, `s != ''`).

The actual bug fix (caching only the current range via cut) is
sufficient on its own.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@clickhouse-gh
Copy link
Contributor

clickhouse-gh bot commented Feb 18, 2026

Workflow [PR], commit [e646986]

Summary:

@clickhouse-gh clickhouse-gh bot added pr-critical-bugfix pr-must-backport Pull request should be backported intentionally. Use this label with great care! labels Feb 18, 2026
-- a Tuple together with the full Tuple via PREWHERE used to cause a LOGICAL_ERROR
-- because the cached accumulated ColumnString broke ColumnSparse invariants.

DROP TABLE IF EXISTS t_sparse_string_size;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it reproduces the issue.

/// If we cache the full accumulated column with num_read_rows < column->size(),
/// insertDataFromCachedColumn will see the size mismatch and replace the result
/// column entirely (e.g. ColumnSparse's values), breaking invariants.
auto column_for_cache = string_state.column->cut(prev_size, num_read_rows);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it affect performance?

The test used `ORDER BY ()` in the table and no `ORDER BY` in the
SELECT query. With randomized insert settings, the INSERT may produce
multiple parts, and `OPTIMIZE TABLE FINAL` merges them with undefined
row order (empty sorting key). Adding `ORDER BY id` to the SELECT
makes the output deterministic regardless of storage order.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@alexey-milovidov
Copy link
Member Author

Re: performance question on cut() at line 135 — the cut() only runs when need_string_data is true, which happens when both the .size subcolumn and the full string column are read in the same query (e.g., SELECT t.a.size ... WHERE toString(t)). In the common case — reading only length(s) / empty(s) / .size without the full string — the sizes-only path (deserializeWithoutStringData) is taken and there is no cut(). The cost is proportional to index_granularity rows per mark, not accumulated data.

@alexey-milovidov alexey-milovidov added the pr-performance Pull request with some performance improvements label Feb 19, 2026
@alexey-milovidov alexey-milovidov self-assigned this Feb 19, 2026
@alexey-milovidov alexey-milovidov merged commit 0d3b93d into master Feb 19, 2026
146 of 147 checks passed
@alexey-milovidov alexey-milovidov deleted the fix-sparse-string-size-subcolumn-prewhere branch February 19, 2026 23:24
@robot-ch-test-poll3 robot-ch-test-poll3 added pr-synced-to-cloud The PR is synced to the cloud repo pr-must-backport-synced The `*-must-backport` labels are synced into the cloud Sync PR labels Feb 19, 2026
robot-ch-test-poll2 added a commit that referenced this pull request Feb 20, 2026
Cherry pick #97264 to 25.11: Fix exception when reading .size subcolumn of sparse Nullable(String) with PREWHERE
robot-clickhouse added a commit that referenced this pull request Feb 20, 2026
…of sparse Nullable(String) with PREWHERE
robot-ch-test-poll2 added a commit that referenced this pull request Feb 20, 2026
Cherry pick #97264 to 25.12: Fix exception when reading .size subcolumn of sparse Nullable(String) with PREWHERE
robot-clickhouse added a commit that referenced this pull request Feb 20, 2026
…of sparse Nullable(String) with PREWHERE
robot-ch-test-poll2 added a commit that referenced this pull request Feb 20, 2026
Cherry pick #97264 to 26.1: Fix exception when reading .size subcolumn of sparse Nullable(String) with PREWHERE
robot-clickhouse added a commit that referenced this pull request Feb 20, 2026
clickhouse-gh bot added a commit that referenced this pull request Feb 20, 2026
Backport #97264 to 26.1: Fix exception when reading .size subcolumn of sparse Nullable(String) with PREWHERE

-- The bug manifested when reading t.a.size and t in the same readRows call
-- across multiple granules with PREWHERE.
SELECT t.a.size, id FROM t_sparse_string_size PREWHERE id % 11 = 0 WHERE toString(t) != '' ORDER BY id LIMIT 3;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't reproduce the issue, you need to remove the LIMIT 3 to reprouce it

@Avogar
Copy link
Member

Avogar commented Feb 20, 2026

It's a bad fix. The problem is not in SerializationStringSize. The problem is in SerializationSparse. Before deserializing sparse elements we need to set settings.insert_only_rows_in_current_range_from_substreams_cache to true. The issue happens because values in ColumnSparse always have one additional row with default value in the beginning, so we cannot use column from the substreams cache, because it doesnt contain it.

I will revert this PR and close all the backports and create a proper fix. @alexey-milovidov please, wait for the review before merging such fixes.

@Avogar
Copy link
Member

Avogar commented Feb 20, 2026

Better fix: #97515

@robot-clickhouse-ci-2 robot-clickhouse-ci-2 added the pr-backports-created Backport PRs are successfully created, it won't be processed by CI script anymore label Feb 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-backports-created Backport PRs are successfully created, it won't be processed by CI script anymore pr-critical-bugfix pr-must-backport Pull request should be backported intentionally. Use this label with great care! pr-must-backport-synced The `*-must-backport` labels are synced into the cloud Sync PR pr-performance Pull request with some performance improvements pr-synced-to-cloud The PR is synced to the cloud repo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants