Revert "Revert "Parallelize reads from a single Parquet file in StorageFile"" by alexey-milovidov · Pull Request #104431 · ClickHouse/ClickHouse

alexey-milovidov · 2026-05-08T22:04:53Z

…geFile""

clickhouse-gh · 2026-05-08T22:05:30Z

Workflow [PR], commit [fe78f35]

Summary: ✅

AI Review

Summary

This revert restores single-file Parquet parallel read splitting in StorageFile and adds safeguards for parallelize_output_from_storages, bucketed-count correctness, and Parquet metadata reuse. I reviewed the current diff and prior discussion threads; previously raised bot concerns are addressed in the current code and tests, and I did not find unresolved correctness, safety, or contract issues that require new review comments.

Final Verdict

Status: ✅ Approve

…t-104251-parquet-single-file-parallelism

alexey-milovidov · 2026-05-10T10:32:23Z

Good: the newly added performance tests prevent this change:

…t-104251-parquet-single-file-parallelism

…-file-parallelism

When a single Parquet file is split into multiple bucketed sources by `StorageFile` (the path re-introduced in this PR), the file-level count cache must be bypassed: it is keyed by file path, so consulting or writing it from the bucketed read path would have every source report the file's full row count and multiply the result by the number of buckets. Addresses review feedback on PR #104431 asking for explicit regression coverage of this invariant.

Three gates on the bucketed single-file read path in `StorageFile`, addressing the `clickbench_parquet_short` regression observed on ARM in the CI of #104431 (#104431) and an open review comment. 1. `parallelize_output_from_storages = 0` now disables the split. The per-bucket sources are exactly the kind of read parallelism the setting's contract is about, but the existing check fired only after the sources had been created. Review feedback from #104431 (comment). 2. `need_only_count` queries skip the split. They consult only the file's metadata, so splitting them across N sources just multiplies the metadata-parse cost N-fold without any read-side benefit. This was the largest single contributor to the `Q1` (`SELECT COUNT(*)`) regression in `clickbench_parquet_short`. 3. The Parquet splitter (`ParquetBucketSplitter::splitToBucketsByCount`) now requires each chunk to cover at least 8 row groups. For a file with a small number of row groups, parallelising across all available threads multiplies per-bucket metadata-parse and prefetcher-setup overhead without giving each source enough work to amortise it. Large files (many row groups) still get full parallelism. Updates `02725_parquet_preserve_order.reference`: the 2-row-group test file falls below the new row-group floor so the pipeline goes back to a single `File` source followed by `Resize 1 → 2`, matching pre-#104251 behaviour. CI report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=104431&sha=78ecefef098eedd55ab0a0ce350082364c8c23be&name_0=PR&name_1=Performance%20Comparison%20%28arm_release%2C%20master_head%2C%201%2F6%29

alexey-milovidov · 2026-05-13T12:19:26Z

Not needed anymore: https://benchmark.clickhouse.com/#system=+curp|H(u,s&type=-&machine=-ca4e|6t|ca2|ae-|g4e|6ax|6ale|3al&cluster_size=-&opensource=+s&hardware=+c&tuned=+n&metric=combined&queries=-

alexey-milovidov · 2026-05-13T12:20:07Z

However, there is still a chance to speed it up: https://benchmark.clickhouse.com/#system=+curp|H(u,s&type=-&machine=-ca4e|6t|ca2|ae-|g4e|6ax|6ale|3al&cluster_size=-&opensource=+s&hardware=+c&tuned=+n&metric=hot&queries=-

alexey-milovidov · 2026-05-14T16:43:43Z

From now on, we will be interested in accelerating those few queries that show the difference between the "single" and "partitioned" variants. If we can make it without degradations of other queries, we can merge this PR.

… single-file split This test pins the contract requested in #104431 (comment): when `parallelize_output_from_storages = 0` is set, the single-file Parquet split path in `StorageFile` must not fan out into multiple per-bucket sources, even if `max_threads > 1` and the file has enough row groups to otherwise be split. The existing `04230_parquet_single_file_parallel_count.sql` only exercises `parallelize_output_from_storages = 1`. The previous regression was specifically that `parallelize_output_from_storages = 0` was ignored for this branch, so a dedicated test is needed to prevent silent regressions. The test creates a 40-row-group Parquet file (well above the 8-row-groups-per-chunk floor in `ParquetBucketSplitter`) and uses `EXPLAIN PIPELINE` to assert that with `parallelize_output_from_storages = 0` there is no `File ×` multiplier and no `Resize` step, and with `parallelize_output_from_storages = 1` the file IS split (the pipeline contains `File × N`). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…t-104251-parquet-single-file-parallelism

alexey-milovidov · 2026-05-15T08:11:25Z

@groeneai, three stateless amd_msan, WasmEdge failures here are the standard DB::SystemLogQueue<DB::QueryLogElement>::waitFlush Timeout exceeded (180 s) pattern:

01939_network_receive_bytes_metrics
02703_max_local_read_bandwidth
03753_join_runtime_filter_dynamically_disable (tracked in #104925)

Same exception across all three: Timeout exceeded (180 s) while flushing system log 'DB::SystemLogQueue<DB::QueryLogElement>'. CIDB shows 12 hits on 2026-05-14 and 14 on 2026-05-13 across many unrelated PRs (and master), so this is not related to the parallel Parquet read path in this PR. Please investigate the underlying SystemLog flush hang and provide a fix in a separate PR.

CI report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=104431&sha=6956a6a1a05203b43c57d3e3fc259668147ea190&name_0=PR&name_1=Stateless%20tests%20%28amd_msan%2C%20WasmEdge%2C%20parallel%2C%201%2F2%29

alexey-milovidov · 2026-05-15T08:11:31Z

@groeneai, the Install packages (amd_release) and Install packages (arm_release) failures are an infrastructure issue: yum localinstall is being given two RPMs of the same package at different version strings and refusing to install both:

Problem 1: cannot install both clickhouse-server-26.5.1.651-1.x86_64 and clickhouse-server-26.5.1.594-1.x86_64

The exact same failure mode hit PR #104445 on the same day with a different version pair (26.5.1.465 vs 26.5.1.641), so this is the install-test script picking up RPMs from two build jobs that produced different .594 / .651 (or .465 / .641) version strings. Not related to the Parquet changes in this PR. Please investigate and fix in a separate PR.

CI report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=104431&sha=6956a6a1a05203b43c57d3e3fc259668147ea190&name_0=PR&name_1=Install%20packages%20%28amd_release%29

`clickbench_parquet_short` on the synthetic 20-row-group file still regressed 18-37% on `arm_release` even after the `8-row-groups-per-chunk` floor in `ParquetBucketSplitter`: 20/8 = 2 chunks, and each extra bucket added ~1-3 ms of per-source setup (metadata parse, prefetcher init, schema conversion, hyperrectangle build, bloom-filter init) on top of the ~5 ms baseline. For "short" queries on small files the parallelism win does not pay for that fixed cost. Bump the floor to 16. The synthetic test now reads with a single source (20/16 = 1 chunk), while `04238_parquet_single_file_parallelize_output_off` still exercises a 2-chunk split (40/16 = 2) and a real `hits.parquet` (hundreds of row groups) still fans out up to `max_threads`. CI report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=104431&sha=5f02d3d3200c9d6572d8848eb639c034a0fc0c62&name_0=PR&name_1=Performance%20Comparison%20%28arm_release%2C%20master_head%2C%201%2F6%29 PR: #104431 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

groeneai · 2026-05-15T22:18:20Z

@alexey-milovidov — confirmed, all three failures (01939_network_receive_bytes_metrics, 02703_max_local_read_bandwidth, 03753_join_runtime_filter_dynamically_disable) are the chronic DB::SystemLogQueue<DB::QueryLogElement>::waitFlush 180s family unrelated to the parallel Parquet read path in this PR.

Status: root cause identified, short-term fix already merged

@Algunenano root-caused this to the recent x86-64-v3 ISA baseline flip (Change default x86 build target from x86-64-v2 (SSE4.2) to x86-64-v3 (AVX2) #90043). On MSan builds with AVX2 enabled, LLVM keeps YMM registers live across hot loops; X86VZeroUpperInserter then emits a vzeroupper before every external call from such a block. With -fsanitize-memory-track-origins=2, __msan_chain_origin is called at every uninit-value propagation step (565,983 such guards in the rebuilt binary), each now paying a vzeroupper. That added enough hot-path overhead to push long tests over the 180s SYSTEM FLUSH LOGS wait. ASan/TSan/UBSan/Debug unaffected.
Fix merged 2026-05-15T08:57:18Z: PR Revert "Change default x86 build target from x86-64-v2 (SSE4.2) to x86-64-v3 (AVX2)" #104971 "Revert change default x86 build target from x86-64-v2 to x86-64-v3" (Algunenano) reverts the v3 baseline as the short-term remediation.
Algunenano's narrower track-origins=1 attempt (PR Drop MSan track-origins to 1 on x86-64-v3 to fix CI timeouts #104947) covered most of the slowdown but log flushing was still slower; he closed it in favor of PR Revert "Revert "Change default x86 build target from x86-64-v2 (SSE4.2) to x86-64-v3 (AVX2)"" #105019 (re-enable v3 after a proper fix lands).
Umbrella tracker Flaky test: 04039_merge_tree_snapshot_teardown_race #104866 (04039_merge_tree_snapshot_teardown_race) + sub-issue Flaky test: 03753_join_runtime_filter_dynamically_disable #104925 (03753_join_runtime_filter_dynamically_disable) both closed at 2026-05-15T08:57Z when Revert "Change default x86 build target from x86-64-v2 (SSE4.2) to x86-64-v3 (AVX2)" #104971 merged.

CIDB verification — WasmEdge MSan SystemLog flush timeouts on master, hourly (UTC):

hour	master	PR
2026-05-15 07:00	8	9
2026-05-15 08:00	0	9
2026-05-15 09:00	0	9
2026-05-15 10:00	0	1
2026-05-15 13:00	0	2
2026-05-15 14:00	0	1

Master goes to zero immediately after #104971 merges at 08:57Z. PR branches predate the fix and still hit it (expected — branches don't auto-rebase). Failure family count over 14 days: 14 distinct test names, ~720 hits across 200+ PRs — the 3 hits on this PR are part of that.

For this PR: the head commit 6956a6a1a05 is from 2026-05-13 (two days before the fix landed). A rebase onto current master + re-run should clear the 3 amd_msan, WasmEdge failures. The Install packages (amd_release/arm_release) RPM-version-mismatch is a separate infrastructure issue and is tracked in our internal queue.

Let me know if you'd like me to file a follow-up issue once Algunenano re-enables v3 in #105019 (to track whether the proper fix holds up), or treat the family as resolved once the master baseline stays clean for 48-72h.

— @groeneai (session cron:clickhouse-ci-task-worker:20260515-221500)

After raising the per-chunk row-group floor to 16 in `ParquetBucketSplitter`, the test file (`numbers(1000)` at row-group size 50 = 20 row groups) stayed below the 2-chunk threshold (`32` row groups), so the bucketed read path was no longer exercised — the test was no longer a regression guard for the count-cache-on-bucketed-source bug it was added to pin. Bump the file to 64 row groups (`numbers(3200)`) so the splitter produces 4 chunks and the bucketed branch is actually hit. The asserted counts are updated to match. This addresses the open review comment on #104431 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ache The single-file Parquet split path in `StorageFile` parses the file's metadata once in `ParquetBucketSplitter::splitToBucketsByCount` and then the per-bucket `ParquetV3BlockInputFormat` sources each parse it again in `getFileMetadata`. The splitter's parse went through Arrow's `parquet::ReadMetaData` (uncacheable) while the source's parse went through `Parquet::Reader::readFileMetaData` keyed in `ParquetMetadataCache` by `(file_path, etag)` — so the two paths could not share work. For "short" queries on a single Parquet file (`clickbench_parquet_short` on the 828 MB synthetic ARM run), the extra footer parse cost 1-1.5 ms on top of a 5-8 ms baseline — 12-30 % of the runtime — and remained after raising the per-chunk row-group floor to 16 because the splitter still parsed metadata before deciding not to split. Route the splitter through the same cache: * `ParquetBucketSplitter::splitToBucketsByCount` now uses `Parquet::Reader::readFileMetaData` via a `Parquet::Prefetcher` (the same code path `NativeParquetSchemaReader::initializeIfNeeded` uses), producing a `parquet::format::FileMetaData` storable in `ParquetMetadataCache`. * New `splitParquetFileWithCache` helper accepts a cache pointer and the same `(file_path, etag)` key the source builds (sub-second mtime + inode + size), populating the cache on miss and reusing the parse on hit. * `ReadFromFile::initializePipeline` calls the cache-aware helper for Parquet, so the per-bucket sources hit the cache and skip the footer parse. The single-bucket case (no fan-out) benefits too: the surviving single source still hits the cache. Net effect: cold cache → 1 parse total (same as master); warm cache → 0 parses on both paths. CI report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=104431&sha=3c70a3490a9353117f2ecc65ca1458dd7285378c&name_0=PR&name_1=Performance%20Comparison%20%28arm_release%2C%20master_head%2C%201%2F6%29 PR: #104431 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

alexey-milovidov · 2026-05-17T00:15:40Z

@groeneai, two CI failures in this PR look unrelated to the Parquet changes — please verify and link any in-progress fix here, otherwise file as separate issues.

Integration tests (amd_msan, 3/6) → test_keeper_force_recovery/test.py::test_cluster_recovery — same 'This instance is not currently serving requests' mntr-output assertion that closed #78474 tracked. This PR touches StorageFile/Parquet only; Keeper is not in the changed paths.
Unit tests (asan_ubsan, function_prop_fuzzer) → FunctionsStress.stress and AllTests — tracked at #104877. The failing example is indexOf determinism, unrelated to Parquet.

Report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=104431&sha=3c70a3490a9353117f2ecc65ca1458dd7285378c&name_0=PR

groeneai · 2026-05-17T01:03:29Z

@alexey-milovidov — confirmed, both failures are unrelated to the Parquet changes in this PR. Cross-linking:

1. test_keeper_force_recovery/test.py::test_cluster_recovery — rare residual flake of closed #78474

The assertion shape ('This instance is not currently serving requests' vs the zk_version ... mntr response, gap between start_cluster()/leader handover and the test issuing mntr) matches the pattern that closed #78474 exactly.

CIDB (30-day): only 4 hits across 4 distinct PRs, 0 on master. No in-progress fix PR. Failures are sparse and re-distributed across unrelated PRs, so the underlying race is still present at very low rate (~0.1%) even after #78474's cleanup. Suggest reopening #78474 (or filing a follow-up) rather than a fresh issue — I can do that next worker run.

SELECT toStartOfDay(check_start_time) AS day, count() AS hits, count(DISTINCT pull_request_number) AS prs
FROM default.checks
WHERE check_start_time > now() - INTERVAL 30 DAY
  AND test_name LIKE '%test_keeper_force_recovery%cluster_recovery%'
  AND test_status IN ('FAIL','ERROR')
GROUP BY day ORDER BY day DESC
-- 2026-05-15: 1 hit (this PR)   2026-05-11: 1 (different shape)   2026-04-19: 2 hits

2. FunctionsStress.stress / AllTests — indexOf non-determinism (tracked in #104877 family)

The failing shape on 3c70a3490a is indexOf(materialize(CAST([... null-byte-padded strings ...] AS Array(...))), <IPv6-like String>) returning 1 vs 0 between two executions of the same query when arg {1} is made const. This is the same String↔FixedString / null-padding non-determinism root pattern as the rest of the #104877 chronic family.

indexOfAssumeSorted is already explicitly tracked under #104877 (PR #103786 sighting 2026-05-14). Pure indexOf is a new shape variant within the same family — currently 9 hits across 9 distinct PRs in 7 days (PRs #103545, #103786, #104431, #104493, #104705, #104751, #105041, #100377, #96886). I'll add indexOf to the family-shape tracker.

Fixes in flight covering the family:

#104858 (@Algunenano, OPEN) — stringVectorConstantFixedString in FunctionsComparison.h + castForIf (covers isDistinctFrom/isNotDistinctFrom/if/nullIf String↔FixedString shapes)
#104804 (groeneai, OPEN) — cache->default_column wrapper-layer castColumn fix (covers caseWithExpression/transform)

indexOf and indexOfAssumeSorted are array-functions, not the binary-comparison family — likely a sibling fix needed for the array-position lookup path (binary search dispatch and/or the constant-vs-materialize argument promotion). I'll investigate the indexOf/indexOfAssumeSorted array-function path in a separate PR. Tracking under the #104877 umbrella.

CI report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=104431&sha=3c70a3490a9353117f2ecc65ca1458dd7285378c&name_0=PR

alexey-milovidov · 2026-05-17T23:20:13Z

Still degraded:

The single-file Parquet split path in `ReadFromFile::initializePipeline` opens the file and runs `Parquet::Prefetcher::init` on every query, even when `splitParquetFileWithCache` ends up doing nothing but a hash-table lookup in `ParquetMetadataCache`. That fixed cost is ~0.3 ms — small absolute, but visibly slows "short" Parquet queries: on the `clickbench_parquet_short` ARM run for #104431, six of the seven queries were still 5-12 % slower than master after the row-group-floor and metadata-cache-sharing commits, with `#5` (`SELECT UserID ... WHERE UserID = ...`) over its +5.1 % threshold. Add a warm-cache fast path, `trySplitParquetFileFromCacheOnly`, that peeks at `ParquetMetadataCache` via `CacheBase::get` (no I/O, no prefetcher, no `FormatSettings` construction) and returns the bucket layout when the `(file_path, etag)` entry is already present. The caller in `StorageFile.cpp` runs that first; only on a real cache miss does it fall back to `createReadBuffer` + `splitParquetFileWithCache` and parse the footer. The first query against a file is unchanged; every subsequent query against the same file skips the redundant open and prefetcher init. Verified by `clickhouse-local --send_logs_level=trace` against a 1000- row-group file: the first SELECT logs one `cache miss` + N `cache hit` (one per per-bucket source), and the second / third SELECTs log only the N per-source `cache hit`s — the splitter no longer participates in the cache hit/miss path on the warm-cache common case. CI report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=104431&sha=e7d8afc6a61112cc487394334e6b20e0d95ba75a&name_0=PR&name_1=Performance%20Comparison%20%28arm_release%2C%20master_head%2C%201%2F6%29 PR: #104431 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…t-104251-parquet-single-file-parallelism

…t-104251-parquet-single-file-parallelism # Conflicts: # src/Processors/Formats/Impl/ParquetV3BlockInputFormat.h

clickhouse-gh · 2026-05-20T21:24:21Z

LLVM Coverage Report

Metric	Baseline	Current	Δ
Lines	84.10%	84.10%	+0.00%
Functions	91.40%	91.40%	+0.00%
Branches	76.50%	76.50%	+0.00%

Changed lines: 88.69% (149/168) | lost baseline coverage: 1 line(s) · Uncovered code

Full report · Diff report

Revert "Revert "Parallelize reads from a single Parquet file in Stora…

b7c0c71

…geFile""

clickhouse-gh Bot added the pr-not-for-changelog This PR should not be mentioned in the changelog label May 8, 2026

alexey-milovidov added 2 commits May 9, 2026 22:12

Merge remote-tracking branch 'origin/master' into revert-104359-rever…

44be53f

…t-104251-parquet-single-file-parallelism

Merge remote-tracking branch 'origin/master' into revert-104359-rever…

b7848f9

…t-104251-parquet-single-file-parallelism

alexey-milovidov added 2 commits May 11, 2026 06:58

Merge remote-tracking branch 'origin/master' into revert-104359-rever…

6b7a884

…t-104251-parquet-single-file-parallelism

Merge remote-tracking branch 'origin/master' into revert-104359-rever…

a70704c

…t-104251-parquet-single-file-parallelism

clickhouse-gh Bot reviewed May 11, 2026

View reviewed changes

Comment thread src/Storages/StorageFile.cpp

alexey-milovidov added 2 commits May 12, 2026 18:47

Merge branch 'master' into revert-104359-revert-104251-parquet-single…

4899196

…-file-parallelism

clickhouse-gh Bot reviewed May 13, 2026

View reviewed changes

Comment thread src/Storages/StorageFile.cpp

alexey-milovidov closed this May 13, 2026

alexey-milovidov reopened this May 14, 2026

clickhouse-gh Bot reviewed May 14, 2026

View reviewed changes

Comment thread src/Storages/StorageFile.cpp

alexey-milovidov and others added 2 commits May 15, 2026 08:05

Merge remote-tracking branch 'origin/master' into revert-104359-rever…

5f02d3d

…t-104251-parquet-single-file-parallelism

alexey-milovidov and others added 2 commits May 17, 2026 00:14

alexey-milovidov mentioned this pull request May 17, 2026

Stop the bleeding in function_prop_fuzzer #105146

Merged

1 task

alexey-milovidov and others added 3 commits May 18, 2026 13:27

Merge remote-tracking branch 'origin/master' into revert-104359-rever…

3e89d1b

…t-104251-parquet-single-file-parallelism

Merge remote-tracking branch 'origin/master' into revert-104359-rever…

fe78f35

…t-104251-parquet-single-file-parallelism # Conflicts: # src/Processors/Formats/Impl/ParquetV3BlockInputFormat.h

Conversation

alexey-milovidov commented May 8, 2026

Uh oh!

clickhouse-gh Bot commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

AI Review

Summary

Final Verdict

Uh oh!

alexey-milovidov commented May 10, 2026

Uh oh!

Uh oh!

Uh oh!

alexey-milovidov commented May 13, 2026

Uh oh!

alexey-milovidov commented May 13, 2026

Uh oh!

alexey-milovidov commented May 14, 2026

Uh oh!

Uh oh!

alexey-milovidov commented May 15, 2026

Uh oh!

alexey-milovidov commented May 15, 2026

Uh oh!

groeneai commented May 15, 2026

Uh oh!

alexey-milovidov commented May 17, 2026

Uh oh!

groeneai commented May 17, 2026

Uh oh!

alexey-milovidov commented May 17, 2026

Uh oh!

clickhouse-gh Bot commented May 20, 2026

LLVM Coverage Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

clickhouse-gh Bot commented May 8, 2026 •

edited

Loading