Skip to content

Revert "Revert "Parallelize reads from a single Parquet file in StorageFile""#104431

Draft
alexey-milovidov wants to merge 16 commits into
masterfrom
revert-104359-revert-104251-parquet-single-file-parallelism
Draft

Revert "Revert "Parallelize reads from a single Parquet file in StorageFile""#104431
alexey-milovidov wants to merge 16 commits into
masterfrom
revert-104359-revert-104251-parquet-single-file-parallelism

Conversation

@alexey-milovidov
Copy link
Copy Markdown
Member

Reverts #104359

@clickhouse-gh
Copy link
Copy Markdown
Contributor

clickhouse-gh Bot commented May 8, 2026

Workflow [PR], commit [fe78f35]

Summary:


AI Review

Summary

This revert restores single-file Parquet parallel read splitting in StorageFile and adds safeguards for parallelize_output_from_storages, bucketed-count correctness, and Parquet metadata reuse. I reviewed the current diff and prior discussion threads; previously raised bot concerns are addressed in the current code and tests, and I did not find unresolved correctness, safety, or contract issues that require new review comments.

Final Verdict

Status: ✅ Approve

@clickhouse-gh clickhouse-gh Bot added the pr-not-for-changelog This PR should not be mentioned in the changelog label May 8, 2026
@alexey-milovidov
Copy link
Copy Markdown
Member Author

Good: the newly added performance tests prevent this change:

Screenshot_20260510_123149

Comment thread src/Storages/StorageFile.cpp
When a single Parquet file is split into multiple bucketed sources by
`StorageFile` (the path re-introduced in this PR), the file-level count
cache must be bypassed: it is keyed by file path, so consulting or
writing it from the bucketed read path would have every source report
the file's full row count and multiply the result by the number of
buckets.

Addresses review feedback on PR #104431 asking for explicit regression
coverage of this invariant.
Comment thread src/Storages/StorageFile.cpp
Three gates on the bucketed single-file read path in `StorageFile`,
addressing the `clickbench_parquet_short` regression observed on ARM in
the CI of #104431 (#104431)
and an open review comment.

1. `parallelize_output_from_storages = 0` now disables the split. The
   per-bucket sources are exactly the kind of read parallelism the
   setting's contract is about, but the existing check fired only after
   the sources had been created. Review feedback from
   #104431 (comment).

2. `need_only_count` queries skip the split. They consult only the
   file's metadata, so splitting them across N sources just multiplies
   the metadata-parse cost N-fold without any read-side benefit. This
   was the largest single contributor to the `Q1` (`SELECT COUNT(*)`)
   regression in `clickbench_parquet_short`.

3. The Parquet splitter (`ParquetBucketSplitter::splitToBucketsByCount`)
   now requires each chunk to cover at least 8 row groups. For a file
   with a small number of row groups, parallelising across all available
   threads multiplies per-bucket metadata-parse and prefetcher-setup
   overhead without giving each source enough work to amortise it. Large
   files (many row groups) still get full parallelism.

Updates `02725_parquet_preserve_order.reference`: the 2-row-group test
file falls below the new row-group floor so the pipeline goes back to a
single `File` source followed by `Resize 1 → 2`, matching pre-#104251
behaviour.

CI report:
https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=104431&sha=78ecefef098eedd55ab0a0ce350082364c8c23be&name_0=PR&name_1=Performance%20Comparison%20%28arm_release%2C%20master_head%2C%201%2F6%29
@alexey-milovidov
Copy link
Copy Markdown
Member Author

@alexey-milovidov
Copy link
Copy Markdown
Member Author

@alexey-milovidov
Copy link
Copy Markdown
Member Author

From now on, we will be interested in accelerating those few queries that show the difference between the "single" and "partitioned" variants. If we can make it without degradations of other queries, we can merge this PR.

Comment thread src/Storages/StorageFile.cpp
alexey-milovidov and others added 2 commits May 15, 2026 08:05
… single-file split

This test pins the contract requested in
#104431 (comment): when
`parallelize_output_from_storages = 0` is set, the single-file Parquet
split path in `StorageFile` must not fan out into multiple per-bucket
sources, even if `max_threads > 1` and the file has enough row groups
to otherwise be split.

The existing `04230_parquet_single_file_parallel_count.sql` only
exercises `parallelize_output_from_storages = 1`. The previous
regression was specifically that `parallelize_output_from_storages = 0`
was ignored for this branch, so a dedicated test is needed to prevent
silent regressions.

The test creates a 40-row-group Parquet file (well above the
8-row-groups-per-chunk floor in `ParquetBucketSplitter`) and uses
`EXPLAIN PIPELINE` to assert that with `parallelize_output_from_storages
= 0` there is no `File ×` multiplier and no `Resize` step, and with
`parallelize_output_from_storages = 1` the file IS split (the pipeline
contains `File × N`).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@alexey-milovidov
Copy link
Copy Markdown
Member Author

@groeneai, three stateless amd_msan, WasmEdge failures here are the standard DB::SystemLogQueue<DB::QueryLogElement>::waitFlush Timeout exceeded (180 s) pattern:

  • 01939_network_receive_bytes_metrics
  • 02703_max_local_read_bandwidth
  • 03753_join_runtime_filter_dynamically_disable (tracked in #104925)

Same exception across all three: Timeout exceeded (180 s) while flushing system log 'DB::SystemLogQueue<DB::QueryLogElement>'. CIDB shows 12 hits on 2026-05-14 and 14 on 2026-05-13 across many unrelated PRs (and master), so this is not related to the parallel Parquet read path in this PR. Please investigate the underlying SystemLog flush hang and provide a fix in a separate PR.

CI report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=104431&sha=6956a6a1a05203b43c57d3e3fc259668147ea190&name_0=PR&name_1=Stateless%20tests%20%28amd_msan%2C%20WasmEdge%2C%20parallel%2C%201%2F2%29

@alexey-milovidov
Copy link
Copy Markdown
Member Author

@groeneai, the Install packages (amd_release) and Install packages (arm_release) failures are an infrastructure issue: yum localinstall is being given two RPMs of the same package at different version strings and refusing to install both:

Problem 1: cannot install both clickhouse-server-26.5.1.651-1.x86_64 and clickhouse-server-26.5.1.594-1.x86_64

The exact same failure mode hit PR #104445 on the same day with a different version pair (26.5.1.465 vs 26.5.1.641), so this is the install-test script picking up RPMs from two build jobs that produced different .594 / .651 (or .465 / .641) version strings. Not related to the Parquet changes in this PR. Please investigate and fix in a separate PR.

CI report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=104431&sha=6956a6a1a05203b43c57d3e3fc259668147ea190&name_0=PR&name_1=Install%20packages%20%28amd_release%29

`clickbench_parquet_short` on the synthetic 20-row-group file still
regressed 18-37% on `arm_release` even after the
`8-row-groups-per-chunk` floor in `ParquetBucketSplitter`: 20/8 = 2
chunks, and each extra bucket added ~1-3 ms of per-source setup
(metadata parse, prefetcher init, schema conversion, hyperrectangle
build, bloom-filter init) on top of the ~5 ms baseline. For "short"
queries on small files the parallelism win does not pay for that
fixed cost.

Bump the floor to 16. The synthetic test now reads with a single
source (20/16 = 1 chunk), while `04238_parquet_single_file_parallelize_output_off`
still exercises a 2-chunk split (40/16 = 2) and a real `hits.parquet`
(hundreds of row groups) still fans out up to `max_threads`.

CI report:
https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=104431&sha=5f02d3d3200c9d6572d8848eb639c034a0fc0c62&name_0=PR&name_1=Performance%20Comparison%20%28arm_release%2C%20master_head%2C%201%2F6%29
PR: #104431

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@groeneai
Copy link
Copy Markdown
Contributor

@alexey-milovidov — confirmed, all three failures (01939_network_receive_bytes_metrics, 02703_max_local_read_bandwidth, 03753_join_runtime_filter_dynamically_disable) are the chronic DB::SystemLogQueue<DB::QueryLogElement>::waitFlush 180s family unrelated to the parallel Parquet read path in this PR.

Status: root cause identified, short-term fix already merged

CIDB verification — WasmEdge MSan SystemLog flush timeouts on master, hourly (UTC):

hour master PR
2026-05-15 07:00 8 9
2026-05-15 08:00 0 9
2026-05-15 09:00 0 9
2026-05-15 10:00 0 1
2026-05-15 13:00 0 2
2026-05-15 14:00 0 1

Master goes to zero immediately after #104971 merges at 08:57Z. PR branches predate the fix and still hit it (expected — branches don't auto-rebase). Failure family count over 14 days: 14 distinct test names, ~720 hits across 200+ PRs — the 3 hits on this PR are part of that.

For this PR: the head commit 6956a6a1a05 is from 2026-05-13 (two days before the fix landed). A rebase onto current master + re-run should clear the 3 amd_msan, WasmEdge failures. The Install packages (amd_release/arm_release) RPM-version-mismatch is a separate infrastructure issue and is tracked in our internal queue.

Let me know if you'd like me to file a follow-up issue once Algunenano re-enables v3 in #105019 (to track whether the proper fix holds up), or treat the family as resolved once the master baseline stays clean for 48-72h.

@groeneai (session cron:clickhouse-ci-task-worker:20260515-221500)

alexey-milovidov and others added 2 commits May 17, 2026 00:14
After raising the per-chunk row-group floor to 16 in `ParquetBucketSplitter`,
the test file (`numbers(1000)` at row-group size 50 = 20 row groups)
stayed below the 2-chunk threshold (`32` row groups), so the bucketed
read path was no longer exercised — the test was no longer a regression
guard for the count-cache-on-bucketed-source bug it was added to pin.

Bump the file to 64 row groups (`numbers(3200)`) so the splitter
produces 4 chunks and the bucketed branch is actually hit. The asserted
counts are updated to match.

This addresses the open review comment on
#104431

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ache

The single-file Parquet split path in `StorageFile` parses the file's
metadata once in `ParquetBucketSplitter::splitToBucketsByCount` and then
the per-bucket `ParquetV3BlockInputFormat` sources each parse it again
in `getFileMetadata`. The splitter's parse went through Arrow's
`parquet::ReadMetaData` (uncacheable) while the source's parse went
through `Parquet::Reader::readFileMetaData` keyed in
`ParquetMetadataCache` by `(file_path, etag)` — so the two paths could
not share work. For "short" queries on a single Parquet file
(`clickbench_parquet_short` on the 828 MB synthetic ARM run), the
extra footer parse cost 1-1.5 ms on top of a 5-8 ms baseline — 12-30 %
of the runtime — and remained after raising the per-chunk row-group
floor to 16 because the splitter still parsed metadata before deciding
not to split.

Route the splitter through the same cache:

* `ParquetBucketSplitter::splitToBucketsByCount` now uses
  `Parquet::Reader::readFileMetaData` via a `Parquet::Prefetcher` (the
  same code path `NativeParquetSchemaReader::initializeIfNeeded` uses),
  producing a `parquet::format::FileMetaData` storable in
  `ParquetMetadataCache`.

* New `splitParquetFileWithCache` helper accepts a cache pointer and
  the same `(file_path, etag)` key the source builds (sub-second mtime
  + inode + size), populating the cache on miss and reusing the parse
  on hit.

* `ReadFromFile::initializePipeline` calls the cache-aware helper for
  Parquet, so the per-bucket sources hit the cache and skip the footer
  parse. The single-bucket case (no fan-out) benefits too: the
  surviving single source still hits the cache.

Net effect: cold cache → 1 parse total (same as master); warm cache →
0 parses on both paths.

CI report:
https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=104431&sha=3c70a3490a9353117f2ecc65ca1458dd7285378c&name_0=PR&name_1=Performance%20Comparison%20%28arm_release%2C%20master_head%2C%201%2F6%29
PR: #104431

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@alexey-milovidov
Copy link
Copy Markdown
Member Author

@groeneai, two CI failures in this PR look unrelated to the Parquet changes — please verify and link any in-progress fix here, otherwise file as separate issues.

  1. Integration tests (amd_msan, 3/6)test_keeper_force_recovery/test.py::test_cluster_recovery — same 'This instance is not currently serving requests' mntr-output assertion that closed #78474 tracked. This PR touches StorageFile/Parquet only; Keeper is not in the changed paths.
  2. Unit tests (asan_ubsan, function_prop_fuzzer)FunctionsStress.stress and AllTests — tracked at #104877. The failing example is indexOf determinism, unrelated to Parquet.

Report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=104431&sha=3c70a3490a9353117f2ecc65ca1458dd7285378c&name_0=PR

@groeneai
Copy link
Copy Markdown
Contributor

@alexey-milovidov — confirmed, both failures are unrelated to the Parquet changes in this PR. Cross-linking:

1. test_keeper_force_recovery/test.py::test_cluster_recovery — rare residual flake of closed #78474

The assertion shape ('This instance is not currently serving requests' vs the zk_version ... mntr response, gap between start_cluster()/leader handover and the test issuing mntr) matches the pattern that closed #78474 exactly.

CIDB (30-day): only 4 hits across 4 distinct PRs, 0 on master. No in-progress fix PR. Failures are sparse and re-distributed across unrelated PRs, so the underlying race is still present at very low rate (~0.1%) even after #78474's cleanup. Suggest reopening #78474 (or filing a follow-up) rather than a fresh issue — I can do that next worker run.

SELECT toStartOfDay(check_start_time) AS day, count() AS hits, count(DISTINCT pull_request_number) AS prs
FROM default.checks
WHERE check_start_time > now() - INTERVAL 30 DAY
  AND test_name LIKE '%test_keeper_force_recovery%cluster_recovery%'
  AND test_status IN ('FAIL','ERROR')
GROUP BY day ORDER BY day DESC
-- 2026-05-15: 1 hit (this PR)   2026-05-11: 1 (different shape)   2026-04-19: 2 hits

2. FunctionsStress.stress / AllTestsindexOf non-determinism (tracked in #104877 family)

The failing shape on 3c70a3490a is indexOf(materialize(CAST([... null-byte-padded strings ...] AS Array(...))), <IPv6-like String>) returning 1 vs 0 between two executions of the same query when arg {1} is made const. This is the same String↔FixedString / null-padding non-determinism root pattern as the rest of the #104877 chronic family.

indexOfAssumeSorted is already explicitly tracked under #104877 (PR #103786 sighting 2026-05-14). Pure indexOf is a new shape variant within the same family — currently 9 hits across 9 distinct PRs in 7 days (PRs #103545, #103786, #104431, #104493, #104705, #104751, #105041, #100377, #96886). I'll add indexOf to the family-shape tracker.

Fixes in flight covering the family:

  • #104858 (@Algunenano, OPEN) — stringVectorConstantFixedString in FunctionsComparison.h + castForIf (covers isDistinctFrom/isNotDistinctFrom/if/nullIf String↔FixedString shapes)
  • #104804 (groeneai, OPEN) — cache->default_column wrapper-layer castColumn fix (covers caseWithExpression/transform)

indexOf and indexOfAssumeSorted are array-functions, not the binary-comparison family — likely a sibling fix needed for the array-position lookup path (binary search dispatch and/or the constant-vs-materialize argument promotion). I'll investigate the indexOf/indexOfAssumeSorted array-function path in a separate PR. Tracking under the #104877 umbrella.

CI report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=104431&sha=3c70a3490a9353117f2ecc65ca1458dd7285378c&name_0=PR

@alexey-milovidov
Copy link
Copy Markdown
Member Author

Still degraded:

Screenshot_20260518_011943

alexey-milovidov and others added 3 commits May 18, 2026 13:27
The single-file Parquet split path in `ReadFromFile::initializePipeline`
opens the file and runs `Parquet::Prefetcher::init` on every query, even
when `splitParquetFileWithCache` ends up doing nothing but a hash-table
lookup in `ParquetMetadataCache`. That fixed cost is ~0.3 ms — small
absolute, but visibly slows "short" Parquet queries: on the
`clickbench_parquet_short` ARM run for #104431, six of the seven queries
were still 5-12 % slower than master after the row-group-floor and
metadata-cache-sharing commits, with `#5` (`SELECT UserID ... WHERE
UserID = ...`) over its +5.1 % threshold.

Add a warm-cache fast path, `trySplitParquetFileFromCacheOnly`, that
peeks at `ParquetMetadataCache` via `CacheBase::get` (no I/O, no
prefetcher, no `FormatSettings` construction) and returns the bucket
layout when the `(file_path, etag)` entry is already present. The caller
in `StorageFile.cpp` runs that first; only on a real cache miss does it
fall back to `createReadBuffer` + `splitParquetFileWithCache` and parse
the footer. The first query against a file is unchanged; every
subsequent query against the same file skips the redundant open and
prefetcher init.

Verified by `clickhouse-local --send_logs_level=trace` against a 1000-
row-group file: the first SELECT logs one `cache miss` + N `cache hit`
(one per per-bucket source), and the second / third SELECTs log only
the N per-source `cache hit`s — the splitter no longer participates in
the cache hit/miss path on the warm-cache common case.

CI report:
https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=104431&sha=e7d8afc6a61112cc487394334e6b20e0d95ba75a&name_0=PR&name_1=Performance%20Comparison%20%28arm_release%2C%20master_head%2C%201%2F6%29
PR: #104431

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…t-104251-parquet-single-file-parallelism

# Conflicts:
#	src/Processors/Formats/Impl/ParquetV3BlockInputFormat.h
@clickhouse-gh
Copy link
Copy Markdown
Contributor

clickhouse-gh Bot commented May 20, 2026

LLVM Coverage Report

Metric Baseline Current Δ
Lines 84.10% 84.10% +0.00%
Functions 91.40% 91.40% +0.00%
Branches 76.50% 76.50% +0.00%

Changed lines: 88.69% (149/168) | lost baseline coverage: 1 line(s) · Uncovered code

Full report · Diff report

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-not-for-changelog This PR should not be mentioned in the changelog

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants