Revert "Revert "Parallelize reads from a single Parquet file in StorageFile""#104431
Revert "Revert "Parallelize reads from a single Parquet file in StorageFile""#104431alexey-milovidov wants to merge 16 commits into
Conversation
|
Workflow [PR], commit [fe78f35] Summary: ✅ AI ReviewSummaryThis revert restores single-file Parquet parallel read splitting in Final VerdictStatus: ✅ Approve |
…t-104251-parquet-single-file-parallelism
…t-104251-parquet-single-file-parallelism
…t-104251-parquet-single-file-parallelism
…t-104251-parquet-single-file-parallelism
…-file-parallelism
When a single Parquet file is split into multiple bucketed sources by `StorageFile` (the path re-introduced in this PR), the file-level count cache must be bypassed: it is keyed by file path, so consulting or writing it from the bucketed read path would have every source report the file's full row count and multiply the result by the number of buckets. Addresses review feedback on PR #104431 asking for explicit regression coverage of this invariant.
Three gates on the bucketed single-file read path in `StorageFile`, addressing the `clickbench_parquet_short` regression observed on ARM in the CI of #104431 (#104431) and an open review comment. 1. `parallelize_output_from_storages = 0` now disables the split. The per-bucket sources are exactly the kind of read parallelism the setting's contract is about, but the existing check fired only after the sources had been created. Review feedback from #104431 (comment). 2. `need_only_count` queries skip the split. They consult only the file's metadata, so splitting them across N sources just multiplies the metadata-parse cost N-fold without any read-side benefit. This was the largest single contributor to the `Q1` (`SELECT COUNT(*)`) regression in `clickbench_parquet_short`. 3. The Parquet splitter (`ParquetBucketSplitter::splitToBucketsByCount`) now requires each chunk to cover at least 8 row groups. For a file with a small number of row groups, parallelising across all available threads multiplies per-bucket metadata-parse and prefetcher-setup overhead without giving each source enough work to amortise it. Large files (many row groups) still get full parallelism. Updates `02725_parquet_preserve_order.reference`: the 2-row-group test file falls below the new row-group floor so the pipeline goes back to a single `File` source followed by `Resize 1 → 2`, matching pre-#104251 behaviour. CI report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=104431&sha=78ecefef098eedd55ab0a0ce350082364c8c23be&name_0=PR&name_1=Performance%20Comparison%20%28arm_release%2C%20master_head%2C%201%2F6%29
|
However, there is still a chance to speed it up: https://benchmark.clickhouse.com/#system=+curp|H(u,s&type=-&machine=-ca4e|6t|ca2|ae-|g4e|6ax|6ale|3al&cluster_size=-&opensource=+s&hardware=+c&tuned=+n&metric=hot&queries=- |
|
From now on, we will be interested in accelerating those few queries that show the difference between the "single" and "partitioned" variants. If we can make it without degradations of other queries, we can merge this PR. |
… single-file split This test pins the contract requested in #104431 (comment): when `parallelize_output_from_storages = 0` is set, the single-file Parquet split path in `StorageFile` must not fan out into multiple per-bucket sources, even if `max_threads > 1` and the file has enough row groups to otherwise be split. The existing `04230_parquet_single_file_parallel_count.sql` only exercises `parallelize_output_from_storages = 1`. The previous regression was specifically that `parallelize_output_from_storages = 0` was ignored for this branch, so a dedicated test is needed to prevent silent regressions. The test creates a 40-row-group Parquet file (well above the 8-row-groups-per-chunk floor in `ParquetBucketSplitter`) and uses `EXPLAIN PIPELINE` to assert that with `parallelize_output_from_storages = 0` there is no `File ×` multiplier and no `Resize` step, and with `parallelize_output_from_storages = 1` the file IS split (the pipeline contains `File × N`). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…t-104251-parquet-single-file-parallelism
|
@groeneai, three stateless
Same exception across all three: |
|
@groeneai, the The exact same failure mode hit PR #104445 on the same day with a different version pair ( |
`clickbench_parquet_short` on the synthetic 20-row-group file still regressed 18-37% on `arm_release` even after the `8-row-groups-per-chunk` floor in `ParquetBucketSplitter`: 20/8 = 2 chunks, and each extra bucket added ~1-3 ms of per-source setup (metadata parse, prefetcher init, schema conversion, hyperrectangle build, bloom-filter init) on top of the ~5 ms baseline. For "short" queries on small files the parallelism win does not pay for that fixed cost. Bump the floor to 16. The synthetic test now reads with a single source (20/16 = 1 chunk), while `04238_parquet_single_file_parallelize_output_off` still exercises a 2-chunk split (40/16 = 2) and a real `hits.parquet` (hundreds of row groups) still fans out up to `max_threads`. CI report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=104431&sha=5f02d3d3200c9d6572d8848eb639c034a0fc0c62&name_0=PR&name_1=Performance%20Comparison%20%28arm_release%2C%20master_head%2C%201%2F6%29 PR: #104431 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
@alexey-milovidov — confirmed, all three failures ( Status: root cause identified, short-term fix already merged
CIDB verification — WasmEdge MSan SystemLog flush timeouts on master, hourly (UTC):
Master goes to zero immediately after #104971 merges at 08:57Z. PR branches predate the fix and still hit it (expected — branches don't auto-rebase). Failure family count over 14 days: 14 distinct test names, ~720 hits across 200+ PRs — the 3 hits on this PR are part of that. For this PR: the head commit Let me know if you'd like me to file a follow-up issue once Algunenano re-enables v3 in #105019 (to track whether the proper fix holds up), or treat the family as resolved once the master baseline stays clean for 48-72h. — @groeneai (session |
After raising the per-chunk row-group floor to 16 in `ParquetBucketSplitter`, the test file (`numbers(1000)` at row-group size 50 = 20 row groups) stayed below the 2-chunk threshold (`32` row groups), so the bucketed read path was no longer exercised — the test was no longer a regression guard for the count-cache-on-bucketed-source bug it was added to pin. Bump the file to 64 row groups (`numbers(3200)`) so the splitter produces 4 chunks and the bucketed branch is actually hit. The asserted counts are updated to match. This addresses the open review comment on #104431 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ache The single-file Parquet split path in `StorageFile` parses the file's metadata once in `ParquetBucketSplitter::splitToBucketsByCount` and then the per-bucket `ParquetV3BlockInputFormat` sources each parse it again in `getFileMetadata`. The splitter's parse went through Arrow's `parquet::ReadMetaData` (uncacheable) while the source's parse went through `Parquet::Reader::readFileMetaData` keyed in `ParquetMetadataCache` by `(file_path, etag)` — so the two paths could not share work. For "short" queries on a single Parquet file (`clickbench_parquet_short` on the 828 MB synthetic ARM run), the extra footer parse cost 1-1.5 ms on top of a 5-8 ms baseline — 12-30 % of the runtime — and remained after raising the per-chunk row-group floor to 16 because the splitter still parsed metadata before deciding not to split. Route the splitter through the same cache: * `ParquetBucketSplitter::splitToBucketsByCount` now uses `Parquet::Reader::readFileMetaData` via a `Parquet::Prefetcher` (the same code path `NativeParquetSchemaReader::initializeIfNeeded` uses), producing a `parquet::format::FileMetaData` storable in `ParquetMetadataCache`. * New `splitParquetFileWithCache` helper accepts a cache pointer and the same `(file_path, etag)` key the source builds (sub-second mtime + inode + size), populating the cache on miss and reusing the parse on hit. * `ReadFromFile::initializePipeline` calls the cache-aware helper for Parquet, so the per-bucket sources hit the cache and skip the footer parse. The single-bucket case (no fan-out) benefits too: the surviving single source still hits the cache. Net effect: cold cache → 1 parse total (same as master); warm cache → 0 parses on both paths. CI report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=104431&sha=3c70a3490a9353117f2ecc65ca1458dd7285378c&name_0=PR&name_1=Performance%20Comparison%20%28arm_release%2C%20master_head%2C%201%2F6%29 PR: #104431 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
@groeneai, two CI failures in this PR look unrelated to the Parquet changes — please verify and link any in-progress fix here, otherwise file as separate issues.
|
|
@alexey-milovidov — confirmed, both failures are unrelated to the Parquet changes in this PR. Cross-linking: 1. The assertion shape ( CIDB (30-day): only 4 hits across 4 distinct PRs, 0 on master. No in-progress fix PR. Failures are sparse and re-distributed across unrelated PRs, so the underlying race is still present at very low rate (~0.1%) even after #78474's cleanup. Suggest reopening #78474 (or filing a follow-up) rather than a fresh issue — I can do that next worker run. SELECT toStartOfDay(check_start_time) AS day, count() AS hits, count(DISTINCT pull_request_number) AS prs
FROM default.checks
WHERE check_start_time > now() - INTERVAL 30 DAY
AND test_name LIKE '%test_keeper_force_recovery%cluster_recovery%'
AND test_status IN ('FAIL','ERROR')
GROUP BY day ORDER BY day DESC
-- 2026-05-15: 1 hit (this PR) 2026-05-11: 1 (different shape) 2026-04-19: 2 hits2. The failing shape on
Fixes in flight covering the family:
|
The single-file Parquet split path in `ReadFromFile::initializePipeline` opens the file and runs `Parquet::Prefetcher::init` on every query, even when `splitParquetFileWithCache` ends up doing nothing but a hash-table lookup in `ParquetMetadataCache`. That fixed cost is ~0.3 ms — small absolute, but visibly slows "short" Parquet queries: on the `clickbench_parquet_short` ARM run for #104431, six of the seven queries were still 5-12 % slower than master after the row-group-floor and metadata-cache-sharing commits, with `#5` (`SELECT UserID ... WHERE UserID = ...`) over its +5.1 % threshold. Add a warm-cache fast path, `trySplitParquetFileFromCacheOnly`, that peeks at `ParquetMetadataCache` via `CacheBase::get` (no I/O, no prefetcher, no `FormatSettings` construction) and returns the bucket layout when the `(file_path, etag)` entry is already present. The caller in `StorageFile.cpp` runs that first; only on a real cache miss does it fall back to `createReadBuffer` + `splitParquetFileWithCache` and parse the footer. The first query against a file is unchanged; every subsequent query against the same file skips the redundant open and prefetcher init. Verified by `clickhouse-local --send_logs_level=trace` against a 1000- row-group file: the first SELECT logs one `cache miss` + N `cache hit` (one per per-bucket source), and the second / third SELECTs log only the N per-source `cache hit`s — the splitter no longer participates in the cache hit/miss path on the warm-cache common case. CI report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=104431&sha=e7d8afc6a61112cc487394334e6b20e0d95ba75a&name_0=PR&name_1=Performance%20Comparison%20%28arm_release%2C%20master_head%2C%201%2F6%29 PR: #104431 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…t-104251-parquet-single-file-parallelism
…t-104251-parquet-single-file-parallelism # Conflicts: # src/Processors/Formats/Impl/ParquetV3BlockInputFormat.h
LLVM Coverage Report
Changed lines: 88.69% (149/168) | lost baseline coverage: 1 line(s) · Uncovered code |

Reverts #104359