Antalya 26.3: Parallelize reads from a single Parquet file in StorageFile#1806
Conversation
…solution in next commit) --- Original cherry-pick message follows: Merge pull request ClickHouse#104251 from alexey-milovidov/parquet-single-file-parallelism Parallelize reads from a single Parquet file in StorageFile # Conflicts: # src/Processors/Formats/Impl/ParquetV3BlockInputFormat.cpp # src/Processors/Formats/Impl/ParquetV3BlockInputFormat.h
RelEasy
|
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Reading a single large local Parquet file via
file()/Fileengine is now parallelised across multiple sources, each handling a subset of row groups. This eliminates aResize 1 → Nbottleneck in the pipeline and brings single-file ClickBench performance close to the partitioned variant — Q23 goes from ~1.4s to ~0.55s, Q22 from ~0.9s to ~0.48s, Q27 from ~1.6s to ~0.54s on 96 vCPUs (ClickHouse#104251 by @alexey-milovidov).Cherry-picked from ClickHouse#104251.
On ClickBench, single-file Parquet runs are 3–9× slower than the 100-file partitioned runs on the same data (e.g. on
c7a.metal-48xl, Q23 is8.90svs0.99s, Q221.82svs0.41s, Q271.21svs0.45s). The cause is inStorageFile: when reading a single splittable file it creates exactly oneParquetV3BlockInputFormatsource, so the pipeline becomesFile 0 → 1followed byResize 1 → 96. That fan-out is a serialization point — every chunk has to leave the single source through onereadbefore any of the 96 aggregators can touch it, so most cores sit idle.The bucket-splitting machinery (
ParquetBucketSplitter,setBucketsToRead,FileBucketInfo) already existed for cluster mode but was never wired intoStorageFile. This PR wires it in:IBucketSplitter::splitToBucketsByCountreturning roughly N contiguous row-group ranges; Parquet implements it.FormatFactory::checkFormatHasSplitterso callers can probe without throwing.StorageFile::ReadFromFile::initializePipeline, when reading exactly one local splittable file, asks the splitter formax_num_streamsbuckets and creates oneStorageFileSourceper bucket. Each source carriesfixed_file_path+file_bucket_infoand skips the sharedFilesIterator.ParquetV3BlockInputFormat::readhonoursbuckets_to_readin the trivial-count path so each bucket only reports its own row count.Pipeline becomes
File × N 0 → 1straight into the aggregators, matching the partitioned variant.Results
96-vCPU box,
hits.parquet(14 GiB, 226 row groups):CPU utilisation on Q23 jumped from ~6× to ~18× of 96 cores. Aggregate results (
count,sum(UserID),sum(length(URL)), Q21, Q23) match the partitioned variant exactly. The remaining ~1.3× gap to partitioned is per-source initialization overhead: each bucket source still reads the 14 GB file's footer separately. Sharing parsed metadata for local files is the obvious next step but a much bigger change.Documentation entry for user-facing changes