Bug Report
Summary: Both Parquet reader implementations (ParquetBlockInputFormat and native V3 ReadManager) report "bytes read" using the entire row group's total_compressed_size (all columns), not just the columns selected by the query. This inflates reported read_bytes and bytes/s statistics by a factor of total_columns / selected_columns.
Severity: Progress bar, system.query_log.read_bytes, and client output show significantly wrong values — especially for wide tables and Iceberg/DataLake workloads with many Parquet files (reporting TB/s when actual I/O is GB/s or less).
Reproduction
- Create a Parquet file with 50 columns, ~200MB total compressed
- Run:
SELECT col1, col2 FROM file('test.parquet') (reading 2 of 50 columns)
- Check
system.query_log — read_bytes reports ~200MB instead of ~8MB
- Overcount is ~25x (50 / 2)
For Iceberg tables with 1000+ Parquet files, this compounds to absurd TB/s statistics.
Root Cause
Arrow-based reader (ParquetBlockInputFormat.cpp, line ~901):
auto row_group_size = metadata->RowGroup(row_group)->total_compressed_size();
// ^^^ ALL columns in the row group, not just selected ones
row_group_batches.back().total_bytes_compressed += row_group_size;
Then used in get_approx_original_chunk_size (line ~1110):
return static_cast<size_t>(std::ceil(
static_cast<double>(row_group_batch.total_bytes_compressed) /
static_cast<double>(row_group_batch.total_rows) *
static_cast<double>(num_rows)));
Native V3 reader (ReadManager.cpp, line ~1130):
size_t virtual_bytes_read = size_t(row_group.meta->total_compressed_size) *
row_subgroup.filter.rows_total /
std::max(size_t(1), size_t(row_group.meta->num_rows));
// ^^^ Same: total_compressed_size is for ALL columns
The existing comment at ReadManager.cpp:1112 acknowledges this: "This is a terrible hack to make progress indication kind of work."
Proposed Fix
For ParquetBlockInputFormat — sum ColumnChunk(col)->total_compressed_size() for only the selected column_indices (which are already computed earlier in the same function):
size_t row_group_size = 0;
auto row_group_meta = metadata->RowGroup(row_group);
for (int column_index : column_indices)
row_group_size += row_group_meta->ColumnChunk(column_index)->total_compressed_size();
For ReadManager — sum per-column compressed sizes from row_group.columns (parallel to primitive_columns, contains only selected columns):
size_t selected_columns_compressed_bytes = 0;
for (const auto & col : row_group.columns)
selected_columns_compressed_bytes += col.meta->meta_data.total_compressed_size;
size_t virtual_bytes_read = selected_columns_compressed_bytes *
row_subgroup.filter.rows_total /
std::max(size_t(1), size_t(row_group.meta->num_rows));
Additional: IcebergIterator never invokes file_progress_callback
IcebergIterator receives a FileProgressCallback in its constructor (line ~358) and stores it (line ~390), but never calls it in next(). The generic KeysIterator in IDataLakeMetadata.cpp:46 correctly invokes the callback. This means Iceberg tables have broken file-level progress tracking.
Impact
| Scenario |
Overcount factor |
| 50 columns, select 2 |
25x |
| 100 columns, select 1 |
100x |
| 200 columns, select 3 |
67x |
| Iceberg 1000 files × 50 cols, select 2 |
25x per file, cumulative |
Version
Tested on ClickHouse master (commit 05d6ca90) and 26.5 builds.
Environment
Linux x86_64, Parquet files on local filesystem and NFS-mounted Iceberg tables via REST catalog.
Bug Report
Summary: Both Parquet reader implementations (
ParquetBlockInputFormatand native V3ReadManager) report "bytes read" using the entire row group'stotal_compressed_size(all columns), not just the columns selected by the query. This inflates reportedread_bytesandbytes/sstatistics by a factor oftotal_columns / selected_columns.Severity: Progress bar,
system.query_log.read_bytes, and client output show significantly wrong values — especially for wide tables and Iceberg/DataLake workloads with many Parquet files (reporting TB/s when actual I/O is GB/s or less).Reproduction
SELECT col1, col2 FROM file('test.parquet')(reading 2 of 50 columns)system.query_log—read_bytesreports ~200MB instead of ~8MBFor Iceberg tables with 1000+ Parquet files, this compounds to absurd TB/s statistics.
Root Cause
Arrow-based reader (
ParquetBlockInputFormat.cpp, line ~901):Then used in
get_approx_original_chunk_size(line ~1110):Native V3 reader (
ReadManager.cpp, line ~1130):The existing comment at
ReadManager.cpp:1112acknowledges this: "This is a terrible hack to make progress indication kind of work."Proposed Fix
For ParquetBlockInputFormat — sum
ColumnChunk(col)->total_compressed_size()for only the selectedcolumn_indices(which are already computed earlier in the same function):For ReadManager — sum per-column compressed sizes from
row_group.columns(parallel toprimitive_columns, contains only selected columns):Additional: IcebergIterator never invokes file_progress_callback
IcebergIteratorreceives aFileProgressCallbackin its constructor (line ~358) and stores it (line ~390), but never calls it innext(). The genericKeysIteratorinIDataLakeMetadata.cpp:46correctly invokes the callback. This means Iceberg tables have broken file-level progress tracking.Impact
Version
Tested on ClickHouse master (commit
05d6ca90) and 26.5 builds.Environment
Linux x86_64, Parquet files on local filesystem and NFS-mounted Iceberg tables via REST catalog.