Skip to content

Parquet reader reports inflated bytes/s by counting all columns instead of selected columns #105333

@minguyen9988

Description

@minguyen9988

Bug Report

Summary: Both Parquet reader implementations (ParquetBlockInputFormat and native V3 ReadManager) report "bytes read" using the entire row group's total_compressed_size (all columns), not just the columns selected by the query. This inflates reported read_bytes and bytes/s statistics by a factor of total_columns / selected_columns.

Severity: Progress bar, system.query_log.read_bytes, and client output show significantly wrong values — especially for wide tables and Iceberg/DataLake workloads with many Parquet files (reporting TB/s when actual I/O is GB/s or less).

Reproduction

  1. Create a Parquet file with 50 columns, ~200MB total compressed
  2. Run: SELECT col1, col2 FROM file('test.parquet') (reading 2 of 50 columns)
  3. Check system.query_logread_bytes reports ~200MB instead of ~8MB
  4. Overcount is ~25x (50 / 2)

For Iceberg tables with 1000+ Parquet files, this compounds to absurd TB/s statistics.

Root Cause

Arrow-based reader (ParquetBlockInputFormat.cpp, line ~901):

auto row_group_size = metadata->RowGroup(row_group)->total_compressed_size();
// ^^^ ALL columns in the row group, not just selected ones
row_group_batches.back().total_bytes_compressed += row_group_size;

Then used in get_approx_original_chunk_size (line ~1110):

return static_cast<size_t>(std::ceil(
    static_cast<double>(row_group_batch.total_bytes_compressed) /
    static_cast<double>(row_group_batch.total_rows) *
    static_cast<double>(num_rows)));

Native V3 reader (ReadManager.cpp, line ~1130):

size_t virtual_bytes_read = size_t(row_group.meta->total_compressed_size) *
    row_subgroup.filter.rows_total /
    std::max(size_t(1), size_t(row_group.meta->num_rows));
// ^^^ Same: total_compressed_size is for ALL columns

The existing comment at ReadManager.cpp:1112 acknowledges this: "This is a terrible hack to make progress indication kind of work."

Proposed Fix

For ParquetBlockInputFormat — sum ColumnChunk(col)->total_compressed_size() for only the selected column_indices (which are already computed earlier in the same function):

size_t row_group_size = 0;
auto row_group_meta = metadata->RowGroup(row_group);
for (int column_index : column_indices)
    row_group_size += row_group_meta->ColumnChunk(column_index)->total_compressed_size();

For ReadManager — sum per-column compressed sizes from row_group.columns (parallel to primitive_columns, contains only selected columns):

size_t selected_columns_compressed_bytes = 0;
for (const auto & col : row_group.columns)
    selected_columns_compressed_bytes += col.meta->meta_data.total_compressed_size;
size_t virtual_bytes_read = selected_columns_compressed_bytes *
    row_subgroup.filter.rows_total /
    std::max(size_t(1), size_t(row_group.meta->num_rows));

Additional: IcebergIterator never invokes file_progress_callback

IcebergIterator receives a FileProgressCallback in its constructor (line ~358) and stores it (line ~390), but never calls it in next(). The generic KeysIterator in IDataLakeMetadata.cpp:46 correctly invokes the callback. This means Iceberg tables have broken file-level progress tracking.

Impact

Scenario Overcount factor
50 columns, select 2 25x
100 columns, select 1 100x
200 columns, select 3 67x
Iceberg 1000 files × 50 cols, select 2 25x per file, cumulative

Version

Tested on ClickHouse master (commit 05d6ca90) and 26.5 builds.

Environment

Linux x86_64, Parquet files on local filesystem and NFS-mounted Iceberg tables via REST catalog.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugConfirmed user-visible misbehaviour in official release

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions