Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

PARQUET-139: Avoid reading footers when using task-side metadata

This updates the InternalParquetRecordReader to initialize the ReadContext in each task rather than once for an entire job. There are two reasons for this change:

1. For correctness, the requested projection schema must be validated against each file schema, not once using the merged schema.
2. To avoid reading file footers on the client side, which is a performance bottleneck.

Because the read context is reinitialized in every task, it is no longer necessary to pass the its contents to each task in ParquetInputSplit. The fields and accessors have been removed.

This also adds a new InputFormat, ParquetFileInputFormat that uses FileSplits instead of ParquetSplits. It goes through the normal ParquetRecordReader and creates a ParquetSplit on the task side. This is to avoid accidental behavior changes in ParquetInputFormat.

Author: Ryan Blue <blue@apache.org>

Closes #91 from rdblue/PARQUET-139-input-format-task-side and squashes the following commits:

cb30660 [Ryan Blue] PARQUET-139: Fix deprecated reader bug from review fixes.
09cde8d [Ryan Blue] PARQUET-139: Implement changes from reviews.
3eec553 [Ryan Blue] PARQUET-139: Merge new InputFormat into ParquetInputFormat.
8971b80 [Ryan Blue] PARQUET-139: Add ParquetFileInputFormat that uses FileSplit.
87dfe86 [Ryan Blue] PARQUET-139: Expose read support helper methods.
057c7dc [Ryan Blue] PARQUET-139: Update reader to initialize read context in tasks.
latest commit ce65dfb394
@rdblue rdblue authored
..
Failed to load latest commit information.
parquet-hive-binding
parquet-hive-storage-handler PARQUET-139: Avoid reading footers when using task-side metadata
REVIEWERS.md PARQUET-111: Updates for apache release
pom.xml PARQUET-111: Updates for apache release
Something went wrong with that request. Please try again.