This updates the InternalParquetRecordReader to initialize the ReadContext in each task rather than once for an entire job. There are two reasons for this change: 1. For correctness, the requested projection schema must be validated against each file schema, not once using the merged schema. 2. To avoid reading file footers on the client side, which is a performance bottleneck. Because the read context is reinitialized in every task, it is no longer necessary to pass the its contents to each task in ParquetInputSplit. The fields and accessors have been removed. This also adds a new InputFormat, ParquetFileInputFormat that uses FileSplits instead of ParquetSplits. It goes through the normal ParquetRecordReader and creates a ParquetSplit on the task side. This is to avoid accidental behavior changes in ParquetInputFormat. Author: Ryan Blue <firstname.lastname@example.org> Closes #91 from rdblue/PARQUET-139-input-format-task-side and squashes the following commits: cb30660 [Ryan Blue] PARQUET-139: Fix deprecated reader bug from review fixes. 09cde8d [Ryan Blue] PARQUET-139: Implement changes from reviews. 3eec553 [Ryan Blue] PARQUET-139: Merge new InputFormat into ParquetInputFormat. 8971b80 [Ryan Blue] PARQUET-139: Add ParquetFileInputFormat that uses FileSplit. 87dfe86 [Ryan Blue] PARQUET-139: Expose read support helper methods. 057c7dc [Ryan Blue] PARQUET-139: Update reader to initialize read context in tasks.
|Failed to load latest commit information.|
|parquet-hive-storage-handler||PARQUET-139: Avoid reading footers when using task-side metadata|
|REVIEWERS.md||PARQUET-111: Updates for apache release|
|pom.xml||PARQUET-111: Updates for apache release|