TupleReadSupport fails when used directly with ParquetReader #195

Closed
prashantkommireddi opened this Issue Oct 10, 2013 · 2 comments

Projects

None yet

2 participants

@prashantkommireddi

I noticed TupleReadSupport is tightly coupled with Pig. ParquetLoader sets a conf property that makes it work, and directly trying to use ParquetReader+TupleReadSupport does not have this conf property set.

ReadSupport<Tuple> readSupport = new TupleReadSupport();
Path file = new Path("/home/pkommireddi/Downloads/test.gz.parquet");
ParquetReader<Tuple> reader = new ParquetReader<Tuple>(file, readSupport);

This fails with the following exception

Here is the exception
parquet.io.ParquetDecodingException: Missing Pig schema: ParquetLoader sets the schema in the job conf
at parquet.pig.TupleReadSupport.prepareForRead(TupleReadSupport.java:160)
at parquet.hadoop.InternalParquetRecordReader.initialize(InternalParquetRecordReader.java:142)
at parquet.hadoop.ParquetReader.initReader(ParquetReader.java:114)
at parquet.hadoop.ParquetReader.read(ParquetReader.java:98)

The issue here is that PigLoader creates a schema and sets it in the configuration and that is missing in the above case.

From TupleReadSupport.java

if (requestedPigSchema == null) {
     throw new ParquetDecodingException("Missing Pig schema: ParquetLoader sets the schema in the job conf");
}

requestedPigSchema should fallback to parquet schema in case it's not been set.

@prashantkommireddi

@julienledem's thoughts

There are two things.

  1. we need to be able to communicate the schema to the ReadSupport when the user provides the schema in the constructor or there is a projection push down in Pig. This why we are setting it in the configuration.
  2. the read support should be able to read with either the pig schema saved in the metadata of fall back to generate one from the parquet schema

It sounds like inside that if block it should call getPigSchemaFromFile(fileSchema, keyValueMetaData)
Or possibly it would do it in the init and pass it to the tasks through the read context.

the other thing to be careful about is that we are using the right schema. As we want all files to return the same schema.

@aniket486
Contributor

Fixed at #175 .

@aniket486 aniket486 closed this Jan 13, 2014
@cloudera-hudson cloudera-hudson pushed a commit to cloudera/parquet-mr that referenced this issue Nov 23, 2015
@tsdeng @rdblue tsdeng + rdblue PARQUET-278 : enforce non empty group on MessageType level
As columnar format, parquet currently does not support empty struct/group without leaves. We should throw when constructing an empty GroupType to give a clear message.

Author: Tianshuo Deng <tdeng@twitter.com>

Closes #195 from tsdeng/message_type_enforce_non_empty_group and squashes the following commits:

a286c58 [Tianshuo Deng] revert change to merge_parquet_pr
a09f6ba [Tianshuo Deng] fix test
ac63567 [Tianshuo Deng] fix tests
aa2633c [Tianshuo Deng] enforce non empty group on MessageType level

Conflicts:
	parquet-thrift/src/main/java/parquet/thrift/ThriftSchemaConverter.java
Resolution:
    Fixed Pig test imports
    Thrift schema converter: conflict with removed assertion because
      PARQUET-162 was not backported. also fixed a test removed in
      PARQUET-162: the filter works, but selects an empty group.
c7e44e9
@julienledem julienledem pushed a commit to julienledem/old-parquet-mr that referenced this issue Jul 30, 2016
@tsdeng tsdeng PARQUET-278 : enforce non empty group on MessageType level
As columnar format, parquet currently does not support empty struct/group without leaves. We should throw when constructing an empty GroupType to give a clear message.

Author: Tianshuo Deng <tdeng@twitter.com>

Closes #195 from tsdeng/message_type_enforce_non_empty_group and squashes the following commits:

a286c58 [Tianshuo Deng] revert change to merge_parquet_pr
a09f6ba [Tianshuo Deng] fix test
ac63567 [Tianshuo Deng] fix tests
aa2633c [Tianshuo Deng] enforce non empty group on MessageType level
60edcf9
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment