Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Already on GitHub? Sign in to your account

Filtered Reader Implementation, Avro Specific Support #68

Closed
wants to merge 11 commits into
from

Conversation

Projects
None yet
4 participants
Contributor

JacobMetcalf commented Jun 23, 2013

As discussed on groups I have implemented:

  • Ability for a filter to be supplied to a ParquetReader to enable it to pull out only a select set of records based on a predicate, page specifier or both.
  • Underlying above skip() operation has been added to Plain and Dictionary readers.
  • Support for AvroSpecific, in addition to current AvroGeneric support.
  • Tests for above.

@dvryaboy dvryaboy and 2 others commented on an outdated diff Jun 23, 2013

...vro/src/main/java/parquet/avro/AvroParquetWriter.java
@@ -44,6 +45,23 @@ public AvroParquetWriter(Path file, Schema avroSchema,
compressionCodecName, blockSize, pageSize);
}
+ /** Create a new {@link AvroParquetWriter}.
+ *
+ * @param file
+ * @param avroSchema
+ * @param compressionCodecName
+ * @param blockSize
+ * @param pageSize
+ * @throws IOException
+ */
+ public AvroParquetWriter(Path file, Schema avroSchema,
+ CompressionCodecName compressionCodecName, int blockSize,
+ int pageSize, boolean enableDictionary) throws IOException {
+ super(file, (WriteSupport<T>)
+ new AvroWriteSupport(new AvroSchemaConverter().convert(avroSchema), avroSchema),
+ compressionCodecName, blockSize, pageSize,enableDictionary,false);
@dvryaboy

dvryaboy Jun 23, 2013

Contributor

style nit: spaces after comms

@JacobMetcalf

JacobMetcalf Jun 29, 2013

Contributor

Fixed

@julienledem

julienledem Jun 29, 2013

Owner

I would also add the last parameter instead of a default of false.

@JacobMetcalf

JacobMetcalf Jun 30, 2013

Contributor

I have two constructors one with the other without, is that what you mean?

@dvryaboy dvryaboy and 1 other commented on an outdated diff Jun 23, 2013

...-avro/src/main/java/parquet/avro/AvroReadSupport.java
@@ -18,6 +18,7 @@
import java.util.Map;
import org.apache.avro.Schema;
import org.apache.avro.generic.GenericRecord;
@dvryaboy

dvryaboy Jun 23, 2013

Contributor

Here and in a few other files you converted from using GenericRecord to IndexedRecord, this import becomes unnecessary. Also please fix javadocs where appropriate (including at the top of this file)

@JacobMetcalf

JacobMetcalf Jun 29, 2013

Contributor

Fixed in coming push.

@dvryaboy dvryaboy commented on the diff Jun 23, 2013

...c/main/java/parquet/avro/AvroParquetOutputFormat.java
@@ -25,7 +26,7 @@
/**
* A Hadoop {@link org.apache.hadoop.mapreduce.OutputFormat} for Parquet files.
*/
-public class AvroParquetOutputFormat extends ParquetOutputFormat<GenericRecord> {
+public class AvroParquetOutputFormat extends ParquetOutputFormat<IndexedRecord> {
@dvryaboy

dvryaboy Jun 23, 2013

Contributor

Briefly looking at the Avro APIs (I am not fluent with Avro), this seems to be a change to a less specific interface -- allow indexing by index, and do not require (but still allow, the user would have to cast) indexing by name. Correct?
Is that likely to mess up existing users? Should we provide some convenience method that does the casting for them so they don't have to deal with it? I don't feel strongly one way or another, just want to hear your and other Avro people's thoughts.

@JacobMetcalf

JacobMetcalf Jun 29, 2013

Contributor

Basically Avro Generic provides serailization for dynamic data structures. Avro Specific on the other hand generates typed Java (and other language) objects from the schema at compile time with getters and setters built in SerDe. This is similar to Protobuf - I don't know Thrift that well. IndexedRecord is the parent class of both GenericRecord and SpecificRecord so by moving the support up a level I aim to cover both communities.

AvroParquetReader is a Java generic so I assume Avro Generic users would be able to instantiate a AvroParquetReader and Avro Specific users an AvroParquetReader for example. If you are suggesting I inherit from this to create an AvroGenericParquetReader we could do this for convenience.

Would be happy to adapt to Tom White or anyone else's input on the matter.

@dvryaboy

dvryaboy Jun 30, 2013

Contributor

@tomwhite can you comment?

@JacobMetcalf

JacobMetcalf Jul 1, 2013

Contributor

@tomwhite also note I have introduced support for unions with multiple types. Would appreciate some feedback on this.

@tomwhite

tomwhite Jul 2, 2013

Contributor

Looks good. Can you add a test to TestAvroSchemaConverter for this. I noticed that you removed the test for a failure, but having a positive test would be good.

@JacobMetcalf

JacobMetcalf Jul 6, 2013

Contributor

Good point - added. Note that since there is no underlying representation of unions in parquet I have modelled this as separate "memberN" sub-records. This results in a sparsely populated column set but is about the only option I see.

We use unions heavily to replicate inheritance - so I would draw parallels between what I am doing here and a one table per class hierarchy in an ORM.

@julienledem

julienledem Jul 8, 2013

Owner

@JacobMetcalf Using a group with one field per type in the union (minus null) sounds good to me. Only one field being non-null. Parquet stores nulls efficiently so sparse columns are not a problem.

@dvryaboy dvryaboy commented on the diff Jun 23, 2013

...vro/src/main/java/parquet/avro/AvroParquetWriter.java
@@ -25,7 +26,7 @@
/**
* Write Avro records to a Parquet file.
*/
-public class AvroParquetWriter<T> extends ParquetWriter<T> {
+public class AvroParquetWriter<T extends IndexedRecord> extends ParquetWriter<T> {
@dvryaboy

dvryaboy Jun 23, 2013

Contributor

does this limit where this Writer is applicable?

@JacobMetcalf

JacobMetcalf Jun 29, 2013

Contributor

Avro Generic and Specific records extend IndexedRecord. If you look at the original version of AvroWriteSupport its write method was expecting a GenericRecord so I believe this a) expands the range of possibilities b) Is more explicit about what is actually supported.

Would appreciate Tom White's views on this however.

@tomwhite

tomwhite Jul 2, 2013

Contributor

Changing from GenericRecord to IndexedRecord is an improvement - nice work Jacob! +1

It would be nice to support Avro Reflect, but that can be done separately.

@dvryaboy dvryaboy and 1 other commented on an outdated diff Jun 23, 2013

...src/test/java/parquet/avro/TestSpecificReadWrite.java
+import org.junit.Test;
+import parquet.hadoop.ParquetReader;
+import parquet.hadoop.ParquetWriter;
+import parquet.hadoop.metadata.CompressionCodecName;
+
+import java.io.File;
+import java.io.IOException;
+
+import static org.junit.Assert.assertEquals;
+import static org.junit.Assert.assertNull;
+import static parquet.filter.ColumnRecordFilter.column;
+import static parquet.filter.ColumnRecordFilter.equalTo;
+import static parquet.hadoop.ParquetWriter.DEFAULT_BLOCK_SIZE;
+import static parquet.hadoop.ParquetWriter.DEFAULT_PAGE_SIZE;
+
+public class TestSpecificReadWrite {
@dvryaboy

dvryaboy Jun 23, 2013

Contributor

could you add some comments explaining what "specific" means in this case and what you are testing? (are other tests testing a general read write?)

@JacobMetcalf

JacobMetcalf Jun 29, 2013

Contributor

Fixed in coming push. Short explanation for Specific vs Generic above.

@dvryaboy dvryaboy and 1 other commented on an outdated diff Jun 23, 2013

...vro/src/main/java/parquet/avro/AvroParquetWriter.java
@@ -44,6 +45,23 @@ public AvroParquetWriter(Path file, Schema avroSchema,
compressionCodecName, blockSize, pageSize);
}
+ /** Create a new {@link AvroParquetWriter}.
+ *
+ * @param file
+ * @param avroSchema
+ * @param compressionCodecName
+ * @param blockSize
+ * @param pageSize
@dvryaboy

dvryaboy Jun 23, 2013

Contributor

missing the enableDictionary param and what that does

@JacobMetcalf

JacobMetcalf Jun 29, 2013

Contributor

Fixed in coming push.

@dvryaboy dvryaboy and 1 other commented on an outdated diff Jun 23, 2013

...src/test/java/parquet/avro/TestSpecificReadWrite.java
+ }
+
+
+ private Path writeCarsToParquetFile( int num, boolean varyYear,
+ CompressionCodecName compression, boolean enableDictionary) throws IOException {
+ File tmp = File.createTempFile(getClass().getSimpleName(), ".tmp");
+ tmp.deleteOnExit();
+ tmp.delete();
+ Path path = new Path(tmp.getPath());
+
+ Car vwPolo = getVwPolo();
+ Car vwPassat = getVwPassat();
+ Car bmwMini = getBmwMini();
+
+ ParquetWriter<Car> writer = new AvroParquetWriter<Car>(path, Car.SCHEMA$, compression,
+ DEFAULT_BLOCK_SIZE, DEFAULT_PAGE_SIZE,enableDictionary);
@dvryaboy

dvryaboy Jun 23, 2013

Contributor

space after comma :)

@JacobMetcalf

JacobMetcalf Jun 29, 2013

Contributor

Fixed in coming push.

@dvryaboy dvryaboy and 1 other commented on an outdated diff Jun 23, 2013

...src/test/java/parquet/avro/TestSpecificReadWrite.java
+
+
+ private Path writeCarsToParquetFile( int num, boolean varyYear,
+ CompressionCodecName compression, boolean enableDictionary) throws IOException {
+ File tmp = File.createTempFile(getClass().getSimpleName(), ".tmp");
+ tmp.deleteOnExit();
+ tmp.delete();
+ Path path = new Path(tmp.getPath());
+
+ Car vwPolo = getVwPolo();
+ Car vwPassat = getVwPassat();
+ Car bmwMini = getBmwMini();
+
+ ParquetWriter<Car> writer = new AvroParquetWriter<Car>(path, Car.SCHEMA$, compression,
+ DEFAULT_BLOCK_SIZE, DEFAULT_PAGE_SIZE,enableDictionary);
+ for ( int i =0; i < num; i++ ) {
@dvryaboy

dvryaboy Jun 23, 2013

Contributor

space after =

@JacobMetcalf

JacobMetcalf Jun 29, 2013

Contributor

Fixed in coming push.

@dvryaboy dvryaboy and 1 other commented on an outdated diff Jun 23, 2013

...src/test/java/parquet/avro/TestSpecificReadWrite.java
+
+ private Path writeCarsToParquetFile( int num, boolean varyYear,
+ CompressionCodecName compression, boolean enableDictionary) throws IOException {
+ File tmp = File.createTempFile(getClass().getSimpleName(), ".tmp");
+ tmp.deleteOnExit();
+ tmp.delete();
+ Path path = new Path(tmp.getPath());
+
+ Car vwPolo = getVwPolo();
+ Car vwPassat = getVwPassat();
+ Car bmwMini = getBmwMini();
+
+ ParquetWriter<Car> writer = new AvroParquetWriter<Car>(path, Car.SCHEMA$, compression,
+ DEFAULT_BLOCK_SIZE, DEFAULT_PAGE_SIZE,enableDictionary);
+ for ( int i =0; i < num; i++ ) {
+ if (varyYear ) {
@dvryaboy

dvryaboy Jun 23, 2013

Contributor

no space before closed paren (or space after open paren)

@JacobMetcalf

JacobMetcalf Jun 29, 2013

Contributor

Fixed in coming push.

@dvryaboy dvryaboy and 1 other commented on an outdated diff Jun 23, 2013

...src/test/java/parquet/avro/TestSpecificReadWrite.java
+ @Test
+ public void testFilterMatchesMultiple() throws IOException {
+
+ Path path = writeCarsToParquetFile(10, false, CompressionCodecName.UNCOMPRESSED, false);
+
+ ParquetReader<Car> reader = new AvroParquetReader<Car>(path, column("make", equalTo("Volkswagen")));
+ for ( int i =0; i < 10; i++ ) {
+ assertEquals(reader.read().toString(), getVwPolo().toString());
+ assertEquals(reader.read().toString(), getVwPassat().toString());
+ }
+ assertNull( reader.read());
+ }
+
+
+ private Path writeCarsToParquetFile( int num, boolean varyYear,
+ CompressionCodecName compression, boolean enableDictionary) throws IOException {
@dvryaboy

dvryaboy Jun 23, 2013

Contributor

you never call this with enableDictionary = true. Is there a reason you parametrized this? Should that be tested as well? Same for varyYear

@JacobMetcalf

JacobMetcalf Jun 29, 2013

Contributor

Have added a test for dictionary encoding. varyYear was for a perf test - removed.

@dvryaboy dvryaboy and 2 others commented on an outdated diff Jun 23, 2013

parquet-column/pom.xml
@@ -25,6 +25,11 @@
<version>1.7</version>
<scope>compile</scope>
</dependency>
+ <dependency>
@dvryaboy

dvryaboy Jun 23, 2013

Contributor

(edited after reading more of the code) we should use maven shade to namespace guava.
(edited after thinking about it) can't use maven shade to namespace guava, since this is used in a user-facing API. Maybe we can pull out Predicate and friends out of Guava, and either include then in their proper namespace under com.google (con: risk breaking self or a user if Guava introduces a breaking change to Promises; pro: existing 3rd party implementations of Predicates work) or under a new namespace (swap the previous pros and cons -- no breakage, but requires new Predicates, which may be weird to explain).

@JacobMetcalf

JacobMetcalf Jun 29, 2013

Contributor

Not quite sure what you wanted me to do here. I did see the different versions of Guava and wondered.

@julienledem

julienledem Jun 29, 2013

Owner

We want to minimize the dependencies of this component. Pig, Hive, etc depend on guava too and guava does has regular incompatible releases. Dependency conflicts quickly become problematic.
We either use shading (see thrift, jackson) to remove the dependency conflict or re-implement simple utilities (see Prevalidators)

@JacobMetcalf

JacobMetcalf Jun 30, 2013

Contributor

Agreed you are working with a lot of libraries here so want minimal dependencies. I have re-implemented and replaced a few other guava uses with core libraries. My only performance concern was replacing the use of Guava's Splitter with String.split().

@dvryaboy

dvryaboy Jun 30, 2013

Contributor

I think we can lift Guava's Splitter verbatim and put it into our own namespace (the license allows this, we just need to credit them appropriately).

@dvryaboy dvryaboy commented on the diff Jun 23, 2013

...c/main/java/parquet/column/impl/ColumnReaderImpl.java
* reads the current value
*/
public void readCurrentValue() {
binding.read();
}
- protected void checkValueRead() {
+ /**
+ * Reads the value into the binding or skips forwards.
@dvryaboy

dvryaboy Jun 23, 2013

Contributor

This is a little confusing, since checkValueRead doesn't scream to me "this will actually do something to the stream" (that was a problem before you got here, too). Maybe something like "readIfPossible(boolean skipDeserialization)" would be more clear?

@julienledem

julienledem Jun 29, 2013

Owner

agreed, we should now rename this method

@dvryaboy dvryaboy and 1 other commented on an outdated diff Jun 23, 2013

.../src/main/java/parquet/filter/ColumnRecordFilter.java
+
+ return ( filterOnColumn.isFullyConsumed()) ? false : filterPredicate.apply( filterOnColumn );
+ }
+
+ /**
+ * @return true if the column we are filtering on has no more values.
+ */
+ @Override
+ public boolean isFullyConsumed() {
+ return filterOnColumn.isFullyConsumed();
+ }
+
+ /**
+ * Predicate for string equality
+ */
+ public static final Predicate<ColumnReader> equalTo( final String value ) {
@dvryaboy

dvryaboy Jun 23, 2013

Contributor

I'm not entirely sure why we only have equalTo for 2 specific types.. is the idea to fill out a collection of a bunch of useful predicates? Should we in that case break provided predicates into their own class?

Not certain I have a total picture of how everything hooks together, but can this break if dictionary encoding is in use? If this is applied after dictionary decoding, that's fine, but we should create an issue to add ability to apply dict filters before decoding for further gains. We should do that for both equality and regexes.

@JacobMetcalf

JacobMetcalf Jun 30, 2013

Contributor

Yes. It was to provide a much simpler way of setting up predicates on specific column types. I have now provided equalTo for all value types. Perhaps in a future push I can add other types of predicates, but then it could be left to the caller.

I have added a test to check equality in presence of dictionary encoding.

@dvryaboy dvryaboy commented on an outdated diff Jun 23, 2013

...n/src/main/java/parquet/filter/PagedRecordFilter.java
+ return new PagedRecordFilter( startPos, pageSize );
+ }
+ };
+ }
+
+ /**
+ * Private constructor, use column() instead.
+ */
+ private PagedRecordFilter(long startPos, long pageSize) {
+ this.startPos = startPos;
+ this.endPos = startPos + pageSize;
+ }
+
+ /**
+ * Terminate early when we have got our page. Later we will want a row count and this
+ * will be no good.
@dvryaboy

dvryaboy Jun 23, 2013

Contributor

This comment is a little terse (and I am a little slow.. have pity :)).

@dvryaboy dvryaboy and 1 other commented on an outdated diff Jun 23, 2013

...-column/src/main/java/parquet/io/MessageColumnIO.java
@@ -58,8 +60,19 @@
this,
recordMaterializer,
validating,
- new ColumnReadStoreImpl(columns, recordMaterializer.getRootConverter(), getType())
- );
+ new ColumnReadStoreImpl(columns, recordMaterializer.getRootConverter(), getType()),
+ RecordFilter.NULL_FILTER
@dvryaboy

dvryaboy Jun 23, 2013

Contributor

does this add any overhead? would it be better to special-case null filter so that we don't have to invoke anything at all?
I supposed it's possible the null filter gets JITed out... would be nice to know for sure.

@JacobMetcalf

JacobMetcalf Jun 30, 2013

Contributor

In next push this is moved to FilteredRecordReader so only gets called when there is a filter.

@dvryaboy dvryaboy and 1 other commented on an outdated diff Jun 23, 2013

.../main/java/parquet/io/RecordReaderImplementation.java
@@ -356,6 +361,10 @@ public int compare(Case o1, Case o2) {
Collections.sort(state.definedCases, caseComparator);
Collections.sort(state.undefinedCases, caseComparator);
}
+
+ // We need to make defensive copy to stop interference but as an optimisation don't bother if null
@dvryaboy

dvryaboy Jun 23, 2013

Contributor

what's the interference you are worried about? Seems like this can get pricey, and we'd need some sort of a selectivity rule-of-thumb for when applying filters in Parquet becomes more trouble than it is worth.

@JacobMetcalf

JacobMetcalf Jun 30, 2013

Contributor

In the coming push I have moved this to a FilteredRecordReader.

I am allowing people to implement their own filters so if we presented the column array to the filter for binding that would allow them to possibly swap the ordering.

@dvryaboy dvryaboy and 1 other commented on an outdated diff Jun 23, 2013

...uet-column/src/test/java/parquet/io/TestFiltered.java
+import parquet.example.data.simple.convert.GroupRecordConverter;
+import parquet.io.api.RecordMaterializer;
+
+import static org.junit.Assert.assertEquals;
+import static org.junit.Assert.assertNull;
+import static parquet.example.Paper.r1;
+import static parquet.example.Paper.r2;
+import static parquet.example.Paper.schema;
+import static parquet.filter.AndRecordFilter.and;
+import static parquet.filter.PagedRecordFilter.page;
+import static parquet.filter.ColumnRecordFilter.equalTo;
+import static parquet.filter.ColumnRecordFilter.column;
+
+public class TestFiltered {
+
+ private static final Log LOG = Log.getLog(TestColumnIO.class);
@dvryaboy

dvryaboy Jun 23, 2013

Contributor

wrong class

@JacobMetcalf

JacobMetcalf Jun 30, 2013

Contributor

Fixed in latest push.

@dvryaboy dvryaboy and 1 other commented on an outdated diff Jun 23, 2013

...adoop/src/main/java/parquet/hadoop/ParquetReader.java
@@ -53,7 +59,7 @@ public ParquetReader(Path file, ReadSupport<T> readSupport) throws IOException {
MessageType schema = fileMetaData.getSchema();
Map<String, String> extraMetadata = fileMetaData.getKeyValueMetaData();
final ReadContext readContext = readSupport.init(conf, extraMetadata, schema);
- reader = new ParquetRecordReader<T>(readSupport);
+ reader = new ParquetRecordReader<T>(readSupport,filter);
@dvryaboy

dvryaboy Jun 23, 2013

Contributor

space after comma

@JacobMetcalf

JacobMetcalf Jun 30, 2013

Contributor

Fixed in next push.

Contributor

dvryaboy commented Jun 23, 2013

This is very nice, Jacob!

@julienledem julienledem and 2 others commented on an outdated diff Jun 29, 2013

...ain/java/parquet/avro/AvroIndexedRecordConverter.java
@@ -111,7 +113,9 @@ public Converter getConverter(int fieldIndex) {
@Override
public void start() {
- this.currentRecord = new GenericData.Record(avroSchema);
+ // Should do the right thing whether it is generic or specific
+ this.currentRecord =
+ (IndexedRecord) SpecificData.get().newRecord((IndexedRecord)null, avroSchema);
@julienledem

julienledem Jun 29, 2013

Owner

Possibly premature optimization but we could also have 2 subclasses with a different implementation of start().
Just a thought.

@JacobMetcalf

JacobMetcalf Jun 30, 2013

Contributor

I am happy to implement separate writers & readers. My personal preference is to have one writer & reader that supports both as Avro does a lot to make the two formats interchangeable and in fact you can mix them.

In terms of performance I will implement a change to slightly speed this up for Generic users by caching the fact that there is no Avro Specific class for this record.

@dvryaboy

dvryaboy Jun 30, 2013

Contributor

we can leave this for another pull, doesn't make sense to block on this i think

@julienledem julienledem commented on the diff Jun 29, 2013

...column/src/main/java/parquet/column/ColumnReader.java
@@ -98,4 +98,14 @@
* @return the current value
*/
double getDouble();
+
+ /**
+ * @return Descriptor of the column.
+ */
+ ColumnDescriptor getDescriptor();
+
+ /**
+ * Skip the current value
+ */
+ void skip();
@julienledem

julienledem Jun 29, 2013

Owner

did you see a use for skip(int n) ? (skip n values)

@JacobMetcalf

JacobMetcalf Jun 30, 2013

Contributor

Since I have to iterate through each skipped record I am only skipping one at a time. But if you feel its useful I can add.

@julienledem

julienledem Jul 8, 2013

Owner

Keep it for your next pull request :)

@julienledem julienledem and 1 other commented on an outdated diff Jun 29, 2013

...c/main/java/parquet/column/impl/ColumnReaderImpl.java
* reads the current value
*/
public void readCurrentValue() {
binding.read();
}
- protected void checkValueRead() {
+ /**
+ * Reads the value into the binding or skips forwards.
+ * @param skip If true don;t deserialize just skip forwards
@JacobMetcalf

JacobMetcalf Jun 30, 2013

Contributor

Fixed in next push.

@julienledem julienledem commented on the diff Jun 29, 2013

...va/parquet/column/values/plain/PlainValuesReader.java
@@ -46,6 +46,15 @@ public float readFloat() {
}
@Override
+ public void skipFloat() {
@julienledem

julienledem Jun 29, 2013

Owner

It looks like splitting the PlainValuesReader into one class per type would simplify the api.
That way you could just define skip() and not have to specify the type in the layers above

@JacobMetcalf

JacobMetcalf Jul 1, 2013

Contributor

I would tend to agree, but is this not an extensive change to the ValuesReader interface?

@julienledem

julienledem Jul 8, 2013

Owner

That does not change the interface too much but it requires the Reader to know what type it is reading.
that way you just add skip() without needing to have one per type.

@julienledem

julienledem Jul 8, 2013

Owner

I just submitted a pull request that does the split:
#79
that way ValuesReader has just skip() and the individual subclasses of PlainValuesReader implement it in the right way.
ValuesReader still need a read...() method for each type because java treats primitive types differently from objects.

@julienledem julienledem and 1 other commented on an outdated diff Jun 29, 2013

...src/main/java/parquet/filter/UnboundRecordFilter.java
@@ -0,0 +1,19 @@
+package parquet.filter;
+
+import com.google.common.base.Predicate;
@julienledem

julienledem Jun 29, 2013

Owner

it does not look like we use this import

@JacobMetcalf

JacobMetcalf Jun 30, 2013

Contributor

Removed in next push.

@julienledem julienledem and 1 other commented on an outdated diff Jun 29, 2013

.../main/java/parquet/io/RecordReaderImplementation.java
@@ -375,6 +384,12 @@ private RecordConsumer wrap(RecordConsumer recordConsumer) {
*/
@Override
public T read() {
+ // Skip forwards until the filter matches a record
+ if ( !skipToMatch()) {
+ return null;
+ }
@julienledem

julienledem Jun 29, 2013

Owner

It looks like we could have a subclass that adds the skip, to avoid the skip overhead when we're not filtering.

@JacobMetcalf

JacobMetcalf Jun 30, 2013

Contributor

Agreed. Have implemented a separate FilteredRecordReader. It required me to add a couple of extra getters to RecordReaderImplementation but since this is a package level class I hope that's okay.

@julienledem julienledem and 1 other commented on an outdated diff Jun 29, 2013

...uet-column/src/test/java/parquet/io/TestFiltered.java
+import parquet.example.data.Group;
+import parquet.example.data.GroupWriter;
+import parquet.example.data.simple.convert.GroupRecordConverter;
+import parquet.io.api.RecordMaterializer;
+
+import static org.junit.Assert.assertEquals;
+import static org.junit.Assert.assertNull;
+import static parquet.example.Paper.r1;
+import static parquet.example.Paper.r2;
+import static parquet.example.Paper.schema;
+import static parquet.filter.AndRecordFilter.and;
+import static parquet.filter.PagedRecordFilter.page;
+import static parquet.filter.ColumnRecordFilter.equalTo;
+import static parquet.filter.ColumnRecordFilter.column;
+
+public class TestFiltered {
@julienledem

julienledem Jun 29, 2013

Owner

you can use "mvn cobertura:cobertura" to check your test coverage if you'd like.

@JacobMetcalf

JacobMetcalf Jul 1, 2013

Contributor

Have run coverage. One weakness is that I test all the different filter predicate types in parquet-avro rather than parquet-column. This was purely because I have more complex schemas here whereas a lot of the tests in parquet-column just use the Dremel example Schema in paper.js. If you want am happy to add more testing to parquet-column.

Owner

julienledem commented Jun 29, 2013

Great work Jacob.
Thanks

@dvryaboy dvryaboy commented on the diff Jun 30, 2013

.../src/main/java/parquet/filter/ColumnRecordFilter.java
@@ -0,0 +1,96 @@
+package parquet.filter;
@dvryaboy

dvryaboy Jun 30, 2013

Contributor

please add APL header

@JacobMetcalf

JacobMetcalf Jul 6, 2013

Contributor

Done.

@dvryaboy dvryaboy commented on the diff Jun 30, 2013

...umn/src/main/java/parquet/filter/AndRecordFilter.java
@@ -0,0 +1,51 @@
+package parquet.filter;
@dvryaboy

dvryaboy Jun 30, 2013

Contributor

please add APL header.. actually now that I look, we need to do that to all new files -- I think @julienledem has a helper task that does this automatically?

@JacobMetcalf

JacobMetcalf Jul 1, 2013

Contributor

Let me know the task - will run and push again.

@julienledem

julienledem Jul 1, 2013

Owner

mvn license:format

@JacobMetcalf

JacobMetcalf Jul 6, 2013

Contributor

Done. It modified a few other files apart from mine but I figured that's what you want.

@julienledem

julienledem Jul 8, 2013

Owner

yep, those had been missed in a previous PR

Contributor

JacobMetcalf commented Jul 1, 2013

Have tried to incorporate all your comments. Let me know what you think.

Contributor

JacobMetcalf commented Jul 6, 2013

Have updated for above comments, included APL headers and caught up with master. Let me know if there is anything else needed to pull this into master,

@julienledem julienledem commented on the diff Jul 8, 2013

...ain/java/parquet/avro/AvroIndexedRecordConverter.java
@@ -239,6 +249,26 @@ final public void addBinary(Binary value) {
}
+ static final class FieldEnumConverter extends PrimitiveConverter {
+
+ private final ParentValueContainer parent;
+ private final Class<? extends Enum> enumClass;
+
+ public FieldEnumConverter(ParentValueContainer parent, Schema enumSchema) {
+ this.parent = parent;
+ this.enumClass = SpecificData.get().getClass(enumSchema);
+ }
+
+ @Override
+ final public void addBinary(Binary value) {
+ Object enumValue = value.toStringUsingUTF8();
+ if (enumClass != null) {
@julienledem

julienledem Jul 8, 2013

Owner

in what case is enumClass null ?
Please add a comment about this

@julienledem julienledem commented on the diff Jul 8, 2013

...o/src/main/java/parquet/avro/AvroSchemaConverter.java
}
throw new UnsupportedOperationException("Cannot convert Avro type " + type);
}
+ private Type convertUnion(String fieldName, Schema schema, Type.Repetition repetition) {
+ List<Schema> nonNullSchemas = new ArrayList(schema.getTypes().size());
+ for (Schema childSchema : schema.getTypes()) {
+ if (childSchema.getType().equals(Schema.Type.NULL)) {
+ repetition = Type.Repetition.OPTIONAL;
+ } else {
+ nonNullSchemas.add(childSchema);
+ }
+ }
+ // If we only get a null and one other type then its a simple optional field
+ // otherwise construct a union container
+ switch (nonNullSchemas.size()) {
+ case 0:
+ throw new UnsupportedOperationException("Cannot convert Avro union of only nulls");
@julienledem

julienledem Jul 8, 2013

Owner

We could probably support that too by just skipping the field. That can wait if you don't feel like doing it.

@julienledem julienledem commented on the diff Jul 8, 2013

...c/test/java/parquet/avro/TestAvroSchemaConverter.java
- new AvroSchemaConverter().convert(schema);
+ // Avro union is modelled using optional data members of thw different types;
+ testConversion(
+ schema,
+ "message record2 {\n" +
+ " optional group myunion {\n" +
@julienledem

julienledem Jul 8, 2013

Owner

isn't the union itself required in that case?
(that would save one definition level)

@julienledem julienledem commented on the diff Jul 8, 2013

...mn/src/main/java/parquet/io/FilteredRecordReader.java
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package parquet.io;
+
+import parquet.column.ColumnReader;
+import parquet.column.impl.ColumnReadStoreImpl;
+import parquet.filter.RecordFilter;
+import parquet.filter.UnboundRecordFilter;
+import parquet.io.api.RecordMaterializer;
+
+import java.util.Arrays;
+
+/**
+ * Extends the
@julienledem

julienledem Jul 8, 2013

Owner

missing end of sentence

Owner

julienledem commented Jul 8, 2013

Great contribution @JacobMetcalf !

Contributor

dvryaboy commented Jul 11, 2013

Folks, what's left to be done to merge this? Looks like some refactor diverged it from master.

Owner

julienledem commented Jul 11, 2013

The only missing part is to integrate the change to PlainValuesReader so that ValuesReader has a single split() method. Also I had a comment about making the parent of the union required.
@JacobMetcalf : I could take a look at integrating master if you want.

Contributor

JacobMetcalf commented Jul 11, 2013

@julienledem Please integrate with master if you have time. If not I will do at the weekend.

The comment for FilteredRecordReader should be: "Extends the record reader to add filtering capabilities. User supplies a filter if the current record being read does not match the filter then the reader skips to the next record." The comment for FieldEnumConverter would be "enumClass will be blank in the case of AvroGeneric".

Owner

julienledem commented Jul 13, 2013

I just merged master into @JacobMetcalf 's contribution there: #89

Owner

julienledem commented Jul 13, 2013

I just merged. Thank you @JacobMetcalf, I'm looking forward to your next contribution!

Contributor

dvryaboy commented Jul 13, 2013

Thanks @JacobMetcalf !

@julienledem julienledem added a commit that referenced this pull request Oct 2, 2014

@rdblue @julienledem rdblue + julienledem PARQUET-107: Add option to disable summary metadata.
This adds an option to the commitJob phase of the MR OutputCommitter,
parquet.enable.summary-metadata (default true), that can be used to
disable the summary metadata files generated from the footers of all of
the files produced. This enables more control over when those summary
files are produced and makes it possible to rename MR outputs and then
generate the summaries.

Author: Ryan Blue <rblue@cloudera.com>

Closes #68 from rdblue/PARQUET-107-add-summary-metadata-option and squashes the following commits:

261e5e4 [Ryan Blue] PARQUET-107: Add option to disable summary metadata.
be1222e

@abayer abayer pushed a commit to cloudera/parquet-mr that referenced this pull request Nov 24, 2014

@rdblue rdblue + Zhihai Xu PARQUET-107: Add option to disable summary metadata.
This adds an option to the commitJob phase of the MR OutputCommitter,
parquet.enable.summary-metadata (default true), that can be used to
disable the summary metadata files generated from the footers of all of
the files produced. This enables more control over when those summary
files are produced and makes it possible to rename MR outputs and then
generate the summaries.

Author: Ryan Blue <rblue@cloudera.com>

Closes #68 from rdblue/PARQUET-107-add-summary-metadata-option and squashes the following commits:

261e5e4 [Ryan Blue] PARQUET-107: Add option to disable summary metadata.

(cherry picked from commit 2d63011)
(cherry picked from commit 97961fd73d636aa3a31f1050672adbac7ad43c18)
5e6f922

@abayer abayer pushed a commit to cloudera/parquet-mr that referenced this pull request Dec 21, 2014

@rdblue @tomwhite rdblue + tomwhite PARQUET-107: Add option to disable summary metadata.
This adds an option to the commitJob phase of the MR OutputCommitter,
parquet.enable.summary-metadata (default true), that can be used to
disable the summary metadata files generated from the footers of all of
the files produced. This enables more control over when those summary
files are produced and makes it possible to rename MR outputs and then
generate the summaries.

Author: Ryan Blue <rblue@cloudera.com>

Closes #68 from rdblue/PARQUET-107-add-summary-metadata-option and squashes the following commits:

261e5e4 [Ryan Blue] PARQUET-107: Add option to disable summary metadata.
2d63011

@cloudera-hudson cloudera-hudson pushed a commit to cloudera/parquet-mr that referenced this pull request Apr 27, 2015

@rdblue @rdblue rdblue + rdblue PARQUET-107: Add option to disable summary metadata.
This adds an option to the commitJob phase of the MR OutputCommitter,
parquet.enable.summary-metadata (default true), that can be used to
disable the summary metadata files generated from the footers of all of
the files produced. This enables more control over when those summary
files are produced and makes it possible to rename MR outputs and then
generate the summaries.

Author: Ryan Blue <rblue@cloudera.com>

Closes #68 from rdblue/PARQUET-107-add-summary-metadata-option and squashes the following commits:

261e5e4 [Ryan Blue] PARQUET-107: Add option to disable summary metadata.

Conflicts:
	parquet-hadoop/src/main/java/parquet/hadoop/ParquetOutputFormat.java
Resolution:
    Conflict in static final constants.
70ddf00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment