Filtered Reader Implementation, Avro Specific Support #68

Closed
wants to merge 11 commits into
from

3 participants

@JacobMetcalf

As discussed on groups I have implemented:

  • Ability for a filter to be supplied to a ParquetReader to enable it to pull out only a select set of records based on a predicate, page specifier or both.

  • Underlying above skip() operation has been added to Plain and Dictionary readers.

  • Support for AvroSpecific, in addition to current AvroGeneric support.

  • Tests for above.

@dvryaboy dvryaboy and 2 others commented on an outdated diff Jun 23, 2013
...vro/src/main/java/parquet/avro/AvroParquetWriter.java
@@ -44,6 +45,23 @@ public AvroParquetWriter(Path file, Schema avroSchema,
compressionCodecName, blockSize, pageSize);
}
+ /** Create a new {@link AvroParquetWriter}.
+ *
+ * @param file
+ * @param avroSchema
+ * @param compressionCodecName
+ * @param blockSize
+ * @param pageSize
+ * @throws IOException
+ */
+ public AvroParquetWriter(Path file, Schema avroSchema,
+ CompressionCodecName compressionCodecName, int blockSize,
+ int pageSize, boolean enableDictionary) throws IOException {
+ super(file, (WriteSupport<T>)
+ new AvroWriteSupport(new AvroSchemaConverter().convert(avroSchema), avroSchema),
+ compressionCodecName, blockSize, pageSize,enableDictionary,false);
@dvryaboy
dvryaboy added a line comment Jun 23, 2013

style nit: spaces after comms

@JacobMetcalf
JacobMetcalf added a line comment Jun 29, 2013

Fixed

@julienledem
Apache Parquet member
julienledem added a line comment Jun 29, 2013

I would also add the last parameter instead of a default of false.

@JacobMetcalf
JacobMetcalf added a line comment Jun 30, 2013

I have two constructors one with the other without, is that what you mean?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@dvryaboy dvryaboy and 1 other commented on an outdated diff Jun 23, 2013
...-avro/src/main/java/parquet/avro/AvroReadSupport.java
@@ -18,6 +18,7 @@
import java.util.Map;
import org.apache.avro.Schema;
import org.apache.avro.generic.GenericRecord;
@dvryaboy
dvryaboy added a line comment Jun 23, 2013

Here and in a few other files you converted from using GenericRecord to IndexedRecord, this import becomes unnecessary. Also please fix javadocs where appropriate (including at the top of this file)

@JacobMetcalf
JacobMetcalf added a line comment Jun 29, 2013

Fixed in coming push.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@dvryaboy dvryaboy commented on the diff Jun 23, 2013
...c/main/java/parquet/avro/AvroParquetOutputFormat.java
@@ -25,7 +26,7 @@
/**
* A Hadoop {@link org.apache.hadoop.mapreduce.OutputFormat} for Parquet files.
*/
-public class AvroParquetOutputFormat extends ParquetOutputFormat<GenericRecord> {
+public class AvroParquetOutputFormat extends ParquetOutputFormat<IndexedRecord> {
@dvryaboy
dvryaboy added a line comment Jun 23, 2013

Briefly looking at the Avro APIs (I am not fluent with Avro), this seems to be a change to a less specific interface -- allow indexing by index, and do not require (but still allow, the user would have to cast) indexing by name. Correct?
Is that likely to mess up existing users? Should we provide some convenience method that does the casting for them so they don't have to deal with it? I don't feel strongly one way or another, just want to hear your and other Avro people's thoughts.

@JacobMetcalf
JacobMetcalf added a line comment Jun 29, 2013

Basically Avro Generic provides serailization for dynamic data structures. Avro Specific on the other hand generates typed Java (and other language) objects from the schema at compile time with getters and setters built in SerDe. This is similar to Protobuf - I don't know Thrift that well. IndexedRecord is the parent class of both GenericRecord and SpecificRecord so by moving the support up a level I aim to cover both communities.

AvroParquetReader is a Java generic so I assume Avro Generic users would be able to instantiate a AvroParquetReader and Avro Specific users an AvroParquetReader for example. If you are suggesting I inherit from this to create an AvroGenericParquetReader we could do this for convenience.

Would be happy to adapt to Tom White or anyone else's input on the matter.

@dvryaboy
dvryaboy added a line comment Jun 30, 2013

@tomwhite can you comment?

@JacobMetcalf
JacobMetcalf added a line comment Jul 1, 2013

@tomwhite also note I have introduced support for unions with multiple types. Would appreciate some feedback on this.

@tomwhite
tomwhite added a line comment Jul 2, 2013

Looks good. Can you add a test to TestAvroSchemaConverter for this. I noticed that you removed the test for a failure, but having a positive test would be good.

@JacobMetcalf
JacobMetcalf added a line comment Jul 6, 2013

Good point - added. Note that since there is no underlying representation of unions in parquet I have modelled this as separate "memberN" sub-records. This results in a sparsely populated column set but is about the only option I see.

We use unions heavily to replicate inheritance - so I would draw parallels between what I am doing here and a one table per class hierarchy in an ORM.

@julienledem
Apache Parquet member
julienledem added a line comment Jul 8, 2013

@JacobMetcalf Using a group with one field per type in the union (minus null) sounds good to me. Only one field being non-null. Parquet stores nulls efficiently so sparse columns are not a problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@dvryaboy dvryaboy commented on the diff Jun 23, 2013
...vro/src/main/java/parquet/avro/AvroParquetWriter.java
@@ -25,7 +26,7 @@
/**
* Write Avro records to a Parquet file.
*/
-public class AvroParquetWriter<T> extends ParquetWriter<T> {
+public class AvroParquetWriter<T extends IndexedRecord> extends ParquetWriter<T> {
@dvryaboy
dvryaboy added a line comment Jun 23, 2013

does this limit where this Writer is applicable?

@JacobMetcalf
JacobMetcalf added a line comment Jun 29, 2013

Avro Generic and Specific records extend IndexedRecord. If you look at the original version of AvroWriteSupport its write method was expecting a GenericRecord so I believe this a) expands the range of possibilities b) Is more explicit about what is actually supported.

Would appreciate Tom White's views on this however.

@tomwhite
tomwhite added a line comment Jul 2, 2013

Changing from GenericRecord to IndexedRecord is an improvement - nice work Jacob! +1

It would be nice to support Avro Reflect, but that can be done separately.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@dvryaboy dvryaboy and 1 other commented on an outdated diff Jun 23, 2013
...src/test/java/parquet/avro/TestSpecificReadWrite.java
+import org.junit.Test;
+import parquet.hadoop.ParquetReader;
+import parquet.hadoop.ParquetWriter;
+import parquet.hadoop.metadata.CompressionCodecName;
+
+import java.io.File;
+import java.io.IOException;
+
+import static org.junit.Assert.assertEquals;
+import static org.junit.Assert.assertNull;
+import static parquet.filter.ColumnRecordFilter.column;
+import static parquet.filter.ColumnRecordFilter.equalTo;
+import static parquet.hadoop.ParquetWriter.DEFAULT_BLOCK_SIZE;
+import static parquet.hadoop.ParquetWriter.DEFAULT_PAGE_SIZE;
+
+public class TestSpecificReadWrite {
@dvryaboy
dvryaboy added a line comment Jun 23, 2013

could you add some comments explaining what "specific" means in this case and what you are testing? (are other tests testing a general read write?)

@JacobMetcalf
JacobMetcalf added a line comment Jun 29, 2013

Fixed in coming push. Short explanation for Specific vs Generic above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@dvryaboy dvryaboy and 1 other commented on an outdated diff Jun 23, 2013
...vro/src/main/java/parquet/avro/AvroParquetWriter.java
@@ -44,6 +45,23 @@ public AvroParquetWriter(Path file, Schema avroSchema,
compressionCodecName, blockSize, pageSize);
}
+ /** Create a new {@link AvroParquetWriter}.
+ *
+ * @param file
+ * @param avroSchema
+ * @param compressionCodecName
+ * @param blockSize
+ * @param pageSize
@dvryaboy
dvryaboy added a line comment Jun 23, 2013

missing the enableDictionary param and what that does

@JacobMetcalf
JacobMetcalf added a line comment Jun 29, 2013

Fixed in coming push.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@dvryaboy dvryaboy and 1 other commented on an outdated diff Jun 23, 2013
...src/test/java/parquet/avro/TestSpecificReadWrite.java
+ }
+
+
+ private Path writeCarsToParquetFile( int num, boolean varyYear,
+ CompressionCodecName compression, boolean enableDictionary) throws IOException {
+ File tmp = File.createTempFile(getClass().getSimpleName(), ".tmp");
+ tmp.deleteOnExit();
+ tmp.delete();
+ Path path = new Path(tmp.getPath());
+
+ Car vwPolo = getVwPolo();
+ Car vwPassat = getVwPassat();
+ Car bmwMini = getBmwMini();
+
+ ParquetWriter<Car> writer = new AvroParquetWriter<Car>(path, Car.SCHEMA$, compression,
+ DEFAULT_BLOCK_SIZE, DEFAULT_PAGE_SIZE,enableDictionary);
@dvryaboy
dvryaboy added a line comment Jun 23, 2013

space after comma :)

@JacobMetcalf
JacobMetcalf added a line comment Jun 29, 2013

Fixed in coming push.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@dvryaboy dvryaboy and 1 other commented on an outdated diff Jun 23, 2013
...src/test/java/parquet/avro/TestSpecificReadWrite.java
+
+
+ private Path writeCarsToParquetFile( int num, boolean varyYear,
+ CompressionCodecName compression, boolean enableDictionary) throws IOException {
+ File tmp = File.createTempFile(getClass().getSimpleName(), ".tmp");
+ tmp.deleteOnExit();
+ tmp.delete();
+ Path path = new Path(tmp.getPath());
+
+ Car vwPolo = getVwPolo();
+ Car vwPassat = getVwPassat();
+ Car bmwMini = getBmwMini();
+
+ ParquetWriter<Car> writer = new AvroParquetWriter<Car>(path, Car.SCHEMA$, compression,
+ DEFAULT_BLOCK_SIZE, DEFAULT_PAGE_SIZE,enableDictionary);
+ for ( int i =0; i < num; i++ ) {
@dvryaboy
dvryaboy added a line comment Jun 23, 2013

space after =

@JacobMetcalf
JacobMetcalf added a line comment Jun 29, 2013

Fixed in coming push.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@dvryaboy dvryaboy and 1 other commented on an outdated diff Jun 23, 2013
...src/test/java/parquet/avro/TestSpecificReadWrite.java
+
+ private Path writeCarsToParquetFile( int num, boolean varyYear,
+ CompressionCodecName compression, boolean enableDictionary) throws IOException {
+ File tmp = File.createTempFile(getClass().getSimpleName(), ".tmp");
+ tmp.deleteOnExit();
+ tmp.delete();
+ Path path = new Path(tmp.getPath());
+
+ Car vwPolo = getVwPolo();
+ Car vwPassat = getVwPassat();
+ Car bmwMini = getBmwMini();
+
+ ParquetWriter<Car> writer = new AvroParquetWriter<Car>(path, Car.SCHEMA$, compression,
+ DEFAULT_BLOCK_SIZE, DEFAULT_PAGE_SIZE,enableDictionary);
+ for ( int i =0; i < num; i++ ) {
+ if (varyYear ) {
@dvryaboy
dvryaboy added a line comment Jun 23, 2013

no space before closed paren (or space after open paren)

@JacobMetcalf
JacobMetcalf added a line comment Jun 29, 2013

Fixed in coming push.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@dvryaboy dvryaboy and 1 other commented on an outdated diff Jun 23, 2013
...src/test/java/parquet/avro/TestSpecificReadWrite.java
+ @Test
+ public void testFilterMatchesMultiple() throws IOException {
+
+ Path path = writeCarsToParquetFile(10, false, CompressionCodecName.UNCOMPRESSED, false);
+
+ ParquetReader<Car> reader = new AvroParquetReader<Car>(path, column("make", equalTo("Volkswagen")));
+ for ( int i =0; i < 10; i++ ) {
+ assertEquals(reader.read().toString(), getVwPolo().toString());
+ assertEquals(reader.read().toString(), getVwPassat().toString());
+ }
+ assertNull( reader.read());
+ }
+
+
+ private Path writeCarsToParquetFile( int num, boolean varyYear,
+ CompressionCodecName compression, boolean enableDictionary) throws IOException {
@dvryaboy
dvryaboy added a line comment Jun 23, 2013

you never call this with enableDictionary = true. Is there a reason you parametrized this? Should that be tested as well? Same for varyYear

@JacobMetcalf
JacobMetcalf added a line comment Jun 29, 2013

Have added a test for dictionary encoding. varyYear was for a perf test - removed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@dvryaboy dvryaboy and 2 others commented on an outdated diff Jun 23, 2013
parquet-column/pom.xml
@@ -25,6 +25,11 @@
<version>1.7</version>
<scope>compile</scope>
</dependency>
+ <dependency>
@dvryaboy
dvryaboy added a line comment Jun 23, 2013

(edited after reading more of the code) we should use maven shade to namespace guava.
(edited after thinking about it) can't use maven shade to namespace guava, since this is used in a user-facing API. Maybe we can pull out Predicate and friends out of Guava, and either include then in their proper namespace under com.google (con: risk breaking self or a user if Guava introduces a breaking change to Promises; pro: existing 3rd party implementations of Predicates work) or under a new namespace (swap the previous pros and cons -- no breakage, but requires new Predicates, which may be weird to explain).

@JacobMetcalf
JacobMetcalf added a line comment Jun 29, 2013

Not quite sure what you wanted me to do here. I did see the different versions of Guava and wondered.

@julienledem
Apache Parquet member
julienledem added a line comment Jun 29, 2013

We want to minimize the dependencies of this component. Pig, Hive, etc depend on guava too and guava does has regular incompatible releases. Dependency conflicts quickly become problematic.
We either use shading (see thrift, jackson) to remove the dependency conflict or re-implement simple utilities (see Prevalidators)

@JacobMetcalf
JacobMetcalf added a line comment Jun 30, 2013

Agreed you are working with a lot of libraries here so want minimal dependencies. I have re-implemented and replaced a few other guava uses with core libraries. My only performance concern was replacing the use of Guava's Splitter with String.split().

@dvryaboy
dvryaboy added a line comment Jun 30, 2013

I think we can lift Guava's Splitter verbatim and put it into our own namespace (the license allows this, we just need to credit them appropriately).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@dvryaboy dvryaboy commented on the diff Jun 23, 2013
...c/main/java/parquet/column/impl/ColumnReaderImpl.java
* reads the current value
*/
public void readCurrentValue() {
binding.read();
}
- protected void checkValueRead() {
+ /**
+ * Reads the value into the binding or skips forwards.
@dvryaboy
dvryaboy added a line comment Jun 23, 2013

This is a little confusing, since checkValueRead doesn't scream to me "this will actually do something to the stream" (that was a problem before you got here, too). Maybe something like "readIfPossible(boolean skipDeserialization)" would be more clear?

@julienledem
Apache Parquet member
julienledem added a line comment Jun 29, 2013

agreed, we should now rename this method

@JacobMetcalf
JacobMetcalf added a line comment Jul 6, 2013

Done

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@dvryaboy dvryaboy and 1 other commented on an outdated diff Jun 23, 2013
.../src/main/java/parquet/filter/ColumnRecordFilter.java
+
+ return ( filterOnColumn.isFullyConsumed()) ? false : filterPredicate.apply( filterOnColumn );
+ }
+
+ /**
+ * @return true if the column we are filtering on has no more values.
+ */
+ @Override
+ public boolean isFullyConsumed() {
+ return filterOnColumn.isFullyConsumed();
+ }
+
+ /**
+ * Predicate for string equality
+ */
+ public static final Predicate<ColumnReader> equalTo( final String value ) {
@dvryaboy
dvryaboy added a line comment Jun 23, 2013

I'm not entirely sure why we only have equalTo for 2 specific types.. is the idea to fill out a collection of a bunch of useful predicates? Should we in that case break provided predicates into their own class?

Not certain I have a total picture of how everything hooks together, but can this break if dictionary encoding is in use? If this is applied after dictionary decoding, that's fine, but we should create an issue to add ability to apply dict filters before decoding for further gains. We should do that for both equality and regexes.

@JacobMetcalf
JacobMetcalf added a line comment Jun 30, 2013

Yes. It was to provide a much simpler way of setting up predicates on specific column types. I have now provided equalTo for all value types. Perhaps in a future push I can add other types of predicates, but then it could be left to the caller.

I have added a test to check equality in presence of dictionary encoding.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@dvryaboy dvryaboy commented on an outdated diff Jun 23, 2013
...n/src/main/java/parquet/filter/PagedRecordFilter.java
+ return new PagedRecordFilter( startPos, pageSize );
+ }
+ };
+ }
+
+ /**
+ * Private constructor, use column() instead.
+ */
+ private PagedRecordFilter(long startPos, long pageSize) {
+ this.startPos = startPos;
+ this.endPos = startPos + pageSize;
+ }
+
+ /**
+ * Terminate early when we have got our page. Later we will want a row count and this
+ * will be no good.
@dvryaboy
dvryaboy added a line comment Jun 23, 2013

This comment is a little terse (and I am a little slow.. have pity :)).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@dvryaboy dvryaboy and 1 other commented on an outdated diff Jun 23, 2013
...-column/src/main/java/parquet/io/MessageColumnIO.java
@@ -58,8 +60,19 @@
this,
recordMaterializer,
validating,
- new ColumnReadStoreImpl(columns, recordMaterializer.getRootConverter(), getType())
- );
+ new ColumnReadStoreImpl(columns, recordMaterializer.getRootConverter(), getType()),
+ RecordFilter.NULL_FILTER
@dvryaboy
dvryaboy added a line comment Jun 23, 2013

does this add any overhead? would it be better to special-case null filter so that we don't have to invoke anything at all?
I supposed it's possible the null filter gets JITed out... would be nice to know for sure.

@JacobMetcalf
JacobMetcalf added a line comment Jun 30, 2013

In next push this is moved to FilteredRecordReader so only gets called when there is a filter.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@dvryaboy dvryaboy and 1 other commented on an outdated diff Jun 23, 2013
.../main/java/parquet/io/RecordReaderImplementation.java
@@ -356,6 +361,10 @@ public int compare(Case o1, Case o2) {
Collections.sort(state.definedCases, caseComparator);
Collections.sort(state.undefinedCases, caseComparator);
}
+
+ // We need to make defensive copy to stop interference but as an optimisation don't bother if null
@dvryaboy
dvryaboy added a line comment Jun 23, 2013

what's the interference you are worried about? Seems like this can get pricey, and we'd need some sort of a selectivity rule-of-thumb for when applying filters in Parquet becomes more trouble than it is worth.

@JacobMetcalf
JacobMetcalf added a line comment Jun 30, 2013

In the coming push I have moved this to a FilteredRecordReader.

I am allowing people to implement their own filters so if we presented the column array to the filter for binding that would allow them to possibly swap the ordering.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@dvryaboy dvryaboy and 1 other commented on an outdated diff Jun 23, 2013
...uet-column/src/test/java/parquet/io/TestFiltered.java
+import parquet.example.data.simple.convert.GroupRecordConverter;
+import parquet.io.api.RecordMaterializer;
+
+import static org.junit.Assert.assertEquals;
+import static org.junit.Assert.assertNull;
+import static parquet.example.Paper.r1;
+import static parquet.example.Paper.r2;
+import static parquet.example.Paper.schema;
+import static parquet.filter.AndRecordFilter.and;
+import static parquet.filter.PagedRecordFilter.page;
+import static parquet.filter.ColumnRecordFilter.equalTo;
+import static parquet.filter.ColumnRecordFilter.column;
+
+public class TestFiltered {
+
+ private static final Log LOG = Log.getLog(TestColumnIO.class);
@dvryaboy
dvryaboy added a line comment Jun 23, 2013

wrong class

@JacobMetcalf
JacobMetcalf added a line comment Jun 30, 2013

Fixed in latest push.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@dvryaboy dvryaboy and 1 other commented on an outdated diff Jun 23, 2013
...adoop/src/main/java/parquet/hadoop/ParquetReader.java
@@ -53,7 +59,7 @@ public ParquetReader(Path file, ReadSupport<T> readSupport) throws IOException {
MessageType schema = fileMetaData.getSchema();
Map<String, String> extraMetadata = fileMetaData.getKeyValueMetaData();
final ReadContext readContext = readSupport.init(conf, extraMetadata, schema);
- reader = new ParquetRecordReader<T>(readSupport);
+ reader = new ParquetRecordReader<T>(readSupport,filter);
@dvryaboy
dvryaboy added a line comment Jun 23, 2013

space after comma

@JacobMetcalf
JacobMetcalf added a line comment Jun 30, 2013

Fixed in next push.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@dvryaboy

This is very nice, Jacob!

@julienledem julienledem and 2 others commented on an outdated diff Jun 29, 2013
...ain/java/parquet/avro/AvroIndexedRecordConverter.java
@@ -111,7 +113,9 @@ public Converter getConverter(int fieldIndex) {
@Override
public void start() {
- this.currentRecord = new GenericData.Record(avroSchema);
+ // Should do the right thing whether it is generic or specific
+ this.currentRecord =
+ (IndexedRecord) SpecificData.get().newRecord((IndexedRecord)null, avroSchema);
@julienledem
Apache Parquet member
julienledem added a line comment Jun 29, 2013

Possibly premature optimization but we could also have 2 subclasses with a different implementation of start().
Just a thought.

@JacobMetcalf
JacobMetcalf added a line comment Jun 30, 2013

I am happy to implement separate writers & readers. My personal preference is to have one writer & reader that supports both as Avro does a lot to make the two formats interchangeable and in fact you can mix them.

In terms of performance I will implement a change to slightly speed this up for Generic users by caching the fact that there is no Avro Specific class for this record.

@dvryaboy
dvryaboy added a line comment Jun 30, 2013

we can leave this for another pull, doesn't make sense to block on this i think

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@julienledem julienledem commented on the diff Jun 29, 2013
...column/src/main/java/parquet/column/ColumnReader.java
@@ -98,4 +98,14 @@
* @return the current value
*/
double getDouble();
+
+ /**
+ * @return Descriptor of the column.
+ */
+ ColumnDescriptor getDescriptor();
+
+ /**
+ * Skip the current value
+ */
+ void skip();
@julienledem
Apache Parquet member
julienledem added a line comment Jun 29, 2013

did you see a use for skip(int n) ? (skip n values)

@JacobMetcalf
JacobMetcalf added a line comment Jun 30, 2013

Since I have to iterate through each skipped record I am only skipping one at a time. But if you feel its useful I can add.

@julienledem
Apache Parquet member
julienledem added a line comment Jul 8, 2013

Keep it for your next pull request :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@julienledem julienledem and 1 other commented on an outdated diff Jun 29, 2013
...c/main/java/parquet/column/impl/ColumnReaderImpl.java
* reads the current value
*/
public void readCurrentValue() {
binding.read();
}
- protected void checkValueRead() {
+ /**
+ * Reads the value into the binding or skips forwards.
+ * @param skip If true don;t deserialize just skip forwards
@julienledem
Apache Parquet member
julienledem added a line comment Jun 29, 2013

don't

@JacobMetcalf
JacobMetcalf added a line comment Jun 30, 2013

Fixed in next push.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@julienledem julienledem commented on the diff Jun 29, 2013
...va/parquet/column/values/plain/PlainValuesReader.java
@@ -46,6 +46,15 @@ public float readFloat() {
}
@Override
+ public void skipFloat() {
@julienledem
Apache Parquet member
julienledem added a line comment Jun 29, 2013

It looks like splitting the PlainValuesReader into one class per type would simplify the api.
That way you could just define skip() and not have to specify the type in the layers above

@JacobMetcalf
JacobMetcalf added a line comment Jul 1, 2013

I would tend to agree, but is this not an extensive change to the ValuesReader interface?

@julienledem
Apache Parquet member
julienledem added a line comment Jul 8, 2013

That does not change the interface too much but it requires the Reader to know what type it is reading.
that way you just add skip() without needing to have one per type.

@julienledem
Apache Parquet member
julienledem added a line comment Jul 8, 2013

I just submitted a pull request that does the split:
#79
that way ValuesReader has just skip() and the individual subclasses of PlainValuesReader implement it in the right way.
ValuesReader still need a read...() method for each type because java treats primitive types differently from objects.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@julienledem julienledem and 1 other commented on an outdated diff Jun 29, 2013
...src/main/java/parquet/filter/UnboundRecordFilter.java
@@ -0,0 +1,19 @@
+package parquet.filter;
+
+import com.google.common.base.Predicate;
@julienledem
Apache Parquet member
julienledem added a line comment Jun 29, 2013

it does not look like we use this import

@JacobMetcalf
JacobMetcalf added a line comment Jun 30, 2013

Removed in next push.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@julienledem julienledem and 1 other commented on an outdated diff Jun 29, 2013
.../main/java/parquet/io/RecordReaderImplementation.java
@@ -375,6 +384,12 @@ private RecordConsumer wrap(RecordConsumer recordConsumer) {
*/
@Override
public T read() {
+ // Skip forwards until the filter matches a record
+ if ( !skipToMatch()) {
+ return null;
+ }
@julienledem
Apache Parquet member
julienledem added a line comment Jun 29, 2013

It looks like we could have a subclass that adds the skip, to avoid the skip overhead when we're not filtering.

@JacobMetcalf
JacobMetcalf added a line comment Jun 30, 2013

Agreed. Have implemented a separate FilteredRecordReader. It required me to add a couple of extra getters to RecordReaderImplementation but since this is a package level class I hope that's okay.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@julienledem julienledem and 1 other commented on an outdated diff Jun 29, 2013
...uet-column/src/test/java/parquet/io/TestFiltered.java
+import parquet.example.data.Group;
+import parquet.example.data.GroupWriter;
+import parquet.example.data.simple.convert.GroupRecordConverter;
+import parquet.io.api.RecordMaterializer;
+
+import static org.junit.Assert.assertEquals;
+import static org.junit.Assert.assertNull;
+import static parquet.example.Paper.r1;
+import static parquet.example.Paper.r2;
+import static parquet.example.Paper.schema;
+import static parquet.filter.AndRecordFilter.and;
+import static parquet.filter.PagedRecordFilter.page;
+import static parquet.filter.ColumnRecordFilter.equalTo;
+import static parquet.filter.ColumnRecordFilter.column;
+
+public class TestFiltered {
@julienledem
Apache Parquet member
julienledem added a line comment Jun 29, 2013

you can use "mvn cobertura:cobertura" to check your test coverage if you'd like.

@JacobMetcalf
JacobMetcalf added a line comment Jul 1, 2013

Have run coverage. One weakness is that I test all the different filter predicate types in parquet-avro rather than parquet-column. This was purely because I have more complex schemas here whereas a lot of the tests in parquet-column just use the Dremel example Schema in paper.js. If you want am happy to add more testing to parquet-column.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@julienledem
Apache Parquet member

Great work Jacob.
Thanks

@dvryaboy dvryaboy commented on the diff Jun 30, 2013
.../src/main/java/parquet/filter/ColumnRecordFilter.java
@@ -0,0 +1,96 @@
+package parquet.filter;
@dvryaboy
dvryaboy added a line comment Jun 30, 2013

please add APL header

@JacobMetcalf
JacobMetcalf added a line comment Jul 6, 2013

Done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@dvryaboy dvryaboy commented on the diff Jun 30, 2013
...umn/src/main/java/parquet/filter/AndRecordFilter.java
@@ -0,0 +1,51 @@
+package parquet.filter;
@dvryaboy
dvryaboy added a line comment Jun 30, 2013

please add APL header.. actually now that I look, we need to do that to all new files -- I think @julienledem has a helper task that does this automatically?

@JacobMetcalf
JacobMetcalf added a line comment Jul 1, 2013

Let me know the task - will run and push again.

@julienledem
Apache Parquet member
julienledem added a line comment Jul 1, 2013

mvn license:format

@JacobMetcalf
JacobMetcalf added a line comment Jul 6, 2013

Done. It modified a few other files apart from mine but I figured that's what you want.

@julienledem
Apache Parquet member
julienledem added a line comment Jul 8, 2013

yep, those had been missed in a previous PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@JacobMetcalf

Have tried to incorporate all your comments. Let me know what you think.

@JacobMetcalf

Have updated for above comments, included APL headers and caught up with master. Let me know if there is anything else needed to pull this into master,

@julienledem julienledem commented on the diff Jul 8, 2013
...ain/java/parquet/avro/AvroIndexedRecordConverter.java
@@ -239,6 +249,26 @@ final public void addBinary(Binary value) {
}
+ static final class FieldEnumConverter extends PrimitiveConverter {
+
+ private final ParentValueContainer parent;
+ private final Class<? extends Enum> enumClass;
+
+ public FieldEnumConverter(ParentValueContainer parent, Schema enumSchema) {
+ this.parent = parent;
+ this.enumClass = SpecificData.get().getClass(enumSchema);
+ }
+
+ @Override
+ final public void addBinary(Binary value) {
+ Object enumValue = value.toStringUsingUTF8();
+ if (enumClass != null) {
@julienledem
Apache Parquet member
julienledem added a line comment Jul 8, 2013

in what case is enumClass null ?
Please add a comment about this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@julienledem julienledem commented on the diff Jul 8, 2013
...o/src/main/java/parquet/avro/AvroSchemaConverter.java
}
throw new UnsupportedOperationException("Cannot convert Avro type " + type);
}
+ private Type convertUnion(String fieldName, Schema schema, Type.Repetition repetition) {
+ List<Schema> nonNullSchemas = new ArrayList(schema.getTypes().size());
+ for (Schema childSchema : schema.getTypes()) {
+ if (childSchema.getType().equals(Schema.Type.NULL)) {
+ repetition = Type.Repetition.OPTIONAL;
+ } else {
+ nonNullSchemas.add(childSchema);
+ }
+ }
+ // If we only get a null and one other type then its a simple optional field
+ // otherwise construct a union container
+ switch (nonNullSchemas.size()) {
+ case 0:
+ throw new UnsupportedOperationException("Cannot convert Avro union of only nulls");
@julienledem
Apache Parquet member
julienledem added a line comment Jul 8, 2013

We could probably support that too by just skipping the field. That can wait if you don't feel like doing it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@julienledem julienledem commented on the diff Jul 8, 2013
...c/test/java/parquet/avro/TestAvroSchemaConverter.java
- new AvroSchemaConverter().convert(schema);
+ // Avro union is modelled using optional data members of thw different types;
+ testConversion(
+ schema,
+ "message record2 {\n" +
+ " optional group myunion {\n" +
@julienledem
Apache Parquet member
julienledem added a line comment Jul 8, 2013

isn't the union itself required in that case?
(that would save one definition level)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@julienledem julienledem commented on the diff Jul 8, 2013
...mn/src/main/java/parquet/io/FilteredRecordReader.java
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package parquet.io;
+
+import parquet.column.ColumnReader;
+import parquet.column.impl.ColumnReadStoreImpl;
+import parquet.filter.RecordFilter;
+import parquet.filter.UnboundRecordFilter;
+import parquet.io.api.RecordMaterializer;
+
+import java.util.Arrays;
+
+/**
+ * Extends the
@julienledem
Apache Parquet member
julienledem added a line comment Jul 8, 2013

missing end of sentence

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@julienledem
Apache Parquet member

Great contribution @JacobMetcalf !

@dvryaboy

Folks, what's left to be done to merge this? Looks like some refactor diverged it from master.

@julienledem
Apache Parquet member

The only missing part is to integrate the change to PlainValuesReader so that ValuesReader has a single split() method. Also I had a comment about making the parent of the union required.
@JacobMetcalf : I could take a look at integrating master if you want.

@JacobMetcalf

@julienledem Please integrate with master if you have time. If not I will do at the weekend.

The comment for FilteredRecordReader should be: "Extends the record reader to add filtering capabilities. User supplies a filter if the current record being read does not match the filter then the reader skips to the next record." The comment for FieldEnumConverter would be "enumClass will be blank in the case of AvroGeneric".

@julienledem
Apache Parquet member

I just merged master into @JacobMetcalf 's contribution there: #89

@julienledem
Apache Parquet member

I just merged. Thank you @JacobMetcalf, I'm looking forward to your next contribution!

@julienledem julienledem added a commit that referenced this pull request Oct 2, 2014
@rdblue rdblue PARQUET-107: Add option to disable summary metadata.
This adds an option to the commitJob phase of the MR OutputCommitter,
parquet.enable.summary-metadata (default true), that can be used to
disable the summary metadata files generated from the footers of all of
the files produced. This enables more control over when those summary
files are produced and makes it possible to rename MR outputs and then
generate the summaries.

Author: Ryan Blue <rblue@cloudera.com>

Closes #68 from rdblue/PARQUET-107-add-summary-metadata-option and squashes the following commits:

261e5e4 [Ryan Blue] PARQUET-107: Add option to disable summary metadata.
be1222e
@abayer abayer pushed a commit to cloudera/parquet-mr that referenced this pull request Nov 24, 2014
@rdblue rdblue PARQUET-107: Add option to disable summary metadata.
This adds an option to the commitJob phase of the MR OutputCommitter,
parquet.enable.summary-metadata (default true), that can be used to
disable the summary metadata files generated from the footers of all of
the files produced. This enables more control over when those summary
files are produced and makes it possible to rename MR outputs and then
generate the summaries.

Author: Ryan Blue <rblue@cloudera.com>

Closes #68 from rdblue/PARQUET-107-add-summary-metadata-option and squashes the following commits:

261e5e4 [Ryan Blue] PARQUET-107: Add option to disable summary metadata.

(cherry picked from commit 2d63011)
(cherry picked from commit 97961fd73d636aa3a31f1050672adbac7ad43c18)
5e6f922
@abayer abayer pushed a commit to cloudera/parquet-mr that referenced this pull request Dec 21, 2014
@rdblue rdblue PARQUET-107: Add option to disable summary metadata.
This adds an option to the commitJob phase of the MR OutputCommitter,
parquet.enable.summary-metadata (default true), that can be used to
disable the summary metadata files generated from the footers of all of
the files produced. This enables more control over when those summary
files are produced and makes it possible to rename MR outputs and then
generate the summaries.

Author: Ryan Blue <rblue@cloudera.com>

Closes #68 from rdblue/PARQUET-107-add-summary-metadata-option and squashes the following commits:

261e5e4 [Ryan Blue] PARQUET-107: Add option to disable summary metadata.
2d63011
@cloudera-hudson cloudera-hudson pushed a commit to cloudera/parquet-mr that referenced this pull request Apr 27, 2015
@rdblue rdblue PARQUET-107: Add option to disable summary metadata.
This adds an option to the commitJob phase of the MR OutputCommitter,
parquet.enable.summary-metadata (default true), that can be used to
disable the summary metadata files generated from the footers of all of
the files produced. This enables more control over when those summary
files are produced and makes it possible to rename MR outputs and then
generate the summaries.

Author: Ryan Blue <rblue@cloudera.com>

Closes #68 from rdblue/PARQUET-107-add-summary-metadata-option and squashes the following commits:

261e5e4 [Ryan Blue] PARQUET-107: Add option to disable summary metadata.

Conflicts:
	parquet-hadoop/src/main/java/parquet/hadoop/ParquetOutputFormat.java
Resolution:
    Conflict in static final constants.
70ddf00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment