As we have moved to Apache, please open your pull requests on: https://github.com/apache/parquet-mr
Java Other
Latest commit fa8957d Mar 4, 2015 @colinmarc colinmarc committed with rdblue PARQUET-187: Replace JavaConversions.asJavaList with JavaConversions.…
…seqAsJavaList

The former was removed in 2.11, but the latter exists in 2.9, 2.10 and 2.11. With this change, I can build on 2.11 without any issue.

Author: Colin Marc <colinmarc@gmail.com>

Closes #121 from colinmarc/build-211 and squashes the following commits:

8a29319 [Colin Marc] Replace JavaConversions.asJavaList with JavaConversions.seqAsJavaList.
Failed to load latest commit information.
dev PARQUET-111: Updates for apache release Feb 2, 2015
doc/dremel_paper removed old doc Mar 11, 2013
parquet-avro PARQUET-192: Fix map null encoding Mar 4, 2015
parquet-cascading PARQUET-181: Scrooge Write Support (take two) Feb 5, 2015
parquet-column PARQUET-116: Pass a filter object to user defined predicate in filter… Feb 9, 2015
parquet-common PARQUET-111: Updates for apache release Feb 3, 2015
parquet-encoding PARQUET-111: Updates for apache release Feb 3, 2015
parquet-generator PARQUET-111: Updates for apache release Feb 3, 2015
parquet-hadoop-bundle PARQUET-111: Updates for apache release Feb 3, 2015
parquet-hadoop PARQUET-188: Change column ordering to match the field order. Mar 4, 2015
parquet-hive-bundle PARQUET-111: Updates for apache release Feb 3, 2015
parquet-hive PARQUET-139: Avoid reading footers when using task-side metadata Feb 5, 2015
parquet-jackson PARQUET-111: Updates for apache release Feb 3, 2015
parquet-pig-bundle PARQUET-111: Updates for apache release Feb 3, 2015
parquet-pig PARQUET-111: Updates for apache release Feb 3, 2015
parquet-protobuf PARQUET-111: Updates for apache release Feb 3, 2015
parquet-scala PARQUET-116: Pass a filter object to user defined predicate in filter… Feb 10, 2015
parquet-scrooge PARQUET-187: Replace JavaConversions.asJavaList with JavaConversions.… Mar 4, 2015
parquet-thrift PARQUET-181: Scrooge Write Support (take two) Feb 5, 2015
parquet-tools PARQUET-111: Updates for apache release Feb 3, 2015
src adding back the license header used by the maven plugin Mar 12, 2013
.gitignore PARQUET-157: Divide by zero fix Jan 29, 2015
.travis.yml PARQUET-117: implement the new page format for Parquet 2.0 Dec 4, 2014
CHANGES.md PARQUET-111: Updates for apache release Feb 3, 2015
CONTRIBUTING.md PARQUET-111: Updates for apache release Feb 3, 2015
DISCLAIMER PARQUET-111: Updates for apache release Feb 3, 2015
KEYS PARQUET-111: Updates for apache release Feb 3, 2015
LICENSE PARQUET-111: Updates for apache release Feb 3, 2015
NOTICE PARQUET-111: Updates for apache release Feb 3, 2015
PoweredBy.md PARQUET-111: Updates for apache release Feb 3, 2015
README.md PARQUET-111: Updates for apache release Feb 3, 2015
changelog.sh PARQUET-111: Updates for apache release Feb 3, 2015
parquet_cascading.md PARQUET-111: Updates for apache release Feb 3, 2015
pom.xml PARQUET-116: Pass a filter object to user defined predicate in filter… Feb 10, 2015

README.md

Parquet MR Build Status

Parquet-MR contains the java implementation of the Parquet format. Parquet is a columnar storage format for Hadoop; it provides efficient storage and encoding of data. Parquet uses the record shredding and assembly algorithm described in the Dremel paper to represent nested structures.

You can find some details about the format and intended use cases in our Hadoop Summit 2013 presentation

Features

Parquet is a very active project, and new features are being added quickly; below is the state as of June 2013.

FeatureIn trunkIn devPlannedExpected release
Type-specific encodingYES1.0
Hive integrationYES (28)1.0
Pig integrationYES1.0
Cascading integrationYES1.0
Crunch integrationYES (CRUNCH-277)1.0
Impala integrationYES (non-nested)1.0
Java Map/Reduce APIYES1.0
Native Avro supportYES1.0
Native Thrift supportYES1.0
Complex structure supportYES1.0
Future-proofed versioningYES1.0
RLEYES1.0
Bit PackingYES1.0
Adaptive dictionary encodingYES1.0
Predicate pushdownYES (68)1.0
Column statsYES2.0
Delta encodingYES2.0
Native Protocol Buffers supportYES1.0
Index pagesYES2.0

Map/Reduce integration

Input and Output formats. Note that to use an Input or Output format, you need to implement a WriteSupport or ReadSupport class, which will implement the conversion of your object to and from a Parquet schema.

We've implemented this for 2 popular data formats to provide a clean migration path as well:

Thrift

Thrift integration is provided by the parquet-thrift sub-project. If you are using Thrift through Scala, you may be using Twitter's Scrooge. If that's the case, not to worry -- we took care of the Scrooge/Apache Thrift glue for you in the parquet-scrooge sub-project.

Avro

Avro conversion is implemented via the parquet-avro sub-project.

Create your own objects

  • The ParquetOutputFormat can be provided a WriteSupport to write your own objects to an event based RecordConsumer.
  • the ParquetInputFormat can be provided a ReadSupport to materialize your own objects by implementing a RecordMaterializer

See the APIs:

Apache Pig integration

A Loader and a Storer are provided to read and write Parquet files with Apache Pig

Storing data into Parquet in Pig is simple:

-- options you might want to fiddle with
SET parquet.page.size 1048576 -- default. this is your min read/write unit.
SET parquet.block.size 134217728 -- default. your memory budget for buffering data
SET parquet.compression lzo -- or you can use none, gzip, snappy
STORE mydata into '/some/path' USING parquet.pig.ParquetStorer;

Reading in Pig is also simple:

mydata = LOAD '/some/path' USING parquet.pig.ParquetLoader();

If the data was stored using Pig, things will "just work". If the data was stored using another method, you will need to provide the Pig schema equivalent to the data you stored (you can also write the schema to the file footer while writing it -- but that's pretty advanced). We will provide a basic automatic schema conversion soon.

Hive integration

Hive integration is provided via the parquet-hive sub-project.

Build

to run the unit tests: mvn test

to build the jars: mvn package

The build runs in Travis CI: Build Status

Add Parquet as a dependency in Maven

Snapshot releases

  <repositories>
    <repository>
      <id>sonatype-nexus-snapshots</id>
      <url>https://oss.sonatype.org/content/repositories/snapshots</url>
      <releases>
        <enabled>false</enabled>
      </releases>
      <snapshots>
        <enabled>true</enabled>
      </snapshots>
     </repository>
  </repositories>
  <dependencies>
    <dependency>
      <groupId>com.twitter</groupId>
      <artifactId>parquet-common</artifactId>
      <version>1.0.0-SNAPSHOT</version>
    </dependency>
    <dependency>
      <groupId>com.twitter</groupId>
      <artifactId>parquet-encoding</artifactId>
      <version>1.0.0-SNAPSHOT</version>
    </dependency>
    <dependency>
      <groupId>com.twitter</groupId>
      <artifactId>parquet-column</artifactId>
      <version>1.0.0-SNAPSHOT</version>
    </dependency>
    <dependency>
      <groupId>com.twitter</groupId>
      <artifactId>parquet-hadoop</artifactId>
      <version>1.0.0-SNAPSHOT</version>
    </dependency>
  </dependencies>

Official releases

1.0.0

  <dependencies>
    <dependency>
      <groupId>com.twitter</groupId>
      <artifactId>parquet-common</artifactId>
      <version>1.0.0</version>
    </dependency>
    <dependency>
      <groupId>com.twitter</groupId>
      <artifactId>parquet-encoding</artifactId>
      <version>1.0.0</version>
    </dependency>
    <dependency>
      <groupId>com.twitter</groupId>
      <artifactId>parquet-column</artifactId>
      <version>1.0.0</version>
    </dependency>
    <dependency>
      <groupId>com.twitter</groupId>
      <artifactId>parquet-hadoop</artifactId>
      <version>1.0.0</version>
    </dependency>
  </dependencies>

How To Contribute

If you are looking for some ideas on what to contribute, check out GitHub issues for this project labeled "Pick me up!". Comment on the issue and/or contact the parquet-dev group with your questions and ideas.

We tend to do fairly close readings of pull requests, and you may get a lot of comments. Some common issues that are not code structure related, but still important:

  • Please make sure to add the license headers to all new files. You can do this automatically by using the mvn license:format command.
  • Use 2 spaces for whitespace. Not tabs, not 4 spaces. The number of the spacing shall be 2.
  • Give your operators some room. Not a+b but a + b and not foo(int a,int b) but foo(int a, int b).
  • Generally speaking, stick to the Sun Java Code Conventions
  • Make sure tests pass!

Authors and contributors

Code of Conduct

We hold ourselves and the Parquet developer community to a code of conduct as described by Twitter OSS: https://github.com/twitter/code-of-conduct/blob/master/code-of-conduct.md.

Discussions

License

Copyright 2012-2013 Twitter, Inc.

Licensed under the Apache License, Version 2.0: http://www.apache.org/licenses/LICENSE-2.0