Permalink
Commits on Mar 4, 2015
  1. PARQUET-187: Replace JavaConversions.asJavaList with JavaConversions.…

    …seqAsJavaList
    
    The former was removed in 2.11, but the latter exists in 2.9, 2.10 and 2.11. With this change, I can build on 2.11 without any issue.
    
    Author: Colin Marc <colinmarc@gmail.com>
    
    Closes #121 from colinmarc/build-211 and squashes the following commits:
    
    8a29319 [Colin Marc] Replace JavaConversions.asJavaList with JavaConversions.seqAsJavaList.
    colinmarc committed with rdblue Mar 4, 2015
  2. PARQUET-188: Change column ordering to match the field order.

    This was the behavior before the V2 pages were added.
    
    Author: Ryan Blue <blue@apache.org>
    
    Closes #129 from rdblue/PARQUET-188-fix-column-metadata-order and squashes the following commits:
    
    3c9fa5d [Ryan Blue] PARQUET-188: Change column ordering to match the field order.
    rdblue committed Mar 4, 2015
  3. PARQUET-192: Fix map null encoding

    This depends on PARQUET-191 for the correct schema representation.
    
    Author: Ryan Blue <blue@apache.org>
    
    Closes #127 from rdblue/PARQUET-192-fix-map-null-encoding and squashes the following commits:
    
    fffde82 [Ryan Blue] PARQUET-192: Fix parquet-avro maps with null values.
    rdblue committed Mar 4, 2015
  4. PARQUET-191: Fix map Type to Avro Schema conversion.

    Author: Ryan Blue <blue@apache.org>
    
    Closes #126 from rdblue/PARQUET-191-fix-map-value-conversion and squashes the following commits:
    
    33f6bbc [Ryan Blue] PARQUET-191: Fix map Type to Avro Schema conversion.
    rdblue committed Mar 4, 2015
Commits on Feb 26, 2015
  1. PARQUET-190: fix an inconsistent Javadoc comment of ReadSupport.prepa…

    …reForRead
    
    ReadSupport.prepareForRead does not return RecordConsumer but RecordMaterializer
    
    Author: choplin <choplin.choplin@gmail.com>
    
    Closes #125 from choplin/fix-javadoc-comment and squashes the following commits:
    
    c3574f3 [choplin] fix an inconsistent Javadoc comment of ReadSupport.prepareForRead
    choplin committed with rdblue Feb 26, 2015
Commits on Feb 10, 2015
  1. PARQUET-164: Add warning when scaling row group sizes.

    Author: Ryan Blue <blue@apache.org>
    
    Closes #119 from rdblue/PARQUET-164-add-memory-manager-warning and squashes the following commits:
    
    241144f [Ryan Blue] PARQUET-164: Add warning when scaling row group sizes.
    rdblue committed Feb 10, 2015
  2. PARQUET-116: Pass a filter object to user defined predicate in filter…

    …2 api
    
    Currently for creating a user defined predicate using the new filter api, no value can be passed to create a dynamic filter at runtime. This reduces the usefulness of the user defined predicate, and meaningful predicates cannot be created. We can add a generic Object value that is passed through the api, which can internally be used in the keep function of the user defined predicate for creating many different types of filters.
    For example, in spark sql, we can pass in a list of filter values for a where IN clause query and filter the row values based on that list.
    
    Author: Yash Datta <Yash.Datta@guavus.com>
    Author: Alex Levenson <alexlevenson@twitter.com>
    Author: Yash Datta <saucam@gmail.com>
    
    Closes #73 from saucam/master and squashes the following commits:
    
    7231a3b [Yash Datta] Merge pull request #3 from isnotinvain/alexlevenson/fix-binary-compat
    dcc276b [Alex Levenson] Ignore binary incompatibility in private filter2 class
    7bfa5ad [Yash Datta] Merge pull request #2 from isnotinvain/alexlevenson/simplify-udp-state
    0187376 [Alex Levenson] Resolve merge conflicts
    25aa716 [Alex Levenson] Simplify user defined predicates with state
    51952f8 [Yash Datta] PARQUET-116: Fix whitespace
    d7b7159 [Yash Datta] PARQUET-116: Make UserDefined abstract, add two subclasses, one accepting udp class, other accepting serializable udp instance
    40d394a [Yash Datta] PARQUET-116: Fix whitespace
    9a63611 [Yash Datta] PARQUET-116: Fix whitespace
    7caa4dc [Yash Datta] PARQUET-116: Add ConfiguredUserDefined that takes a serialiazble udp directly
    0eaabf4 [Yash Datta] PARQUET-116: Move the config object from keep method to a configure method in udp predicate
    f51a431 [Yash Datta] PARQUET-116: Adding type safety for the filter object to be passed to user defined predicate
    d5a2b9e [Yash Datta] PARQUET-116: Enforce that the filter object to be passed must be Serializable
    dfd0478 [Yash Datta] PARQUET-116: Add a test case for passing a filter object to user defined predicate
    4ab46ec [Yash Datta] PARQUET-116: Pass a filter object to user defined predicate in filter2 api
    Yash Datta committed with isnotinvain Feb 10, 2015
Commits on Feb 5, 2015
  1. PARQUET-139: Avoid reading footers when using task-side metadata

    This updates the InternalParquetRecordReader to initialize the ReadContext in each task rather than once for an entire job. There are two reasons for this change:
    
    1. For correctness, the requested projection schema must be validated against each file schema, not once using the merged schema.
    2. To avoid reading file footers on the client side, which is a performance bottleneck.
    
    Because the read context is reinitialized in every task, it is no longer necessary to pass the its contents to each task in ParquetInputSplit. The fields and accessors have been removed.
    
    This also adds a new InputFormat, ParquetFileInputFormat that uses FileSplits instead of ParquetSplits. It goes through the normal ParquetRecordReader and creates a ParquetSplit on the task side. This is to avoid accidental behavior changes in ParquetInputFormat.
    
    Author: Ryan Blue <blue@apache.org>
    
    Closes #91 from rdblue/PARQUET-139-input-format-task-side and squashes the following commits:
    
    cb30660 [Ryan Blue] PARQUET-139: Fix deprecated reader bug from review fixes.
    09cde8d [Ryan Blue] PARQUET-139: Implement changes from reviews.
    3eec553 [Ryan Blue] PARQUET-139: Merge new InputFormat into ParquetInputFormat.
    8971b80 [Ryan Blue] PARQUET-139: Add ParquetFileInputFormat that uses FileSplit.
    87dfe86 [Ryan Blue] PARQUET-139: Expose read support helper methods.
    057c7dc [Ryan Blue] PARQUET-139: Update reader to initialize read context in tasks.
    rdblue committed Feb 5, 2015
  2. PARQUET-177: Added lower bound to memory manager resize

    PARQUET-177
    
    Author: Daniel Weeks <dweeks@netflix.com>
    
    Closes #115 from danielcweeks/memory-manager-limit and squashes the following commits:
    
    b2e4708 [Daniel Weeks] Updated to base memory allocation off estimated chunk size
    09d7aa3 [Daniel Weeks] Updated property name and default value
    8f6cff1 [Daniel Weeks] Added low bound to memory manager resize
    danielcweeks committed Feb 5, 2015
  3. PARQUET-181: Scrooge Write Support (take two)

    This is similar to https://github.com/apache/incubator-parquet-mr/pull/43, but instead of making `ThriftWriteSupport` abstract, it keeps it around (but deprecated) and adds `AbstractThriftWriteSupport`. This is a little less elegant, but it seems to appease the semver overlords.
    
    Author: Colin Marc <colinmarc@gmail.com>
    
    Closes #58 from colinmarc/scrooge-write-support-2 and squashes the following commits:
    
    e2a0abd [Colin Marc] add write support to ParquetScroogeScheme
    19cf1a8 [Colin Marc] Add ScroogeWriteSupport and ParquetScroogeOutputFormat.
    colinmarc committed with tsdeng Feb 5, 2015
Commits on Feb 3, 2015
  1. PARQUET-173: Fixes `StatisticsFilter` for `And` filter predicate

    <!-- Reviewable:start -->
    [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/incubator-parquet-mr/108)
    <!-- Reviewable:end -->
    
    Author: Cheng Lian <lian@databricks.com>
    
    Closes #108 from liancheng/PARQUET-173 and squashes the following commits:
    
    d188f0b [Cheng Lian] Fixes test case
    be2c8a1 [Cheng Lian] Fixes `StatisticsFilter` for `And` filter predicate
    liancheng committed with isnotinvain Feb 3, 2015
  2. PARQUET-111: Updates for apache release

    Updates for first Apache release of parquet-mr.
    
    Author: Ryan Blue <blue@apache.org>
    
    Closes #109 from rdblue/PARQUET-111-update-for-apache-release and squashes the following commits:
    
    bf19849 [Ryan Blue] PARQUET-111: Add ARRIS copyright header to parquet-tools.
    f1a5c28 [Ryan Blue] PARQUET-111: Update headers in parquet-protobuf.
    ee4ea88 [Ryan Blue] PARQUET-111: Remove leaked LICENSE and NOTICE files.
    5bf178b [Ryan Blue] PARQUET-111: Update module names, urls, and binary LICENSE files.
    6736320 [Ryan Blue] PARQUET-111: Add RAT exclusion for auto-generated POM files.
    7db4553 [Ryan Blue] PARQUET-111: Add attribution for Spark dev script to LICENSE.
    45e29f2 [Ryan Blue] PARQUET-111: Update LICENSE and NOTICE.
    516c058 [Ryan Blue] PARQUET-111: Update license headers to pass RAT check.
    da688e3 [Ryan Blue] PARQUET-111: Update NOTICE with Apache boilerplate.
    234715d [Ryan Blue] PARQUET-111: Add DISCLAIMER and KEYS.
    f1d3601 [Ryan Blue] PARQUET-111: Update to use Apache parent POM.
    rdblue committed Feb 3, 2015
Commits on Jan 30, 2015
  1. PARQUET-157: Divide by zero fix

    There is a divide by zero error in logging code inside the InternalParquetRecordReader. I've been running with this fixed for a while but everytime I revert I hit the problem again. I can't believe anyone else hasn't had this problem. I submitted a Jira ticket a few weeks ago but didn't hear anything on the list so here's the fix.
    
    This also avoids compiling log statements in some cases where it's unnecessary inside the checkRead method of InternalParquetRecordReader.
    
    Also added a .gitignore entry to clean up a build artifact.
    
    Author: Jim Carroll <jim@dontcallme.com>
    
    Closes #102 from jimfcarroll/divide-by-zero-fix and squashes the following commits:
    
    423200c [Jim Carroll] Filter out parquet-scrooge build artifact from git.
    22337f3 [Jim Carroll] PARQUET-157: Fix a divide by zero error when Parquet runs quickly. Also avoid compiling log statements in some cases where it's unnecessary.
    Jim Carroll committed with rdblue Jan 30, 2015
  2. PARQUET-142: add path filter in ParquetReader

    Currently parquet-tools command fails when input is a directory with _SUCCESS file from mapreduce. Filtering those out like ParquetFileReader does fixes the problem.
    
    ```
    parquet-cat /tmp/parquet_write_test
    Could not read footer: java.lang.RuntimeException: file:/tmp/parquet_write_test/_SUCCESS is not a Parquet file (too small)
    
    $ tree /tmp/parquet_write_test
    /tmp/parquet_write_test
    ├── part-m-00000.parquet
    └── _SUCCESS
    ```
    
    Author: Neville Li <neville@spotify.com>
    
    Closes #89 from nevillelyh/gh/path-filter and squashes the following commits:
    
    7377a20 [Neville Li] PARQUET-142: add path filter in ParquetReader
    nevillelyh committed with rdblue Jan 30, 2015
  3. PARQUET-124: normalize path checking to prevent mismatch between URI …

    …and ...
    
    ...path
    
    Author: Chris Albright <calbright@cj.com>
    
    Closes #79 from chrisalbright/master and squashes the following commits:
    
    b1b0086 [Chris Albright] Merge remote-tracking branch 'upstream/master'
    9669427 [Chris Albright] PARQUET-124: Adding test (Thanks Ryan Blue) that proves mergeFooters was failing
    8e342ed [Chris Albright] PARQUET-124: normalize path checking to prevent mismatch between URI and path
    Chris Albright committed with rdblue Jan 30, 2015
  4. PARQUET-133: Upgrade snappy-java to 1.1.1.6

    Upgrade snappy-java to 1.1.1.6 (the latest vesrion), since 1.0.5 is no longer maintained in https://github.com/xerial/snappy-java, and 1.1.1.6 supports broader platforms including PowerPC, IBM-AIX 6.4, SunOS, etc. And also it has a better native coding loading mechanism (allowing to use snappy-java from multiple class loaders)
    
    Author: Taro L. Saito <leo@xerial.org>
    
    Closes #85 from xerial/PARQUET-133 and squashes the following commits:
    
    01d7b78 [Taro L. Saito] PARQUET-133: Upgrade snappy-java to 1.1.1.6
    xerial committed with rdblue Jan 30, 2015
Commits on Jan 29, 2015
  1. PARQUET-174: Replaces AssertionError constructor introduced in Java7

    AssertionError(String, Throwable) was introduced in Java7. Replacing it with AssertionError(String) + initCause(Throwable)
    
    Author: Laurent Goujon <laurentgo@users.noreply.github.com>
    
    Closes #101 from laurentgo/fix-java7ism and squashes the following commits:
    
    c00fb7c [Laurent Goujon] Replaces AssertionError constructor introduced in Java7
    laurentgo committed with julienledem Jan 29, 2015
Commits on Jan 27, 2015
  1. PARQUET-136: NPE thrown in StatisticsFilter when all values in a stri…

    …ng/binary column trunk are null
    
    In case of all nulls in a binary column, statistics object read from file metadata is empty, and should return true for all nulls check for the column. Even if column has no values, it can be ignored.
    
    The other way is to fix this behaviour in the writer, but is that what we want ?
    
    Author: Yash Datta <Yash.Datta@guavus.com>
    Author: Alex Levenson <alexlevenson@twitter.com>
    Author: Yash Datta <saucam@gmail.com>
    
    Closes #99 from saucam/npe and squashes the following commits:
    
    5138e44 [Yash Datta] PARQUET-136: Remove unreachable block
    b17cd38 [Yash Datta] Revert "PARQUET-161: Trigger tests"
    82209e6 [Yash Datta] PARQUET-161: Trigger tests
    aab2f81 [Yash Datta] PARQUET-161: Review comments for the test case
    2217ee2 [Yash Datta] PARQUET-161: Add a test case for checking the correct statistics info is recorded in case of all nulls in a column
    c2f8d6f [Yash Datta] PARQUET-161: Fix the write path to write statistics object in case of only nulls in the column
    97bb517 [Yash Datta] Revert "revert TestStatisticsFilter.java"
    a06f0d0 [Yash Datta] Merge pull request #1 from isnotinvain/alexlevenson/PARQUET-161-136
    b1001eb [Alex Levenson] Fix statistics isEmpty, handle more edge cases in statistics filter
    0c88be0 [Alex Levenson] revert TestStatisticsFilter.java
    1ac9192 [Yash Datta] PARQUET-136: Its better to not filter chunks for which empty statistics object is returned. Empty statistics can be read in case of 1. pre-statistics files, 2. files written from current writer that has a bug, as it does not write the statistics if column has all nulls
    e5e924e [Yash Datta] Revert "PARQUET-136: In case of all nulls in a binary column, statistics object read from file metadata is empty, and should return true for all nulls check for the column"
    8cc5106 [Yash Datta] Revert "PARQUET-136: fix hasNulls to cater to the case where all values are nulls"
    c7c126f [Yash Datta] PARQUET-136: fix hasNulls to cater to the case where all values are nulls
    974a22b [Yash Datta] PARQUET-136: In case of all nulls in a binary column, statistics object read from file metadata is empty, and should return true for all nulls check for the column
    Yash Datta committed with isnotinvain Jan 27, 2015
Commits on Jan 24, 2015
  1. PARQUET-168: Fixes parquet-tools command line option description

    <!-- Reviewable:start -->
    [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/incubator-parquet-mr/106)
    <!-- Reviewable:end -->
    
    Author: Cheng Lian <lian@databricks.com>
    
    Closes #106 from liancheng/PARQUET-168 and squashes the following commits:
    
    4524f2d [Cheng Lian] Fixes command line option description
    liancheng committed with rdblue Jan 24, 2015
Commits on Jan 13, 2015
  1. PARQUET-141: upgrade to scrooge 3.17.0, remove reflection based field…

    … info inspection...
    
    upgrade to scrooge 3.17.0, remove reflection based field info inspection, support enum and requirement type correctly
    
    This PR is essential for scrooge write support https://github.com/apache/incubator-parquet-mr/pull/58
    
    Author: Tianshuo Deng <tdeng@twitter.com>
    
    Closes #88 from tsdeng/scrooge_schema_converter_upgrade and squashes the following commits:
    
    77cc12a [Tianshuo Deng] delete empty line, retrigger jenkins
    80d61ad [Tianshuo Deng] format
    26e1fe1 [Tianshuo Deng] fix exception handling
    706497d [Tianshuo Deng] support union
    1b51f0f [Tianshuo Deng] upgrade to scrooge 3.17.0, remove reflection based field info inspection, support enum and requirement type correctly
    tsdeng committed Jan 13, 2015
Commits on Dec 29, 2014
  1. PARQUET-108: Parquet Memory Management in Java

    PARQUET-108: Parquet Memory Management in Java.
    When Parquet tries to write very large "row groups", it may causes tasks to run out of memory during dynamic partitions when a reducer may have many Parquet files open at a given time.
    
    This patch implements a memory manager to control the total memory size used by writers and balance their memory usage, which ensures that we don't run out of memory due to writing too many row groups within a single JVM.
    
    Author: dongche1 <dong1.chen@intel.com>
    
    Closes #80 from dongche/master and squashes the following commits:
    
    e511f85 [dongche1] Merge remote branch 'upstream/master'
    60a96b5 [dongche1] Merge remote branch 'upstream/master'
    2d17212 [dongche1] improve MemoryManger instantiation, change access level
    6e9333e [dongche1] change blocksize type from int to long
    e07b16e [dongche1] Refine updateAllocation(), addWriter(). Remove redundant getMemoryPoolRatio
    9a0a831 [dongche1] log the inconsistent ratio config instead of thowing an exception
    3a35d22 [dongche1] Move the creation of MemoryManager. Throw exception instead of logging it
    aeda7bc [dongche1] PARQUET-108: Parquet Memory Management in Java" ;
    c883bba [dongche1] PARQUET-108: Parquet Memory Management in Java
    7b45b2c [dongche1] PARQUET-108: Parquet Memory Management in Java
    6d766aa [dongche1] PARQUET-108: Parquet Memory Management in Java --- address some comments
    3abfe2b [dongche1] parquet 108
    dongche committed with Brock Noland Dec 29, 2014
Commits on Dec 16, 2014
  1. PARQUET-150 Update merge script issue id matching.

    This matches a word boundary after the issue id rather than a colon.
    
    Author: Ryan Blue <blue@apache.org>
    
    This patch had conflicts when merged, resolved by
    Committer: Ryan Blue <blue@apache.org>
    
    Closes #94 from rdblue/PARQUET-150-update-merge-script and squashes the following commits:
    
    cc39713 [Ryan Blue] PARQUET-150: Update merge script issue id matching.
    rdblue committed Dec 16, 2014
Commits on Dec 11, 2014
  1. PARQUET-145 InternalParquetRecordReader.close() should not throw an e…

    …xception if initialization has failed
    
    PARQUET-145 InternalParquetRecordReader.close() should not throw an exception if initialization has failed
    
    Author: Wolfgang Hoschek <whoschek@cloudera.com>
    
    Closes #93 from whoschek/PARQUET-145-3 and squashes the following commits:
    
    52a6acb [Wolfgang Hoschek] PARQUET-145 InternalParquetRecordReader.close() should not throw an exception if initialization has failed
    whoschek committed with rdblue Dec 11, 2014
Commits on Dec 4, 2014
  1. PARQUET-117: implement the new page format for Parquet 2.0

    The new page format was defined some time ago:
    Parquet/parquet-format#64
    Parquet/parquet-format#44
    The goals are the following:
     - cut pages on record boundaries to facilitate skipping pages in predicate poush down
     - read rl and dl independently of data
     - optionally not compress data
    
    Author: julien <julien@twitter.com>
    
    Closes #75 from julienledem/new_page_format and squashes the following commits:
    
    fbbc23a [julien] make mvn install display output only if it fails
    4189383 [julien] save output lines as travis cuts after 10000
    44d3684 [julien] fix parquet-tools for new page format
    0fb8c15 [julien] Merge branch 'master' into new_page_format
    5880cbb [julien] Merge branch 'master' into new_page_format
    6ee7303 [julien] make parquet.column package not semver compliant
    42f6c9f [julien] add tests and fix bugs
    266302b [julien] fix write path
    4e76369 [julien] read path
    050a487 [julien] fix compilation
    e0e9d00 [julien] better ColumnWriterStore definition
    ecf04ce [julien] remove unnecessary change
    2bc4d01 [julien] first stab at write path for the new page format
    julienledem committed with tsdeng Dec 4, 2014
Commits on Dec 2, 2014
  1. PARQUET-140: Allow clients to control the GenericData instance used t…

    …o read Avro records
    
    Author: Josh Wills <jwills@cloudera.com>
    
    Closes #90 from jwills/master and squashes the following commits:
    
    044cf54 [Josh Wills] PARQUET-140: Allow clients to control the GenericData object that is used to read Avro records
    jwills committed with tomwhite Dec 2, 2014
Commits on Nov 25, 2014
  1. PARQUET-52: refactor fallback mechanism

    See: https://issues.apache.org/jira/browse/PARQUET-52
    Context:
    In the ValuesWriter API there is a mechanism to return the Encoding actually used which allows to fallback to a different encoding.
    For example the dictionary encoding may fail if there are too many distinct values and the dictionary grows too big. In such cases the DictionaryValuesWriter was falling back to the Plain encoding.
    This can happen as well if the space savings are not satisfying when writing the first page and we prefer to fallback to a more light weight encoding.
    With Parquet 2.0 we are adding new encodings and the fall back is not necessarily Plain anymore.
    This Pull Request decouple the fallback mechanism from Dictionary and Plain encodings and allows to reuse the fallback logic with other encodings.
    One could imagine more than one level of fallback in the future by chaining the FallBackValuesWriter.
    
    Author: julien <julien@twitter.com>
    
    Closes #74 from julienledem/fallback and squashes the following commits:
    
    b74a4ca [julien] Merge branch 'master' into fallback
    d9abd62 [julien] better naming
    aa90caf [julien] exclude values encoding from SemVer
    10f295e [julien] better test setup
    c516bd9 [julien] improve test
    780c4c3 [julien] license header
    f16311a [julien] javadoc
    aeb8084 [julien] add more test; fix dic decoding
    0793399 [julien] Merge branch 'master' into fallback
    2638ec9 [julien] fix dictionary encoding labelling
    2fd9372 [julien] consistent naming
    cf7a734 [julien] rewrite ParquetProperties to enable proper fallback
    bf1474a [julien] refactor fallback mechanism
    julienledem committed Nov 25, 2014
Commits on Nov 20, 2014
  1. PARQUET-114: Sample NanoTime class serializes and deserializes Timest…

    …amp incorrectly
    
    I ran the Parquet Column tests and they passed.
    
    FYI @rdblue
    
    Author: Brock Noland <brock@apache.org>
    
    Closes #71 from brockn/master and squashes the following commits:
    
    69ba484 [Brock Noland] PARQUET-114 - Sample NanoTime class serializes and deserializes Timestamp incorrectly
    Brock Noland committed Nov 20, 2014
Commits on Nov 19, 2014
  1. PARQUET-132: Add type parameter to AvroParquetInputFormat.

    Author: Ryan Blue <blue@apache.org>
    
    Closes #84 from rdblue/PARQUET-132-parameterize-avro-inputformat and squashes the following commits:
    
    63114b0 [Ryan Blue] PARQUET-132: Add type parameter to AvroParquetInputFormat.
    rdblue committed Nov 19, 2014
Commits on Nov 18, 2014
  1. PARQUET-135: Input location is not getting set for the getStatistics …

    …in ParquetLoader when using two different loaders within a Pig script.
    
    Author: elif dede <edede@twitter.com>
    
    Closes #86 from elifdd/parquetLoader_error_PARQUET-135 and squashes the following commits:
    
    b0150ee [elif dede] fixed white space
    bdb381a [elif dede] PARQUET-135: Call setInput from getStatistics in ParquetLoader to fix ReduceEstimator errors in pig jobs
    elifdd committed with julienledem Nov 18, 2014
Commits on Nov 7, 2014
  1. PARQUET-122: make task side metadata true by default

    Author: julien <julien@twitter.com>
    
    Closes #78 from julienledem/task_side_metadata_default_true and squashes the following commits:
    
    32451a7 [julien] make task side metadata true by default
    julienledem committed Nov 7, 2014
Commits on Nov 3, 2014
  1. PARQUET-121: Allow Parquet to build with Java 8

    There are test failures running with Java 8 due to http://openjdk.java.net/jeps/180 which changed retrieval order for HashMap.
    
    Here's how I tested this:
    
    ```bash
    use-java8
    mvn clean install -DskipTests -Dmaven.javadoc.skip=true
    mvn test
    mvn test -P hadoop-2
    ```
    
    I also compiled the main code with Java 7 (target=1.6 bytecode), and compiled the tests with Java 8, and ran them with Java 8. The idea here is to simulate users who want to run Parquet with JRE 8.
    ```bash
    use-java7
    mvn clean install -DskipTests -Dmaven.javadoc.skip=true
    use-java8
    find . -name test-classes | grep target/test-classes | grep -v 'parquet-scrooge' | xargs rm -rf
    mvn test -DtargetJavaVersion=1.8 -Dmaven.main.skip=true -Dscala.maven.test.skip=true
    ```
    A couple of notes about this:
    * The targetJavaVersion property is used since other Hadoop projects use the same name.
    * I couldn’t get parquet-scrooge to compile with target=1.8, which is why I introduced scala.maven.test.skip (and updated scala-maven-plugin to the latest version which supports the property). Compiling with target=1.8 should be fixed in another JIRA as it looks pretty involved.
    
    Author: Tom White <tom@cloudera.com>
    
    Closes #77 from tomwhite/PARQUET-121-java8 and squashes the following commits:
    
    8717e13 [Tom White] Fix tests to run under Java 8.
    35ea670 [Tom White] PARQUET-121. Allow Parquet to build with Java 8.
    tomwhite committed Nov 3, 2014
  2. PARQUET-123: Enable dictionary support in AvroIndexedRecordConverter

    If consumers are loading Parquet records into an immutable structure
    like an Apache Spark RDD, being able to configure string reuse in
    AvroIndexedRecordConverter can drastically reduce the overall memory
    footprint of strings.
    
    NOTE: This isn't meant to be a merge-able PR (yet). I want to use
    this PR as a way to discuss: (1) if this is a reasonable approach
    and (2) to learn if PrimitiveConverter needs to be thread-safe as
    I'm currently using a ConcurrentHashMap. If there's agreement
    that this would be worthwhile, I'll create a JIRA and write some
    unit tests.
    
    Author: Matt Massie <massie@cs.berkeley.edu>
    
    Closes #76 from massie/immutable-strings and squashes the following commits:
    
    88ce5bf [Matt Massie] PARQUET-123: Enable dictionary support in AvroIndexedRecordConverter
    massie committed with tomwhite Nov 3, 2014
Commits on Oct 29, 2014
  1. PARQUET-106: Relax InputSplit Protections

    https://issues.apache.org/jira/browse/PARQUET-106
    
    Author: Daniel Weeks <dweeks@netflix.com>
    
    Closes #67 from dcw-netflix/input-split2 and squashes the following commits:
    
    2f2c0c7 [Daniel Weeks] Update ParquetInputSplit.java
    12bd3c1 [Daniel Weeks] Update ParquetInputSplit.java
    6c662ee [Daniel Weeks] Update ParquetInputSplit.java
    5f9f02e [Daniel Weeks] Update ParquetInputSplit.java
    d19e1ac [Daniel Weeks] Merge branch 'master' into input-split2
    c4172bb [Daniel Weeks] Merge remote-tracking branch 'upstream/master'
    01a5e8f [Daniel Weeks] Relaxed protections on input split class
    d37a6de [Daniel Weeks] Resetting pom to main
    0c1572e [Daniel Weeks] Merge remote-tracking branch 'upstream/master'
    98c6607 [Daniel Weeks] Merge remote-tracking branch 'upstream/master'
    96ba602 [Daniel Weeks] Disabled projects that don't compile
    danielcweeks committed with julienledem Oct 29, 2014
Commits on Oct 21, 2014
  1. PARQUET-105: use mvn shade plugin to create uber jar, support meta on…

    … a folder
    
    1. Make hadoop dependency from parquet-tools so it is provided. It can be used against different version of hadoop
    2. Use maven shade plugin to create a all in one jar, which can be used both locally or in hadoop
    3. Make parquet-meta command support both folder(read summary file) and a single file
    
    Author: Tianshuo Deng <tdeng@twitter.com>
    
    Closes #69 from tsdeng/bundle_parquet_tools and squashes the following commits:
    
    d8dcd3e [Tianshuo Deng] print file offset, file path, and cancel autoCrop
    a2d1399 [Tianshuo Deng] support local mode
    5009a85 [Tianshuo Deng] fix README
    0756f81 [Tianshuo Deng] remove semver check for parquet_tools
    78c7f4b [Tianshuo Deng] use mvn shade plugin to create uber jar, support meta on a folder
    tsdeng committed with julienledem Oct 21, 2014