Decouple Parquet from Hadoop API #305

Closed
julienledem opened this Issue Feb 14, 2014 · 12 comments

Projects

None yet

5 participants

@julienledem
Member

To allow reading and writing Parquet files independently of the Hadoop APIs.

@fangjian601

I'm interested in this project. But I have several questions:

  1. Are ParquetFileReader and ParquetFileWriter the implementations of readers and writers for Parquet files (by using Hadoop APIs)?
  2. This project is trying to create another reader and writer which directly reads and writes Parquet files without using Hadoop API. Does this indicate the reader and writer should handle the files by using local file system APIs?
  3. I don't know the relation between this issue and the referenced issue #313 , can you give me some explanations?

I'll appreciate your assistance.

@fangjian601

Anyone could answer my questions?

@julienledem
Member
  1. yes those classes implement the low level of the file format based on the Hadoop FileSystem API.
  2. The goal is to decouple those implementations from the Hadoop APIs.
    To do that we need to create interfaces that provide the following abstractions:
    • For write: an OutputStream that provides the current position so that we can record offset for the footer.
    • For read: an InputStream that provides
      • a seek(offset) method to seek to the footer and individual column chunks
      • the size of the stream to find the footer
      • the current position to determine if we need to seek.
    • Also a general path to stream and configuration notion.
      All those are present in the Hadoop API which is why it was convenient to use them in Parquet but that ties the implementation unnecessarily to Hadoop.
      Then we can have Local and Hadoop implementations for this.
  3. We need also a way to abstract out configuration. In the case of hadoop it would come from the Configuration object, outside of Hadoop there would be something similar to a simple Map<String, String>
@shevek
shevek commented Apr 7, 2014

Strong, strong support for this request. After reviewing the code, there seems to be no strong reason for requiring Hadoop on the classpath. We desperately want a columnar filestore which does NOT use Hadoop APIs.

@julienledem
Member

@fangjian601 your proposal is great but unfortunately we did not have enough slots for all the projects and could not have one for this. Parquet will have one GSOC student this year.
You are very welcome to contribute anyway and we are happy to help you, but it won't be backed by GSOC.

@julienledem julienledem removed the GSoC-2014 label Apr 24, 2014
@fangjian601

@julienledem I'll still try to work on this issue this summer. I'm quite interested in Parquet and hoping I could be a contributor

@julienledem
Member

@fangjian601 great to hear. Let us know if you have questions.

@jmd1011
jmd1011 commented Jun 28, 2016

Has this effort been abandoned, or is this still on the radar?

@julienledem
Member

@jmd1011 Nobody is working on it that I know of. You're welcome to propose something.
Please use https://issues.apache.org/jira/browse/PARQUET as this repo is deprecated

@julienledem julienledem pushed a commit to julienledem/old-parquet-mr that referenced this issue Jul 30, 2016
@rdblue rdblue PARQUET-415: Fix ByteBuffer Binary serialization.
This also adds a test to validate that serialization works for all
Binary objects that are already test cases.

Author: Ryan Blue <blue@apache.org>

Closes #305 from rdblue/PARQUET-415-fix-bytebuffer-binary-serialization and squashes the following commits:

4e75d54 [Ryan Blue] PARQUET-415: Fix ByteBuffer Binary serialization.
0a711eb
@l15k4
l15k4 commented Dec 8, 2016

@jmd1011 Hey, is this effort going anywhere or you gave up :-) https://github.com/jmd1011/parquet-readers

I would be so nice to use hadoopless parquet

@jmd1011
jmd1011 commented Dec 22, 2016

I am still working on this, yup! Things are just coming along slower than anticipated.

@cloudera-hudson cloudera-hudson pushed a commit to cloudera/parquet-mr that referenced this issue Jan 31, 2017
@rdblue @gszadovszky rdblue + gszadovszky PARQUET-415: Fix ByteBuffer Binary serialization.
This also adds a test to validate that serialization works for all
Binary objects that are already test cases.

Author: Ryan Blue <blue@apache.org>

Closes #305 from rdblue/PARQUET-415-fix-bytebuffer-binary-serialization and squashes the following commits:

4e75d54 [Ryan Blue] PARQUET-415: Fix ByteBuffer Binary serialization.

Conflicts:
	parquet-column/src/main/java/parquet/io/api/Binary.java
	parquet-column/src/test/java/parquet/io/api/TestBinary.java

(cherry picked from commit 7cc953d9d4825c2bffbd8bb3af66d900a7fedec8)
1ad5139
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment