Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

incorporate parquet files? #9897

Open
wibeasley opened this issue Sep 8, 2023 · 2 comments
Open

incorporate parquet files? #9897

wibeasley opened this issue Sep 8, 2023 · 2 comments
Labels

Comments

@wibeasley
Copy link

Has there been any discussion of using parquet files at some level of dataverse? (I see it mentioned in only one issue.)

Apache Parquet is an open source, column-oriented data file format designed for efficient data storage and retrieval. It provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk. Apache Parquet is designed to be a common interchange format for both batch and interactive workloads.

I've used them some, and I love how they work well with R, Python, DuckDb, Spark, and others.

Several R programmers (like @kuriwaki) have advocated for rds files over RData files. From my recent experience with parquet files, they have all the advertised advantages of rds files (eg, compression, strong-typing, and factor levels), plus the appeal of interoperability with other platforms.

I haven't thought much beyond this. But when I read about problems with RData files and the messiness of Rserve described by @landreev, I see parquet as a improvement for many reasons --not least is the ability to replace a flaky remote instance with a local parquet library.

cc: @pdurbin

@pdurbin
Copy link
Member

pdurbin commented Sep 8, 2023

Yes! Recently 2020 data from the US Census was published in Harvard Dataverse in parquet format: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/5LAVKV

(2010 data was published as well: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/1OR2A6 )

This work is with the US Census is ongoing and being tracked here:

That said, no, Dataverse doesn't have any particular support for parquet files. In the examples above the parquet files are in a zip file. Here's a preview of the 2020 zip:

Screen Shot 2023-09-08 at 3 49 20 PM

@kuriwaki
Copy link
Member

kuriwaki commented Jul 2, 2024

One note here is that dataverse does not seem to "Unzip" a compressed parquet collection in a way that respects the file hierarchy. In this example I just made it says it "failed to unzip the file properly". The file itself is still intact: the user can unzip it themselves after downloading, and Phil's screenshot above shows that the file hierarchy can be viewed as metadata, which may be the best way forward. But just a note.

parquet-dataverse.mp4

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants