Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Converter-Loader communicates via Parquet files #48

Merged
merged 14 commits into from Sep 15, 2023
Merged

Converter-Loader communicates via Parquet files #48

merged 14 commits into from Sep 15, 2023

Conversation

vlofgren
Copy link
Contributor

Previously the converter populated a large directory structure with compressed JSON files, one for each domain. This was fairly inefficient and slow (file systems don't tend to like having millions of files in them). The pull request replaces this directory structure with a series of parquet files instead. A side-effect of this is that parts of the loading is trivially parallellizable, and so loading is about twice as fast.

The pull request incorporates a version of the parquet-floor library to enable parquet support without pulling a significant portion of Hadoop into the project's dependency tree. This library is modified to support zstd compression as well as fields that are lists of objects or primitives via trove.

@vlofgren vlofgren marked this pull request as ready for review September 15, 2023 11:31
@vlofgren vlofgren merged commit 46232c7 into master Sep 15, 2023
@vlofgren vlofgren deleted the parquet branch March 21, 2024 13:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant