Converter-Loader communicates via Parquet files #48
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Previously the converter populated a large directory structure with compressed JSON files, one for each domain. This was fairly inefficient and slow (file systems don't tend to like having millions of files in them). The pull request replaces this directory structure with a series of parquet files instead. A side-effect of this is that parts of the loading is trivially parallellizable, and so loading is about twice as fast.
The pull request incorporates a version of the parquet-floor library to enable parquet support without pulling a significant portion of Hadoop into the project's dependency tree. This library is modified to support zstd compression as well as fields that are lists of objects or primitives via trove.