Converter-Loader communicates via Parquet files #48

vlofgren · 2023-09-14T12:18:00Z

Previously the converter populated a large directory structure with compressed JSON files, one for each domain. This was fairly inefficient and slow (file systems don't tend to like having millions of files in them). The pull request replaces this directory structure with a series of parquet files instead. A side-effect of this is that parts of the loading is trivially parallellizable, and so loading is about twice as fast.

The pull request incorporates a version of the parquet-floor library to enable parquet support without pulling a significant portion of Hadoop into the project's dependency tree. This library is modified to support zstd compression as well as fields that are lists of objects or primitives via trove.

This small library, while great, will require some modifications to fit the project's needs, so it goes into third-party directly.

…sed json.

…tionsCompiler

…logItem

…sks in parallel for a ~2X speedup

vlofgren added 14 commits September 5, 2023 10:38

(parquet) Add parquet library

a284682

This small library, while great, will require some modifications to fit the project's needs, so it goes into third-party directly.

(parquet) Use ZSTD compression by default.

dbe974f

(parquet-floor) Patch in support for writing and reading repeated values

a00cabe

(work-log) New batching work log

a52d78c

(processed-data) New parquet-serializable models for converter output

064bc5e

(parquet-floor) Modify the parquet library to permit list-fields.

9f672a0

(converter,loader) Converter outputs parquet files instead of compres…

24b4606

…sed json.

(converting) WIP begin to remove converting-model and the old Instruc…

4799dd7

…tionsCompiler

(work-log) Fix bug where items weren't added to the current batch on …

87a8593

…logItem

(converter) Add heartbeats to the loader processes and execute the ta…

c71f6ad

…sks in parallel for a ~2X speedup

(refactor) Remove converting-model package completely

eaeb23d

(docs) Update the documentation up-to-date information

35996d0

(converter, control) Re-enable sideloading encyclopedia data

5e5aaf9

(converter) Write dummy processor log when sideloading

c67d95c

vlofgren marked this pull request as ready for review September 15, 2023 11:31

vlofgren merged commit 46232c7 into master Sep 15, 2023

vlofgren deleted the parquet branch March 21, 2024 13:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Converter-Loader communicates via Parquet files #48

Converter-Loader communicates via Parquet files #48

vlofgren commented Sep 14, 2023

Converter-Loader communicates via Parquet files #48

Converter-Loader communicates via Parquet files #48

Conversation

vlofgren commented Sep 14, 2023