Support bulk export to parquet format #1340

lmsurpre · 2020-07-20T20:27:17Z

Is your feature request related to a problem? Please describe.
Parquet format is desirable for two reasons:

smaller size (vs NDJSON)
more analytic-ready

Describe the solution you'd like
Eventually I'd like to be able to support an SQL on FHIR view of the data, backed by Parquet files in an Object store, but for now I just want to get a simple flattened parquet export working.

Describe alternatives you've considered
The main alternative is to export to NDJSON format and then transform from that format to Parquet.

Additional context
See https://medium.com/@gidon_16942/apache-parquet-for-hl7-fhir-c23610131f8c

1. Updates to fhir-operation-bulkdata and fhir-bulkimportexport-webapp to accept and pass alternative export formats (application/fhir+parquet) 2. Introduction of SparkParquetWriter for using Apache Spark to introspect a list of JSON strings (e.g. FHIR resources), infer a schema, and write the schema and data into a parquet file. This approach writes specific files under a single logical file which is actually a directory, allowing the schema to change over time. 3. Updates to the System ChunkReader and ChunkWriter for parquet; instead of writing to a buffer in the reader, we simply add to the context info and return the Resource objects themselves. Then, in the writer, we always write a file with whatever resources were passed. This works because Spark is managing the writes to COS for us and Parquet has this notion of a multi-part logical file, so we don't need the tight control over multi-part uploads like in the case of NDJSON. Also includes miscellaneous cleanup, debug, and style changes. Signed-off-by: Lee Surprenant <lmsurpre@us.ibm.com>

issue #1340 - support bulk export to parquet format

lmsurpre self-assigned this Jul 22, 2020

lmsurpre added this to the Sprint 15 milestone Jul 31, 2020

lmsurpre added the showcase Used to Identify End-of-Sprint Demos label Jul 31, 2020

lmsurpre added a commit that referenced this issue Aug 7, 2020

Merge pull request #1396 from IBM/issue-1340

c0beacc

issue #1340 - support bulk export to parquet format

lmsurpre mentioned this issue Aug 7, 2020

Split bulkdata webapp into a separate deployable #1406

Open

lmsurpre closed this as completed Aug 7, 2020

ZuSe mentioned this issue Nov 5, 2021

Support Apache Parquet Format for bulk export hapifhir/hapi-fhir#3143

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support bulk export to parquet format #1340

Support bulk export to parquet format #1340

lmsurpre commented Jul 20, 2020

Support bulk export to parquet format #1340

Support bulk export to parquet format #1340

Comments

lmsurpre commented Jul 20, 2020