Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support bulk export to parquet format #1340

Closed
lmsurpre opened this issue Jul 20, 2020 · 0 comments
Closed

Support bulk export to parquet format #1340

lmsurpre opened this issue Jul 20, 2020 · 0 comments
Assignees
Labels
showcase Used to Identify End-of-Sprint Demos
Milestone

Comments

@lmsurpre
Copy link
Member

Is your feature request related to a problem? Please describe.
Parquet format is desirable for two reasons:

  1. smaller size (vs NDJSON)
  2. more analytic-ready

Describe the solution you'd like
Eventually I'd like to be able to support an SQL on FHIR view of the data, backed by Parquet files in an Object store, but for now I just want to get a simple flattened parquet export working.

Describe alternatives you've considered
The main alternative is to export to NDJSON format and then transform from that format to Parquet.

Additional context
See https://medium.com/@gidon_16942/apache-parquet-for-hl7-fhir-c23610131f8c

@lmsurpre lmsurpre self-assigned this Jul 22, 2020
lmsurpre added a commit that referenced this issue Jul 29, 2020
1. Updates to fhir-operation-bulkdata and fhir-bulkimportexport-webapp
to accept and pass alternative export formats (application/fhir+parquet)

2. Introduction of SparkParquetWriter for using Apache Spark to
introspect a list of JSON strings (e.g. FHIR resources), infer a schema,
and write the schema and data into a parquet file. This approach writes
specific files under a single logical file which is actually a
directory, allowing the schema to change over time.

3. Updates to the System ChunkReader and ChunkWriter for parquet;
instead of writing to a buffer in the reader, we simply add to the
context info and return the Resource objects themselves. Then, in the
writer, we always write a file with whatever resources were passed.
This works because Spark is managing the writes to COS for us and
Parquet has this notion of a multi-part logical file, so we don't need
the tight control over multi-part uploads like in the case of NDJSON.

Also includes miscellaneous cleanup, debug, and style changes.

Signed-off-by: Lee Surprenant <lmsurpre@us.ibm.com>
lmsurpre added a commit that referenced this issue Jul 29, 2020
1. Updates to fhir-operation-bulkdata and fhir-bulkimportexport-webapp
to accept and pass alternative export formats (application/fhir+parquet)

2. Introduction of SparkParquetWriter for using Apache Spark to
introspect a list of JSON strings (e.g. FHIR resources), infer a schema,
and write the schema and data into a parquet file. This approach writes
specific files under a single logical file which is actually a
directory, allowing the schema to change over time.

3. Updates to the System ChunkReader and ChunkWriter for parquet;
instead of writing to a buffer in the reader, we simply add to the
context info and return the Resource objects themselves. Then, in the
writer, we always write a file with whatever resources were passed.
This works because Spark is managing the writes to COS for us and
Parquet has this notion of a multi-part logical file, so we don't need
the tight control over multi-part uploads like in the case of NDJSON.

Also includes miscellaneous cleanup, debug, and style changes.

Signed-off-by: Lee Surprenant <lmsurpre@us.ibm.com>
lmsurpre added a commit that referenced this issue Jul 29, 2020
1. Updates to fhir-operation-bulkdata and fhir-bulkimportexport-webapp
to accept and pass alternative export formats (application/fhir+parquet)

2. Introduction of SparkParquetWriter for using Apache Spark to
introspect a list of JSON strings (e.g. FHIR resources), infer a schema,
and write the schema and data into a parquet file. This approach writes
specific files under a single logical file which is actually a
directory, allowing the schema to change over time.

3. Updates to the System ChunkReader and ChunkWriter for parquet;
instead of writing to a buffer in the reader, we simply add to the
context info and return the Resource objects themselves. Then, in the
writer, we always write a file with whatever resources were passed.
This works because Spark is managing the writes to COS for us and
Parquet has this notion of a multi-part logical file, so we don't need
the tight control over multi-part uploads like in the case of NDJSON.

Also includes miscellaneous cleanup, debug, and style changes.

Signed-off-by: Lee Surprenant <lmsurpre@us.ibm.com>
@lmsurpre lmsurpre added this to the Sprint 15 milestone Jul 31, 2020
@lmsurpre lmsurpre added the showcase Used to Identify End-of-Sprint Demos label Jul 31, 2020
lmsurpre added a commit that referenced this issue Aug 3, 2020
1. Updates to fhir-operation-bulkdata and fhir-bulkimportexport-webapp
to accept and pass alternative export formats (application/fhir+parquet)

2. Introduction of SparkParquetWriter for using Apache Spark to
introspect a list of JSON strings (e.g. FHIR resources), infer a schema,
and write the schema and data into a parquet file. This approach writes
specific files under a single logical file which is actually a
directory, allowing the schema to change over time.

3. Updates to the System ChunkReader and ChunkWriter for parquet;
instead of writing to a buffer in the reader, we simply add to the
context info and return the Resource objects themselves. Then, in the
writer, we always write a file with whatever resources were passed.
This works because Spark is managing the writes to COS for us and
Parquet has this notion of a multi-part logical file, so we don't need
the tight control over multi-part uploads like in the case of NDJSON.

Also includes miscellaneous cleanup, debug, and style changes.

Signed-off-by: Lee Surprenant <lmsurpre@us.ibm.com>
lmsurpre added a commit that referenced this issue Aug 7, 2020
issue #1340 - support bulk export to parquet format
@lmsurpre lmsurpre closed this as completed Aug 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
showcase Used to Identify End-of-Sprint Demos
Projects
None yet
Development

No branches or pull requests

1 participant