-
Notifications
You must be signed in to change notification settings - Fork 157
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support bulk export to parquet format #1340
Comments
lmsurpre
added a commit
that referenced
this issue
Jul 29, 2020
1. Updates to fhir-operation-bulkdata and fhir-bulkimportexport-webapp to accept and pass alternative export formats (application/fhir+parquet) 2. Introduction of SparkParquetWriter for using Apache Spark to introspect a list of JSON strings (e.g. FHIR resources), infer a schema, and write the schema and data into a parquet file. This approach writes specific files under a single logical file which is actually a directory, allowing the schema to change over time. 3. Updates to the System ChunkReader and ChunkWriter for parquet; instead of writing to a buffer in the reader, we simply add to the context info and return the Resource objects themselves. Then, in the writer, we always write a file with whatever resources were passed. This works because Spark is managing the writes to COS for us and Parquet has this notion of a multi-part logical file, so we don't need the tight control over multi-part uploads like in the case of NDJSON. Also includes miscellaneous cleanup, debug, and style changes. Signed-off-by: Lee Surprenant <lmsurpre@us.ibm.com>
lmsurpre
added a commit
that referenced
this issue
Jul 29, 2020
1. Updates to fhir-operation-bulkdata and fhir-bulkimportexport-webapp to accept and pass alternative export formats (application/fhir+parquet) 2. Introduction of SparkParquetWriter for using Apache Spark to introspect a list of JSON strings (e.g. FHIR resources), infer a schema, and write the schema and data into a parquet file. This approach writes specific files under a single logical file which is actually a directory, allowing the schema to change over time. 3. Updates to the System ChunkReader and ChunkWriter for parquet; instead of writing to a buffer in the reader, we simply add to the context info and return the Resource objects themselves. Then, in the writer, we always write a file with whatever resources were passed. This works because Spark is managing the writes to COS for us and Parquet has this notion of a multi-part logical file, so we don't need the tight control over multi-part uploads like in the case of NDJSON. Also includes miscellaneous cleanup, debug, and style changes. Signed-off-by: Lee Surprenant <lmsurpre@us.ibm.com>
lmsurpre
added a commit
that referenced
this issue
Jul 29, 2020
1. Updates to fhir-operation-bulkdata and fhir-bulkimportexport-webapp to accept and pass alternative export formats (application/fhir+parquet) 2. Introduction of SparkParquetWriter for using Apache Spark to introspect a list of JSON strings (e.g. FHIR resources), infer a schema, and write the schema and data into a parquet file. This approach writes specific files under a single logical file which is actually a directory, allowing the schema to change over time. 3. Updates to the System ChunkReader and ChunkWriter for parquet; instead of writing to a buffer in the reader, we simply add to the context info and return the Resource objects themselves. Then, in the writer, we always write a file with whatever resources were passed. This works because Spark is managing the writes to COS for us and Parquet has this notion of a multi-part logical file, so we don't need the tight control over multi-part uploads like in the case of NDJSON. Also includes miscellaneous cleanup, debug, and style changes. Signed-off-by: Lee Surprenant <lmsurpre@us.ibm.com>
lmsurpre
added a commit
that referenced
this issue
Aug 3, 2020
1. Updates to fhir-operation-bulkdata and fhir-bulkimportexport-webapp to accept and pass alternative export formats (application/fhir+parquet) 2. Introduction of SparkParquetWriter for using Apache Spark to introspect a list of JSON strings (e.g. FHIR resources), infer a schema, and write the schema and data into a parquet file. This approach writes specific files under a single logical file which is actually a directory, allowing the schema to change over time. 3. Updates to the System ChunkReader and ChunkWriter for parquet; instead of writing to a buffer in the reader, we simply add to the context info and return the Resource objects themselves. Then, in the writer, we always write a file with whatever resources were passed. This works because Spark is managing the writes to COS for us and Parquet has this notion of a multi-part logical file, so we don't need the tight control over multi-part uploads like in the case of NDJSON. Also includes miscellaneous cleanup, debug, and style changes. Signed-off-by: Lee Surprenant <lmsurpre@us.ibm.com>
lmsurpre
added a commit
that referenced
this issue
Aug 7, 2020
issue #1340 - support bulk export to parquet format
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Is your feature request related to a problem? Please describe.
Parquet format is desirable for two reasons:
Describe the solution you'd like
Eventually I'd like to be able to support an SQL on FHIR view of the data, backed by Parquet files in an Object store, but for now I just want to get a simple flattened parquet export working.
Describe alternatives you've considered
The main alternative is to export to NDJSON format and then transform from that format to Parquet.
Additional context
See https://medium.com/@gidon_16942/apache-parquet-for-hl7-fhir-c23610131f8c
The text was updated successfully, but these errors were encountered: