Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Parquet as export file format for COPY TO #15955

Open
proddata opened this issue May 2, 2024 · 2 comments
Open

Support Parquet as export file format for COPY TO #15955

proddata opened this issue May 2, 2024 · 2 comments
Labels
need refined description A maintainer should refine the description and clarify the scope needs concrete use-case needs info or feedback

Comments

@proddata
Copy link
Member

proddata commented May 2, 2024

Problem Statement

CrateDB's current export functionality is limited to JSON format, which results in the loss of type information and suboptimal data handling. The COPY TO command lacks support for exporting data in Parquet format, a standard that preserves type information and facilitates efficient data handling. This absence limits interoperability with data engineering workflows and libraries reliant on Parquet's schema structure. Addressing this gap by introducing Parquet support in the COPY TO command would enable seamless integration with various tools and frameworks. Also if read support for Parquet files would be added either directly or via e.g. 3rd party FDW implementations this could be used for archival and/or cold storage use cases.

Possible Solutions

Support COPY quotes TO DIRECTORY '/tmp/' with (FORMAT PARQUET); to export table partitions into parquet files, similar to the existing JSON export.

Setting a compression mechanism used (zstd, gzip, snappy) and the size of row groups should also be possible

The CrateDB data types should be mapped to appropriate parquet data types e.g. (incomplete)

CrateDB Data Type Parquet Data Type
INTEGER INT32
BIGINT INT64
SMALLINT INT16
REAL FLOAT
DOUBLE PRECISION DOUBLE
BOOLEAN BOOLEAN
TIMESTAMP TIMESTAMP
TIMESTAMP WITH TIME ZONE TIMESTAMP
STRING STRING / BYTE_ARRAY
TEXT STRING / BYTE_ARRAY
OBJECT (DYNAMIC) STRUCT (with nested fields representing the object schema)
OBJECT (STRICT) STRUCT (with fixed fields representing the object schema)
OBJECT (IGNORED) JSON
ARRAY LIST
GEO_POINT STRUCT (with two fields representing latitude and longitude)
GEO_SHAPE STRUCT (with fields representing different components of the shape)
IP BYTE_ARRAY (representing the IP address in binary form)

Considered Alternatives

  • Custom application using Apache arrow
  • Export as .json then transform to .parquet
@mfussenegger mfussenegger added needs info or feedback needs concrete use-case need refined description A maintainer should refine the description and clarify the scope labels May 6, 2024
@mfussenegger
Copy link
Member

Can you give some more context on concrete use cases?

COPY FROM is currently designed to do concurrent per shard exports. If you're going to use the parquet files with other tools, I'm not sure if having 1 file per shard is appropriate?

@proddata
Copy link
Member Author

proddata commented May 6, 2024

Can you give some more context on concrete use cases?

  • Long-term storage / Archiving - for efficient long-term storage of data, while still keeping metadata information and being much faster queryable / processed on demand. While this is also possible with gzipped json in a way, one can't easily process them, nor does it keep table schema / data type information.
  • Integration with Data Engineering Ecosystems: Simplifies data integration with systems like Databricks / Apache Spark, Hadoop and cloud services (AWS S3, Azure Data Lake).
  • Machine Learning: enabling quick access to large datasets for feature selection and model training, reducing the load on the database and preventing system slowdowns.

If you're going to use the parquet files with other tools, I'm not sure if having 1 file per shard is appropriate?

One parquet file per shard would be perfect (one might consider even splitting shards for very large ones). I'd say that it is very common to have a dataset composed of multiple parquet files. One could go a step further and also think about supporting hive-like partitioning.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
need refined description A maintainer should refine the description and clarify the scope needs concrete use-case needs info or feedback
Projects
None yet
Development

No branches or pull requests

2 participants