Support Parquet as export file format for `COPY TO` #15955

proddata · 2024-05-02T13:57:13Z

Problem Statement

CrateDB's current export functionality is limited to JSON format, which results in the loss of type information and suboptimal data handling. The COPY TO command lacks support for exporting data in Parquet format, a standard that preserves type information and facilitates efficient data handling. This absence limits interoperability with data engineering workflows and libraries reliant on Parquet's schema structure. Addressing this gap by introducing Parquet support in the COPY TO command would enable seamless integration with various tools and frameworks. Also if read support for Parquet files would be added either directly or via e.g. 3rd party FDW implementations this could be used for archival and/or cold storage use cases.

Possible Solutions

Support COPY quotes TO DIRECTORY '/tmp/' with (FORMAT PARQUET); to export table partitions into parquet files, similar to the existing JSON export.

Setting a compression mechanism used (zstd, gzip, snappy) and the size of row groups should also be possible

The CrateDB data types should be mapped to appropriate parquet data types e.g. (incomplete)

CrateDB Data Type	Parquet Data Type
INTEGER	INT32
BIGINT	INT64
SMALLINT	INT16
REAL	FLOAT
DOUBLE PRECISION	DOUBLE
BOOLEAN	BOOLEAN
TIMESTAMP	TIMESTAMP
TIMESTAMP WITH TIME ZONE	TIMESTAMP
STRING	STRING / BYTE_ARRAY
TEXT	STRING / BYTE_ARRAY
OBJECT (DYNAMIC)	STRUCT (with nested fields representing the object schema)
OBJECT (STRICT)	STRUCT (with fixed fields representing the object schema)
OBJECT (IGNORED)	JSON
ARRAY	LIST
GEO_POINT	STRUCT (with two fields representing latitude and longitude)
GEO_SHAPE	STRUCT (with fields representing different components of the shape)
IP	BYTE_ARRAY (representing the IP address in binary form)

Considered Alternatives

Custom application using Apache arrow
Export as .json then transform to .parquet

The text was updated successfully, but these errors were encountered:

mfussenegger · 2024-05-06T08:07:05Z

Can you give some more context on concrete use cases?

COPY FROM is currently designed to do concurrent per shard exports. If you're going to use the parquet files with other tools, I'm not sure if having 1 file per shard is appropriate?

proddata · 2024-05-06T08:46:36Z

Can you give some more context on concrete use cases?

Long-term storage / Archiving - for efficient long-term storage of data, while still keeping metadata information and being much faster queryable / processed on demand. While this is also possible with gzipped json in a way, one can't easily process them, nor does it keep table schema / data type information.
Integration with Data Engineering Ecosystems: Simplifies data integration with systems like Databricks / Apache Spark, Hadoop and cloud services (AWS S3, Azure Data Lake).
Machine Learning: enabling quick access to large datasets for feature selection and model training, reducing the load on the database and preventing system slowdowns.

If you're going to use the parquet files with other tools, I'm not sure if having 1 file per shard is appropriate?

One parquet file per shard would be perfect (one might consider even splitting shards for very large ones). I'd say that it is very common to have a dataset composed of multiple parquet files. One could go a step further and also think about supporting hive-like partitioning.

mfussenegger added needs info or feedback needs concrete use-case need refined description A maintainer should refine the description and clarify the scope labels May 6, 2024

mfussenegger mentioned this issue May 6, 2024

fdw: Parquet foreign data wrapper (write support) #15772

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support Parquet as export file format for `COPY TO` #15955

Support Parquet as export file format for `COPY TO` #15955

proddata commented May 2, 2024 •

edited

mfussenegger commented May 6, 2024

proddata commented May 6, 2024 •

edited

Support Parquet as export file format for COPY TO #15955

Support Parquet as export file format for COPY TO #15955

Comments

proddata commented May 2, 2024 • edited

Problem Statement

Possible Solutions

Considered Alternatives

mfussenegger commented May 6, 2024

proddata commented May 6, 2024 • edited

Support Parquet as export file format for `COPY TO` #15955

Support Parquet as export file format for `COPY TO` #15955

proddata commented May 2, 2024 •

edited

proddata commented May 6, 2024 •

edited