-
Notifications
You must be signed in to change notification settings - Fork 225
Open
Labels
enhancementNew feature or requestNew feature or request
Description
What is the problem the feature request solves?
When using native_datafusion
or native_iceberg_compat
Parquet readers based on DataFusion's DataSourceExec, the schemas that Comet passes in result in dictionaries being unpacked immediately.
Describe the potential solution
Arrow-rs will use a provided schema as a hint, and in the case of dictionary encoded columns, preserve the encoding:
https://github.com/apache/arrow-rs/blob/880be2f0a0b9675d8b42206e70543472a58792aa/parquet/src/arrow/schema/primitive.rs#L91
The challenge is similar to int96, where the native side doesn't really have the Parquet schema when generating the DataSourceExec. We'd either need to pass this from early on the Spark side when the schema is first read, or add a coercion rule to DataFusion.
Additional context
No response
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request