-
Notifications
You must be signed in to change notification settings - Fork 118
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Fix reading partition key columns in DeltaLake #2118
Conversation
.pushdowns | ||
.columns | ||
.as_ref() | ||
.map(|v| v.iter().map(|s| s.as_ref()).collect::<Vec<&str>>()); | ||
.map(|v| v.iter().map(|s| s.as_str()).collect::<Vec<&str>>()); | ||
let file_column_names = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you likely need to apply this logic to the eager code path. Say you passed in a predicate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be fixed in this section: https://github.com/Eventual-Inc/Daft/pull/2118/files#diff-1d03b58906d515cc349a8a414595727e787d60af7adb3ab9fb3db3558c9e7e27R789-R808
Unfortunately our code is a little jank right now, with 2 main codepaths:
MicroPartition::from_scan_task
either eagerly produces loaded micropartitions or lazily produces unloaded micropartitions usingread_parquet_into_micropartition
- Lazy micropartitions then load data using
materialize_scan_task
So any logic often has to be duplicated across those two functions (read_parquet_into_micropartition
and materialize_scan_task
). Perhaps there is some opportunity to consolidate this by having read_parquet_into_micropartition
call materialize_scan_task
under the hood, but that might be a refactor for another PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a simpler way would be to just wrap read_parquet_bulk
in this crate
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you elaborate? Both read_parquet_into_micropartition
and materialize_scan_task
do currently wrap read_parquet_bulk
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that @samster25 was talking about creating a read_parquet_bulk_with_virtual
(or similarly named) utility in this micropartition.rs
module that does all of the virtual column + pushdown handling before calling read_parquet_bulk
, shared (called) by both the eager and lazy paths, so we don't have to duplicate this handling logic.
However, that would require duplicating the read_parquet_bulk
API, which is pretty large, and we also wouldn't have coverage for other file types (e.g. if we added Avro or ORC support to Iceberg reads). @samster25 am I misunderstanding your suggestion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do already have a shared _read_parquet_into_loaded_micropartition
utility for read_parquet_into_micropartition
, maybe we should refactor materialize_scan_task
to call out to similar helpers for each file format.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that @samster25 was talking about creating a
read_parquet_bulk_with_virtual
(or similarly named) utility in thismicropartition.rs
module that does all of the virtual column + pushdown handling before callingread_parquet_bulk
, shared (called) by both the eager and lazy paths, so we don't have to duplicate this handling logic.
Yeah exactly!
@@ -82,3 +82,40 @@ def test_daft_iceberg_table_renamed_column_pushdown_collect_correct(local_iceber | |||
iceberg_pandas = tab.scan().to_arrow().to_pandas() | |||
iceberg_pandas = iceberg_pandas[["idx_renamed"]] | |||
assert_df_equals(daft_pandas, iceberg_pandas, sort_key=[]) | |||
|
|||
|
|||
@pytest.mark.integration() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add 2 tests:
- select only the partition column with filter on it
- select only the partition column with filter on another column
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added the two requested tests:
- This fails, because reading just the partition column only still fails after this PR (we dont have a great way of "reading [] columns from a file" at the moment
- This passes
c2eb5c2
to
f05a535
Compare
f05a535
to
d8d06b8
Compare
Fixes pushdowns for column selection on partition keys in DeltaLake and Iceberg.
When table formats such as Iceberg and Delta Lake store the data for a partition column, they will strip the column from the actual Parquet data files that they write out. Then when we want to select only specific columns, our Parquet reader fails when it is not able to find those columns in the file.
NOTE: Seems like Iceberg only does this for identity transformed partition columns
Follow-on issue for selection of only the partition keys: #2129