-
Notifications
You must be signed in to change notification settings - Fork 310
Leverage Iceberg-Rust for all the transforms #1833
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
fab99e3
to
edbac4d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! I fixed CI and added pyiceberg-core
to the pyarrow
install group for poetry
[tool.poetry.extras]
pyarrow = ["pyarrow", "pyiceberg-core"]
@@ -289,7 +289,7 @@ generate-setup-file = false | |||
script = "build-module.py" | |||
|
|||
[tool.poetry.extras] | |||
pyarrow = ["pyarrow"] | |||
pyarrow = ["pyarrow", "pyiceberg-core"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
technically this is not required, because we only use the transforms when writing to a partitioned table. But I think that it might lead to a lot of confusion if we don't do this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed i was considering this too and came to the same conclusion :)
Also I think the pyarrow_transform
functions are used on the read path too
iceberg-python/pyiceberg/io/pyarrow.py
Line 2696 in a67c559
name, partition.transform.pyarrow_transform(source_field.field_type)(arrow_table[source_field.name]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also I think the pyarrow_transform functions are used on the read path too
For completeness, I don't think that's true. _determine_partitions
is only used by _dataframe_to_data_files
. When reading, we often have a single value SELECT * FROM tbl WHERE created_at > '2025-01-01 19:25:00'
, so there performance is not that important, and we use non-Arrow transform (example for the MonthPartitioning).
Rationale for this change
Testing out to use Iceberg Rust for all of the transforms. I think we have some rounding error in apache/iceberg-rust#1128
Closes #1591
Are these changes tested?
Are there any user-facing changes?