Skip to content

Leverage Iceberg-Rust for all the transforms #1833

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 19 commits into from
Jun 7, 2025

Conversation

Fokko
Copy link
Contributor

@Fokko Fokko commented Mar 23, 2025

Rationale for this change

Testing out to use Iceberg Rust for all of the transforms. I think we have some rounding error in apache/iceberg-rust#1128

Closes #1591

Are these changes tested?

Are there any user-facing changes?

@Fokko Fokko changed the title Test with Iceberg-Rust Leverage Iceberg-Rust for all the transforms Mar 28, 2025
@Fokko Fokko marked this pull request as ready for review June 6, 2025 09:18
@Fokko Fokko force-pushed the fd-use-iceberg-rust branch from fab99e3 to edbac4d Compare June 6, 2025 09:19
Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! I fixed CI and added pyiceberg-core to the pyarrow install group for poetry

[tool.poetry.extras]
pyarrow = ["pyarrow", "pyiceberg-core"]

@@ -289,7 +289,7 @@ generate-setup-file = false
script = "build-module.py"

[tool.poetry.extras]
pyarrow = ["pyarrow"]
pyarrow = ["pyarrow", "pyiceberg-core"]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

technically this is not required, because we only use the transforms when writing to a partitioned table. But I think that it might lead to a lot of confusion if we don't do this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed i was considering this too and came to the same conclusion :)

Also I think the pyarrow_transform functions are used on the read path too

name, partition.transform.pyarrow_transform(source_field.field_type)(arrow_table[source_field.name])

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also I think the pyarrow_transform functions are used on the read path too

For completeness, I don't think that's true. _determine_partitions is only used by _dataframe_to_data_files. When reading, we often have a single value SELECT * FROM tbl WHERE created_at > '2025-01-01 19:25:00', so there performance is not that important, and we use non-Arrow transform (example for the MonthPartitioning).

@kevinjqliu kevinjqliu merged commit f507dbd into apache:main Jun 7, 2025
10 checks passed
@Fokko Fokko deleted the fd-use-iceberg-rust branch June 8, 2025 18:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[test] Improve partition transform test
2 participants