Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade to Arrow 13.0.0 #368

Merged
merged 6 commits into from
Aug 30, 2023
Merged

Upgrade to Arrow 13.0.0 #368

merged 6 commits into from
Aug 30, 2023

Conversation

adamreeve
Copy link
Contributor

@adamreeve adamreeve commented Aug 30, 2023

This upgrades the version of the Arrow C++ library to 13.0.0.

The notable changes in this version are:

  • Writing the page index can now be controlled per column
  • The Parquet version written now defaults to 2.6. This version includes the nanosecond timestamp precision but we were already using this in ParquetSharp so this change doesn't seem to affect us besides the metadata being different
  • Changes to how the column encodings are computed affected the order these are returned in, but the order shouldn't have been relied on

Comparing benchmark results between the current master and this branch, the timings are mostly the same, with some of the "chunked" writing tests slowing down a little (5 to 10%) indicating there might be a bit more per-row-group overhead. There is quite a large slowdown in the nested write test though, going from 47 ms to 85 ms per iteration on my machine. I looked into this and have made a PR to fix this in Arrow (apache/arrow#37454), so hopefully the performance will be better again for the next version. I think this regression probably isn't bad enough to block upgrading to 13.0.0 though as it looks like it should only affect writing nested data, where we make lots of calls to WriteBatch with small batches of data. (And possibly this could be refactored to avoid needing to make so many WriteBatch calls if we buffered these first?)

The native method name for getting whether this enabled was changed
in Arrow 13, and this can now be enabled and disabled per-column
Copy link
Contributor

@marcin-krystianc marcin-krystianc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good 👍

@adamreeve adamreeve merged commit 05e4a7f into G-Research:master Aug 30, 2023
26 checks passed
@adamreeve adamreeve deleted the arrow-13 branch August 30, 2023 22:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants