-
Notifications
You must be signed in to change notification settings - Fork 309
Improve upsert memory pressure #1995
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…M when updating large tables
…all filter expressions. Prevents memory pressure due to large filters
fwiw I think we should try to get this merged in at some point. Some ideas:
|
I've been thinking about what I (as a developer) want. The answer is: set max memory usage. Some ideas:
|
Did an update and ran a quick benchmark with different table = catalog.get_table("some_table")
# Benchmark loop
p = table.scan().to_arrow_batch_reader(concurrent_tasks=100)
for batch in tqdm.tqdm(p):
print(pool.max_memory()) Results (including
|
Did another update to get rid of the I also refactored |
Typically, I'm not a big fan of this kind of flag, since it will delegate the responsibility to the user, and not every user might know the implications. Instead, let's optimize for the best experience out of the box. |
@koenvo Looks like there are some issues with the |
I was kind of scared by the if property_as_bool(self._io.properties, PYARROW_USE_LARGE_TYPES_ON_READ, False):
deprecation_message(
deprecated_in="0.10.0",
removed_in="0.11.0",
help_message=f"Property `{PYARROW_USE_LARGE_TYPES_ON_READ}` will be removed.",
)
result = result.cast(arrow_schema) Currently, I do the cast in my own application code, as I ran into the same issue but I though it was just my implementation. Had to add the cast. And happy to look into this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two minor comments, this looks great 👍
Use arrow_schema.empty_table() Co-authored-by: Fokko Driesprong <fokko@apache.org>
Co-authored-by: Fokko Driesprong <fokko@apache.org>
Moving this forward, this looks great @koenvo 🙌 |
Summary
This PR updates the upsert logic to use batch processing. The main goal is to prevent out-of-memory (OOM) issues when updating large tables by avoiding loading all data at once.
Note: This has only been tested against the unit tests—no real-world datasets have been evaluated yet.
This PR partially depends on functionality introduced in #1817.
Notes
All data is read sequentially, which may be slower than the parallel read used byfixed usingto_arrow
.concurrent_tasks
parameterPerformance Comparison
In setups with many small files, network and metadata overhead become the dominant factor. This impacts batch reading performance, as each file contributes relatively more overhead than payload. In the test setup used here, metadata access was the largest cost.
Using
to_arrow_batch_reader
(sequential):Using
to_arrow
(parallel):