You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
### Summary
This PR updates the upsert logic to use batch processing. The main goal
is to prevent out-of-memory (OOM) issues when updating large tables by
avoiding loading all data at once.
**Note:** This has only been tested against the unit tests—no real-world
datasets have been evaluated yet.
This PR partially depends on functionality introduced in
[#1817](apache/iceberg#1817).
---
### Notes
- Duplicate detection across multiple batches is **not** possible with
this approach.
- ~All data is read sequentially, which may be slower than the parallel
read used by `to_arrow`.~ fixed using `concurrent_tasks` parameter
---
### Performance Comparison
In setups with many small files, network and metadata overhead become
the dominant factor. This impacts batch reading performance, as each
file contributes relatively more overhead than payload. In the test
setup used here, metadata access was the largest cost.
#### Using `to_arrow_batch_reader` (sequential):
- **Scan:** 9993.50 ms
- **To list:** 19811.09 ms
#### Using `to_arrow` (parallel):
- **Scan:** 10607.88 ms
---------
Co-authored-by: Fokko Driesprong <fokko@apache.org>
0 commit comments