[incremental scan] Support for equality deletes by gbrgr · Pull Request #60 · RelationalAI/iceberg-rust

gbrgr · 2026-03-05T07:30:30Z

https://relationalai.atlassian.net/browse/RAI-44214
+
https://relationalai.atlassian.net/browse/RAI-44483

gbrgr · 2026-03-05T15:24:54Z

+    /// Spawns a concurrent task to collect all manifest entries from the `from_snapshot`.
+    /// Returns a receiver that will yield the manifest entries as they are collected.
+    /// Errors are sent through `error_tx`.
+    fn spawn_baseline_file_collection(


@vustef this is code to read metadata about the from_snapshot. I could not really reuse code from the full scan here, because the manifest reading logic is tightly coupled into plan_files. Should not hurt here though.

I don't see though what is different in this function than any other that reads metadata about a snapshot?

Could from_snapshot be a parameter, so that the function is not special and tied to the "baseline_file_collection"?

As said, the way metadata is read in the full scan is tightly coupled to the specific manifest file context types, etc.

Will add a parameter

gbrgr · 2026-03-06T07:14:18Z

+                if !equality_deletes.is_empty() {
+                    // The predicate from build_combined_equality_delete_predicate is a "survival"
+                    // filter (keeps non-deleted rows). Negate it to select rows TO DELETE.
+                    let survival_predicate = delete_filter


Notice that for the full scan, the equality predicate is built at read time in the arrow reader. However, since the incremental scan builds the delete index beforehand, we can compile it here at plan time.

vustef

First round + will review tests later. Thanks Gerald

vustef · 2026-03-06T09:03:02Z

+    /// Spawns a concurrent task to collect all manifest entries from the `from_snapshot`.
+    /// Returns a receiver that will yield the manifest entries as they are collected.
+    /// Errors are sent through `error_tx`.
+    fn spawn_baseline_file_collection(


I don't see though what is different in this function than any other that reads metadata about a snapshot?

vustef · 2026-03-06T09:04:14Z

+    /// Spawns a concurrent task to collect all manifest entries from the `from_snapshot`.
+    /// Returns a receiver that will yield the manifest entries as they are collected.
+    /// Errors are sent through `error_tx`.
+    fn spawn_baseline_file_collection(


Could from_snapshot be a parameter, so that the function is not special and tied to the "baseline_file_collection"?

vustef

Went through everything now. My main concern is code duplication between all these different tasks (full, and all incremental ones). Let's discuss after you go through comments

vustef · 2026-03-09T10:34:27Z

+        // Three-branch strategy matching Java's ReadConf constructor:
+        //   Branch 1: file has embedded field IDs → use as-is
+        //   Branch 2: name_mapping present → apply name mapping, reopen
+        //   Branch 3: fallback → assign position-based IDs, reopen


Not sure if this is needed, it's stated in the function. Otherwise I expected to see this logic here

vustef · 2026-03-09T10:36:58Z

+    ) -> Result<ParquetRecordBatchStreamBuilder<ArrowFileReader>> {
+        // Metadata fields (e.g. _file, _pos) are virtual — they don't exist as Parquet columns.
+        // Filter them out so get_arrow_projection_mask only sees real schema field IDs.
+        let real_field_ids: Vec<i32> = field_ids


nit: perhaps we should keep the names, in this case let project_field_ids_without_metadata, so that it's easier to correlate that with the previous code / upstream

vustef · 2026-03-09T10:43:16Z

I know this is the difference, but I'm think we can control this difference through a param, that might allow us to skip this step. We either get delete_predicate as a param in a common function, or we get a receiver.

vustef · 2026-03-09T10:49:27Z

+        let (iceberg_field_ids, field_id_map) =
+            Self::build_field_id_set_and_map(builder.parquet_schema(), bound_predicate)?;
+
+        if let Some(use_fallback) = projection {


I don't get this, so incremental always uses fallback? Why? And why do we call this fallback?

Why should incremental always use this fallback? That's parameterized by has_missing_field_ids. From Claude

So "fallback" is not about incremental code at all — it refers to position-based projection for migrated tables that lack embedded Parquet field IDs, versus the normal field-ID-based projection. It corresponds directly to has_missing_field_ids. Breaking it down: - projection = None → skip projection entirely (append task: projection is handled separately afterward via apply_projection) - projection = Some(false) → apply field-ID-based projection (file has embedded field IDs, normal case) - projection = Some(true) → apply position-based projection (file lacks field IDs, i.e. has_missing_field_ids = true, migrated table)```

gbrgr · 2026-03-09T11:52:10Z

            }
        };

-        // There are three possible sources for potential lists of selected RowGroup indices,


TODO we lost this comment, will reinsert

vustef

Thanks Gerald

gbrgr added 8 commits March 5, 2026 08:30

[incremental scan] Support for equality deletes

0e4e526

Clippy fix

c910e94

Refactor

c1fa1f8

.

aab0b15

Small optimization

34d9159

Clippy fix

f972615

Fix deadlock

947f055

Add test

a98e1ce

gbrgr marked this pull request as ready for review March 5, 2026 15:22

gbrgr requested a review from vustef March 5, 2026 15:23

gbrgr commented Mar 5, 2026

View reviewed changes

gbrgr commented Mar 6, 2026

View reviewed changes

gbrgr added 3 commits March 6, 2026 08:58

Code reuse

f6c871c

Add test

3e62a7d

Cleanup

48e3b10

vustef reviewed Mar 6, 2026

View reviewed changes

gbrgr added 2 commits March 6, 2026 11:24

Some PR comments

585e0bc

Enable row group filtering

54ea7d6

vustef reviewed Mar 6, 2026

View reviewed changes

gbrgr added 11 commits March 6, 2026 14:28

PR comments

eef647d

Factor out common code

a0f211d

Clippy fix

4907990

Tighten method visibility

e58b4bf

.

897abf6

Unify some more

80a20f7

Format

5d1b3ca

.

4f51fa7

Unify more

f64d82e

Unify more, add partition info to append task

d5453ce

.

19826d1

gbrgr added 10 commits March 8, 2026 11:18

Unify predicate application

2eaba7f

.

379a27d

Allow in tests for multiple eq columns

09de39d

Add test

7cad600

Remove rewrite

411147d

Add case_sensitive

936efa4

.

d79ad18

Add helper

940be35

Format

4122d6e

Fix clippy

33a0288

vustef reviewed Mar 9, 2026

View reviewed changes

gbrgr added 4 commits March 9, 2026 12:20

Some PR comments

a779e05

Make big helper

08fe33f

Split up more

f7a142a

Inline small helpers

ee5fc70

gbrgr commented Mar 9, 2026

View reviewed changes

gbrgr added 2 commits March 9, 2026 12:53

Reinsert comment

fab3efa

Narrow visibility

7f11c4a

vustef approved these changes Mar 9, 2026

View reviewed changes

Comment thread crates/iceberg/src/arrow/incremental.rs Outdated

gbrgr added 3 commits March 9, 2026 13:12

Remove mut

ca2d37e

.

bef812d

Format

b59b570

gbrgr enabled auto-merge (squash) March 9, 2026 12:16

gbrgr merged commit 4c94e86 into main Mar 9, 2026
19 checks passed

gbrgr deleted the gb/eq-deletes branch March 9, 2026 12:20

Conversation

gbrgr commented Mar 5, 2026 • edited by vustef Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vustef left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

vustef left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vustef left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

gbrgr commented Mar 5, 2026 •

edited by vustef

Loading