Skip to content

Conversation

@gbrgr
Copy link
Collaborator

@gbrgr gbrgr commented Oct 23, 2025

Closes RAI-43289.
Closes RAI-43292.

Incremental Scan Implementation

Summary

This PR introduces Incremental Scan functionality to the Iceberg Rust implementation, enabling efficient querying of changes between table snapshots. Incremental scans return the net changes (appends and deletes) between two snapshots, which is essential for incremental data processing workflows, change data capture (CDC), and efficient data pipeline operations.

Key Features

Incremental Scan API

  • New IncrementalScan builder with fluent API for configuring scans between snapshots
  • Support for scanning changes from a starting snapshot to an ending snapshot
  • Returns separate streams for appended data and positional deletes
  • Implements column projection via .select() for efficient data retrieval
  • Configurable batch size via .with_batch_size() for memory optimization

File Path Tracking

  • Automatically adds a reserved _file column to all delete record batches containing the source parquet file path
  • Uses RunEndEncoded arrays for memory-efficient file path storage in non-empty batches
  • Includes proper field metadata with reserved field ID (-2048)

Net Change Computation

  • Computes net effects between snapshots: rows added then deleted within the range don't appear in results
  • Efficiently processes only the data files and delete files that changed between snapshots
  • Supports complex scenarios including partial deletes, cross-file operations, and multiple snapshots

Implementation Details

Core Components

  1. IncrementalScan Builder (scan/incremental/mod.rs)

    • Validates snapshot ranges and table state
    • Generates file scan tasks for appends and deletes
    • Integrates with existing ArrowReader infrastructure
  2. Streaming Implementation (arrow/incremental.rs)

    • StreamsInto trait with .stream() method for converting scan tasks to Arrow record streams
    • Separate processing for append tasks (data files) and delete tasks (positional deletes)
    • Optimized delete task processing with schema reuse (O(1) allocations instead of O(n_batches))
  3. File Path Column Addition (arrow/reader.rs)

    • add_file_path_column() function adds _file column to record batches
    • Handles both empty and non-empty batches correctly
    • Maintains proper Arrow schema metadata for Iceberg field IDs

Restrictions

  • Does not support yet deletion of entire parquet files, or overwriting of parquet files.

Testing

Comprehensive test suite added in scan/incremental/tests.rs:

Test Fixture

  • IncrementalTestFixture - Helper for creating test tables with controlled snapshots
  • Supports Add operations with custom file names and data
  • Supports Delete operations with position and file tracking
  • Verification helper verify_incremental_scan() for asserting expected results

Test Coverage

  1. test_incremental_fixture_simple - Basic append and delete operations
  2. test_incremental_fixture_complex - Multiple snapshots with overlapping operations
    • Tests 6 different snapshot range combinations
    • Verifies net change computation (e.g., data added then deleted doesn't appear)
  3. test_incremental_scan_edge_cases - Edge cases across 7 snapshots and 3 data files
    • Partial deletes from multiple files
    • Cross-file operations
    • Empty result sets
  4. test_incremental_scan_builder_options - Builder API functionality
    • Column projection (.select())
    • Batch size configuration
    • Multiple batch size scenarios
  5. test_add_file_path_column - Unit tests for file path column addition
    • Normal case with RunEndEncoded arrays
    • Empty batch handling with StringArray
    • Special characters in file paths

All tests passing: ✅ 4 incremental scan tests, ✅ 3 file path column tests

API Example

// Scan changes between snapshot 2 and snapshot 5
let incremental_scan = table
    .scan()
    .incremental()
    .from_snapshot_id(2)
    .to_snapshot_id(5)
    .select(vec!["id", "name"])  // Column projection
    .with_batch_size(Some(1024)) // Configure batch size
    .build()?;

let (appends_stream, deletes_stream) = incremental_scan.to_unzipped_arrow()?;

// Process appended rows
while let Some(batch) = appends_stream.next().await {
    let batch = batch?;
    // batch contains appended data
}

// Process deleted positions
while let Some(batch) = deletes_stream.next().await {
    let batch = batch?;
    // batch contains (pos, _file) tuples for deleted positions
}

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

license-eye has checked 363 files.

Valid Invalid Ignored Fixed
296 2 65 0
Click to see the invalid file list
  • crates/playground/Cargo.toml
  • crates/playground/src/main.rs
Use this command to fix any missing license headers
```bash

docker run -it --rm -v $(pwd):/github/workspace apache/skywalking-eyes header fix

</details>

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

license-eye has checked 366 files.

Valid Invalid Ignored Fixed
299 2 65 0
Click to see the invalid file list
  • crates/playground/Cargo.toml
  • crates/playground/src/main.rs
Use this command to fix any missing license headers
```bash

docker run -it --rm -v $(pwd):/github/workspace apache/skywalking-eyes header fix

</details>

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

license-eye has checked 366 files.

Valid Invalid Ignored Fixed
299 2 65 0
Click to see the invalid file list
  • crates/playground/Cargo.toml
  • crates/playground/src/main.rs
Use this command to fix any missing license headers
```bash

docker run -it --rm -v $(pwd):/github/workspace apache/skywalking-eyes header fix

</details>

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

license-eye has checked 366 files.

Valid Invalid Ignored Fixed
299 2 65 0
Click to see the invalid file list
  • crates/playground/Cargo.toml
  • crates/playground/src/main.rs
Use this command to fix any missing license headers
```bash

docker run -it --rm -v $(pwd):/github/workspace apache/skywalking-eyes header fix

</details>

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

license-eye has checked 366 files.

Valid Invalid Ignored Fixed
299 2 65 0
Click to see the invalid file list
  • crates/playground/Cargo.toml
  • crates/playground/src/main.rs
Use this command to fix any missing license headers
```bash

docker run -it --rm -v $(pwd):/github/workspace apache/skywalking-eyes header fix

</details>

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

license-eye has checked 366 files.

Valid Invalid Ignored Fixed
299 2 65 0
Click to see the invalid file list
  • crates/playground/Cargo.toml
  • crates/playground/src/main.rs
Use this command to fix any missing license headers
```bash

docker run -it --rm -v $(pwd):/github/workspace apache/skywalking-eyes header fix

</details>

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

license-eye has checked 366 files.

Valid Invalid Ignored Fixed
299 2 65 0
Click to see the invalid file list
  • crates/playground/Cargo.toml
  • crates/playground/src/main.rs
Use this command to fix any missing license headers
```bash

docker run -it --rm -v $(pwd):/github/workspace apache/skywalking-eyes header fix

</details>

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

license-eye has checked 367 files.

Valid Invalid Ignored Fixed
299 3 65 0
Click to see the invalid file list
  • crates/iceberg/src/scan/incremental/tests.rs
  • crates/playground/Cargo.toml
  • crates/playground/src/main.rs
Use this command to fix any missing license headers
```bash

docker run -it --rm -v $(pwd):/github/workspace apache/skywalking-eyes header fix

</details>

Copy link

@vustef vustef left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Gerald, looking good to me, just a few last follow ups.

})?
.clone();

// TODO: What properties do we need to verify about the snapshots? What about
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's remove this comment and track it separately elsewhere (Jira or google doc with open questions)

gbrgr and others added 3 commits October 30, 2025 14:44
Co-authored-by: Vukasin Stefanovic <vukasin.stefanovic92@gmail.com>
@gbrgr gbrgr enabled auto-merge (squash) October 30, 2025 14:01
@gbrgr gbrgr merged commit 2a5a365 into main Oct 30, 2025
16 checks passed
@gbrgr gbrgr deleted the gb/incremental-bootstrap branch October 30, 2025 14:08
gbrgr added a commit that referenced this pull request Nov 3, 2025
* WIP, initial draft of incremental scan

* .

* .

* cargo fmt

* Implement unzipped stream

* Remove printlns

* Add API method for unzipped stream

* .

* Remove comment

* Rename var

* Add import

* Measure time

* Fix typo

* Undo some changes

* Change type name

* Add comment header

* Fail when encountering equality deletes

* Add comments

* Add some preliminary tests

* Format

* Remove playground

* Add more tests

* Clippy

* .

* .

* Adapt tests

* .

* Add test

* Add tests

* Add tests

* Format

* Add test

* Format

* .

* Rm newline

* Rename trait function

* Reuse schema

* .

* remove clone

* Add test for adding file_path column

* Make `from_snapshot` mandatory

* Error out if incremental scan encounters neither Append nor Delete

* .

* Add materialized variant of add_file_path_column

* .

* Allow dead code

* Some PR comments

* .

* More PR comments

* .

* Add comments

* Avoid cloning

* Add reference to PR

* Some PR comments

* .

* format

* Allow overwrite operation for now

* Fix file_path column

* Add overwrite test

* Unwrap delete vector

* .

* Add assertion

* Avoid cloning the mutex guard

* Abort when encountering a deleted delete file

* Adjust comment

* Update crates/iceberg/src/arrow/reader.rs

Co-authored-by: Vukasin Stefanovic <vukasin.stefanovic92@gmail.com>

* Add check

* Update crates/iceberg/src/scan/incremental/mod.rs

---------

Co-authored-by: Vukasin Stefanovic <vukasin.stefanovic92@gmail.com>
vustef added a commit to RelationalAI/RustyIceberg.jl that referenced this pull request Nov 4, 2025
FFI and Julia Bindings for incremental scan changes that were introduced
with RelationalAI/iceberg-rust#3.

I refactored the file structures a bit, for both Rust and Julia code.
Rust code reuses some common parts through macros, for Julia I didn't
bother to do that (mostly afraid of macros and ccall interaction being a
rabbit hole with little benefit).
Note that I had to use struct instead of const for `ScanRef`, since now
with additional type, we have method overloads, which if we use
`ScanRef` const aliases actually use same type, and then become
overwrites instead of overloads.

There's also a new test data, and new test that exercises positional
delete and inserts.

---------

Co-authored-by: Gerald Berger <59661379+gbrgr@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants