Skip to content

feat: OverwriteAction support (replace all / partial overwrite)#106

Merged
gbrgr merged 19 commits into
mainfrom
gb/overwrite-action
May 29, 2026
Merged

feat: OverwriteAction support (replace all / partial overwrite)#106
gbrgr merged 19 commits into
mainfrom
gb/overwrite-action

Conversation

@gbrgr
Copy link
Copy Markdown
Contributor

@gbrgr gbrgr commented May 28, 2026

Summary

Adds atomic overwrite snapshot support to RustyIceberg.jl, enabling callers to replace all (or a subset of) existing Parquet files with a new set in a single Iceberg Operation::Overwrite snapshot.

Depends on: RelationalAI/iceberg-rust#76 (cherry-pick of upstream apache/iceberg-rust#2185, which adds OverwriteAction to iceberg-rust).

Changes

FFI (iceberg_rust_ffi/src/transaction.rs)

  • IcebergOverwriteAction — accumulates added + deleted DataFile lists
  • iceberg_overwrite_action_new / _free
  • iceberg_overwrite_action_add_data_files — move new files into action
  • iceberg_overwrite_action_delete_data_files — move files-to-delete into action
  • iceberg_overwrite_action_apply — calls Transaction::overwrite().apply()
  • iceberg_table_list_data_files — async walk of manifest list to collect all live DataFile records from the current snapshot

Julia bindings (src/transaction.jl)

  • OverwriteAction struct + constructor / free_overwrite_action!
  • add_data_files(action, files) / delete_data_files(action, files)
  • apply(action, tx) / with_overwrite(f, tx) convenience helper
  • list_data_files(table) -> DataFiles
  • All new symbols exported from RustyIceberg

Tests (test/overwrite_tests.jl)

Self-contained, no Docker — all tests use mktempdir + catalog_create_memory:

  • OverwriteAction lifecycle (new / free / double-free)
  • list_data_files on empty table
  • list_data_files after append
  • Overwrite replaces all existing files
  • Overwrite deletes only explicitly listed files; others survive intact
  • Overwrite add-only (no deletes) produces a new snapshot
  • Two sequential overwrites converge correctly
  • Error handling: freed action, null DataFiles, committed (consumed) transaction

Usage

# Replace all existing files atomically
old_files = list_data_files(table)
new_files = RustyIceberg.with_data_file_writer(table) do w
    write(w, new_data)
end
updated_table = with_transaction(table, catalog) do tx
    with_overwrite(tx) do action
        add_data_files(action, new_files)
        delete_data_files(action, old_files)
    end
end

Test plan

  • make run-containers && make test passes (all overwrite testsets green)
  • Existing test suite unaffected (27875 pre-existing tests still pass)

🤖 Generated with Claude Code

gbrgr and others added 18 commits May 28, 2026 08:08
Update iceberg-rust rev to a4a353577ad7414b065770ba970c1353325a3adb
(RelationalAI/iceberg-rust#76, adds OverwriteAction).

Fix three API breaks exposed by the rev bump:
- UnzippedIncrementalBatchRecordStream renamed to UnzippedIncrementalScanResult
  (struct with .appends/.deletes fields instead of a tuple alias)
- to_unzipped_arrow() now returns that struct, not a bare tuple
- ArrowReader::read() now returns ScanResult; call .stream() to unwrap

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
New FFI surface for atomic table overwrites:
- IcebergOverwriteAction: accumulates added + deleted DataFile lists
- iceberg_overwrite_action_new / _free
- iceberg_overwrite_action_add_data_files: move new files into action
- iceberg_overwrite_action_delete_data_files: move files-to-delete into action
- iceberg_overwrite_action_apply: calls Transaction::overwrite().apply()
- iceberg_table_list_data_files: async walk of manifest list to collect all
  live DataFile records from the current snapshot (needed so Julia can
  supply the delete list for a full-table replace)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- OverwriteAction struct + constructor/free
- add_data_files(action, files) / delete_data_files(action, files)
- apply(action, tx) / with_overwrite(f, tx) convenience helper
- list_data_files(table) -> DataFiles (async, walks manifest list)
- All new symbols exported from RustyIceberg

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Six testsets, all self-contained with mktempdir + catalog_create_memory:
- OverwriteAction lifecycle (new/free/double-free)
- list_data_files on empty table
- list_data_files after append
- Overwrite replaces all existing files (2 appended files → 2 new rows)
- Overwrite add-only produces a new snapshot
- Two sequential overwrites converge correctly
- OverwriteAction error handling (freed action, null DataFiles, consumed tx)

All tables freed in finally blocks.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
with_data_file_writer is not exported from RustyIceberg, so bare usage
in test file caused UndefVarError at runtime.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…l overwrite

- Qualify with_transaction as RustyIceberg.with_transaction (not exported)
- Fix "apply on consumed transaction" test: apply() doesn't consume tx,
  only commit() does; now we commit first then try apply
- Add "partial overwrite" testset: delete all + re-add kept rows + new rows,
  verifies mixed add_data_files calls and selective deletion

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…tion

Instead of awkwardly re-adding "kept" rows, list_data_files on an earlier
snapshot (v1) gives just the first file. The overwrite deletes only those
files; the second file (appended in v2) is not in the delete list and
survives intact — directly testing the expected semantics.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The previous code used bare IcebergException(UNEXPECTED, ...) which
references an undefined constant, causing UndefVarError at runtime.
Switch to parse_and_throw (same pattern as FastAppendAction) which
extracts the error code from the Rust-encoded message string, and
use Ref{Ptr{Cchar}} (not Ptr{Ptr{Cchar}}) to match the calling convention.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…er-overwrite test

- Add #[derive(Default)] to IcebergOverwriteAction to satisfy clippy
- cargo fmt reformatting of transaction.rs and incremental_pipeline.rs
- Bump crate version 0.8.1 → 0.8.2
- Add testset: fast append after full overwrite clears table then re-populates it

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Both iceberg_overwrite_action_add/delete_data_files drain the Vec<DataFile>
via std::mem::take but leave the IcebergDataFiles box alive. Wrap the ccall
in try/finally and call free_data_files! to match the FastAppendAction pattern.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Picks up the fix that relaxes SnapshotProducer's precondition check to
allow Overwrite snapshots that only delete files without adding new ones.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
read_table_data returns nothing when there are no record batches, which
is the expected state after clearing a table via delete-only overwrite.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Rust FFI:
- iceberg_data_files_len: return file count without consuming the handle
- iceberg_data_files_to_json: serialize all DataFile metadata fields to a
  JSON array (content, file_path, file_format, record_count,
  file_size_in_bytes, column/value/null/nan counts, bounds, split_offsets,
  sort_order_id, equality_ids, first_row_id, referenced_data_file,
  content_offset, content_size_in_bytes)

Julia:
- Base.length(df::DataFiles): wraps iceberg_data_files_len
- data_file_info(df::DataFiles): returns Vector{Dict{String,Any}} via JSON

Tests updated to assert file counts and metadata instead of just checking
for non-null handles.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Comment thread src/data_file.jl Outdated
Co-authored-by: Richard Gankema <richardgankema@gmail.com>
@gbrgr gbrgr enabled auto-merge (squash) May 29, 2026 08:58
@gbrgr gbrgr merged commit 96c2b19 into main May 29, 2026
6 checks passed
@gbrgr gbrgr deleted the gb/overwrite-action branch May 29, 2026 09:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants