Skip to content

Fix Iceberg write retry loop stuck on explicit metadata file path#101548

Merged
alesapin merged 4 commits intoClickHouse:masterfrom
groeneai:fix-iceberg-write-retry-loop
Apr 8, 2026
Merged

Fix Iceberg write retry loop stuck on explicit metadata file path#101548
alesapin merged 4 commits intoClickHouse:masterfrom
groeneai:fix-iceberg-write-retry-loop

Conversation

@groeneai
Copy link
Copy Markdown
Contributor

@groeneai groeneai commented Apr 1, 2026

Changelog category (leave one):

  • Bug Fix (user-visible misbehavior in an official stable release)

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Fix Iceberg INSERT retry loop failing when the table was created with iceberg_metadata_file_path and the target metadata version already exists.

What this PR does

When an Iceberg table is created with iceberg_metadata_file_path (e.g. for time-travel reads), the metadata write retry loop in INSERT and mutation operations always re-read the explicitly specified metadata version instead of discovering the actual latest version.

This caused the retry loop to regenerate the same target metadata version on every attempt. If that version already existed (e.g. from a previous successful write, or a concurrent writer), all 100 retries would fail with:

Code: 736. DB::Exception: Write into iceberg was not successful. (DATALAKE_DATABASE_ERROR)

Root cause: getLatestOrExplicitMetadataFileAndVersion() unconditionally returns the explicit metadata file path when iceberg_metadata_file_path is set, even during retry loops that need the actual latest version to advance past conflicts.

Evidence from server logs (before fix): The retry loop re-reads v2 and tries to write v3 on every attempt:

Rereading metadata file issue87414/test/t0/metadata/v2.metadata.json with version 2
Writing new metadata file s3a://test/issue87414/test/t0/metadata/v3.metadata.json
Failed to write metadata s3a://test/issue87414/test/t0/metadata/v3.metadata.json, retrying
Rereading metadata file issue87414/test/t0/metadata/v2.metadata.json with version 2  ← same!
Writing new metadata file s3a://test/issue87414/test/t0/metadata/v3.metadata.json     ← same!
... (100 times)

After fix: The retry reads the actual latest version (v3) and advances to v4:

Writing new metadata file s3a://test/issue87414/test/t0/metadata/v3.metadata.json
Failed to write metadata s3a://test/issue87414/test/t0/metadata/v3.metadata.json, retrying
Rereading metadata file issue87414/test/t0/metadata/v3.metadata.json with version 3  ← correct!
Writing new metadata file s3a://test/issue87414/test/t0/metadata/v4.metadata.json     ← advances!
Metadata file s3a://test/issue87414/test/t0/metadata/v4.metadata.json written         ← success!

The fix

Add ignore_explicit_metadata_file_path parameter to getLatestOrExplicitMetadataFileAndVersion(). Write retry paths (in IcebergWrites.cpp and Mutations.cpp) pass true to bypass the explicit path and discover the real latest metadata version via version hint or directory listing.

CI impact

This fixes test 03758_fix_isssue_87414 which has been failing across 23 distinct PRs with 89 total hits in the last 30 days. The test writes to an Iceberg table created with an explicit metadata path; when run multiple times (flaky check), subsequent runs fail because v3.metadata.json already exists from the first run.

🤖 Generated with Claude Code

When an Iceberg table is created with iceberg_metadata_file_path (e.g.
for time-travel reads), the metadata write retry loop in INSERT and
mutation operations would always re-read the explicitly specified
metadata version instead of discovering the actual latest version.

This caused the retry loop to regenerate the same target metadata
version on every attempt. If that version already existed (e.g. from
a previous write), all 100 retries would fail with
DATALAKE_DATABASE_ERROR ("Write into iceberg was not successful").

The fix adds an ignore_explicit_metadata_file_path parameter to
getLatestOrExplicitMetadataFileAndVersion(). Write retry paths now
pass this flag to bypass the explicit path and discover the real
latest metadata version, allowing the retry to advance past it.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@groeneai
Copy link
Copy Markdown
Contributor Author

groeneai commented Apr 1, 2026

Pre-PR Validation Gate

Session: cron:clickhouse-ci-task-worker:20260401-184500

a) Deterministic repro? ✅ Yes.

  1. Upload test data to MinIO: mc cp --recursive data_minio/issue87414/ minio/test/issue87414/
  2. Run: CREATE TABLE t ENGINE = IcebergS3(s3_conn, filename='issue87414/test/t0') SETTINGS iceberg_metadata_file_path='metadata/v2.metadata.json'; INSERT INTO t (c0) SETTINGS write_full_path_in_iceberg_metadata=1, allow_insert_into_iceberg=1 VALUES (1)
  3. Run the INSERT again → Code: 736. DATALAKE_DATABASE_ERROR — 100% deterministic on second run.

b) Root cause explained?
When a table has iceberg_metadata_file_path set, the retry cleanup in initializeMetadata() calls getLatestOrExplicitMetadataFileAndVersion() which always returns the explicit version (v2). So filename_generator.setVersion(2+1) always targets v3. If v3 already exists, the exists() check in writeMetadataFileAndVersionHint() returns false 100 times → DATALAKE_DATABASE_ERROR.

c) Fix matches root cause?
The fix adds ignore_explicit_metadata_file_path to bypass the explicit path during retry, so the function discovers the actual latest version via version hint/listing. The retry then advances to v(latest+1).

d) Test intent preserved?
No test changes — the existing test 03758_fix_isssue_87414 now passes correctly on repeated runs. The test's purpose (verify Iceberg INSERT creates correct metadata) is fully preserved.

e) Both directions demonstrated?

  • Before fix: Second run fails with DATALAKE_DATABASE_ERROR — server log shows 100 retries all targeting v3
  • After fix: Second run succeeds — server log shows retry reads v3 and advances to v4. Verified 5 consecutive runs, all producing correct output (95 rows from v2, 96 rows from v3).

f) Fix is general?
Fixed the same pattern in all 4 call sites: IcebergWrites.cpp (retry cleanup) and Mutations.cpp (3 mutation retry loops: mutate, alter, expireSnapshots). All write retry paths that call getLatestOrExplicitMetadataFileAndVersion now correctly ignore the explicit metadata file path.

@groeneai
Copy link
Copy Markdown
Contributor Author

groeneai commented Apr 1, 2026

cc @scanhex12 @divanik — could you review this? It fixes a bug where the Iceberg write retry loop gets stuck when the table was created with iceberg_metadata_file_path — the retry always re-reads the explicit metadata version instead of discovering the actual latest, causing all 100 retries to target the same already-existing file.

@nikitamikhaylov nikitamikhaylov added the can be tested Allows running workflows for external contributors label Apr 1, 2026
@clickhouse-gh
Copy link
Copy Markdown
Contributor

clickhouse-gh Bot commented Apr 1, 2026

Workflow [PR], commit [d82943e]

Summary:

job_name test_name status info comment
Stateless tests (amd_asan_ubsan, flaky check) failure
00746_sql_fuzzy FAIL cidb IGNORED
Stateless tests (amd_tsan, s3 storage, parallel, 2/2) failure
03237_avro_union_in_complex_types FAIL cidb, issue ISSUE EXISTS
Integration tests (amd_binary, 2/5) failure
test_storage_kafka/test_batch_fast.py::test_kafka_commit_on_block_write[generate_old_create_table_query] FAIL cidb, issue ISSUE EXISTS
Integration tests (arm_binary, distributed plan, 1/4) failure
test_keeper_snapshot_chunked_transfer/test.py::test_recover_after_interrupted_transfer[remote_disk] FAIL cidb, issue ISSUE CREATED

AI Review

Summary
This PR fixes an Iceberg write retry issue when iceberg_metadata_file_path is set: retry paths now bypass the explicit metadata path and re-read the actual latest metadata version before generating the next target version. The change is focused, consistent across INSERT retry and mutation/alter/expire write paths, and includes a regression test that reproduces the conflict scenario and verifies successful progression. I did not find high-confidence correctness, safety, or performance issues requiring inline comments.

ClickHouse Rules

Item Status Notes
Deletion logging
Serialization versioning
Core-area scrutiny
No test removal
Experimental gate
No magic constants
Backward compatibility
SettingsChangesHistory.cpp
PR metadata quality
Safe rollout
Compilation time

Final Verdict

  • Status: ✅ Approve

@clickhouse-gh clickhouse-gh Bot added the pr-bugfix Pull request with bugfix, not backported by default label Apr 1, 2026
@alesapin alesapin self-assigned this Apr 2, 2026
Copy link
Copy Markdown
Member

@alesapin alesapin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dear @groeneai, can you please add a functional test with IcebergLocal to reproduce the issue and prove that your fix works.

Regression test per @alesapin's review request. Reproduces the retry loop
bug: creates an IcebergLocal table, adds data to produce multiple metadata
versions, then re-creates with iceberg_metadata_file_path pointing to an
old version. The INSERT triggers a metadata conflict (target version already
exists), and the retry must discover the actual latest version to advance.

Without the fix: DATALAKE_DATABASE_ERROR after 100 retries (all targeting
the same already-existing version). With the fix: retry ignores the
explicit path, discovers the latest version, and creates the next one.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@groeneai
Copy link
Copy Markdown
Contributor Author

groeneai commented Apr 2, 2026

Done — added test 04079_iceberg_write_retry_explicit_metadata in be71c7e.

It reproduces the retry loop bug with IcebergLocal:

  1. Creates a table, inserts 2 rows (producing metadata v1→v3)
  2. Drops the table, re-creates with iceberg_metadata_file_path = 'metadata/v1.metadata.json'
  3. Inserts — first attempt reads v1, targets v2, but v2 already exists → conflict → retry
  4. Without fix: retry re-reads v1 (explicit path) → targets v2 again → 100 retries → DATALAKE_DATABASE_ERROR
  5. With fix: retry discovers v3 as latest → creates v4 → success
  6. Verifies all 3 rows are in the final table

Both directions confirmed locally: fails without fix, 10/10 passes with fix.

@alexey-milovidov
Copy link
Copy Markdown
Member

The Stress test (arm_msan) failure is fixed by #101239, which should be merged first. After it is merged, please update the branch to include the fix.

@groeneai
Copy link
Copy Markdown
Contributor Author

groeneai commented Apr 7, 2026

Done — merged latest master (includes the SimSIMD SVE MSan fix from #101239) into the branch. CI should re-run with the fix included.

@clickhouse-gh
Copy link
Copy Markdown
Contributor

clickhouse-gh Bot commented Apr 7, 2026

LLVM Coverage Report

Metric Baseline Current Δ
Lines 84.10% 84.10% +0.00%
Functions 90.90% 90.90% +0.00%
Branches 76.60% 76.60% +0.00%

Changed lines: 65.22% (30/46) · Uncovered code

Full report · Diff report

@alesapin alesapin added this pull request to the merge queue Apr 8, 2026
Merged via the queue into ClickHouse:master with commit 3424d0e Apr 8, 2026
158 of 163 checks passed
@robot-clickhouse robot-clickhouse added the pr-synced-to-cloud The PR is synced to the cloud repo label Apr 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

can be tested Allows running workflows for external contributors pr-bugfix Pull request with bugfix, not backported by default pr-synced-to-cloud The PR is synced to the cloud repo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants