Skip to content

Antalya 26.1 Backport of #96191 - Introduce async prefetch and staleness for Iceberg metadata#1575

Merged
zvonand merged 2 commits intoantalya-26.1from
backports/antalya-26.1/96191
Mar 26, 2026
Merged

Antalya 26.1 Backport of #96191 - Introduce async prefetch and staleness for Iceberg metadata#1575
zvonand merged 2 commits intoantalya-26.1from
backports/antalya-26.1/96191

Conversation

@mkmkme
Copy link
Copy Markdown
Collaborator

@mkmkme mkmkme commented Mar 25, 2026

Changelog category (leave one):

  • New Feature

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

  1. Iceberg receives the feature to asynchronously pre-populate metadata into cache. This can be enabled by setting iceberg_metadata_async_prefetch_period_ms at the table creation. E.g.:
CREATE TABLE X (...)
ENGINE = Iceberg*(...)
SETTINGS iceberg_metadata_async_prefetch_period_ms = 60_000
  1. Select queries from Iceberg tables now can be executed with specifying iceberg_metadata_staleness_ms parameter, which would allow ClickHouse to rely on the cache version of the metadata if it's fresher than the specified staleness. Otherwise, the remote Iceberg catalog will be queried for the latest metadata in order to process the request (how it worked before). With this change, we're able to eliminate calls to Iceberg catalog down to 0 during request processing, which is expected to bring a visible performance gain. Example:
SELECT count() FROM {TABLE_NAME} SETTINGS iceberg_metadata_staleness_ms=600000

Similar functionality is available at:

Bechmarks:

(ClickHouse#96191 by @arsenmuk)

CI/CD Options

Exclude tests:

  • Fast test
  • Integration Tests
  • Stateless tests
  • Stateful tests
  • Performance tests
  • All with ASAN
  • All with TSAN
  • All with MSAN
  • All with UBSAN
  • All with Coverage
  • All with Aarch64
  • All Regression
  • Disable CI Cache

Regression jobs to run:

  • Fast suites (mostly <1h)
  • Aggregate Functions (2h)
  • Alter (1.5h)
  • Benchmark (30m)
  • ClickHouse Keeper (1h)
  • Iceberg (2h)
  • LDAP (1h)
  • Parquet (1.5h)
  • RBAC (1.5h)
  • SSL Server (1h)
  • S3 (2h)
  • S3 Export (2h)
  • Swarms (30m)
  • Tiered Storage (2h)

arsenmuk and others added 2 commits March 25, 2026 09:57
…ache-preheat-and-staleness-for-iceberg

Introduce async prefetch and staleness for Iceberg metadata
@github-actions
Copy link
Copy Markdown

Workflow [PR], commit [e5e2e24]

@mkmkme
Copy link
Copy Markdown
Collaborator Author

mkmkme commented Mar 25, 2026

The diff has two function calls that are different from the upstream:

1. getBackgroundContext()Context::createCopy(Context::getGlobalContextInstance())

Upstream code (line 268):

auto ctx = Context::getGlobalContextInstance()->getBackgroundContext();

Our replacement (line 261):

auto ctx = Context::createCopy(Context::getGlobalContextInstance());

Upstream's getBackgroundContext() returns a ContextMutablePtr that was created during server initialization in makeBackgroundContext(). It does exactly three things:

Context::createCopy(shared_from_this()) — copies the global context
setCurrentProfile(background_profile_name) — applies an optional "background" settings profile (falls back to the system profile if none is configured)
Sets is_background_operation = true — a flag used by upstream to mark the context for background ops.

We can (and should) change it to the upstream's code once upstream's ClickHouse#93905 is backported.

2: getManifestFileEntriesHandle(...)Iceberg::getManifestFile(...)

Upstream code (lines 275–276):

auto manifest_file_ptr = getManifestFileEntriesHandle(
    object_storage, persistent_components, ctx, log, entry, actual_table_state_snapshot.schema_id);

Our replacement (lines 268–273):

auto manifest_file_ptr = Iceberg::getManifestFile(
    object_storage, persistent_components, ctx, log,
    entry.manifest_file_absolute_path,
    entry.added_sequence_number,
    entry.added_snapshot_id,
    *secondary_storages);

What upstream getManifestFileEntriesHandle does:

  1. Calls getManifestFile(...) to download/deserialize the manifest file (or get it from cache)
  2. Creates a ManifestFileIterator, drains it (while (iterator->next()) {})
  3. Returns a handle to the drained entries

Steps 2–3 exist because upstream refactored manifest files into an iterator-based API (ManifestFileIterator). The Antalya branch hasn't been refactored — it still uses the older ManifestFileContent/ManifestFilePtr model.

The entire purpose of this loop is cache warming — the return value manifest_file_ptr is immediately discarded. The goal is just to trigger the download-and-cache side effect. Both getManifestFileEntriesHandle and Iceberg::getManifestFile achieve exactly this: they read the manifest file from remote storage and store it in the metadata cache.

Parameter mapping:

  • Upstream passes the whole ManifestFileCacheKey entry as a single struct. Antalya's getManifestFile takes the fields individually, so we unpack entry.manifest_file_absolute_path, entry.added_sequence_number, entry.added_snapshot_id.
  • Antalya's signature also requires SecondaryStorages & (for multi-storage resolution). We pass *secondary_storages which is a member of IcebergMetadata.
  • Upstream's extra parameter actual_table_state_snapshot.schema_id is only used inside the ManifestFileIterator, which we don't need because Antalya's getManifestFile handles schema processing internally in ManifestFileContent.

Both changes should be correct and safe. I have also verified the added/changed integration tests are passing.

@mkmkme
Copy link
Copy Markdown
Collaborator Author

mkmkme commented Mar 25, 2026

AI audit note: This review comment was generated by AI (gpt-5.3-codex).

Audit update for PR #1575 (Antalya 26.1 backport of ClickHouse#96191 — Iceberg async metadata prefetch + read staleness):

Confirmed defects

No confirmed defects in reviewed scope.

Coverage summary

  • Scope reviewed: PR Antalya 26.1 Backport of #96191 - Introduce async prefetch and staleness for Iceberg metadata #1575iceberg_metadata_async_prefetch_period_ms (background BackgroundSchedulePool task in IcebergMetadata), session setting iceberg_metadata_staleness_ms + IcebergMetadataFilesCache::getOrSetLatestMetadataVersion, getLatestMetadataFileAndVersion / getLatestOrExplicitMetadataFileAndVersion wiring, cache invalidation on successful writes, Context::getIcebergSchedulePool, DataLakeConfiguration member order, docs/tests touchpoints.
  • Call graph (summary): Table attach → IcebergMetadata ctor optionally schedules backgroundMetadataPrefetcherThreadgetRelevantState(..., true)getLatestOrExplicitMetadataFileAndVersiongetOrSetLatestMetadataVersion (force path: tolerated_staleness_ms = 0) → load_fn (list + optional parse) → manifest prefetch via getManifestFile; read paths → getRelevantState(..., false) → staleness-aware cache probe → getState; writes/mutations/compaction/Glue catalog paths rely on default force_fetch_latest_metadata = true or explicit true → catalog refresh.
  • Concurrency / shutdown: BackgroundSchedulePoolTaskInfo::deactivate() takes exec_mutex before marking deactivated; execute() holds the same mutex while function() runs, so destructor deactivate() blocks until the current prefetch body finishes (no confirmed use-after-free on IcebergMetadata*). SecondaryStorages uses a mutex in resolveObjectStorageForPath (shared map).
  • Fault categories (logical): (1) Stale reads vs writes — pass (writes pass force true or default true; post-write remove on path/uuid keys). (2) Shutdown vs background task — pass (mutex ordering as above). (3) Cache key / variant safety — no collision reproduced; std::get<LatestMetadataVersionPtr> would only throw if the same cache key were ever populated with a non–latest-metadata cell — not shown reachable from distinct keying (table_path / bare uuid vs getKey(uuid, file_path)).
  • Categories failed: none.
  • Categories passed: staleness vs force-fetch semantics, write-path cache invalidation, background scheduling and destructor interaction, secondary storage resolution locking, default parameter audit on call sites.
  • Assumptions/limits: Static code review only (no integration run); “confirmed” is limited to what the code paths prove; residual risk remains if CacheBase keys were ever reused for another cell type in a future change without namespace separation.

Copy link
Copy Markdown
Collaborator

@arthurpassos arthurpassos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, brownie points for the summary of conflicts

@alsugiliazova
Copy link
Copy Markdown
Member

PR #1575 CI Verification Report

CI Results Overview

Category Count
Success 54
Failure 1 (BuzzHouse)
Skipped 41 (excluded sanitizer suites)
Total (result_pr.json) 96

PR's New Test Validation

All 4 new test_async_metadata_refresh.py tests were executed and passed in two integration test configurations:

Job Shard Tests Result
Integration tests (amd_binary, 4/5) 4/5 4 tests All OK
Integration tests (amd_asan, db disk, old analyzer, 5/6) 5/6 4 tests All OK

Tests validated:

  • test_selecting_with_stale_vs_latest_metadata[s3] — OK
  • test_default_async_metadata_refresh[s3] — OK
  • test_async_metadata_refresh[s3] — OK
  • test_insert_updates_metadata_cache[s3] — OK

Additionally, all existing Iceberg tests passed across all shards (~470 Iceberg tests in amd_binary, ~470 in amd_asan).

CI Failures

1. BuzzHouse (amd_debug) — Known Flaky Fuzzer

Job: BuzzHouse (amd_debug)

Server crash during random SQL fuzzing ("Lost connection to server"). BuzzHouse has a ~5.3% failure rate in upstream CI (228 failures out of 4,266 runs in the last 30 days). BuzzHouse (arm_asan) passed.

Related to PR: No — Known flaky fuzzer unrelated to Iceberg metadata changes

2. DCO (Developer Certificate of Origin)

Missing Signed-off-by line on one or more commits. This also fails on other PRs on the same branch (e.g., PR #1552).

Related to PR: Potentially — Commit signing requirement, but appears to be a branch-wide issue

3. Integration tests (amd_asan, targeted) — CI Framework Bug

Job: Integration tests (amd_asan, targeted)

Error: AssertionError: Invalid status [skipped] to be set as GH commit status.state in praktika/gh.py. The targeted test runner found no previously-failed tests to re-run, attempted to report "skipped" status, and the CI framework crashed because GitHub's API doesn't accept "skipped" as a commit status state.

Related to PR: No — CI framework bug in Praktika

4. Stateless tests (arm_asan, targeted) — CI Framework Bug

Job: Stateless tests (arm_asan, targeted)

Same AssertionError: Invalid status [skipped] as above. The PR has no stateless tests, so the targeted runner correctly skipped but crashed trying to report the status.

Related to PR: No — CI framework bug in Praktika

5. GrypeScan (-alpine) — CVE in Base Image

CVE-2026-2673 (High) in Alpine base image OpenSSL packages. Same failure on PR #1552.

Related to PR: No — Base image vulnerability

6. Regression Parquet (x86 + aarch64) — Pre-existing Failure

Jobs: RegressionTestsRelease / Parquet, RegressionTestsAarch64 / Parquet

Failed test: /parquet/datatypes/unsupportednullDB::Exception: Unsupported Parquet type 'null'. Same failure on PR #1552 on the same branch.

Related to PR: No — Pre-existing Parquet subsystem issue, unrelated to Iceberg

7. Regression Swarms (x86 + aarch64) — Pre-existing Failure

8. Regression S3Export (part) (x86 + aarch64) — Timeout/Cancelled

Related to PR: No — Pre-existing S3 export stress test timeouts

Conclusion

Failure Root Cause PR-Related
BuzzHouse (amd_debug) Known flaky fuzzer (~5.3% fail rate) No
DCO Missing Signed-off-by Potentially (branch-wide)
Integration tests (targeted) Praktika CI framework bug No
Stateless tests (targeted) Praktika CI framework bug No
GrypeScan (-alpine) CVE-2026-2673 in Alpine OpenSSL No
Regression Parquet Pre-existing unsupportednull failure No
Regression Swarms Pre-existing node failure instability No
Regression S3Export (part) Pre-existing concurrent ALTER timeouts No

Verdict: Ready to merge — No PR-related failures detected.

@alsugiliazova alsugiliazova added the verified Verified by QA label Mar 26, 2026
@zvonand zvonand merged commit a7d6a1a into antalya-26.1 Mar 26, 2026
258 of 277 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants