observe(activation): log each step of iceberg-only catalog activation#642
Merged
Conversation
Iceberg-only (no-DuckLake) tenant activation can exceed the control-plane activate-tenant deadline (~60s) and fail with DeadlineExceeded, but the worker logs show only warm-up then silence: every step of the iceberg attach is a CGO db.Exec / LoadExtensions that emits no Go log until it returns, so a stall in any one of them is invisible. The first existing log line is 'Attaching Iceberg catalog', well past the count query, LoadExtensions(iceberg), and the secret statements — so a hang before it can't be localized from logs. Add a lightweight 'Iceberg activation step starting' INFO log (with elapsed-since-start) before each step in attachLakekeeperCatalog (count-catalogs, load-iceberg-extension, create-s3-data-secret, create-catalog-auth-secret, attach-catalog, create-default-schema) and around setIcebergDefault's 'USE iceberg.<schema>' (the sole iceberg-only-specific step after the attach). The last 'step starting' line without a following step then pinpoints the blocking call. Pure instrumentation, no behavior change. Lets the next activation of an iceberg-only tenant in mw-dev localize the stall (worker reachability to Lakekeeper is already confirmed via HTTP 200), so the actual fix can target the right call.
benben
added a commit
that referenced
this pull request
Jun 1, 2026
Iceberg extension was downloaded on-demand at first use, unlike httpfs, ducklake, json, and postgres_scanner which the Dockerfiles pre-seed into the bundled extension cache. That on-demand INSTALL silently blocks the iceberg-only tenant activation past the ~60s activate-tenant deadline (observed on mw-dev with the per-step logging from #642: count-catalogs completes in ~1ms, load-iceberg-extension never returns, worker is retired at 60.7s with no DuckDB-level error). Iceberg+DuckLake tenants don't hit it because LoadExtensions(delta) runs first and primes DuckDB's extension subsystem. Bundle iceberg the same way as the others: curl the .duckdb_extension.gz from ${DUCKDB_EXTENSION_REPOSITORY} at build time, gunzip into /build/duckdb-extensions/v${DUCKDB_EXTENSION_VERSION}/linux_${TARGETARCH}/, and add it to the size-check loop. Applies to both the standalone Dockerfile and Dockerfile.worker so worker pods get a local cache hit on LoadExtensions("iceberg") instead of a CDN fetch. Eliminates the iceberg-only activation timeout and brings the activation cost of LoadExtensions("iceberg") in line with the other bundled extensions for every iceberg-using tenant (lakekeeper + s3tables backends alike).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
Iceberg-only (no-DuckLake) tenant activation can exceed the control-plane activate-tenant deadline (~60s) →
DeadlineExceeded, failing the session and churning the worker. Confirmed on mw-dev (ben-{cnpg,ext,aur}-ice); the iceberg + DuckLake combos and the S3Tables backend are unaffected.The worker logs are useless for localizing it: they show warm-up, then silence for the whole ~60s. Every step of the Lakekeeper iceberg attach is a CGO
db.Exec/LoadExtensionsthat emits no Go log until it returns, and the first existing log line ("Attaching Iceberg catalog") is already past the count query,LoadExtensions(iceberg), and the secret statements. Worker→Lakekeeper reachability is already ruled out (HTTP 200 from a born-as-worker pod; the:8181egress gap is fixed in charts #11444).What
Pure instrumentation — a
"Iceberg activation step starting"INFO log (withelapsed) before each step inattachLakekeeperCatalog:count-catalogs→load-iceberg-extension→create-s3-data-secret→create-catalog-auth-secret→attach-catalog→create-default-schema, plus a before/after log aroundsetIcebergDefault'sUSE iceberg.<schema>(the only iceberg-only-specific step after the attach, and the leading suspect).No behavior change. The last
"step starting"line without a following step pinpoints the blocking call, so the next iceberg-only activation in mw-dev localizes the stall and the real fix can target the exactdb.Exec.🤖 Generated with Claude Code