Skip to content

observe(activation): log each step of iceberg-only catalog activation#642

Merged
benben merged 1 commit into
mainfrom
ben/fix-iceberg-only-activation-timeout
May 29, 2026
Merged

observe(activation): log each step of iceberg-only catalog activation#642
benben merged 1 commit into
mainfrom
ben/fix-iceberg-only-activation-timeout

Conversation

@benben
Copy link
Copy Markdown
Member

@benben benben commented May 29, 2026

Why

Iceberg-only (no-DuckLake) tenant activation can exceed the control-plane activate-tenant deadline (~60s) → DeadlineExceeded, failing the session and churning the worker. Confirmed on mw-dev (ben-{cnpg,ext,aur}-ice); the iceberg + DuckLake combos and the S3Tables backend are unaffected.

The worker logs are useless for localizing it: they show warm-up, then silence for the whole ~60s. Every step of the Lakekeeper iceberg attach is a CGO db.Exec / LoadExtensions that emits no Go log until it returns, and the first existing log line ("Attaching Iceberg catalog") is already past the count query, LoadExtensions(iceberg), and the secret statements. Worker→Lakekeeper reachability is already ruled out (HTTP 200 from a born-as-worker pod; the :8181 egress gap is fixed in charts #11444).

What

Pure instrumentation — a "Iceberg activation step starting" INFO log (with elapsed) before each step in attachLakekeeperCatalog:
count-catalogsload-iceberg-extensioncreate-s3-data-secretcreate-catalog-auth-secretattach-catalogcreate-default-schema, plus a before/after log around setIcebergDefault's USE iceberg.<schema> (the only iceberg-only-specific step after the attach, and the leading suspect).

No behavior change. The last "step starting" line without a following step pinpoints the blocking call, so the next iceberg-only activation in mw-dev localizes the stall and the real fix can target the exact db.Exec.

🤖 Generated with Claude Code

Iceberg-only (no-DuckLake) tenant activation can exceed the control-plane
activate-tenant deadline (~60s) and fail with DeadlineExceeded, but the
worker logs show only warm-up then silence: every step of the iceberg
attach is a CGO db.Exec / LoadExtensions that emits no Go log until it
returns, so a stall in any one of them is invisible. The first existing
log line is 'Attaching Iceberg catalog', well past the count query,
LoadExtensions(iceberg), and the secret statements — so a hang before it
can't be localized from logs.

Add a lightweight 'Iceberg activation step starting' INFO log (with
elapsed-since-start) before each step in attachLakekeeperCatalog
(count-catalogs, load-iceberg-extension, create-s3-data-secret,
create-catalog-auth-secret, attach-catalog, create-default-schema) and
around setIcebergDefault's 'USE iceberg.<schema>' (the sole
iceberg-only-specific step after the attach). The last 'step starting'
line without a following step then pinpoints the blocking call.

Pure instrumentation, no behavior change. Lets the next activation of an
iceberg-only tenant in mw-dev localize the stall (worker reachability to
Lakekeeper is already confirmed via HTTP 200), so the actual fix can target
the right call.
@benben benben merged commit eff4f67 into main May 29, 2026
22 checks passed
@benben benben deleted the ben/fix-iceberg-only-activation-timeout branch May 29, 2026 16:58
benben added a commit that referenced this pull request Jun 1, 2026
Iceberg extension was downloaded on-demand at first use, unlike httpfs,
ducklake, json, and postgres_scanner which the Dockerfiles pre-seed into
the bundled extension cache. That on-demand INSTALL silently blocks the
iceberg-only tenant activation past the ~60s activate-tenant deadline
(observed on mw-dev with the per-step logging from #642: count-catalogs
completes in ~1ms, load-iceberg-extension never returns, worker is
retired at 60.7s with no DuckDB-level error). Iceberg+DuckLake tenants
don't hit it because LoadExtensions(delta) runs first and primes
DuckDB's extension subsystem.

Bundle iceberg the same way as the others: curl the .duckdb_extension.gz
from ${DUCKDB_EXTENSION_REPOSITORY} at build time, gunzip into
/build/duckdb-extensions/v${DUCKDB_EXTENSION_VERSION}/linux_${TARGETARCH}/,
and add it to the size-check loop. Applies to both the standalone
Dockerfile and Dockerfile.worker so worker pods get a local cache hit on
LoadExtensions("iceberg") instead of a CDN fetch.

Eliminates the iceberg-only activation timeout and brings the activation
cost of LoadExtensions("iceberg") in line with the other bundled
extensions for every iceberg-using tenant (lakekeeper + s3tables
backends alike).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant