Skip to content

feat(provisioner): Lakekeeper-as-Iceberg-catalog — PR3 (controller trigger)#579

Merged
fuziontech merged 4 commits into
mainfrom
lakekeeper-pr3-provisioning-trigger
May 20, 2026
Merged

feat(provisioner): Lakekeeper-as-Iceberg-catalog — PR3 (controller trigger)#579
fuziontech merged 4 commits into
mainfrom
lakekeeper-pr3-provisioning-trigger

Conversation

@fuziontech
Copy link
Copy Markdown
Member

Summary

Third of the Lakekeeper-as-Iceberg-catalog series. Adds the trigger that turns PR1's manual `EnsureForOrg` into automatic per-org provisioning by wiring it into the existing `Controller` reconcile loop. Targets `lakekeeper-pr2-worker-wiring` as the base — stacked PR; merge #574 and #576 first.

Without this PR, an org with `Iceberg.Backend="lakekeeper"` would sit indefinitely with an empty `LakekeeperEndpoint` and the worker's `AttachIcebergCatalog` would no-op forever. PR3 closes that gap by running the provisioner once per reconcile tick for each `Ready` warehouse that needs it.

What's in this PR

  • `Controller` gains optional `lakekeeperProvisioner` + `LakekeeperInputsResolver` dependencies via `WithLakekeeperProvisioner`. Both nil → `reconcileLakekeeper` is a silent no-op, so existing deployments (operator not installed) are unaffected.
  • `reconcileLakekeeper` runs from `reconcileReady`, gated on `Iceberg.Enabled` + `Backend=lakekeeper` + empty `LakekeeperEndpoint`. Already-provisioned warehouses hit the early-return path.
  • `ErrBootstrapPending` is logged at debug and treated as "requeue next tick". Any other provisioner error is logged at warn so the loop continues with the next warehouse.
  • `LakekeeperInputsResolver` is a function the controller calls to build `ProvisioningInputs`. The convention for sourcing the admin DSN from a Crossplane-managed Secret is deliberately not baked in here — see "What's NOT in this PR" below.

What's NOT in this PR

  • The prod implementation of `LakekeeperInputsResolver`. It needs the Crossplane composition's master-credential Secret name pattern, which lives outside this repo. Wiring it is a small follow-up — the resolver function shape is the only thing this PR locks in.
  • OIDC SA-token auth. Orgs provisioned through this trigger get Lakekeeper in `allowall` mode + NetworkPolicy isolation. PR4 will swap that for OIDC.

Test plan

  • 5 unit tests covering the gate logic:
    • Silent no-op when provisioner is unconfigured (existing deployments unaffected)
    • Silent no-op when `Iceberg.Enabled=false`
    • Silent no-op when `Backend=s3_tables`
    • Silent no-op when `LakekeeperEndpoint` is already populated (idempotent)
    • Graceful continue when the inputs resolver errors (logs warn, doesn't crash)
  • `go test ./...` + `-tags kubernetes` sweeps green
  • `golangci-lint run` clean
  • CI

Stacked

main
└── lakekeeper-pr1 (#574)
    └── lakekeeper-pr2-worker-wiring (#576)
        └── lakekeeper-pr3-provisioning-trigger (this PR)

🤖 Generated with Claude Code

@fuziontech fuziontech force-pushed the lakekeeper-pr2-worker-wiring branch from 64d3b7f to 17c2df5 Compare May 19, 2026 22:22
@fuziontech fuziontech force-pushed the lakekeeper-pr3-provisioning-trigger branch from e42ecd8 to d91a322 Compare May 19, 2026 22:22
Base automatically changed from lakekeeper-pr2-worker-wiring to main May 20, 2026 00:24
@fuziontech fuziontech enabled auto-merge (squash) May 20, 2026 00:25
fuziontech and others added 4 commits May 20, 2026 11:15
…le (PR3)

Adds the trigger that turns PR1's manual EnsureForOrg into automatic
per-org Lakekeeper provisioning. Until now, an org with
Iceberg.Backend=lakekeeper would sit indefinitely with an empty
LakekeeperEndpoint and the worker's AttachIcebergCatalog would no-op.
This PR closes that gap by running the provisioner once per reconcile
tick for each Ready warehouse that needs it.

  * Controller gains optional lakekeeperProvisioner +
    LakekeeperInputsResolver dependencies, set via
    WithLakekeeperProvisioner. Both nil → reconcileLakekeeper is a
    silent no-op so existing deployments (operator not installed) are
    unaffected.
  * reconcileLakekeeper runs from reconcileReady, after the iceberg
    status propagation so the controller sees the freshest view of
    w.Iceberg. Gated on Iceberg.Enabled + Backend=lakekeeper +
    LakekeeperEndpoint=="" so the early-return path is the common
    case for already-provisioned warehouses.
  * ErrBootstrapPending is treated as a normal "requeue next tick"
    signal (logged at debug). Any other error is logged at warn and
    the loop continues with the next warehouse.
  * LakekeeperInputsResolver is a function the controller calls to
    build ProvisioningInputs from the warehouse + a K8s Secret managed
    by Crossplane. The convention for sourcing the admin DSN isn't
    baked in here — it's a per-deployment concern that the caller
    plumbs at controller-construction time.

What's NOT in this PR:
  * The actual prod implementation of LakekeeperInputsResolver. That
    needs the Crossplane composition's master-credential Secret name
    pattern, which lives outside this repo. Wiring it is a follow-up.
  * OIDC SA-token auth — orgs provisioned through this trigger get
    Lakekeeper in allowall mode + NetworkPolicy isolation. PR4 swaps
    that for OIDC.

Tests: 5 unit tests covering the controller's gate logic — silent
no-op when provisioner is unconfigured, when Iceberg.Enabled=false,
when Backend=s3_tables, when LakekeeperEndpoint is already set, and
graceful continue when the inputs resolver returns an error.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
  * **Stale w.Iceberg fixed.** The "Lakekeeper reconcile sees the most
    recent view" comment was wrong — the iceberg-propagation write went
    to the DB only; the in-memory pointer stayed stale. New
    applyIcebergUpdatesToWarehouse mirrors the persisted updates onto
    w right after the store write succeeds, so reconcileLakekeeper sees
    the freshest values without a re-read round-trip. Misleading
    comment removed.

  * **WithLakekeeperProvisioner now validates non-nil args** via panic.
    Partial wiring would silently disable the reconcile step; a panic
    surfaces the misconfiguration at startup instead. Two new tests
    cover the panic paths.

  * **Positive-path test added.** TestReconcileLakekeeper_HappyPath
    builds a real LakekeeperProvisioner with fakes (fake K8s client +
    httptest Lakekeeper + live prototype PG), seeds the CR with
    status.bootstrappedAt, and asserts that:
      - EnsureForOrg is actually invoked when all gates pass
      - The inputs resolver's ProvisioningInputs reach the provisioner
        field-for-field intact
      - The warehouse row gets the persisted Lakekeeper config
        (endpoint, warehouse name) at the end
    Closes the "tests only verify skip paths" gap from the review.

  * **TODO on duplicate Duckling Get round-trips.** reconcileReady now
    fetches the same CR three times when iceberg is enabled (pgbouncer
    drift, iceberg drift, status propagation) — flagged for
    consolidation when another caller adds a fourth.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…A-token auth) (#580)

* feat(provisioner,worker): OIDC SA-token auth via in-process broker (PR4)

Adds defense-in-depth for the duckling → Lakekeeper path. Until this
PR, isolation between orgs depended entirely on NetworkPolicy +
allowall on the Lakekeeper side. PR4 turns on the operator's
authentication.kubernetes mode so Lakekeeper validates the duckling's
projected SA token against the K8s TokenReview API before accepting
any catalog request — even if a NetworkPolicy is misconfigured, only
ducklings with a valid SA token signed by the cluster CA can talk to
the catalog.

DuckDB's iceberg extension fetches the bearer via OAuth2 (POST
client_credentials), not by reading a file. The bridge is the new
in-process broker that the worker runs on a loopback port.

  * server/lakekeeperbroker (NEW) — tiny HTTP server, loopback-only,
    handles POST /token by re-reading a projected SA token from disk
    each request and wrapping it as an OAuth2 response. Kubelet
    rotates the file in place; no in-process caching. Refuses to bind
    to non-loopback (exposing the SA token to any other caller would
    be a real leak). 10 unit tests cover happy path, GET rejection,
    file missing/empty 503s, health endpoint, double-start protection,
    expires_in override, non-loopback refusal.

  * cmd/duckgres-worker — starts the broker when DUCKGRES_LAKEKEEPER_
    TOKEN_PATH is set. When unset (every existing duckling pod
    today), no broker starts and behavior is unchanged.

  * LakekeeperCRSpec gains KubernetesAuthAudiences. Non-empty
    populates spec.authentication.kubernetes on the CR (the operator
    turns this into LAKEKEEPER__K8S_AUTH_ENABLED=true +
    LAKEKEEPER__K8S_AUTH_AUDIENCES=<csv>). Empty omits the block
    entirely — Lakekeeper continues running in allowall mode.

  * ProvisioningInputs.KubernetesAuthAudiences threads the audience
    list through to the CR. When non-empty, EnsureForOrg also writes
    LakekeeperOAuth2ServerURI=http://127.0.0.1:9876/token (the
    worker-local broker) to the warehouse row, so the worker's
    server/iceberg ATTACH builder emits the OAuth2 secret + ATTACH
    instead of the AUTHORIZATION_TYPE 'none' form.

What's NOT in this PR:
  * Pod spec / chart changes adding the projected SA volume mount —
    that's a follow-up in the charts repo where the duckling pod
    template lives. The broker is dormant until the env var + token
    file are wired by ops.
  * Changes to the controller's InputsResolver to set
    KubernetesAuthAudiences. The flag lives in ProvisioningInputs and
    callers opt in when ready; the prod resolver implementation is
    still on the deferred list (task #24).

Tests: 4 new — 2 for the CR's authentication block (on/off), 2 for
the OAUTH2_SERVER_URI population (OIDC mode → 127.0.0.1:9876; allowall
mode → empty). Live-PG-gated. Plus the 10 broker unit tests.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fixup(provisioner,broker): address PR4 deep-review findings

  * **Pragma: no-cache** added to /token response. RFC 6749 §5.1 mandates
    both Cache-Control: no-store AND Pragma: no-cache; we only had the
    former. The HTTP/1.0 Pragma header is irrelevant on a loopback
    connection to DuckDB, but the fix is one line and the broker is
    otherwise spec-compliant. Test asserts both headers now.

  * **Cross-check on the OIDC test** —
    TestEnsureForOrg_PersistsOAuth2URIWhenKubernetesAuthOn now reads the
    Lakekeeper CR back from the fake dynamic client and asserts that
    spec.authentication.kubernetes.enabled is true with the right
    audiences IN THE SAME EnsureForOrg call. Without this, the DB row
    could carry the broker URL while the CR stayed in allowall mode —
    Lakekeeper would reject every token. A future refactor that splits
    or reorders the wiring would now fail the test instead of silently
    deploying broken auth.

  * **WithExpiresIn TODO** — added a TODO(PR5) noting the env-var wiring
    for the override is part of the same pod-spec work that lands the
    projected SA volume. The 60s default is intentional; the option is
    pre-staged for when the override actually has somewhere to live.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fixup(provisioner): doc OIDC flag-day deploy ordering on KubernetesAuthAudiences

Cross-PR review caught that once any org gets a non-empty
KubernetesAuthAudiences value, the provisioner writes
LakekeeperOAuth2ServerURI=http://127.0.0.1:9876/token to the row, and
that value is permanent (no path clears it). Ducklings whose pod spec
hasn't yet been wired to (a) mount the projected SA token at
DUCKGRES_LAKEKEEPER_TOKEN_PATH and (b) start the broker on 9876 will
have iceberg ATTACH fail with connection refused.

Documents the required deploy ordering on the struct field comment:
ship the pod spec change first, then the operator chart change, then
flip the audiences in the inputs resolver.

Codified guardrail is a follow-up — the provisioner would need a
signal that the worker image has the broker compiled in (PR4 already
ensures that) AND that the runtime env has DUCKGRES_LAKEKEEPER_TOKEN_PATH
set, which only the cluster operator knows. For now, the comment is
the contract.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…R5) (#583)

Enables the per-org Lakekeeper provisioning branch end-to-end. The
provisioning controller already had WithLakekeeperProvisioner (PR3) but
nothing called it, so reconcileLakekeeper was inert in every deployment.

- newLakekeeperInputsResolver resolves per-org ProvisioningInputs:
    1. Crossplane Duckling CR status (prod) — the metadata-store master
       creds double as the admin connection that CREATEs the
       lakekeeper_<orgid> db/role; data-store bucket is the S3 warehouse.
       Admin DDL + the Lakekeeper pod target the DIRECT Aurora endpoint,
       never the PgBouncer pooler (transaction pooling breaks CREATE
       DATABASE and Lakekeeper's own migrations). S3 uses pod IRSA.
    2. Env fallback (dev/orbstack + MinIO) when no usable Duckling CR.
- multitenant.go wires it after NewController, gated behind
  DUCKGRES_LAKEKEEPER_PROVISIONER_ENABLED (off by default; best-effort if
  the K8s client can't be built). S3-Tables warehouses are unaffected.
- KubernetesAuthAudiences left empty: this is the allowall + NetworkPolicy
  deployment shape. OIDC SA-token auth remains a separate flag-day change.

Builds clean with -tags kubernetes; resolver unit tests cover the CR path,
env fallback, CR-error fallback, and the no-source error.

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
@fuziontech fuziontech force-pushed the lakekeeper-pr3-provisioning-trigger branch from 8a76b67 to 5e5ac44 Compare May 20, 2026 18:16
@fuziontech fuziontech merged commit 7745593 into main May 20, 2026
16 checks passed
@fuziontech fuziontech deleted the lakekeeper-pr3-provisioning-trigger branch May 20, 2026 18:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant