feat(controlplane): protect active sessions across CP shutdowns by benben · Pull Request #472 · PostHog/duckgres

benben · 2026-04-30T12:17:42Z

Summary

Closes the timing race that killed the customer query on worker 40761 yesterday — a short SELECT that just happened to be in flight when the old CP gave up draining at the 15-minute self-imposed wall. Bumping the wall doesn't fix the race; the same query lands at any wall. Three interlocking changes remove the wall instead, in a way that bounds the cost and self-cleans:

Layer	What	Why
1	CP drains until sessions finish (no internal timeout in remote mode)	Was the proximate cause: a self-imposed `HandoverDrainTimeout` (15m) ran `ShutdownAll` on a CP that was still serving traffic. K8s `terminationGracePeriodSeconds` (24h per chart) is the right wall.
2	`ShutdownAll` skips workers with `activeSessions > 0`	Defense in depth — even if k8s SIGKILLs the CP, the worker pod survives so a Flight client can reconnect by session token and a peer CP can claim via `TakeOverWorker`.
3	`ListOrphanedWorkers` excludes workers with reclaimable Flight sessions	Without this, a peer CP's janitor would retire workers spared by Layer 2 the moment the dying CP's row was expired, defeating the protection. Bounded by `ExpireFlightSessionRecords` so the protection self-clears within the session TTL (default 1h).

Behavior matrix

Scenario	Result
Customer mid-query during CP roll	Old CP waits for the query to finish (Layer 1). Clean exit.
Long query > k8s gracePeriod (24h)	k8s SIGKILLs the CP. Layer 2 keeps the worker pod alive. Layer 3 keeps the orphan janitor from retiring it while the Flight session record is active or reconnecting. Customer Flight client reconnects → peer CP claims via `TakeOverWorker` → query resumes.
Customer never reconnects	`ExpireFlightSessionRecords` moves the session record to expired (default after 1h). Next orphan-sweep tick retires the worker normally. No leak.
Pgwire customer on dying CP	Connection dies when the CP exits regardless — protocol can't survive. Layer 1 extends the CP's life so most short queries finish; the long-tail case is unrecoverable for pgwire by design.
Idle / warm worker during shutdown	Drained as before. Layer 2's check is `activeSessions > 0`, so warm-pool workers (0 sessions) still go through the drain chain cleanly.

RED → GREEN

Two-commit history:

062fc2f (RED): four failing tests
- TestShutdownAll_SparesWorkersWithActiveSessions (Layer 2)
- TestListOrphanedWorkersExcludesWorkersWithActiveFlightSessions (Layer 3, primary)
- TestListOrphanedWorkersIncludesWorkersWithReconnectingFlightSessions (Layer 3, reconnect window)
- TestListOrphanedWorkersIncludesWorkersWithExpiredFlightSessions (Layer 3, bound — terminal sessions don't pin workers)
cf8121a (GREEN): implementations land all three layers; existing ShutdownAll tests still pass; existing orphan-cleanup tests still pass.

Layer 1 isn't covered by a unit test (its surface — cfg.HandoverDrainTimeout default and waitForDrain's timeout==0 path — needs a fully constructed CP to exercise meaningfully). The change is small and the log line has been updated to make the unbounded behavior explicit ("Waiting for planned shutdown drain (unbounded — k8s SIGKILL is the wall).") so operators can verify visually post-deploy.

Operational notes

A draining CP can now stay alive for a long time. With maxSurge: 0 already in the chart, you'll have N CPs steady-state plus draining old ones; resource usage to watch is config-store connections (one per draining CP) and pod count.
The previous ExpireDrainingControlPlaneInstances safety valve is now disabled in remote mode (gated on j.maxDrainTimeout > 0). A CP that's stuck mid-drain but heartbeating will not be force-expired by peers; it'll be either:
- SIGKILL'd by k8s at gracePeriod (heartbeat stops → ExpireControlPlaneInstances handles it normally) or
- Manually killed by an operator (same path)
Stale-but-still-heartbeating CPs are unusual; if they become a problem, raising the issue at the source (the leak) beats reinstating the wall.

Test plan

Full controlplane suite passes with -tags kubernetes.
All four new tests fail on 062fc2f, pass on cf8121a.
CI runs the postgres-backed Layer 3 tests against the real schema.
Post-deploy: roll a CP in mw-prod-us during a long query → verify no flight client panic / worker likely crashed errors at the moment of CP exit.

Failing tests for the two pieces of state-store + worker-pool behavior that need to change so a CP rollout doesn't kill in-flight customer queries — extends the worker 40761 incident analysis to a regression suite. - TestShutdownAll_SparesWorkersWithActiveSessions: a worker with activeSessions > 0 must be skipped by ShutdownAll. Today the chain marks-draining and pod-deletes every owned worker, killing in-flight queries when the CP receives SIGTERM. - TestListOrphanedWorkersExcludesWorkersWithActiveFlightSessions: a worker with a Flight session record in active state must be spared by ListOrphanedWorkers even if its owning CP has expired. Without this, peer CPs' janitors retire workers the customer can still reclaim by reconnect. - TestListOrphanedWorkersIncludesWorkersWithReconnectingFlightSessions: same protection extends to records in reconnecting state — that's exactly when the customer is in the middle of picking the session back up. - TestListOrphanedWorkersIncludesWorkersWithExpiredFlightSessions: once the Flight session record is terminal (expired/closed) the customer can't reclaim, so the worker should be retired normally. Bounds the protection so a stuck row can't pin a worker forever. Tests fail with the current implementation; the next commit makes them pass.

Three interlocking changes that keep an in-flight customer query from being killed by a CP rollout. Together they remove the timing race behind the worker-40761 incident: a query running at the moment the old CP gives up (whether at 15min, 8h, or any other timeout) survives. Layer 1: CP doesn't exit while sessions are active. - HandoverDrainTimeout default flips from 15m to 0 in remote mode (= unbounded; k8s terminationGracePeriodSeconds is the only wall). The previous self-imposed wall was the proximate cause of pod-deletes during long-tail drains. Process mode keeps 24h since there's no k8s safety net. - waitForDrain treats timeout==0 as unbounded via context.Background(). - Janitor's ExpireDrainingControlPlaneInstances is now disabled in remote mode (it gates on j.maxDrainTimeout > 0). A draining CP that is still heartbeating stays "alive" to its peers; only stale heartbeat — handled by the existing ExpireControlPlaneInstances — marks a CP dead. This prevents peer CPs from forcibly expiring a draining CP and orphan-retiring its workers. Layer 2: ShutdownAll skips workers with active sessions. - Defense in depth for the case where the CP exits anyway (k8s SIGKILL after gracePeriod, or future code paths). A worker with activeSessions > 0 is left running in 'hot' state owned by the dying CP. The pod survives, the Flight client can reconnect by session token, and a peer CP can claim via TakeOverWorker. - Workers preserved this way stay in the in-memory pool until process exit so any residual session bookkeeping during the shutdown window still finds them. Layer 3: orphan janitor spares workers with reclaimable Flight sessions. - ListOrphanedWorkers gains a NOT EXISTS filter against flight_session_records: a row with at least one session in active or reconnecting state is left alone. Bounds the protection — once ExpireFlightSessionRecords moves the session to expired/closed, the worker is retired normally on the next sweep. Behavior with all three layers: - Customer mid-query during CP roll → old CP's drainAndShutdown waits for sessions → if customer finishes, clean exit; if k8s SIGKILL fires, worker survives via Layer 2 + Layer 3. - Customer Flight client reconnects within session TTL → peer CP claims via TakeOverWorker, query resumes. - Customer never reconnects → flight_session_records expires (TTL default 1h) → next orphan sweep retires the worker normally. - Pgwire customer connected to dying CP → connection dies when CP exits regardless. Worker preservation is moot for them but harmless; Layer 1 still extends the CP's life so most short queries finish.

…ges (#483) * fix(controlplane): stamp warm idle workers with creating CP id to stop orphan churn (#469) workerRecordFor used to clear OwnerCPInstanceID whenever state==Idle, so every freshly-spawned warm worker landed in the runtime store with an empty owner. That row matched ListOrphanedWorkers case (2) (NULLIF(owner_cp_instance_id, '') IS NULL AND last_heartbeat_at <= before) the moment it crossed the 30s orphan grace, because nothing refreshes last_heartbeat_at for an idle row in the warm pool. The janitor retired the worker, reconcileWarmCapacity replaced it, and the loop ran continuously. Persisting warm workers with the creating CP's instance id moves them to case (1), which is bounded by the active CP's 5s heartbeat instead. If the CP genuinely dies, the existing CP-expiry path still flips its rows to expired and the orphan grace then applies as designed. OrgID is still cleared on the idle transition. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(cache-proxy): forward origin status, body, and headers verbatim (#470) * test(cache-proxy): cover origin status passthrough (RED) Five new tests plus an update to the existing TestHandleProxyOriginError, all asserting that the cache proxy must forward upstream status codes, bodies, and response headers verbatim instead of collapsing every non-2xx into a 502. The user case that motivated this: S3 returns 400 with <Code>ExpiredToken</Code> in an XML envelope. The proxy was rewriting that to 502 with a Go-formatted error string, which (a) made DuckDB's httpfs treat a terminal auth failure as transient and retry it indefinitely, and (b) hid the real S3 error class from operators. New cases: - 5xx forwarded verbatim (replaces the old 502-asserting test) - 400 forwarded verbatim, body + Content-Type + X-Amz-Request-Id preserved - 404 forwarded verbatim - 416 forwarded verbatim with Content-Range preserved - error responses are NOT cached (the next request hits origin) - pure network errors (no HTTP response) still get 502 (the only case where 502 is correct, since there's no upstream status to forward) Tests fail with the current implementation; the next commit makes them pass. * fix(cache-proxy): forward origin status, body, and headers verbatim (GREEN) The proxy was previously translating every non-2xx upstream response into a 502 Bad Gateway with a Go-formatted error string. That: 1. Hid the real S3 error class — DuckDB's httpfs treats 5xx as transient and retries, so a terminal 4xx (e.g. an ExpiredToken auth failure) was being retried indefinitely instead of failing fast and surfacing the real cause. 2. Stripped the XML error envelope DuckLake parses, replacing it with a Go error string DuckLake doesn't understand. 3. Dropped headers (Content-Type, X-Amz-Request-Id, Content-Range) that DuckDB and operators rely on. This change introduces `originStatusError`, a typed error returned by fetchOrigin whenever the upstream responds with status >= 400. The caller in HandleProxy detects it via errors.As and forwards the captured status code, body (up to 1 MiB), and headers (minus hop-by-hop) back to the client unchanged. Pure transport errors (DNS, connection refused, TLS, timeout) keep returning 502 — there's no upstream status to forward in that case, and 5xx is what httpfs's transient-retry policy is designed for. Tests in the previous commit covered: - 5xx forwarded verbatim (replaces the old 502-asserting test) - 400 forwarded verbatim with XML body + Content-Type + amz headers - 404 forwarded verbatim - 416 forwarded verbatim with Content-Range preserved - error responses are NOT cached (the next request hits origin) - true network errors still get 502 * feat(controlplane): persist STS expiry and refresh credentials before they go stale (#471) Closes the worker credential expiration bug that surfaced in mw-prod-us (ExpiredToken errors after the 1h STS session-duration boundary). Schema: - worker_records gets s3_credentials_expires_at (nullable, indexed). It's set by the activator after a successful STS AssumeRole + ActivateTenant and consulted by the credential refresh scheduler to pick workers near expiry. NULL is treated as 'due now' so legacy / unstamped rows get refreshed eagerly. - UpsertWorkerRecord's DoUpdates list includes the new column. Configstore methods: - ListWorkersDueForCredentialRefresh(ownerCPInstanceID, cutoff) returns workers we own in active org-bound states whose creds expire by the cutoff or have NULL expiry. - MarkCredentialsRefreshed(workerID, cpInstanceID, expectedEpoch, newExpiresAt) is a conditional UPDATE that only stamps when ownership is still ours. - BumpWorkerEpoch(workerID, cpInstanceID, expectedEpoch) atomically increments owner_epoch on a worker we already own (used before re-sending ActivateTenant for refresh — the worker's reuseExistingActivation guard requires payload.OwnerEpoch > current). - Postgres-backed regression coverage for all three (owner-scoping, NULL-as-due, healthy/neutral/terminal exclusions, conditional update failure modes). Worker side: - Skip the in-process StartCredentialRefresh ticker in shared-warm (multi-tenant) mode. That ticker (a) ran on the session's pinned *sql.Conn and serialized behind user queries — a 1h+ query starved the refresh until creds had already expired — and (b) hit an else branch that swapped the org's STS-brokered config secret for a credential_chain one DuckDB can't satisfy on EKS Pod Identity. The control plane drives refreshes via re-activation now; the standalone single-tenant path keeps the ticker. Activator (shared_worker_activator.go): - TenantActivationPayload.S3CredentialsExpiresAt carries the STS Expiration through from buildDuckLakeConfigFromDuckling. Static-cred warehouses (config-store path) leave it nil. - ActivateReservedWorker stamps the expiry on the worker_records row via runtimeStore.MarkCredentialsRefreshed after a successful activation. Best-effort: failure here doesn't fail activation. - New RefreshCredentials method bumps owner_epoch atomically, sends ActivateTenant with the bumped epoch and freshly-brokered creds, and stamps the new expiry. Skips static-cred orgs. Janitor scheduler: - New refreshExpiringCredentials lambda (leader-only) lists workers whose creds expire within stsSessionDuration/2 (= 30 min today), resolves the in-memory ManagedWorker, and calls activator.RefreshCredentials. - Persists state in worker_records, so a CP failover doesn't lose scheduling — the next leader queries the same column. - A worker not in the leader's local pool (mid-takeover, mid-retire) is skipped; it'll be picked up on the next tick if it's still ours. * feat(controlplane): protect active sessions across CP shutdowns (#472) * test(controlplane): protect active sessions from CP shutdown (RED) Failing tests for the two pieces of state-store + worker-pool behavior that need to change so a CP rollout doesn't kill in-flight customer queries — extends the worker 40761 incident analysis to a regression suite. - TestShutdownAll_SparesWorkersWithActiveSessions: a worker with activeSessions > 0 must be skipped by ShutdownAll. Today the chain marks-draining and pod-deletes every owned worker, killing in-flight queries when the CP receives SIGTERM. - TestListOrphanedWorkersExcludesWorkersWithActiveFlightSessions: a worker with a Flight session record in active state must be spared by ListOrphanedWorkers even if its owning CP has expired. Without this, peer CPs' janitors retire workers the customer can still reclaim by reconnect. - TestListOrphanedWorkersIncludesWorkersWithReconnectingFlightSessions: same protection extends to records in reconnecting state — that's exactly when the customer is in the middle of picking the session back up. - TestListOrphanedWorkersIncludesWorkersWithExpiredFlightSessions: once the Flight session record is terminal (expired/closed) the customer can't reclaim, so the worker should be retired normally. Bounds the protection so a stuck row can't pin a worker forever. Tests fail with the current implementation; the next commit makes them pass. * feat(controlplane): protect active sessions across CP shutdowns (GREEN) Three interlocking changes that keep an in-flight customer query from being killed by a CP rollout. Together they remove the timing race behind the worker-40761 incident: a query running at the moment the old CP gives up (whether at 15min, 8h, or any other timeout) survives. Layer 1: CP doesn't exit while sessions are active. - HandoverDrainTimeout default flips from 15m to 0 in remote mode (= unbounded; k8s terminationGracePeriodSeconds is the only wall). The previous self-imposed wall was the proximate cause of pod-deletes during long-tail drains. Process mode keeps 24h since there's no k8s safety net. - waitForDrain treats timeout==0 as unbounded via context.Background(). - Janitor's ExpireDrainingControlPlaneInstances is now disabled in remote mode (it gates on j.maxDrainTimeout > 0). A draining CP that is still heartbeating stays "alive" to its peers; only stale heartbeat — handled by the existing ExpireControlPlaneInstances — marks a CP dead. This prevents peer CPs from forcibly expiring a draining CP and orphan-retiring its workers. Layer 2: ShutdownAll skips workers with active sessions. - Defense in depth for the case where the CP exits anyway (k8s SIGKILL after gracePeriod, or future code paths). A worker with activeSessions > 0 is left running in 'hot' state owned by the dying CP. The pod survives, the Flight client can reconnect by session token, and a peer CP can claim via TakeOverWorker. - Workers preserved this way stay in the in-memory pool until process exit so any residual session bookkeeping during the shutdown window still finds them. Layer 3: orphan janitor spares workers with reclaimable Flight sessions. - ListOrphanedWorkers gains a NOT EXISTS filter against flight_session_records: a row with at least one session in active or reconnecting state is left alone. Bounds the protection — once ExpireFlightSessionRecords moves the session to expired/closed, the worker is retired normally on the next sweep. Behavior with all three layers: - Customer mid-query during CP roll → old CP's drainAndShutdown waits for sessions → if customer finishes, clean exit; if k8s SIGKILL fires, worker survives via Layer 2 + Layer 3. - Customer Flight client reconnects within session TTL → peer CP claims via TakeOverWorker, query resumes. - Customer never reconnects → flight_session_records expires (TTL default 1h) → next orphan sweep retires the worker normally. - Pgwire customer connected to dying CP → connection dies when CP exits regardless. Worker preservation is moot for them but harmless; Layer 1 still extends the CP's life so most short queries finish. * feat(server): split query-error logs by SQLSTATE class (user vs infra) (#473) * feat(server): split query-error logs by SQLSTATE class (user vs infra) The single "Query execution failed." Error log line was alerting- hostile: a customer typo'ing a column name produced the same Error that a worker crash did, drowning real infra failures in user-error noise. Split into two distinguishable lines using the SQLSTATE we already compute for the pgwire error response — no new string matching: Info "Query execution failed." for SQLSTATE classes {0A, 22, 23, 25, 28, 2B, 3D, 3F, 42, 44} plus 57014 (client cancellation) Error "Query execution errored." for everything else (08, 53, 54, 57 except 57014, 58, XX, …) Mechanically: - New isUserQueryError(err) inspects the existing classifyErrorCode output and matches the class against a closed allow-list. Closed- list semantics so future SQLSTATEs we haven't catalogued err toward Error (the safe direction for alerting). - 57014 query_canceled short-circuits to user (its parent class 57 is otherwise infra; client-pressed-Ctrl-C is user-attributable). - DuckLake transaction conflict and metadata-connection-lost paths are unchanged — they keep their own Warn lines because retry tooling cares about those signals specifically. classifyErrorCode default fallback flips from "42000" to "XX000": unknown errors (no DuckDB prefix matched) are typically infra (gRPC, network, internal panics), not syntax errors. The previous default made every unknown error look like a user error in the alert path. Two existing test cases that asserted the old fallback are updated accordingly with comments explaining why. New tests in TestIsUserQueryError pin every user-class branch plus representative infra cases and edge cases (nil error, 57014 short- circuit, 40001 sitting in infra-side after the early-return). * feat(server): log query start/finish on the worker boundary with trace_id Two new Info-level slog lines around the worker dispatch in executeSelectQuery so an operator can correlate logs with traces and filter to a specific worker: Info "Query started." user=… query=… worker=N worker_pod=… trace_id=<otel-hex> Info "Query finished." user=… duration_ms=… rows=N worker=N worker_pod=… trace_id=<otel-hex> [error=… on failures] The worker / worker_pod attributes match what "Query execution failed." already emits — same shape so a search like worker=40761 surfaces the full lifecycle on a single worker. trace_id matches the OTEL trace ID exported for the same query (see server/tracing.go's existing traceIDFromContext helper), so a Loki query with trace_id=abc123 lines up directly with the trace view. logQueryFinished stays Info even on error so the start/finish pair is always balanced in the log stream — severity routing for failures is done by logQueryError separately (Info for user errors, Error for infra). Operators following one trace see started + finished + an optional separate severity line. Initial use sites are limited to executeSelectQuery (the SELECT path that's the bulk of customer queries and the one that fired in the worker-40761 incident). The non-returning path in executeSimpleQuery and the prepared-statement path still emit only the existing logQuery (structured query log) — they can be wired up incrementally if the lifecycle lines prove useful in production. * fix(controlplane): refresh STS credentials per CP, not only on the janitor leader (#474) The credential-refresh scheduler added in #471 was wired into the janitor's leader-only loop. Workers owned by non-leader CPs therefore never had their S3 session tokens re-brokered, and any long-running query on those workers hit ExpiredToken once the original 1h STS session lapsed. Move the scheduler out of the janitor and into a per-CP background goroutine spawned from SetupMultiTenant. Each CP refreshes only the workers it owns (filtered by cpInstanceID in the SQL), regardless of leader status. The new credentialRefreshScheduler type wraps the existing tick logic so it's independently testable. * Bypass transpiler for file COPY statements (#475) * docs: clarify k8s worker reuse policy (#476) * fix(server): skip DuckLake index ensure when all indexes already exist (#478) Replaces the 9 sequential CREATE INDEX IF NOT EXISTS round-trips on every fresh worker pod's first catalog attach with a single pg_indexes existence check. CREATE INDEX IF NOT EXISTS is a no-op at the storage layer once the index exists, but still costs a server round-trip; under pgbouncer transaction pooling that round-trip can take 1-2s during burst load (server-conn handover + TLS handshake). Empirically 16-24s in the prod posthog metastore. Collapsing to one round-trip cuts the post-attach window to ~150-300ms in the steady state, which matters because the control plane's session-init context is bounded at 5s — the slow path was overrunning the budget and surfacing as "failed to detect ducklake catalog attachment" FATALs on every warm-pool replenishment. The slow path is unchanged and still self-heals if any index goes missing. Indexes are now declared as a name+stmt slice so the fast-path lookup names can't drift from the slow-path statements. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix timestamp aliases in Arrow schema mapping (#479) * fix(duckdbservice): isolate session-cleanup contexts and discard poisoned conns (#481) Three related changes to cleanupSessionState that together fix a worker poisoning chain after cancelled queries. When a query is cancelled mid-execution, DuckDB can leave the underlying connection in an aborted/INTERRUPT'd state where every subsequent statement returns "INTERRUPT Error: Interrupted!" until ROLLBACK runs. The cleanup path called by DestroySession then runs SELECT + DROP statements against this poisoned connection inside a single 5-second context. The SELECT hits the aborted-state error or burns the whole budget; every following DROP then fails instantly with context deadline exceeded. The connection is returned to the *sql.DB pool dirty, and the next session that picks it up fails its first metadata operation (e.g. "USE memory") under the control plane's 5s session-init context. From the client this surfaces as "failed to initialize session database metadata" or "failed to detect ducklake catalog attachment" — same symptoms as the warm-pool churn case fixed in #478, but a different root cause that the previous PR doesn't address. Three changes: - ROLLBACK first to clear any aborted-transaction state before running the cleanup queries. Same pattern that initSearchPath already uses. - Per-step contexts: 3s for the enumeration SELECT, 1s for each DROP, separately. A slow SELECT no longer eats the budget for all DROPs; one slow DROP no longer poisons the rest. - cleanupSessionState now returns a clean/dirty signal. If cleanup didn't complete cleanly, DestroySession marks the conn bad via Conn.Raw returning driver.ErrBadConn, so database/sql discards it from the pool instead of handing it to the next session. Reproduced the poisoning chain on the dev cluster: a clickbench scan cancelled mid-run caused a downstream session to fail with "failed to initialize session database metadata: switch to memory catalog: flight execute update: deadline exceeded", followed by the worker being marked unresponsive and reaped. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(duckdbservice): discard pooled conn after every session in cluster mode (#485) * fix(duckdbservice): discard pooled conn after every session in cluster mode In cluster mode every worker is bound to a single org via activateTenant, so the security boundary aligns with the worker lifecycle and there's no need to scrub per-conn state on session teardown to protect against cross-org leakage. The existing cleanup loop is two things: - Mostly wasted work in cluster mode. A typical billing-style session (open / SELECT 1 / close) creates zero user temp objects, but the cleanup still enumerates duckdb_views() and issues ~46 DROP IF EXISTS no-ops against system views that live in main / information_schema / pg_catalog (the DROP statements target the temp schema, so they're lookups against an empty namespace). - Incomplete. It handles temp tables and temp views but not temp macros, temp types, or temp sequences. Those still leak across pooled-conn reuse — verified on dev: a TEMP MACRO created in session A was visible in session B when the pool happened to hand back the same driver conn. Cluster mode now skips the cleanup loop entirely and always marks the conn bad via Conn.Raw → driver.ErrBadConn so database/sql evicts it from the pool. The next session opens a fresh DuckDB connection — no temp leakage of any kind. Standalone mode keeps the existing cleanupSessionState path because pooled conns can be reused across orgs and scrubbing is required. The post-#481 conn-discard fallback is preserved when cleanup fails. Empirically motivated: prod was reaping ~29 workers/hour because bursts of session-end events (e.g. 16 billing queries arriving at once) ran hundreds of cleanup round-trips concurrently on a single worker, blocking the gRPC health-check long enough to fail 3× and trip the unresponsive- worker reaper. The cleanup itself was correct (post-#481) but the aggregate throughput was the bottleneck. Skipping the loop in cluster mode collapses ~48 round-trips per session-end to 0. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * refactor(duckdbservice): wrap conn-pool eviction behind a named helper driver.ErrBadConn is the standard idiom for evicting a *sql.Conn from the *sql.DB pool, but the name is misleading at call sites — the conn isn't actually broken, we just don't want it reused. Go acknowledges the API gap (golang/go#40722) but hasn't fixed it. Wrap the Conn.Raw + ErrBadConn dance behind a named helper (evictConnFromPool) with a comment explaining why this is the standard idiom despite the awkward name. Update both call sites in DestroySession plus the log key (discarded -> evicted) to match. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(k8s): include hot_idle in findActiveOrgWorkerPodSince The helper polls cp_runtime.worker_records for workers bound to a given org. The state filter previously excluded hot_idle, on the assumption that the test would always catch a worker mid-session. With faster session teardown (cluster-mode conn eviction), workers can transition hot -> hot_idle before the polling loop runs, leaving the test unable to find the worker that just served the org. hot_idle workers are still bound to the org — org_id stays set on the runtime record while the worker is parked for fast re-claim by the same org — so they're the correct answer to 'which worker handled this org just now?' Adding hot_idle to the state list makes the helper robust to session-teardown speed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * refactor(duckdbservice): extract DuckDB-free arrow helpers into arrowmap subpackage Step 1 of a 4-step plan to split duckgres into separate control-plane and worker binaries so the control plane can stop linking libduckdb (and the worker image can ship per-DuckDB-version variants). This change moves the DuckDB-free helpers (DuckDBTypeToArrow, QualifyTableName, QuoteIdent, SupportsLimit, and the struct/map/decimal type-string parsers) out of duckdbservice/arrow_helpers.go into a new duckdbservice/arrowmap package. arrowmap has zero dependency on github.com/duckdb/duckdb-go; it's plain string-to-arrow.DataType mapping. The duckdbservice package keeps the helpers callable as before via thin re-export shims so existing call sites are unaffected. server/flightsqlingress is updated to import arrowmap directly for the three call sites that don't need DuckDB types (the AppendValue call site stays on duckdbservice for now; splitting AppendValue's duckdb-bound branches lands in a follow-up). This PR is purely a refactor — no behavior changes, no new binary, no CI changes. It establishes the package boundary and pattern that subsequent PRs build on. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * refactor(arrowmap): split AppendValue into duckdb-free core + driver hooks Step 2 of the binary-split plan. Moves AppendValue out of duckdbservice into arrowmap so callers that don't need duckdb-go's driver value types (duckdb.Interval, Decimal, UUID, OrderedMap, Map) can use it without pulling libduckdb into their import graph. The split uses a registration hook rather than a build tag: - arrowmap.AppendValue handles all arrow-native and Go-native value types directly. It first consults any registered Appender hooks before falling back to the built-in switch. - duckdbservice/appender_init.go's init() registers a single hook that handles duckdb.Interval / Decimal / UUID / OrderedMap / Map. - When duckdbservice is linked into a binary (worker, standalone), the duckdb types get full coverage automatically. When it isn't (a future controlplane-only binary), the duckdb cases are dead code paths that wouldn't fire anyway because duckdb-go is the only thing producing those typed values. OrderedMapValue moved from server/flight_executor.go into arrowmap so the MAP branch in arrowmap.AppendValue can switch on it without arrowmap depending on the server package. server.OrderedMapValue is preserved as a type alias for backward compatibility with existing call sites. server/flightsqlingress no longer imports duckdbservice — it uses arrowmap.AppendValue directly. The package still transitively pulls duckdb-go via its `server` import; getting `server` itself duckdb-free is the next chunk of work. Verified: - go build ./... clean - go build -tags kubernetes ./... clean - go test -short ./duckdbservice/... ./server/flightsqlingress/... all green - go list -deps ./duckdbservice/arrowmap | grep duckdb-go is empty (arrowmap remains a pure leaf even with AppendValue + OrderedMapValue living in it) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * refactor(server): extract auth and sysinfo helpers into pure subpackages Step 3 of the binary-split plan. Carves the duckdb-free pieces of the server package into focused subpackages so the eventual server-package split (PR #4+) has less tangled state to deal with. server/auth/ — RateLimiter, RateLimitConfig, ValidateUserPassword, BeginRateLimitedAuthAttempt, RecordFailed/SuccessfulAuth, plus the auth-related Prometheus metrics (auth_failures, rate_limit_rejects, rate_limited_ips). server/sysinfo/ — SystemMemoryBytes, AutoMemoryLimit, ParseMemoryBytes, ValidateMemoryLimit. The /proc/meminfo reader and the DuckDB-style "4GB"/"512MB" string parser/validator. Both new packages have zero dependency on github.com/duckdb/duckdb-go (verified with `go list -deps`), so the eventual control-plane-only binary will be able to use them without linking libduckdb. Backward compatibility is preserved via type aliases and re-export `var`s in server/auth_aliases.go and server/sysinfo_aliases.go. Existing references to server.RateLimiter, server.NewRateLimiter, server.ValidateUserPassword, server.ParseMemoryBytes, etc. continue to compile without touching their call sites. New code should import the subpackages directly. Three Prometheus metrics moved with the rate-limit code: - duckgres_auth_failures_total - duckgres_rate_limit_rejects_total - duckgres_rate_limited_ips The two used outside the auth package itself (RateLimitRejectsCounter, AuthFailuresCounter) are exported; rateLimitedIPsGauge stays private. Verified: - go build ./... clean - go build -tags kubernetes ./... clean - go test -short ./server/auth/... ./server/sysinfo/... ./server/... ./controlplane/... — all green (pre-existing testcontainer Postgres failures in controlplane/admin/ unrelated, requires Docker) - go list -deps ./server/auth | grep duckdb-go is empty - go list -deps ./server/sysinfo | grep duckdb-go is empty Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Eric Duong <eric@posthog.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: Benjamin Knofe-Vider <benben@users.noreply.github.com> Co-authored-by: Bill Guowei Yang <bill@posthog.com>

* refactor(arrowmap): split AppendValue into duckdb-free core + driver hooks Step 2 of the binary-split plan. Moves AppendValue out of duckdbservice into arrowmap so callers that don't need duckdb-go's driver value types (duckdb.Interval, Decimal, UUID, OrderedMap, Map) can use it without pulling libduckdb into their import graph. The split uses a registration hook rather than a build tag: - arrowmap.AppendValue handles all arrow-native and Go-native value types directly. It first consults any registered Appender hooks before falling back to the built-in switch. - duckdbservice/appender_init.go's init() registers a single hook that handles duckdb.Interval / Decimal / UUID / OrderedMap / Map. - When duckdbservice is linked into a binary (worker, standalone), the duckdb types get full coverage automatically. When it isn't (a future controlplane-only binary), the duckdb cases are dead code paths that wouldn't fire anyway because duckdb-go is the only thing producing those typed values. OrderedMapValue moved from server/flight_executor.go into arrowmap so the MAP branch in arrowmap.AppendValue can switch on it without arrowmap depending on the server package. server.OrderedMapValue is preserved as a type alias for backward compatibility with existing call sites. server/flightsqlingress no longer imports duckdbservice — it uses arrowmap.AppendValue directly. The package still transitively pulls duckdb-go via its `server` import; getting `server` itself duckdb-free is the next chunk of work. Verified: - go build ./... clean - go build -tags kubernetes ./... clean - go test -short ./duckdbservice/... ./server/flightsqlingress/... all green - go list -deps ./duckdbservice/arrowmap | grep duckdb-go is empty (arrowmap remains a pure leaf even with AppendValue + OrderedMapValue living in it) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * refactor(server): extract auth and sysinfo helpers into pure subpackages (#483) * fix(controlplane): stamp warm idle workers with creating CP id to stop orphan churn (#469) workerRecordFor used to clear OwnerCPInstanceID whenever state==Idle, so every freshly-spawned warm worker landed in the runtime store with an empty owner. That row matched ListOrphanedWorkers case (2) (NULLIF(owner_cp_instance_id, '') IS NULL AND last_heartbeat_at <= before) the moment it crossed the 30s orphan grace, because nothing refreshes last_heartbeat_at for an idle row in the warm pool. The janitor retired the worker, reconcileWarmCapacity replaced it, and the loop ran continuously. Persisting warm workers with the creating CP's instance id moves them to case (1), which is bounded by the active CP's 5s heartbeat instead. If the CP genuinely dies, the existing CP-expiry path still flips its rows to expired and the orphan grace then applies as designed. OrgID is still cleared on the idle transition. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(cache-proxy): forward origin status, body, and headers verbatim (#470) * test(cache-proxy): cover origin status passthrough (RED) Five new tests plus an update to the existing TestHandleProxyOriginError, all asserting that the cache proxy must forward upstream status codes, bodies, and response headers verbatim instead of collapsing every non-2xx into a 502. The user case that motivated this: S3 returns 400 with <Code>ExpiredToken</Code> in an XML envelope. The proxy was rewriting that to 502 with a Go-formatted error string, which (a) made DuckDB's httpfs treat a terminal auth failure as transient and retry it indefinitely, and (b) hid the real S3 error class from operators. New cases: - 5xx forwarded verbatim (replaces the old 502-asserting test) - 400 forwarded verbatim, body + Content-Type + X-Amz-Request-Id preserved - 404 forwarded verbatim - 416 forwarded verbatim with Content-Range preserved - error responses are NOT cached (the next request hits origin) - pure network errors (no HTTP response) still get 502 (the only case where 502 is correct, since there's no upstream status to forward) Tests fail with the current implementation; the next commit makes them pass. * fix(cache-proxy): forward origin status, body, and headers verbatim (GREEN) The proxy was previously translating every non-2xx upstream response into a 502 Bad Gateway with a Go-formatted error string. That: 1. Hid the real S3 error class — DuckDB's httpfs treats 5xx as transient and retries, so a terminal 4xx (e.g. an ExpiredToken auth failure) was being retried indefinitely instead of failing fast and surfacing the real cause. 2. Stripped the XML error envelope DuckLake parses, replacing it with a Go error string DuckLake doesn't understand. 3. Dropped headers (Content-Type, X-Amz-Request-Id, Content-Range) that DuckDB and operators rely on. This change introduces `originStatusError`, a typed error returned by fetchOrigin whenever the upstream responds with status >= 400. The caller in HandleProxy detects it via errors.As and forwards the captured status code, body (up to 1 MiB), and headers (minus hop-by-hop) back to the client unchanged. Pure transport errors (DNS, connection refused, TLS, timeout) keep returning 502 — there's no upstream status to forward in that case, and 5xx is what httpfs's transient-retry policy is designed for. Tests in the previous commit covered: - 5xx forwarded verbatim (replaces the old 502-asserting test) - 400 forwarded verbatim with XML body + Content-Type + amz headers - 404 forwarded verbatim - 416 forwarded verbatim with Content-Range preserved - error responses are NOT cached (the next request hits origin) - true network errors still get 502 * feat(controlplane): persist STS expiry and refresh credentials before they go stale (#471) Closes the worker credential expiration bug that surfaced in mw-prod-us (ExpiredToken errors after the 1h STS session-duration boundary). Schema: - worker_records gets s3_credentials_expires_at (nullable, indexed). It's set by the activator after a successful STS AssumeRole + ActivateTenant and consulted by the credential refresh scheduler to pick workers near expiry. NULL is treated as 'due now' so legacy / unstamped rows get refreshed eagerly. - UpsertWorkerRecord's DoUpdates list includes the new column. Configstore methods: - ListWorkersDueForCredentialRefresh(ownerCPInstanceID, cutoff) returns workers we own in active org-bound states whose creds expire by the cutoff or have NULL expiry. - MarkCredentialsRefreshed(workerID, cpInstanceID, expectedEpoch, newExpiresAt) is a conditional UPDATE that only stamps when ownership is still ours. - BumpWorkerEpoch(workerID, cpInstanceID, expectedEpoch) atomically increments owner_epoch on a worker we already own (used before re-sending ActivateTenant for refresh — the worker's reuseExistingActivation guard requires payload.OwnerEpoch > current). - Postgres-backed regression coverage for all three (owner-scoping, NULL-as-due, healthy/neutral/terminal exclusions, conditional update failure modes). Worker side: - Skip the in-process StartCredentialRefresh ticker in shared-warm (multi-tenant) mode. That ticker (a) ran on the session's pinned *sql.Conn and serialized behind user queries — a 1h+ query starved the refresh until creds had already expired — and (b) hit an else branch that swapped the org's STS-brokered config secret for a credential_chain one DuckDB can't satisfy on EKS Pod Identity. The control plane drives refreshes via re-activation now; the standalone single-tenant path keeps the ticker. Activator (shared_worker_activator.go): - TenantActivationPayload.S3CredentialsExpiresAt carries the STS Expiration through from buildDuckLakeConfigFromDuckling. Static-cred warehouses (config-store path) leave it nil. - ActivateReservedWorker stamps the expiry on the worker_records row via runtimeStore.MarkCredentialsRefreshed after a successful activation. Best-effort: failure here doesn't fail activation. - New RefreshCredentials method bumps owner_epoch atomically, sends ActivateTenant with the bumped epoch and freshly-brokered creds, and stamps the new expiry. Skips static-cred orgs. Janitor scheduler: - New refreshExpiringCredentials lambda (leader-only) lists workers whose creds expire within stsSessionDuration/2 (= 30 min today), resolves the in-memory ManagedWorker, and calls activator.RefreshCredentials. - Persists state in worker_records, so a CP failover doesn't lose scheduling — the next leader queries the same column. - A worker not in the leader's local pool (mid-takeover, mid-retire) is skipped; it'll be picked up on the next tick if it's still ours. * feat(controlplane): protect active sessions across CP shutdowns (#472) * test(controlplane): protect active sessions from CP shutdown (RED) Failing tests for the two pieces of state-store + worker-pool behavior that need to change so a CP rollout doesn't kill in-flight customer queries — extends the worker 40761 incident analysis to a regression suite. - TestShutdownAll_SparesWorkersWithActiveSessions: a worker with activeSessions > 0 must be skipped by ShutdownAll. Today the chain marks-draining and pod-deletes every owned worker, killing in-flight queries when the CP receives SIGTERM. - TestListOrphanedWorkersExcludesWorkersWithActiveFlightSessions: a worker with a Flight session record in active state must be spared by ListOrphanedWorkers even if its owning CP has expired. Without this, peer CPs' janitors retire workers the customer can still reclaim by reconnect. - TestListOrphanedWorkersIncludesWorkersWithReconnectingFlightSessions: same protection extends to records in reconnecting state — that's exactly when the customer is in the middle of picking the session back up. - TestListOrphanedWorkersIncludesWorkersWithExpiredFlightSessions: once the Flight session record is terminal (expired/closed) the customer can't reclaim, so the worker should be retired normally. Bounds the protection so a stuck row can't pin a worker forever. Tests fail with the current implementation; the next commit makes them pass. * feat(controlplane): protect active sessions across CP shutdowns (GREEN) Three interlocking changes that keep an in-flight customer query from being killed by a CP rollout. Together they remove the timing race behind the worker-40761 incident: a query running at the moment the old CP gives up (whether at 15min, 8h, or any other timeout) survives. Layer 1: CP doesn't exit while sessions are active. - HandoverDrainTimeout default flips from 15m to 0 in remote mode (= unbounded; k8s terminationGracePeriodSeconds is the only wall). The previous self-imposed wall was the proximate cause of pod-deletes during long-tail drains. Process mode keeps 24h since there's no k8s safety net. - waitForDrain treats timeout==0 as unbounded via context.Background(). - Janitor's ExpireDrainingControlPlaneInstances is now disabled in remote mode (it gates on j.maxDrainTimeout > 0). A draining CP that is still heartbeating stays "alive" to its peers; only stale heartbeat — handled by the existing ExpireControlPlaneInstances — marks a CP dead. This prevents peer CPs from forcibly expiring a draining CP and orphan-retiring its workers. Layer 2: ShutdownAll skips workers with active sessions. - Defense in depth for the case where the CP exits anyway (k8s SIGKILL after gracePeriod, or future code paths). A worker with activeSessions > 0 is left running in 'hot' state owned by the dying CP. The pod survives, the Flight client can reconnect by session token, and a peer CP can claim via TakeOverWorker. - Workers preserved this way stay in the in-memory pool until process exit so any residual session bookkeeping during the shutdown window still finds them. Layer 3: orphan janitor spares workers with reclaimable Flight sessions. - ListOrphanedWorkers gains a NOT EXISTS filter against flight_session_records: a row with at least one session in active or reconnecting state is left alone. Bounds the protection — once ExpireFlightSessionRecords moves the session to expired/closed, the worker is retired normally on the next sweep. Behavior with all three layers: - Customer mid-query during CP roll → old CP's drainAndShutdown waits for sessions → if customer finishes, clean exit; if k8s SIGKILL fires, worker survives via Layer 2 + Layer 3. - Customer Flight client reconnects within session TTL → peer CP claims via TakeOverWorker, query resumes. - Customer never reconnects → flight_session_records expires (TTL default 1h) → next orphan sweep retires the worker normally. - Pgwire customer connected to dying CP → connection dies when CP exits regardless. Worker preservation is moot for them but harmless; Layer 1 still extends the CP's life so most short queries finish. * feat(server): split query-error logs by SQLSTATE class (user vs infra) (#473) * feat(server): split query-error logs by SQLSTATE class (user vs infra) The single "Query execution failed." Error log line was alerting- hostile: a customer typo'ing a column name produced the same Error that a worker crash did, drowning real infra failures in user-error noise. Split into two distinguishable lines using the SQLSTATE we already compute for the pgwire error response — no new string matching: Info "Query execution failed." for SQLSTATE classes {0A, 22, 23, 25, 28, 2B, 3D, 3F, 42, 44} plus 57014 (client cancellation) Error "Query execution errored." for everything else (08, 53, 54, 57 except 57014, 58, XX, …) Mechanically: - New isUserQueryError(err) inspects the existing classifyErrorCode output and matches the class against a closed allow-list. Closed- list semantics so future SQLSTATEs we haven't catalogued err toward Error (the safe direction for alerting). - 57014 query_canceled short-circuits to user (its parent class 57 is otherwise infra; client-pressed-Ctrl-C is user-attributable). - DuckLake transaction conflict and metadata-connection-lost paths are unchanged — they keep their own Warn lines because retry tooling cares about those signals specifically. classifyErrorCode default fallback flips from "42000" to "XX000": unknown errors (no DuckDB prefix matched) are typically infra (gRPC, network, internal panics), not syntax errors. The previous default made every unknown error look like a user error in the alert path. Two existing test cases that asserted the old fallback are updated accordingly with comments explaining why. New tests in TestIsUserQueryError pin every user-class branch plus representative infra cases and edge cases (nil error, 57014 short- circuit, 40001 sitting in infra-side after the early-return). * feat(server): log query start/finish on the worker boundary with trace_id Two new Info-level slog lines around the worker dispatch in executeSelectQuery so an operator can correlate logs with traces and filter to a specific worker: Info "Query started." user=… query=… worker=N worker_pod=… trace_id=<otel-hex> Info "Query finished." user=… duration_ms=… rows=N worker=N worker_pod=… trace_id=<otel-hex> [error=… on failures] The worker / worker_pod attributes match what "Query execution failed." already emits — same shape so a search like worker=40761 surfaces the full lifecycle on a single worker. trace_id matches the OTEL trace ID exported for the same query (see server/tracing.go's existing traceIDFromContext helper), so a Loki query with trace_id=abc123 lines up directly with the trace view. logQueryFinished stays Info even on error so the start/finish pair is always balanced in the log stream — severity routing for failures is done by logQueryError separately (Info for user errors, Error for infra). Operators following one trace see started + finished + an optional separate severity line. Initial use sites are limited to executeSelectQuery (the SELECT path that's the bulk of customer queries and the one that fired in the worker-40761 incident). The non-returning path in executeSimpleQuery and the prepared-statement path still emit only the existing logQuery (structured query log) — they can be wired up incrementally if the lifecycle lines prove useful in production. * fix(controlplane): refresh STS credentials per CP, not only on the janitor leader (#474) The credential-refresh scheduler added in #471 was wired into the janitor's leader-only loop. Workers owned by non-leader CPs therefore never had their S3 session tokens re-brokered, and any long-running query on those workers hit ExpiredToken once the original 1h STS session lapsed. Move the scheduler out of the janitor and into a per-CP background goroutine spawned from SetupMultiTenant. Each CP refreshes only the workers it owns (filtered by cpInstanceID in the SQL), regardless of leader status. The new credentialRefreshScheduler type wraps the existing tick logic so it's independently testable. * Bypass transpiler for file COPY statements (#475) * docs: clarify k8s worker reuse policy (#476) * fix(server): skip DuckLake index ensure when all indexes already exist (#478) Replaces the 9 sequential CREATE INDEX IF NOT EXISTS round-trips on every fresh worker pod's first catalog attach with a single pg_indexes existence check. CREATE INDEX IF NOT EXISTS is a no-op at the storage layer once the index exists, but still costs a server round-trip; under pgbouncer transaction pooling that round-trip can take 1-2s during burst load (server-conn handover + TLS handshake). Empirically 16-24s in the prod posthog metastore. Collapsing to one round-trip cuts the post-attach window to ~150-300ms in the steady state, which matters because the control plane's session-init context is bounded at 5s — the slow path was overrunning the budget and surfacing as "failed to detect ducklake catalog attachment" FATALs on every warm-pool replenishment. The slow path is unchanged and still self-heals if any index goes missing. Indexes are now declared as a name+stmt slice so the fast-path lookup names can't drift from the slow-path statements. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix timestamp aliases in Arrow schema mapping (#479) * fix(duckdbservice): isolate session-cleanup contexts and discard poisoned conns (#481) Three related changes to cleanupSessionState that together fix a worker poisoning chain after cancelled queries. When a query is cancelled mid-execution, DuckDB can leave the underlying connection in an aborted/INTERRUPT'd state where every subsequent statement returns "INTERRUPT Error: Interrupted!" until ROLLBACK runs. The cleanup path called by DestroySession then runs SELECT + DROP statements against this poisoned connection inside a single 5-second context. The SELECT hits the aborted-state error or burns the whole budget; every following DROP then fails instantly with context deadline exceeded. The connection is returned to the *sql.DB pool dirty, and the next session that picks it up fails its first metadata operation (e.g. "USE memory") under the control plane's 5s session-init context. From the client this surfaces as "failed to initialize session database metadata" or "failed to detect ducklake catalog attachment" — same symptoms as the warm-pool churn case fixed in #478, but a different root cause that the previous PR doesn't address. Three changes: - ROLLBACK first to clear any aborted-transaction state before running the cleanup queries. Same pattern that initSearchPath already uses. - Per-step contexts: 3s for the enumeration SELECT, 1s for each DROP, separately. A slow SELECT no longer eats the budget for all DROPs; one slow DROP no longer poisons the rest. - cleanupSessionState now returns a clean/dirty signal. If cleanup didn't complete cleanly, DestroySession marks the conn bad via Conn.Raw returning driver.ErrBadConn, so database/sql discards it from the pool instead of handing it to the next session. Reproduced the poisoning chain on the dev cluster: a clickbench scan cancelled mid-run caused a downstream session to fail with "failed to initialize session database metadata: switch to memory catalog: flight execute update: deadline exceeded", followed by the worker being marked unresponsive and reaped. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(duckdbservice): discard pooled conn after every session in cluster mode (#485) * fix(duckdbservice): discard pooled conn after every session in cluster mode In cluster mode every worker is bound to a single org via activateTenant, so the security boundary aligns with the worker lifecycle and there's no need to scrub per-conn state on session teardown to protect against cross-org leakage. The existing cleanup loop is two things: - Mostly wasted work in cluster mode. A typical billing-style session (open / SELECT 1 / close) creates zero user temp objects, but the cleanup still enumerates duckdb_views() and issues ~46 DROP IF EXISTS no-ops against system views that live in main / information_schema / pg_catalog (the DROP statements target the temp schema, so they're lookups against an empty namespace). - Incomplete. It handles temp tables and temp views but not temp macros, temp types, or temp sequences. Those still leak across pooled-conn reuse — verified on dev: a TEMP MACRO created in session A was visible in session B when the pool happened to hand back the same driver conn. Cluster mode now skips the cleanup loop entirely and always marks the conn bad via Conn.Raw → driver.ErrBadConn so database/sql evicts it from the pool. The next session opens a fresh DuckDB connection — no temp leakage of any kind. Standalone mode keeps the existing cleanupSessionState path because pooled conns can be reused across orgs and scrubbing is required. The post-#481 conn-discard fallback is preserved when cleanup fails. Empirically motivated: prod was reaping ~29 workers/hour because bursts of session-end events (e.g. 16 billing queries arriving at once) ran hundreds of cleanup round-trips concurrently on a single worker, blocking the gRPC health-check long enough to fail 3× and trip the unresponsive- worker reaper. The cleanup itself was correct (post-#481) but the aggregate throughput was the bottleneck. Skipping the loop in cluster mode collapses ~48 round-trips per session-end to 0. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * refactor(duckdbservice): wrap conn-pool eviction behind a named helper driver.ErrBadConn is the standard idiom for evicting a *sql.Conn from the *sql.DB pool, but the name is misleading at call sites — the conn isn't actually broken, we just don't want it reused. Go acknowledges the API gap (golang/go#40722) but hasn't fixed it. Wrap the Conn.Raw + ErrBadConn dance behind a named helper (evictConnFromPool) with a comment explaining why this is the standard idiom despite the awkward name. Update both call sites in DestroySession plus the log key (discarded -> evicted) to match. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(k8s): include hot_idle in findActiveOrgWorkerPodSince The helper polls cp_runtime.worker_records for workers bound to a given org. The state filter previously excluded hot_idle, on the assumption that the test would always catch a worker mid-session. With faster session teardown (cluster-mode conn eviction), workers can transition hot -> hot_idle before the polling loop runs, leaving the test unable to find the worker that just served the org. hot_idle workers are still bound to the org — org_id stays set on the runtime record while the worker is parked for fast re-claim by the same org — so they're the correct answer to 'which worker handled this org just now?' Adding hot_idle to the state list makes the helper robust to session-teardown speed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * refactor(duckdbservice): extract DuckDB-free arrow helpers into arrowmap subpackage Step 1 of a 4-step plan to split duckgres into separate control-plane and worker binaries so the control plane can stop linking libduckdb (and the worker image can ship per-DuckDB-version variants). This change moves the DuckDB-free helpers (DuckDBTypeToArrow, QualifyTableName, QuoteIdent, SupportsLimit, and the struct/map/decimal type-string parsers) out of duckdbservice/arrow_helpers.go into a new duckdbservice/arrowmap package. arrowmap has zero dependency on github.com/duckdb/duckdb-go; it's plain string-to-arrow.DataType mapping. The duckdbservice package keeps the helpers callable as before via thin re-export shims so existing call sites are unaffected. server/flightsqlingress is updated to import arrowmap directly for the three call sites that don't need DuckDB types (the AppendValue call site stays on duckdbservice for now; splitting AppendValue's duckdb-bound branches lands in a follow-up). This PR is purely a refactor — no behavior changes, no new binary, no CI changes. It establishes the package boundary and pattern that subsequent PRs build on. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * refactor(arrowmap): split AppendValue into duckdb-free core + driver hooks Step 2 of the binary-split plan. Moves AppendValue out of duckdbservice into arrowmap so callers that don't need duckdb-go's driver value types (duckdb.Interval, Decimal, UUID, OrderedMap, Map) can use it without pulling libduckdb into their import graph. The split uses a registration hook rather than a build tag: - arrowmap.AppendValue handles all arrow-native and Go-native value types directly. It first consults any registered Appender hooks before falling back to the built-in switch. - duckdbservice/appender_init.go's init() registers a single hook that handles duckdb.Interval / Decimal / UUID / OrderedMap / Map. - When duckdbservice is linked into a binary (worker, standalone), the duckdb types get full coverage automatically. When it isn't (a future controlplane-only binary), the duckdb cases are dead code paths that wouldn't fire anyway because duckdb-go is the only thing producing those typed values. OrderedMapValue moved from server/flight_executor.go into arrowmap so the MAP branch in arrowmap.AppendValue can switch on it without arrowmap depending on the server package. server.OrderedMapValue is preserved as a type alias for backward compatibility with existing call sites. server/flightsqlingress no longer imports duckdbservice — it uses arrowmap.AppendValue directly. The package still transitively pulls duckdb-go via its `server` import; getting `server` itself duckdb-free is the next chunk of work. Verified: - go build ./... clean - go build -tags kubernetes ./... clean - go test -short ./duckdbservice/... ./server/flightsqlingress/... all green - go list -deps ./duckdbservice/arrowmap | grep duckdb-go is empty (arrowmap remains a pure leaf even with AppendValue + OrderedMapValue living in it) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * refactor(server): extract auth and sysinfo helpers into pure subpackages Step 3 of the binary-split plan. Carves the duckdb-free pieces of the server package into focused subpackages so the eventual server-package split (PR #4+) has less tangled state to deal with. server/auth/ — RateLimiter, RateLimitConfig, ValidateUserPassword, BeginRateLimitedAuthAttempt, RecordFailed/SuccessfulAuth, plus the auth-related Prometheus metrics (auth_failures, rate_limit_rejects, rate_limited_ips). server/sysinfo/ — SystemMemoryBytes, AutoMemoryLimit, ParseMemoryBytes, ValidateMemoryLimit. The /proc/meminfo reader and the DuckDB-style "4GB"/"512MB" string parser/validator. Both new packages have zero dependency on github.com/duckdb/duckdb-go (verified with `go list -deps`), so the eventual control-plane-only binary will be able to use them without linking libduckdb. Backward compatibility is preserved via type aliases and re-export `var`s in server/auth_aliases.go and server/sysinfo_aliases.go. Existing references to server.RateLimiter, server.NewRateLimiter, server.ValidateUserPassword, server.ParseMemoryBytes, etc. continue to compile without touching their call sites. New code should import the subpackages directly. Three Prometheus metrics moved with the rate-limit code: - duckgres_auth_failures_total - duckgres_rate_limit_rejects_total - duckgres_rate_limited_ips The two used outside the auth package itself (RateLimitRejectsCounter, AuthFailuresCounter) are exported; rateLimitedIPsGauge stays private. Verified: - go build ./... clean - go build -tags kubernetes ./... clean - go test -short ./server/auth/... ./server/sysinfo/... ./server/... ./controlplane/... — all green (pre-existing testcontainer Postgres failures in controlplane/admin/ unrelated, requires Docker) - go list -deps ./server/auth | grep duckdb-go is empty - go list -deps ./server/sysinfo | grep duckdb-go is empty Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Eric Duong <eric@posthog.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: Benjamin Knofe-Vider <benben@users.noreply.github.com> Co-authored-by: Bill Guowei Yang <bill@posthog.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: Eric Duong <eric@posthog.com> Co-authored-by: Benjamin Knofe-Vider <benben@users.noreply.github.com> Co-authored-by: Bill Guowei Yang <bill@posthog.com>

benben added 2 commits April 30, 2026 14:12

benben requested review from EDsCODE, bill-ph and fuziontech April 30, 2026 12:43

benben merged commit 2fb1996 into main Apr 30, 2026
21 checks passed

benben deleted the ben/protect-active-sessions-from-cp-shutdown branch April 30, 2026 12:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(controlplane): protect active sessions across CP shutdowns#472

feat(controlplane): protect active sessions across CP shutdowns#472
benben merged 2 commits intomainfrom
ben/protect-active-sessions-from-cp-shutdown

benben commented Apr 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

benben commented Apr 30, 2026

Summary

Behavior matrix

RED → GREEN

Operational notes

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant