fix(worker): rotate STS secret on a side conn, off the activation lock by benben · Pull Request #516 · PostHog/duckgres

benben · 2026-05-04T10:22:19Z

Summary

Rotate the DuckDB S3 secret on a side connection that shares the same DuckDB instance as the main client-query DB but has its own Go connection pool — so CREATE OR REPLACE SECRET never queues behind a long-running client query.
Run the rotation outside p.mu.Lock in reuseExistingActivation , so a slow refresh can't starve concurrent p.mu.RLock acquirers (the gRPC health-check goroutine snapshotting sessions for stall detection).

Why

mw-prod-us 2026-05-04: ~10 worker pods/hour killed with K8s worker unresponsive, deleting pod , every kill ~10 s after a Refreshing S3 credentials... log. Workers were idle on CPU/memory — pure deadlock.

Root cause:

Step	What	Where
1	Credential refresh sends `ActivateTenant`	`controlplane/multitenant.go` (now per-CP after #474)
2	Worker enters `reuseExistingActivation` , takes `p.mu.Lock`	`duckdbservice/activation.go`
3	Calls `RefreshS3Secret` → `db.Exec(\"CREATE OR REPLACE SECRET ...\")` on the activation DB	`server/server.go`
4	Activation DB has `MaxOpenConns=1` ; an active client query holds the conn → `db.Exec` queues indefinitely	—
5	gRPC health-check needs `p.mu.RLock` → starved behind `p.mu.Lock`	`duckdbservice/flight_handler.go:212`
6	3 × 3 s health-check timeouts = 9 s of unresponsive → CP force-deletes the pod	`controlplane/worker_mgr.go`

Fix

Side-connection for control DDL

server/duckdb_pair.go introduces server.OpenDuckDBPair :

One *duckdb.Connector (one DuckDB Database)
Two *sql.DB s built via sql.OpenDB , each MaxOpenConns(1)
Wrapped in nonClosingConnector so *sql.DB.Close doesn't kill the underlying instance

SessionPool now stores a *DuckDBPair ; controlDB is exposed for credential rotation. Both DBs see the same SecretManager and attached catalogs. The running query picks up the new STS token on its next S3 request — strictly better than today's "no rotation possible while a query holds the conn" behavior.

Lock-released refresh

reuseExistingActivation is now three phases:

Locked: inspect activation, decide if refresh is needed, capture refresh fn / DB / sem.
Unlocked: run RefreshS3Secret on controlDB .
Locked: re-validate (activation may have changed) and commit the new payload.

Health-check RLock acquisition no longer waits for the secret to commit.

Test plan

TestReuseExistingActivationDoesNotBlockHealthChecks — injects a blocking refreshS3Secret stub, asserts a concurrent p.mu.RLock acquirer returns within 100 ms during the slow refresh, and asserts the rotation runs on controlDB (not Main).
Confirmed the test FAILS on the pre-fix code path (locally simulated by reverting the unlock — RLock blocked, test fired its deadlock-regression message).
All existing ./duckdbservice/ , ./server/ , ./controlplane/ tests pass with -tags kubernetes .
Deploy to mw-dev, watch Refreshing S3 credentials... complete cleanly while a long query is in flight.
Deploy to mw-prod-us, confirm K8s worker unresponsive rate drops to zero on managed-warehouse workers.

In mw-prod-us 2026-05-04 we observed ~10 worker pods/hour killed with "K8s worker unresponsive, deleting pod" — every kill ~10 s after a "Refreshing S3 credentials..." log. Workers had ~0.1 cores CPU, ~1 GiB of 360 GiB memory, no OOM. Pure deadlock. Root cause: - reuseExistingActivation held p.mu.Lock across server.RefreshS3Secret. - RefreshS3Secret runs CREATE OR REPLACE SECRET on the activation *sql.DB, which has MaxOpenConns=1 — when a long-running client query was already on that connection, the Exec queued indefinitely. - The gRPC health check needs p.mu.RLock to snapshot sessions for stall detection. With p.mu.Lock held by the stuck refresh, RLock starved. - Three CP-side health-check timeouts at 3 s each = 9 s of unresponsive, then the CP force-deletes the pod. Two changes pinning down both layers: 1. Side-connection for control DDL. server.OpenDuckDBPair opens one *duckdb.Connector and returns Main + Control *sql.DBs sharing the same DuckDB Database (same SecretManager, same attached catalogs) but with independent Go connection pools. SessionPool now owns this pair via p.activePair / p.controlDB; CREATE OR REPLACE SECRET runs on Control, so credential rotation never queues behind a client query. The running query picks up the new token on its next S3 request — strictly better than today's "no rotation possible while a query holds the conn" behavior. 2. Lock-released refresh. reuseExistingActivation now snapshots state under p.mu.Lock, releases the lock, runs RefreshS3Secret (now on Control), then re-acquires to re-validate and commit the new payload. Health-check RLock acquisition is no longer gated on the secret rotation finishing. The tests/perf build failure visible in `go build ./...` is unrelated and pre-exists this branch. Tests: - TestReuseExistingActivationDoesNotBlockHealthChecks: injects a slow refreshS3Secret stub via a new test field, asserts a concurrent p.mu.RLock acquirer returns within 100 ms while refresh is in flight, and asserts the slow refresh ran on controlDB (not the main DB). Confirmed it FAILS on the pre-fix code path (regression-probe revert showed RLock blocked) and PASSES with both changes applied.

CI's "Verify cmd/duckgres-controlplane does not link libduckdb" check caught duckdb_pair.go pulling github.com/duckdb/duckdb-go/v2 into server/, which is in the CP's import graph. Move the pair builder to duckdbservice/ (worker-only package — the CP binary doesn't import it), exposing two helpers from server/ that don't themselves pull in duckdb-go-v2: - server.DuckDBDSN — returns the openBaseDB DSN. - server.ConfigureMainDB — applies threads / memory_limit / extensions / profiling on a *sql.DB. The duckdbservice pair builder calls those two plus the existing server.ConfigureDBConnection. Result: the CP binary stays libduckdb-free (verified locally with `go list -deps ./cmd/duckgres-controlplane | grep duckdb-go` empty), and the worker still gets the same shared-connector Main + Control pair. Also fix two errcheck lint failures: the deadlock-regression test's deferred *sql.DB Close calls now drop the return value explicitly.

…518) isQueryCancelled matches any error string containing "context canceled", which conflates two distinct cases: - The user/CP cancelled the request: pgwire CancelRequest, ctx deadline, client TCP close. c.ctx.Err() is non-nil. - gRPC bubbled up a Canceled status because the *server* side closed — worker died, takeover, retire. c.ctx is still healthy. The shared call sites in conn.go (and isUserQueryError's shortcut) treated both as user cancellations, so worker-kill failures were silently suppressed: - Line 1627 short-circuited logQueryError, so "Query execution errored." never fired. - isUserQueryError returned true for any cancellation, so even when logQueryError did run it logged at Info "Query execution failed.". Effect on observability: alerting queries like count_over_time({namespace="duckgres"} |~ "Query execution errored." [5m]) showed zero events during the mw-prod-us deadlock incident on 2026-05-04 even though dozens of in-flight queries failed because the CP killed their workers. PR #516 fixes the deadlock; this fixes the visibility blind spot so the next class of infra cancellations actually pages. Changes: - Add (*clientConn).isCallerCancellation(err) — true only when the err is cancel-shaped AND c.ctx.Err() != nil. - Replace isQueryCancelled in 17 clientConn-method error sites with the new method. classifyErrorCode (no clientConn) keeps isQueryCancelled since SQLSTATE 57014 is the right wire response either way. - Drop isUserQueryError's "isQueryCancelled → return true" shortcut. The caller-cancellation case is now filtered upstream; any cancel-shaped err that reaches isUserQueryError is infra by definition and class 57 routes it to Error level via the existing SQLSTATE table. Tests: - TestIsCallerCancellation: gRPC Canceled with healthy ctx is NOT a caller cancel; user Ctrl-C (cancelled ctx) is. - TestLogQueryErrorRoutesInfraCancelToErrorLevel: captures slog output, asserts a worker-death-shaped error logs "Query execution errored." at Error. - TestIsUserQueryError "client cancellation" case updated to reflect the new semantics: any cancel reaching isUserQueryError is infra → false.

This is the actual root cause of the upstream-S3 501 NotImplemented errors on parquet writes that we've been chasing through three previous PRs. forwardUncached built its outbound request via http.NewRequestWithContext with the inbound r.Body as the body. Per Go's docs: When body is of type *bytes.Buffer, *bytes.Reader, or *strings.Reader, the returned request's ContentLength is set [...]. For other types, the default is left as 0; the body is then sent using chunked transfer encoding. r.Body is a generic io.ReadCloser, so the outbound req had ContentLength=0 and Go's Transport fell back to Transfer-Encoding: chunked. AWS S3 returns 501 NotImplemented for chunked PUT/POST regardless of whether the client intended chunked. So even though DuckDB sent a perfectly clean Content-Length-bearing PUT, the proxy was rewriting it as chunked and S3 rejected it. The chain of prior PRs (#516 deadlock fix, #518 / #519 logging, #524 / #525 / #526 visibility) gave us the breadcrumbs to actually see this; the fix itself is one line: mirror ContentLength + TransferEncoding + Trailer from the inbound request so the wire shape is preserved. Verified the chunked / 501 chain is impossible to recreate now: AWS Sigv4 signs `host;x-amz-content-sha256;x-amz-date` (per duckdb-httpfs/src/s3fs.cpp:84), so neither Content-Length nor Transfer-Encoding is in the signed headers list — meaning we're free to set Content-Length without invalidating the signature. Tests cover the proxy-as-transparent-forwarder invariant for the non-cached path: - TestForwardUncachedPropagatesContentLength: regression — outbound ContentLength matches inbound, no Transfer-Encoding: chunked. Verified this fails on the pre-fix code (origin sees ContentLength=-1 and Transfer-Encoding: chunked). - TestForwardUncachedPreservesRequestHeaders: Authorization, x-amz-*, custom headers round-trip to origin verbatim. - TestForwardUncachedStripsHopByHopBothDirections: Connection / Keep-Alive stripped per RFC 7230 in both request and response, while non-hop-by-hop headers pass through. - TestForwardUncachedPreservesQueryString: AWS multipart-upload params (?uploads, ?partNumber, ?uploadId=...) round-trip exactly so Sigv4's canonical-request hash is preserved. - TestForwardUncachedPreservesResponseBodyBytewise / Non2xx: response body bytes (binary, XML envelopes) are forwarded byte-for-byte. Locks in that the log_preview capture on non-2xx doesn't corrupt the body. - TestForwardUncachedPreservesMethod: PUT/POST/DELETE/HEAD all reach the origin unchanged.

benben added 2 commits May 4, 2026 12:21

benben requested a review from a team May 4, 2026 11:20

benben merged commit f1eb12d into main May 4, 2026
22 checks passed

benben deleted the ben/worker-deadlock-control-db branch May 4, 2026 11:20

This was referenced May 4, 2026

fix(server): log infra-driven query cancellations at Error, not Info #518

Merged

fix(server): emit Query started/finished on every executor path #519

Merged

benben mentioned this pull request May 5, 2026

fix(cache-proxy): forward inbound body framing transparently #527

Merged

9 tasks

EDsCODE mentioned this pull request May 7, 2026

fix(controlplane): write FATAL ErrorResponse before closing client conn on worker crash #540

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(worker): rotate STS secret on a side conn, off the activation lock#516

fix(worker): rotate STS secret on a side conn, off the activation lock#516
benben merged 2 commits intomainfrom
ben/worker-deadlock-control-db

benben commented May 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

benben commented May 4, 2026

Summary

Why

Fix

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant