fix(worker): rotate STS secret on a side conn, off the activation lock#516
Merged
fix(worker): rotate STS secret on a side conn, off the activation lock#516
Conversation
In mw-prod-us 2026-05-04 we observed ~10 worker pods/hour killed with "K8s worker unresponsive, deleting pod" — every kill ~10 s after a "Refreshing S3 credentials..." log. Workers had ~0.1 cores CPU, ~1 GiB of 360 GiB memory, no OOM. Pure deadlock. Root cause: - reuseExistingActivation held p.mu.Lock across server.RefreshS3Secret. - RefreshS3Secret runs CREATE OR REPLACE SECRET on the activation *sql.DB, which has MaxOpenConns=1 — when a long-running client query was already on that connection, the Exec queued indefinitely. - The gRPC health check needs p.mu.RLock to snapshot sessions for stall detection. With p.mu.Lock held by the stuck refresh, RLock starved. - Three CP-side health-check timeouts at 3 s each = 9 s of unresponsive, then the CP force-deletes the pod. Two changes pinning down both layers: 1. Side-connection for control DDL. server.OpenDuckDBPair opens one *duckdb.Connector and returns Main + Control *sql.DBs sharing the same DuckDB Database (same SecretManager, same attached catalogs) but with independent Go connection pools. SessionPool now owns this pair via p.activePair / p.controlDB; CREATE OR REPLACE SECRET runs on Control, so credential rotation never queues behind a client query. The running query picks up the new token on its next S3 request — strictly better than today's "no rotation possible while a query holds the conn" behavior. 2. Lock-released refresh. reuseExistingActivation now snapshots state under p.mu.Lock, releases the lock, runs RefreshS3Secret (now on Control), then re-acquires to re-validate and commit the new payload. Health-check RLock acquisition is no longer gated on the secret rotation finishing. The tests/perf build failure visible in `go build ./...` is unrelated and pre-exists this branch. Tests: - TestReuseExistingActivationDoesNotBlockHealthChecks: injects a slow refreshS3Secret stub via a new test field, asserts a concurrent p.mu.RLock acquirer returns within 100 ms while refresh is in flight, and asserts the slow refresh ran on controlDB (not the main DB). Confirmed it FAILS on the pre-fix code path (regression-probe revert showed RLock blocked) and PASSES with both changes applied.
CI's "Verify cmd/duckgres-controlplane does not link libduckdb" check caught duckdb_pair.go pulling github.com/duckdb/duckdb-go/v2 into server/, which is in the CP's import graph. Move the pair builder to duckdbservice/ (worker-only package — the CP binary doesn't import it), exposing two helpers from server/ that don't themselves pull in duckdb-go-v2: - server.DuckDBDSN — returns the openBaseDB DSN. - server.ConfigureMainDB — applies threads / memory_limit / extensions / profiling on a *sql.DB. The duckdbservice pair builder calls those two plus the existing server.ConfigureDBConnection. Result: the CP binary stays libduckdb-free (verified locally with `go list -deps ./cmd/duckgres-controlplane | grep duckdb-go` empty), and the worker still gets the same shared-connector Main + Control pair. Also fix two errcheck lint failures: the deadlock-regression test's deferred *sql.DB Close calls now drop the return value explicitly.
This was referenced May 4, 2026
benben
added a commit
that referenced
this pull request
May 4, 2026
…518) isQueryCancelled matches any error string containing "context canceled", which conflates two distinct cases: - The user/CP cancelled the request: pgwire CancelRequest, ctx deadline, client TCP close. c.ctx.Err() is non-nil. - gRPC bubbled up a Canceled status because the *server* side closed — worker died, takeover, retire. c.ctx is still healthy. The shared call sites in conn.go (and isUserQueryError's shortcut) treated both as user cancellations, so worker-kill failures were silently suppressed: - Line 1627 short-circuited logQueryError, so "Query execution errored." never fired. - isUserQueryError returned true for any cancellation, so even when logQueryError did run it logged at Info "Query execution failed.". Effect on observability: alerting queries like count_over_time({namespace="duckgres"} |~ "Query execution errored." [5m]) showed zero events during the mw-prod-us deadlock incident on 2026-05-04 even though dozens of in-flight queries failed because the CP killed their workers. PR #516 fixes the deadlock; this fixes the visibility blind spot so the next class of infra cancellations actually pages. Changes: - Add (*clientConn).isCallerCancellation(err) — true only when the err is cancel-shaped AND c.ctx.Err() != nil. - Replace isQueryCancelled in 17 clientConn-method error sites with the new method. classifyErrorCode (no clientConn) keeps isQueryCancelled since SQLSTATE 57014 is the right wire response either way. - Drop isUserQueryError's "isQueryCancelled → return true" shortcut. The caller-cancellation case is now filtered upstream; any cancel-shaped err that reaches isUserQueryError is infra by definition and class 57 routes it to Error level via the existing SQLSTATE table. Tests: - TestIsCallerCancellation: gRPC Canceled with healthy ctx is NOT a caller cancel; user Ctrl-C (cancelled ctx) is. - TestLogQueryErrorRoutesInfraCancelToErrorLevel: captures slog output, asserts a worker-death-shaped error logs "Query execution errored." at Error. - TestIsUserQueryError "client cancellation" case updated to reflect the new semantics: any cancel reaching isUserQueryError is infra → false.
9 tasks
benben
added a commit
that referenced
this pull request
May 5, 2026
This is the actual root cause of the upstream-S3 501 NotImplemented errors
on parquet writes that we've been chasing through three previous PRs.
forwardUncached built its outbound request via http.NewRequestWithContext
with the inbound r.Body as the body. Per Go's docs:
When body is of type *bytes.Buffer, *bytes.Reader, or *strings.Reader,
the returned request's ContentLength is set [...]. For other types,
the default is left as 0; the body is then sent using chunked
transfer encoding.
r.Body is a generic io.ReadCloser, so the outbound req had ContentLength=0
and Go's Transport fell back to Transfer-Encoding: chunked. AWS S3 returns
501 NotImplemented for chunked PUT/POST regardless of whether the client
intended chunked.
So even though DuckDB sent a perfectly clean Content-Length-bearing PUT,
the proxy was rewriting it as chunked and S3 rejected it. The chain of
prior PRs (#516 deadlock fix, #518 / #519 logging, #524 / #525 / #526
visibility) gave us the breadcrumbs to actually see this; the fix itself
is one line: mirror ContentLength + TransferEncoding + Trailer from the
inbound request so the wire shape is preserved.
Verified the chunked / 501 chain is impossible to recreate now: AWS
Sigv4 signs `host;x-amz-content-sha256;x-amz-date` (per
duckdb-httpfs/src/s3fs.cpp:84), so neither Content-Length nor
Transfer-Encoding is in the signed headers list — meaning we're free to
set Content-Length without invalidating the signature.
Tests cover the proxy-as-transparent-forwarder invariant for the
non-cached path:
- TestForwardUncachedPropagatesContentLength: regression — outbound
ContentLength matches inbound, no Transfer-Encoding: chunked. Verified
this fails on the pre-fix code (origin sees ContentLength=-1 and
Transfer-Encoding: chunked).
- TestForwardUncachedPreservesRequestHeaders: Authorization, x-amz-*,
custom headers round-trip to origin verbatim.
- TestForwardUncachedStripsHopByHopBothDirections: Connection /
Keep-Alive stripped per RFC 7230 in both request and response, while
non-hop-by-hop headers pass through.
- TestForwardUncachedPreservesQueryString: AWS multipart-upload params
(?uploads, ?partNumber, ?uploadId=...) round-trip exactly so Sigv4's
canonical-request hash is preserved.
- TestForwardUncachedPreservesResponseBodyBytewise / Non2xx: response
body bytes (binary, XML envelopes) are forwarded byte-for-byte. Locks
in that the log_preview capture on non-2xx doesn't corrupt the body.
- TestForwardUncachedPreservesMethod: PUT/POST/DELETE/HEAD all reach
the origin unchanged.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
CREATE OR REPLACE SECRETnever queues behind a long-running client query.p.mu.LockinreuseExistingActivation, so a slow refresh can't starve concurrentp.mu.RLockacquirers (the gRPC health-check goroutine snapshotting sessions for stall detection).Why
mw-prod-us 2026-05-04: ~10 worker pods/hour killed with
K8s worker unresponsive, deleting pod, every kill ~10 s after aRefreshing S3 credentials...log. Workers were idle on CPU/memory — pure deadlock.Root cause:
ActivateTenantcontrolplane/multitenant.go(now per-CP after #474)reuseExistingActivation, takesp.mu.Lockduckdbservice/activation.goRefreshS3Secret→db.Exec(\"CREATE OR REPLACE SECRET ...\")on the activation DBserver/server.goMaxOpenConns=1; an active client query holds the conn →db.Execqueues indefinitelyp.mu.RLock→ starved behindp.mu.Lockduckdbservice/flight_handler.go:212controlplane/worker_mgr.goFix
Side-connection for control DDL
server/duckdb_pair.gointroducesserver.OpenDuckDBPair:*duckdb.Connector(one DuckDB Database)*sql.DBs built viasql.OpenDB, eachMaxOpenConns(1)nonClosingConnectorso*sql.DB.Closedoesn't kill the underlying instanceSessionPoolnow stores a*DuckDBPair;controlDBis exposed for credential rotation. Both DBs see the same SecretManager and attached catalogs. The running query picks up the new STS token on its next S3 request — strictly better than today's "no rotation possible while a query holds the conn" behavior.Lock-released refresh
reuseExistingActivationis now three phases:RefreshS3SecretoncontrolDB.Health-check
RLockacquisition no longer waits for the secret to commit.Test plan
TestReuseExistingActivationDoesNotBlockHealthChecks— injects a blockingrefreshS3Secretstub, asserts a concurrentp.mu.RLockacquirer returns within 100 ms during the slow refresh, and asserts the rotation runs oncontrolDB(not Main)../duckdbservice/,./server/,./controlplane/tests pass with-tags kubernetes.Refreshing S3 credentials...complete cleanly while a long query is in flight.K8s worker unresponsiverate drops to zero on managed-warehouse workers.