Skip to content

fix(worker): rotate STS secret on a side conn, off the activation lock#516

Merged
benben merged 2 commits intomainfrom
ben/worker-deadlock-control-db
May 4, 2026
Merged

fix(worker): rotate STS secret on a side conn, off the activation lock#516
benben merged 2 commits intomainfrom
ben/worker-deadlock-control-db

Conversation

@benben
Copy link
Copy Markdown
Member

@benben benben commented May 4, 2026

Summary

  • Rotate the DuckDB S3 secret on a side connection that shares the same DuckDB instance as the main client-query DB but has its own Go connection pool — so CREATE OR REPLACE SECRET never queues behind a long-running client query.
  • Run the rotation outside p.mu.Lock in reuseExistingActivation , so a slow refresh can't starve concurrent p.mu.RLock acquirers (the gRPC health-check goroutine snapshotting sessions for stall detection).

Why

mw-prod-us 2026-05-04: ~10 worker pods/hour killed with K8s worker unresponsive, deleting pod , every kill ~10 s after a Refreshing S3 credentials... log. Workers were idle on CPU/memory — pure deadlock.

Root cause:

Step What Where
1 Credential refresh sends ActivateTenant controlplane/multitenant.go (now per-CP after #474)
2 Worker enters reuseExistingActivation , takes p.mu.Lock duckdbservice/activation.go
3 Calls RefreshS3Secretdb.Exec(\"CREATE OR REPLACE SECRET ...\") on the activation DB server/server.go
4 Activation DB has MaxOpenConns=1 ; an active client query holds the conn → db.Exec queues indefinitely
5 gRPC health-check needs p.mu.RLock → starved behind p.mu.Lock duckdbservice/flight_handler.go:212
6 3 × 3 s health-check timeouts = 9 s of unresponsive → CP force-deletes the pod controlplane/worker_mgr.go

Fix

Side-connection for control DDL

server/duckdb_pair.go introduces server.OpenDuckDBPair :

  • One *duckdb.Connector (one DuckDB Database)
  • Two *sql.DB s built via sql.OpenDB , each MaxOpenConns(1)
  • Wrapped in nonClosingConnector so *sql.DB.Close doesn't kill the underlying instance

SessionPool now stores a *DuckDBPair ; controlDB is exposed for credential rotation. Both DBs see the same SecretManager and attached catalogs. The running query picks up the new STS token on its next S3 request — strictly better than today's "no rotation possible while a query holds the conn" behavior.

Lock-released refresh

reuseExistingActivation is now three phases:

  1. Locked: inspect activation, decide if refresh is needed, capture refresh fn / DB / sem.
  2. Unlocked: run RefreshS3Secret on controlDB .
  3. Locked: re-validate (activation may have changed) and commit the new payload.

Health-check RLock acquisition no longer waits for the secret to commit.

Test plan

  • TestReuseExistingActivationDoesNotBlockHealthChecks — injects a blocking refreshS3Secret stub, asserts a concurrent p.mu.RLock acquirer returns within 100 ms during the slow refresh, and asserts the rotation runs on controlDB (not Main).
  • Confirmed the test FAILS on the pre-fix code path (locally simulated by reverting the unlock — RLock blocked, test fired its deadlock-regression message).
  • All existing ./duckdbservice/ , ./server/ , ./controlplane/ tests pass with -tags kubernetes .
  • Deploy to mw-dev, watch Refreshing S3 credentials... complete cleanly while a long query is in flight.
  • Deploy to mw-prod-us, confirm K8s worker unresponsive rate drops to zero on managed-warehouse workers.

benben added 2 commits May 4, 2026 12:21
In mw-prod-us 2026-05-04 we observed ~10 worker pods/hour killed with
"K8s worker unresponsive, deleting pod" — every kill ~10 s after a
"Refreshing S3 credentials..." log. Workers had ~0.1 cores CPU, ~1 GiB
of 360 GiB memory, no OOM. Pure deadlock.

Root cause:

- reuseExistingActivation held p.mu.Lock across server.RefreshS3Secret.
- RefreshS3Secret runs CREATE OR REPLACE SECRET on the activation *sql.DB,
  which has MaxOpenConns=1 — when a long-running client query was already
  on that connection, the Exec queued indefinitely.
- The gRPC health check needs p.mu.RLock to snapshot sessions for stall
  detection. With p.mu.Lock held by the stuck refresh, RLock starved.
- Three CP-side health-check timeouts at 3 s each = 9 s of unresponsive,
  then the CP force-deletes the pod.

Two changes pinning down both layers:

1. Side-connection for control DDL. server.OpenDuckDBPair opens one
   *duckdb.Connector and returns Main + Control *sql.DBs sharing the same
   DuckDB Database (same SecretManager, same attached catalogs) but with
   independent Go connection pools. SessionPool now owns this pair via
   p.activePair / p.controlDB; CREATE OR REPLACE SECRET runs on Control,
   so credential rotation never queues behind a client query. The
   running query picks up the new token on its next S3 request — strictly
   better than today's "no rotation possible while a query holds the
   conn" behavior.

2. Lock-released refresh. reuseExistingActivation now snapshots state
   under p.mu.Lock, releases the lock, runs RefreshS3Secret (now on
   Control), then re-acquires to re-validate and commit the new payload.
   Health-check RLock acquisition is no longer gated on the secret
   rotation finishing.

The tests/perf build failure visible in `go build ./...` is unrelated
and pre-exists this branch.

Tests:

- TestReuseExistingActivationDoesNotBlockHealthChecks: injects a slow
  refreshS3Secret stub via a new test field, asserts a concurrent
  p.mu.RLock acquirer returns within 100 ms while refresh is in flight,
  and asserts the slow refresh ran on controlDB (not the main DB).
  Confirmed it FAILS on the pre-fix code path (regression-probe revert
  showed RLock blocked) and PASSES with both changes applied.
CI's "Verify cmd/duckgres-controlplane does not link libduckdb" check
caught duckdb_pair.go pulling github.com/duckdb/duckdb-go/v2 into
server/, which is in the CP's import graph.

Move the pair builder to duckdbservice/ (worker-only package — the CP
binary doesn't import it), exposing two helpers from server/ that don't
themselves pull in duckdb-go-v2:

- server.DuckDBDSN — returns the openBaseDB DSN.
- server.ConfigureMainDB — applies threads / memory_limit / extensions /
  profiling on a *sql.DB.

The duckdbservice pair builder calls those two plus the existing
server.ConfigureDBConnection. Result: the CP binary stays libduckdb-free
(verified locally with `go list -deps ./cmd/duckgres-controlplane |
grep duckdb-go` empty), and the worker still gets the same
shared-connector Main + Control pair.

Also fix two errcheck lint failures: the deadlock-regression test's
deferred *sql.DB Close calls now drop the return value explicitly.
@benben benben requested a review from a team May 4, 2026 11:20
@benben benben merged commit f1eb12d into main May 4, 2026
22 checks passed
@benben benben deleted the ben/worker-deadlock-control-db branch May 4, 2026 11:20
benben added a commit that referenced this pull request May 4, 2026
…518)

isQueryCancelled matches any error string containing "context canceled",
which conflates two distinct cases:

- The user/CP cancelled the request: pgwire CancelRequest, ctx deadline,
  client TCP close. c.ctx.Err() is non-nil.
- gRPC bubbled up a Canceled status because the *server* side closed —
  worker died, takeover, retire. c.ctx is still healthy.

The shared call sites in conn.go (and isUserQueryError's shortcut)
treated both as user cancellations, so worker-kill failures were silently
suppressed:

- Line 1627 short-circuited logQueryError, so "Query execution errored."
  never fired.
- isUserQueryError returned true for any cancellation, so even when
  logQueryError did run it logged at Info "Query execution failed.".

Effect on observability: alerting queries like
  count_over_time({namespace="duckgres"} |~ "Query execution errored." [5m])
showed zero events during the mw-prod-us deadlock incident on 2026-05-04
even though dozens of in-flight queries failed because the CP killed
their workers. PR #516 fixes the deadlock; this fixes the visibility
blind spot so the next class of infra cancellations actually pages.

Changes:

- Add (*clientConn).isCallerCancellation(err) — true only when the err
  is cancel-shaped AND c.ctx.Err() != nil.
- Replace isQueryCancelled in 17 clientConn-method error sites with the
  new method. classifyErrorCode (no clientConn) keeps isQueryCancelled
  since SQLSTATE 57014 is the right wire response either way.
- Drop isUserQueryError's "isQueryCancelled → return true" shortcut. The
  caller-cancellation case is now filtered upstream; any cancel-shaped
  err that reaches isUserQueryError is infra by definition and class 57
  routes it to Error level via the existing SQLSTATE table.

Tests:

- TestIsCallerCancellation: gRPC Canceled with healthy ctx is NOT a
  caller cancel; user Ctrl-C (cancelled ctx) is.
- TestLogQueryErrorRoutesInfraCancelToErrorLevel: captures slog output,
  asserts a worker-death-shaped error logs "Query execution errored."
  at Error.
- TestIsUserQueryError "client cancellation" case updated to reflect the
  new semantics: any cancel reaching isUserQueryError is infra → false.
benben added a commit that referenced this pull request May 5, 2026
This is the actual root cause of the upstream-S3 501 NotImplemented errors
on parquet writes that we've been chasing through three previous PRs.

forwardUncached built its outbound request via http.NewRequestWithContext
with the inbound r.Body as the body. Per Go's docs:

    When body is of type *bytes.Buffer, *bytes.Reader, or *strings.Reader,
    the returned request's ContentLength is set [...]. For other types,
    the default is left as 0; the body is then sent using chunked
    transfer encoding.

r.Body is a generic io.ReadCloser, so the outbound req had ContentLength=0
and Go's Transport fell back to Transfer-Encoding: chunked. AWS S3 returns
501 NotImplemented for chunked PUT/POST regardless of whether the client
intended chunked.

So even though DuckDB sent a perfectly clean Content-Length-bearing PUT,
the proxy was rewriting it as chunked and S3 rejected it. The chain of
prior PRs (#516 deadlock fix, #518 / #519 logging, #524 / #525 / #526
visibility) gave us the breadcrumbs to actually see this; the fix itself
is one line: mirror ContentLength + TransferEncoding + Trailer from the
inbound request so the wire shape is preserved.

Verified the chunked / 501 chain is impossible to recreate now: AWS
Sigv4 signs `host;x-amz-content-sha256;x-amz-date` (per
duckdb-httpfs/src/s3fs.cpp:84), so neither Content-Length nor
Transfer-Encoding is in the signed headers list — meaning we're free to
set Content-Length without invalidating the signature.

Tests cover the proxy-as-transparent-forwarder invariant for the
non-cached path:

- TestForwardUncachedPropagatesContentLength: regression — outbound
  ContentLength matches inbound, no Transfer-Encoding: chunked. Verified
  this fails on the pre-fix code (origin sees ContentLength=-1 and
  Transfer-Encoding: chunked).
- TestForwardUncachedPreservesRequestHeaders: Authorization, x-amz-*,
  custom headers round-trip to origin verbatim.
- TestForwardUncachedStripsHopByHopBothDirections: Connection /
  Keep-Alive stripped per RFC 7230 in both request and response, while
  non-hop-by-hop headers pass through.
- TestForwardUncachedPreservesQueryString: AWS multipart-upload params
  (?uploads, ?partNumber, ?uploadId=...) round-trip exactly so Sigv4's
  canonical-request hash is preserved.
- TestForwardUncachedPreservesResponseBodyBytewise / Non2xx: response
  body bytes (binary, XML envelopes) are forwarded byte-for-byte. Locks
  in that the log_preview capture on non-2xx doesn't corrupt the body.
- TestForwardUncachedPreservesMethod: PUT/POST/DELETE/HEAD all reach
  the origin unchanged.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant