fix(executor): bound DB-authority queries with 2s statement_timeout (KEEP-410)#1111
Merged
fix(executor): bound DB-authority queries with 2s statement_timeout (KEEP-410)#1111
Conversation
… (KEEP-410) Wrap both fetchCompletedStepOutputStep (single-row) and fetchCompletedStepOutputsBatchStep (batch) in a Drizzle transaction. The first statement in each transaction is SET LOCAL statement_timeout = '2s', ensuring the timeout applies only to authority reads and reverts automatically on commit/rollback without leaking across pooled connections. Add isStatementTimeout() helper that checks SQLSTATE 57014 at both the thrown error level and error.cause (DrizzleQueryError wraps the driver error on cause). Both single and batch catch blocks now emit outcome=timeout instead of outcome=error for SQLSTATE 57014, making timeouts observable in metrics independently of other DB failures.
Add 9 new tests covering SQLSTATE 57014 statement timeout detection in both the single-row (getCompletedStepOutput) and batch (getCompletedStepOutputs) paths: - timeout returns null - timeout increments outcome=timeout counter, not outcome=error - cache evicts on timeout so the next call retries the DB - DrizzleQueryError wrapping (error.cause pattern) is also detected - generic errors still increment outcome=error (regression guard) - same four checks mirrored for the batch convergence path
🧹 PR Environment Cleaned UpThe PR environment has been successfully deleted. Deleted Resources:
All resources have been cleaned up and will no longer incur costs. |
ℹ️ No PR Environment to Clean UpNo PR environment was found for this PR. This is expected if:
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
statement_timeout, a single slow query holds a pool slot indefinitely; under RDS hiccup or planner regression, all 10 pool slots can stall, cascading into workflow failures across every concurrent execution on the pod. The kill-switch (KH_EXECUTOR_AUTHORITY_DB_FALLBACK=false) is the only relief and requires a pod restart cycle.fetchCompletedStepOutputStepandfetchCompletedStepOutputsBatchStep) by wrapping each in a Drizzle transaction withSET LOCAL statement_timeout = '2s'as the first statement. Pattern mirrors existing precedent atkeeperhub-executor/workflow-runner.ts:99.57014(canceling statement due to statement timeout), the helper returns null and incrementstracker_db_fallback.total{outcome='timeout'}(new label distinct fromerror). Same containment shape as existing error path; just adds a hard upper bound.lib/db/connection-utils.ts:113-119and checks botherror.codeanderror.cause.codefor postgres-js' wrapped error shape.Notes for review
outcome='error'.db.transaction+SET LOCALSQL path isn't exercised by mocks. Suggested smoke test: run a workflow againstpnpm devwithEXPLAIN-traceable query.KH_EXECUTOR_AUTHORITY_DB_FALLBACK=falsereturns null beforedb.transaction()is opened. No DB connection acquired in the off state.Open follow-up
statement_timeoutbounds each query but NOT the connection-pool queue-wait — the 11th caller still waits up to 30s on a saturated pool. The kill-switch is the only escape and requires pod restart. A circuit breaker would give sub-second relief. Filed as a follow-up; out of scope for this PR.Test plan
pnpm devbuilds cleanly (Workflow DevKit bundler accepts the changes — verified locally)pnpm dev: run a fan-out workflow, verifytracker_db_fallback.total{outcome='hit'|'miss'|'error'|'timeout'}is wired and increments per calloutcome='timeout'increments under steady state; if any fire, investigate before prodtracker_db_fallback.total{outcome='timeout'}rate exceeding zeroCloses KEEP-410.