Skip to content

fix(executor): bound DB-authority queries with 2s statement_timeout (KEEP-410)#1111

Merged
eskp merged 2 commits intostagingfrom
simon/keep-410-stmt-timeout
May 4, 2026
Merged

fix(executor): bound DB-authority queries with 2s statement_timeout (KEEP-410)#1111
eskp merged 2 commits intostagingfrom
simon/keep-410-stmt-timeout

Conversation

@eskp
Copy link
Copy Markdown

@eskp eskp commented May 4, 2026

Summary

  • The DB-backed step-success authority added in PR fix(executor): cross-process step-success authority (KEEP-395+398) #1110 issues Postgres reads from inside the workflow body. Without a statement_timeout, a single slow query holds a pool slot indefinitely; under RDS hiccup or planner regression, all 10 pool slots can stall, cascading into workflow failures across every concurrent execution on the pod. The kill-switch (KH_EXECUTOR_AUTHORITY_DB_FALLBACK=false) is the only relief and requires a pod restart cycle.
  • This PR bounds both helpers (fetchCompletedStepOutputStep and fetchCompletedStepOutputsBatchStep) by wrapping each in a Drizzle transaction with SET LOCAL statement_timeout = '2s' as the first statement. Pattern mirrors existing precedent at keeperhub-executor/workflow-runner.ts:99.
  • On Postgres SQLSTATE 57014 (canceling statement due to statement timeout), the helper returns null and increments tracker_db_fallback.total{outcome='timeout'} (new label distinct from error). Same containment shape as existing error path; just adds a hard upper bound.
  • Detection mirrors lib/db/connection-utils.ts:113-119 and checks both error.code and error.cause.code for postgres-js' wrapped error shape.

Notes for review

  • 9 new unit tests cover timeout detection (single + batch), counter labeling, cache eviction on timeout, and regression-canary that non-timeout errors still land on outcome='error'.
  • Manual local-pg verification recommended before merge — unit tests mock at the step boundary, so the actual db.transaction + SET LOCAL SQL path isn't exercised by mocks. Suggested smoke test: run a workflow against pnpm dev with EXPLAIN-traceable query.
  • Latency cost: BEGIN + SET LOCAL + SELECT + COMMIT triples round-trip count for the cross-pod tracker-miss case. Estimated 2-5ms intra-region overhead, well within the 50ms p99 convergence-merge budget. Tracker-hit fast path is unaffected.
  • Kill-switch short-circuit confirmed: KH_EXECUTOR_AUTHORITY_DB_FALLBACK=false returns null before db.transaction() is opened. No DB connection acquired in the off state.

Open follow-up

  • KEEP-413: app-level circuit breaker. statement_timeout bounds each query but NOT the connection-pool queue-wait — the 11th caller still waits up to 30s on a saturated pool. The kill-switch is the only escape and requires pod restart. A circuit breaker would give sub-second relief. Filed as a follow-up; out of scope for this PR.

Test plan

  • CI: 33/33 unit tests pass (24 pre-existing + 9 new)
  • pnpm dev builds cleanly (Workflow DevKit bundler accepts the changes — verified locally)
  • Manual smoke against pnpm dev: run a fan-out workflow, verify tracker_db_fallback.total{outcome='hit'|'miss'|'error'|'timeout'} is wired and increments per call
  • Staging UAT after merge: confirm zero outcome='timeout' increments under steady state; if any fire, investigate before prod
  • Prod monitoring: alert on tracker_db_fallback.total{outcome='timeout'} rate exceeding zero

Closes KEEP-410.

eskp added 2 commits May 4, 2026 13:06
… (KEEP-410)

Wrap both fetchCompletedStepOutputStep (single-row) and
fetchCompletedStepOutputsBatchStep (batch) in a Drizzle transaction.
The first statement in each transaction is SET LOCAL statement_timeout = '2s',
ensuring the timeout applies only to authority reads and reverts automatically
on commit/rollback without leaking across pooled connections.

Add isStatementTimeout() helper that checks SQLSTATE 57014 at both the
thrown error level and error.cause (DrizzleQueryError wraps the driver
error on cause). Both single and batch catch blocks now emit
outcome=timeout instead of outcome=error for SQLSTATE 57014, making
timeouts observable in metrics independently of other DB failures.
Add 9 new tests covering SQLSTATE 57014 statement timeout detection in
both the single-row (getCompletedStepOutput) and batch
(getCompletedStepOutputs) paths:

- timeout returns null
- timeout increments outcome=timeout counter, not outcome=error
- cache evicts on timeout so the next call retries the DB
- DrizzleQueryError wrapping (error.cause pattern) is also detected
- generic errors still increment outcome=error (regression guard)
- same four checks mirrored for the batch convergence path
@eskp eskp requested review from a team, OleksandrUA, joelorzet and suisuss and removed request for a team May 4, 2026 03:21
@eskp eskp merged commit 6ed5397 into staging May 4, 2026
24 checks passed
@eskp eskp deleted the simon/keep-410-stmt-timeout branch May 4, 2026 07:06
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 4, 2026

🧹 PR Environment Cleaned Up

The PR environment has been successfully deleted.

Deleted Resources:

  • Namespace: pr-1111
  • All Helm releases (Keeperhub, Scheduler, Event services)
  • PostgreSQL Database (including data)
  • LocalStack, Redis
  • All associated secrets and configs

All resources have been cleaned up and will no longer incur costs.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 4, 2026

ℹ️ No PR Environment to Clean Up

No PR environment was found for this PR. This is expected if:

  • The PR never had the deploy-pr-environment label
  • The environment was already cleaned up
  • The deployment never completed successfully

@eskp eskp mentioned this pull request May 4, 2026
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant