fix(executor): bound DB-authority queries with 2s statement_timeout (KEEP-410) by eskp · Pull Request #1111 · KeeperHub/keeperhub

eskp · 2026-05-04T03:21:56Z

Summary

The DB-backed step-success authority added in PR fix(executor): cross-process step-success authority (KEEP-395+398) #1110 issues Postgres reads from inside the workflow body. Without a statement_timeout, a single slow query holds a pool slot indefinitely; under RDS hiccup or planner regression, all 10 pool slots can stall, cascading into workflow failures across every concurrent execution on the pod. The kill-switch (KH_EXECUTOR_AUTHORITY_DB_FALLBACK=false) is the only relief and requires a pod restart cycle.
This PR bounds both helpers (fetchCompletedStepOutputStep and fetchCompletedStepOutputsBatchStep) by wrapping each in a Drizzle transaction with SET LOCAL statement_timeout = '2s' as the first statement. Pattern mirrors existing precedent at keeperhub-executor/workflow-runner.ts:99.
On Postgres SQLSTATE 57014 (canceling statement due to statement timeout), the helper returns null and increments tracker_db_fallback.total{outcome='timeout'} (new label distinct from error). Same containment shape as existing error path; just adds a hard upper bound.
Detection mirrors lib/db/connection-utils.ts:113-119 and checks both error.code and error.cause.code for postgres-js' wrapped error shape.

Notes for review

9 new unit tests cover timeout detection (single + batch), counter labeling, cache eviction on timeout, and regression-canary that non-timeout errors still land on outcome='error'.
Manual local-pg verification recommended before merge — unit tests mock at the step boundary, so the actual db.transaction + SET LOCAL SQL path isn't exercised by mocks. Suggested smoke test: run a workflow against pnpm dev with EXPLAIN-traceable query.
Latency cost: BEGIN + SET LOCAL + SELECT + COMMIT triples round-trip count for the cross-pod tracker-miss case. Estimated 2-5ms intra-region overhead, well within the 50ms p99 convergence-merge budget. Tracker-hit fast path is unaffected.
Kill-switch short-circuit confirmed: KH_EXECUTOR_AUTHORITY_DB_FALLBACK=false returns null before db.transaction() is opened. No DB connection acquired in the off state.

Open follow-up

KEEP-413: app-level circuit breaker. statement_timeout bounds each query but NOT the connection-pool queue-wait — the 11th caller still waits up to 30s on a saturated pool. The kill-switch is the only escape and requires pod restart. A circuit breaker would give sub-second relief. Filed as a follow-up; out of scope for this PR.

Test plan

CI: 33/33 unit tests pass (24 pre-existing + 9 new)
pnpm dev builds cleanly (Workflow DevKit bundler accepts the changes — verified locally)
Manual smoke against pnpm dev: run a fan-out workflow, verify tracker_db_fallback.total{outcome='hit'|'miss'|'error'|'timeout'} is wired and increments per call
Staging UAT after merge: confirm zero outcome='timeout' increments under steady state; if any fire, investigate before prod
Prod monitoring: alert on tracker_db_fallback.total{outcome='timeout'} rate exceeding zero

Closes KEEP-410.

… (KEEP-410) Wrap both fetchCompletedStepOutputStep (single-row) and fetchCompletedStepOutputsBatchStep (batch) in a Drizzle transaction. The first statement in each transaction is SET LOCAL statement_timeout = '2s', ensuring the timeout applies only to authority reads and reverts automatically on commit/rollback without leaking across pooled connections. Add isStatementTimeout() helper that checks SQLSTATE 57014 at both the thrown error level and error.cause (DrizzleQueryError wraps the driver error on cause). Both single and batch catch blocks now emit outcome=timeout instead of outcome=error for SQLSTATE 57014, making timeouts observable in metrics independently of other DB failures.

Add 9 new tests covering SQLSTATE 57014 statement timeout detection in both the single-row (getCompletedStepOutput) and batch (getCompletedStepOutputs) paths: - timeout returns null - timeout increments outcome=timeout counter, not outcome=error - cache evicts on timeout so the next call retries the DB - DrizzleQueryError wrapping (error.cause pattern) is also detected - generic errors still increment outcome=error (regression guard) - same four checks mirrored for the batch convergence path

github-actions · 2026-05-04T07:07:08Z

🧹 PR Environment Cleaned Up

The PR environment has been successfully deleted.

Deleted Resources:

Namespace: pr-1111
All Helm releases (Keeperhub, Scheduler, Event services)
PostgreSQL Database (including data)
LocalStack, Redis
All associated secrets and configs

All resources have been cleaned up and will no longer incur costs.

github-actions · 2026-05-04T07:07:09Z

ℹ️ No PR Environment to Clean Up

No PR environment was found for this PR. This is expected if:

The PR never had the deploy-pr-environment label
The environment was already cleaned up
The deployment never completed successfully

eskp added 2 commits May 4, 2026 13:06

eskp requested review from a team, OleksandrUA, joelorzet and suisuss and removed request for a team May 4, 2026 03:21

eskp temporarily deployed to staging May 4, 2026 03:22 — with GitHub Actions Inactive

eskp merged commit 6ed5397 into staging May 4, 2026
24 checks passed

eskp deleted the simon/keep-410-stmt-timeout branch May 4, 2026 07:06

eskp temporarily deployed to staging May 4, 2026 07:06 — with GitHub Actions Inactive

eskp mentioned this pull request May 4, 2026

release: to prod #1117

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(executor): bound DB-authority queries with 2s statement_timeout (KEEP-410)#1111

fix(executor): bound DB-authority queries with 2s statement_timeout (KEEP-410)#1111
eskp merged 2 commits intostagingfrom
simon/keep-410-stmt-timeout

eskp commented May 4, 2026

Uh oh!

Uh oh!

github-actions Bot commented May 4, 2026

Uh oh!

github-actions Bot commented May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

eskp commented May 4, 2026

Summary

Notes for review

Open follow-up

Test plan

Uh oh!

Uh oh!

github-actions Bot commented May 4, 2026

🧹 PR Environment Cleaned Up

Uh oh!

github-actions Bot commented May 4, 2026

ℹ️ No PR Environment to Clean Up

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant