fix(executor): cross-process step-success authority (KEEP-395+398) by eskp · Pull Request #1110 · KeeperHub/keeperhub

eskp · 2026-05-04T00:02:00Z

Summary

Convergence merges and spurious-max-retries recovery were silently failing in production when Workflow DevKit checkpoint-resumed the workflow body on a different process from where its predecessors ran. The in-process step-success-tracker Map was empty on the consumer side, so KEEP-395 (PR fix(executor): resolve convergence/template race + bounded downstream drain #1104) returned stale closure outputs and KEEP-398 (PR fix(executor): reconcile spurious max-retries via step-success-tracker #1103) refused to demote spurious "exceeded max retries" errors. Cross-pod resume is the actual prod failure mode; local UAT cannot reach it.
This PR replaces the Map oracle with a durable DB-backed authority via a new output_raw column on workflow_execution_logs. Tracker stays as the fast path; on miss, a single batched query (WHERE node_id = ANY(...)) hydrates from the persisted unredacted step output. Both convergence merge and the post-drain spurious-recovery pass consume the same helper (getCompletedStepOutput / getCompletedStepOutputs).
output_raw is necessary because the existing output column is redactSensitiveData(...)'d for observability/UI; reading from it would inject [REDACTED] into downstream templates on cross-pod resume but not on single-process resume -- a divergent semantics regression worse than the bug being fixed. Surfaced during code review.
DB-fallback path is gated by KH_EXECUTOR_AUTHORITY_DB_FALLBACK (default on, read at module load -- toggling requires a pod restart) for ops emergency rollback. Errors in the DB path are contained inside getCompletedStepOutput -- they evict the cache key, log a structured warning, increment tracker_db_fallback.total{outcome=error}, and return null, falling back to closure value without cascading into a workflow failure.
Adds workflow.executor.tracker_db_fallback.total{source,outcome} counter, tracker_db_fallback.duration_ms histogram, and source in {in_catch, post_drain} label on existing spurious_recovery.total for observability.
Pre-migration rows (written before 0066 lands) have output_raw IS NULL. Both findFirst and findMany WHERE clauses include isNotNull(outputRaw) so these rows are treated as a miss -- the caller falls back to its closure value, identical to pre-PR cross-process resume behaviour, with no regression during the deploy window.

Notes for review

New migrations 0066 (column add, transactional, instant metadata-only since PG 11) and 0067 (partial composite index on (execution_id, node_id) WHERE status='success'). 0067 originally used CREATE INDEX CONCURRENTLY but drizzle-kit migrate wraps each migration in a transaction and CONCURRENTLY cannot run inside one -- confirmed empirically via pnpm db:migrate dry-run. Accepted brief AccessExclusive lock during partial-index build (sub-second on prod-sized data; pre-create with CONCURRENTLY manually before deploy if the table exceeds 10M rows -- see DBA runbook in REVIEW-D-DBA.md).
Supersedes fix(executor): reconcile spurious max-retries via step-success-tracker #1103 and fix(executor): resolve convergence/template race + bounded downstream drain #1104 in architecture but does not revert them; those merges are the baseline this PR builds on.
Adjacent open: executionTrace truncation in incrementCompletedSteps (lib/workflow/executor/logging.ts:310-343) is a separate unguarded read-modify-write race confirmed during investigation (8-of-11 steps in trace on prod execution vdij1o81aos88rx4pve0q). Not fixed here -- suggest a follow-up Linear ticket.

Test plan

CI: 54 new tests in this PR pass; 56 tests across the four KEEP-395/398 files pass; 84 tests across the full executor module pass
Migration applies cleanly: pnpm db:migrate against a staging-state local DB shows output_raw column on workflow_execution_logs and idx_exec_logs_success_lookup index present
Staging UAT: call defi-position-aggregator-ethereum 10x via wallet MCP, confirm output _race_warnings is empty and no exceeded max retries errors surface
Staging UAT: call stablecoin-yield-compare-base 5x via wallet MCP, same checks
Verify KH_EXECUTOR_AUTHORITY_DB_FALLBACK=false kill-switch: no tracker_db_fallback.total counter increments while flag is off (requires pod restart to take effect)
Verify new metrics exported and visible in staging dashboard: tracker_db_fallback.total{source,outcome}, tracker_db_fallback.duration_ms, spurious_recovery.total{source}
Compare staging log volume of tracker_degraded warns before/after deploy -- should approach zero (each remaining one is a healed cross-pod resume, not silent data loss)

Closes KEEP-395, KEEP-398.

Add failing/canary unit tests for KEEP-395 and KEEP-398: - get-completed-step-output.test.ts: tracker hit, DB fallback, single-flight - convergence-merge-authority.test.ts: 9-fan-in cross-process repro - spurious-recovery-post-drain.test.ts: post-drain reconciliation pass - cross-process-tracker-simulation.test.ts: regression canary for both bugs Fix executor-final-success and executor-spurious-max-retries to import from lightweight helper modules instead of executor.workflow.ts, removing pre-existing test-environment failures caused by the step-registry import chain.

Introduce DB-backed authority for step completion that survives cross-pod SDK checkpoint resumes, where the in-process step-success-tracker is empty on a fresh worker that did not run the predecessor steps. New modules: - runner-error-patterns.ts: spurious SDK error regexes extracted from executor.workflow.ts so spurious-recovery.ts can import them without pulling the full plugin registry chain - final-success.ts: computeFinalSuccess and ExecutionResult extracted for the same reason (test-importable without executor.workflow.ts) - get-completed-step-output.ts: tracker-then-DB dual-read with single-flight coalescing for concurrent convergence node queries on the same predecessor - spurious-recovery.ts: post-drain reconciliation pass for KEEP-398 convergence-tracker-merge.ts: add mergeFromAuthority (async, DB fallback) alongside the existing synchronous mergeFromTracker executor.workflow.ts: - Replace mergeFromTracker call-site with await mergeFromAuthority for KEEP-395 - Add post-drain reconcileSpuriousFailures pass before computeFinalSuccess for KEEP-398 - Add clearOutputCache to finally block alongside clearExecution - Re-export computeFinalSuccess, ExecutionResult, and error regexes from their new lightweight modules for backward compat

…execution_logs Adds output_raw (nullable JSONB) to store the unredacted step payload alongside the existing output column which receives redactSensitiveData(). The executor reads output_raw for cross-process resume; the UI/observability layer continues reading output. Also adds a partial composite index on (execution_id, node_id) WHERE status = 'success' via CONCURRENTLY to cover the executor authority lookup without blocking writes on deploy. Migration 0066: ADD COLUMN output_raw jsonb Migration 0067: CREATE INDEX CONCURRENTLY idx_exec_logs_success_lookup

…s authority logStepCompleteDb now writes two output columns per row: - output: redactSensitiveData(output) for observability/UI display (unchanged) - output_raw: raw unredacted payload, authoritative source for executor resume This closes the C1 redaction-divergence: cross-pod resume via DB now reads the same raw value that same-process tracker returns, eliminating silent "[REDACTED]" injection into downstream template values.

…kill-switch, observability Addresses C1 (read side), C2, C3 (partial via batch helper), M1, M3, M4: - Reads output_raw instead of output so cross-process resume returns raw values not "[REDACTED]" strings (C1) - Wraps queryDb rejection in .catch: evicts cache key, logs structured warning, increments tracker_db_fallback.total{outcome=error}, returns null so callers fall back to closure value -- rejection cannot poison the cache (C2) - Adds getCompletedStepOutputs() batch helper: single WHERE node_id = ANY($1) query for all tracker-miss predecessors, reducing N serial round-trips to one for cold-pod fan-in convergence (C3) - Kill-switch: KH_EXECUTOR_AUTHORITY_DB_FALLBACK=false disables DB fallback entirely at module load, reverting to tracker-only behaviour (M1) - Filters iterationIndex IS NULL + forEachNodeId IS NULL to select only canonical convergence rows; orderBy completedAt DESC as defense-in-depth (M3) - Emits tracker_db_fallback.total{source,outcome} counter and tracker_db_fallback.duration_ms latency histogram (M4)

@deprecated

…hority Replaces the sequential per-predecessor await loop with a single call to getCompletedStepOutputs(), which issues one WHERE node_id = ANY($1) query for all tracker-miss predecessors. For a 9-fan-in cold-pod resume this reduces ~45-225ms of serial RTTs to a single round-trip. Also marks mergeFromTracker @deprecated (KEEP-395) -- retained for the regression-canary unit test that demonstrates pre-fix behaviour.

…e labels reconcileSpuriousFailures (M2): each per-node DB lookup is now wrapped in try/catch. An error on one candidate logs a structured warning, increments spurious_recovery.error.total, and continues to the next candidate -- a single transient DB failure cannot abort the full reconcile pass. executor.workflow.ts in-catch fast-path (M4): adds source="in_catch" label to the existing spurious_recovery.total counter so in-catch vs post_drain recoveries can be split in dashboards. spurious-recovery.ts post-drain counter (M4): adds source="post_drain" label to match, completing the {source} label on all spurious_recovery.total calls.

Updated existing tests to match new mock shape (outputRaw instead of output, findMany for batch path, findFirst for single-node path). New test coverage added: - output_raw flows through unredacted (apiKey field not replaced) - DB rejection evicts cache; subsequent call retries cleanly - Batch helper issues exactly 1 DB call for 9 missing predecessors - Iteration rows excluded by IS NULL filter - Latency histogram recorded on batch DB call - Error containment in reconcileSpuriousFailures: first candidate DB error skipped, second candidate recovered, correct counters incremented - spurious_recovery.total carries source="post_drain" label - 9-fan-in mergeFromAuthority issues exactly 1 DB call

drizzle-kit migrate wraps each migration file in a transaction and CREATE INDEX CONCURRENTLY cannot run inside a transaction block. Confirmed via local pnpm db:migrate dry-run. Accept the brief AccessExclusive lock during build of the partial status='success' index. If lock duration becomes an issue at prod scale, a follow-up can rebuild CONCURRENTLY out-of-band and drop this index.

…ll safety Rows written before migration 0066 have output_raw IS NULL. Without this fix, getCompletedStepOutput returned { output: null, source: 'db' } for these rows, and the convergence merge overwrote the closure value with null -- reincarnating KEEP-395 for any workflow that started before the migration landed and resumed on a post-migration pod. Add isNotNull(workflowExecutionLogs.outputRaw) to both the findFirst WHERE clause (single-row path) and the findMany WHERE clause (batch convergence path). The DB now returns no row for pre-migration records; the caller falls back to its closure value, which is identical to pre-PR cross-process resume behaviour. No regression on same-process resume (tracker hit, no DB query). Update the test that previously enshrined the buggy behaviour: null outputRaw is now a miss, not a { output: null, source: 'db' } hit. Add a batch-path companion test asserting the result Map omits the key for pre-migration rows.

…ic import The previous kill-switch describe block mutated process.env at runtime, which has no effect on the module-level constant read at import time. It then asserted the default-on path (DB IS queried), not the off-state -- giving zero coverage of the emergency rollback path. Replace with vi.stubEnv("KH_EXECUTOR_AUTHORITY_DB_FALLBACK", "false") + vi.resetModules() + dynamic import() to re-load the module with the env var present at load time. Two tests now exercise the actual off-state: - getCompletedStepOutput returns null without calling findFirst - getCompletedStepOutputs returns empty Map without calling findMany vi.unstubAllEnvs() + vi.resetModules() after each test restores the default-on import for subsequent describe blocks.

… workflow bundler Workflow function imports were pulling lib/db (postgres) and nanoid into the workflow bundle via get-completed-step-output.ts, which Workflow DevKit forbids. Extracted the Drizzle queries into get-completed-step-output.step.ts with "use step" directives. The SWC plugin in workflow mode replaces "use step" function bodies with runtime stubs and strips their module-level imports, so postgres and nanoid never appear in the workflow VM bundle. The in-process helper now calls those step functions instead of querying directly. Updated all four unit test files to mock fetchCompletedStepOutputStep and fetchCompletedStepOutputsBatchStep from the new step file instead of the lib/db mock target.

github-actions · 2026-05-04T02:53:49Z

🧹 PR Environment Cleaned Up

The PR environment has been successfully deleted.

Deleted Resources:

Namespace: pr-1110
All Helm releases (Keeperhub, Scheduler, Event services)
PostgreSQL Database (including data)
LocalStack, Redis
All associated secrets and configs

All resources have been cleaned up and will no longer incur costs.

github-actions · 2026-05-04T02:53:50Z

ℹ️ No PR Environment to Clean Up

No PR environment was found for this PR. This is expected if:

The PR never had the deploy-pr-environment label
The environment was already cleaned up
The deployment never completed successfully

eskp added 9 commits May 4, 2026 04:48

eskp requested review from a team, OleksandrUA, joelorzet and suisuss and removed request for a team May 4, 2026 00:02

eskp had a problem deploying to staging May 4, 2026 00:02 — with GitHub Actions Failure

eskp added 2 commits May 4, 2026 10:33

eskp had a problem deploying to staging May 4, 2026 00:38 — with GitHub Actions Failure

eskp temporarily deployed to staging May 4, 2026 00:38 — with GitHub Actions Inactive

eskp temporarily deployed to staging May 4, 2026 01:30 — with GitHub Actions Inactive

eskp merged commit 7040e29 into staging May 4, 2026
33 checks passed

eskp deleted the simon/keep-395-keep-398-fix-cross-process-step-tracker branch May 4, 2026 02:53

eskp temporarily deployed to staging May 4, 2026 02:53 — with GitHub Actions Inactive

eskp mentioned this pull request May 4, 2026

fix(executor): bound DB-authority queries with 2s statement_timeout (KEEP-410) #1111

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(executor): cross-process step-success authority (KEEP-395+398)#1110

fix(executor): cross-process step-success authority (KEEP-395+398)#1110
eskp merged 12 commits intostagingfrom
simon/keep-395-keep-398-fix-cross-process-step-tracker

eskp commented May 4, 2026 •

edited

Loading

Uh oh!

Uh oh!

github-actions Bot commented May 4, 2026

Uh oh!

github-actions Bot commented May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

eskp commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Notes for review

Test plan

Uh oh!

Uh oh!

github-actions Bot commented May 4, 2026

🧹 PR Environment Cleaned Up

Uh oh!

github-actions Bot commented May 4, 2026

ℹ️ No PR Environment to Clean Up

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

eskp commented May 4, 2026 •

edited

Loading