fix(executor): cross-process step-success authority (KEEP-395+398)#1110
Merged
fix(executor): cross-process step-success authority (KEEP-395+398)#1110
Conversation
Add failing/canary unit tests for KEEP-395 and KEEP-398: - get-completed-step-output.test.ts: tracker hit, DB fallback, single-flight - convergence-merge-authority.test.ts: 9-fan-in cross-process repro - spurious-recovery-post-drain.test.ts: post-drain reconciliation pass - cross-process-tracker-simulation.test.ts: regression canary for both bugs Fix executor-final-success and executor-spurious-max-retries to import from lightweight helper modules instead of executor.workflow.ts, removing pre-existing test-environment failures caused by the step-registry import chain.
Introduce DB-backed authority for step completion that survives cross-pod SDK checkpoint resumes, where the in-process step-success-tracker is empty on a fresh worker that did not run the predecessor steps. New modules: - runner-error-patterns.ts: spurious SDK error regexes extracted from executor.workflow.ts so spurious-recovery.ts can import them without pulling the full plugin registry chain - final-success.ts: computeFinalSuccess and ExecutionResult extracted for the same reason (test-importable without executor.workflow.ts) - get-completed-step-output.ts: tracker-then-DB dual-read with single-flight coalescing for concurrent convergence node queries on the same predecessor - spurious-recovery.ts: post-drain reconciliation pass for KEEP-398 convergence-tracker-merge.ts: add mergeFromAuthority (async, DB fallback) alongside the existing synchronous mergeFromTracker executor.workflow.ts: - Replace mergeFromTracker call-site with await mergeFromAuthority for KEEP-395 - Add post-drain reconcileSpuriousFailures pass before computeFinalSuccess for KEEP-398 - Add clearOutputCache to finally block alongside clearExecution - Re-export computeFinalSuccess, ExecutionResult, and error regexes from their new lightweight modules for backward compat
…execution_logs Adds output_raw (nullable JSONB) to store the unredacted step payload alongside the existing output column which receives redactSensitiveData(). The executor reads output_raw for cross-process resume; the UI/observability layer continues reading output. Also adds a partial composite index on (execution_id, node_id) WHERE status = 'success' via CONCURRENTLY to cover the executor authority lookup without blocking writes on deploy. Migration 0066: ADD COLUMN output_raw jsonb Migration 0067: CREATE INDEX CONCURRENTLY idx_exec_logs_success_lookup
…s authority logStepCompleteDb now writes two output columns per row: - output: redactSensitiveData(output) for observability/UI display (unchanged) - output_raw: raw unredacted payload, authoritative source for executor resume This closes the C1 redaction-divergence: cross-pod resume via DB now reads the same raw value that same-process tracker returns, eliminating silent "[REDACTED]" injection into downstream template values.
…kill-switch, observability
Addresses C1 (read side), C2, C3 (partial via batch helper), M1, M3, M4:
- Reads output_raw instead of output so cross-process resume returns raw
values not "[REDACTED]" strings (C1)
- Wraps queryDb rejection in .catch: evicts cache key, logs structured
warning, increments tracker_db_fallback.total{outcome=error}, returns null
so callers fall back to closure value -- rejection cannot poison the cache (C2)
- Adds getCompletedStepOutputs() batch helper: single WHERE node_id = ANY($1)
query for all tracker-miss predecessors, reducing N serial round-trips to
one for cold-pod fan-in convergence (C3)
- Kill-switch: KH_EXECUTOR_AUTHORITY_DB_FALLBACK=false disables DB fallback
entirely at module load, reverting to tracker-only behaviour (M1)
- Filters iterationIndex IS NULL + forEachNodeId IS NULL to select only
canonical convergence rows; orderBy completedAt DESC as defense-in-depth (M3)
- Emits tracker_db_fallback.total{source,outcome} counter and
tracker_db_fallback.duration_ms latency histogram (M4)
…hority Replaces the sequential per-predecessor await loop with a single call to getCompletedStepOutputs(), which issues one WHERE node_id = ANY($1) query for all tracker-miss predecessors. For a 9-fan-in cold-pod resume this reduces ~45-225ms of serial RTTs to a single round-trip. Also marks mergeFromTracker @deprecated (KEEP-395) -- retained for the regression-canary unit test that demonstrates pre-fix behaviour.
…e labels
reconcileSpuriousFailures (M2): each per-node DB lookup is now wrapped in
try/catch. An error on one candidate logs a structured warning, increments
spurious_recovery.error.total, and continues to the next candidate -- a
single transient DB failure cannot abort the full reconcile pass.
executor.workflow.ts in-catch fast-path (M4): adds source="in_catch" label
to the existing spurious_recovery.total counter so in-catch vs post_drain
recoveries can be split in dashboards.
spurious-recovery.ts post-drain counter (M4): adds source="post_drain" label
to match, completing the {source} label on all spurious_recovery.total calls.
Updated existing tests to match new mock shape (outputRaw instead of output, findMany for batch path, findFirst for single-node path). New test coverage added: - output_raw flows through unredacted (apiKey field not replaced) - DB rejection evicts cache; subsequent call retries cleanly - Batch helper issues exactly 1 DB call for 9 missing predecessors - Iteration rows excluded by IS NULL filter - Latency histogram recorded on batch DB call - Error containment in reconcileSpuriousFailures: first candidate DB error skipped, second candidate recovered, correct counters incremented - spurious_recovery.total carries source="post_drain" label - 9-fan-in mergeFromAuthority issues exactly 1 DB call
drizzle-kit migrate wraps each migration file in a transaction and CREATE INDEX CONCURRENTLY cannot run inside a transaction block. Confirmed via local pnpm db:migrate dry-run. Accept the brief AccessExclusive lock during build of the partial status='success' index. If lock duration becomes an issue at prod scale, a follow-up can rebuild CONCURRENTLY out-of-band and drop this index.
…ll safety
Rows written before migration 0066 have output_raw IS NULL. Without this
fix, getCompletedStepOutput returned { output: null, source: 'db' } for
these rows, and the convergence merge overwrote the closure value with null
-- reincarnating KEEP-395 for any workflow that started before the migration
landed and resumed on a post-migration pod.
Add isNotNull(workflowExecutionLogs.outputRaw) to both the findFirst WHERE
clause (single-row path) and the findMany WHERE clause (batch convergence
path). The DB now returns no row for pre-migration records; the caller falls
back to its closure value, which is identical to pre-PR cross-process resume
behaviour. No regression on same-process resume (tracker hit, no DB query).
Update the test that previously enshrined the buggy behaviour: null outputRaw
is now a miss, not a { output: null, source: 'db' } hit. Add a batch-path
companion test asserting the result Map omits the key for pre-migration rows.
…ic import
The previous kill-switch describe block mutated process.env at runtime, which
has no effect on the module-level constant read at import time. It then
asserted the default-on path (DB IS queried), not the off-state -- giving zero
coverage of the emergency rollback path.
Replace with vi.stubEnv("KH_EXECUTOR_AUTHORITY_DB_FALLBACK", "false") +
vi.resetModules() + dynamic import() to re-load the module with the env var
present at load time. Two tests now exercise the actual off-state:
- getCompletedStepOutput returns null without calling findFirst
- getCompletedStepOutputs returns empty Map without calling findMany
vi.unstubAllEnvs() + vi.resetModules() after each test restores the default-on
import for subsequent describe blocks.
… workflow bundler Workflow function imports were pulling lib/db (postgres) and nanoid into the workflow bundle via get-completed-step-output.ts, which Workflow DevKit forbids. Extracted the Drizzle queries into get-completed-step-output.step.ts with "use step" directives. The SWC plugin in workflow mode replaces "use step" function bodies with runtime stubs and strips their module-level imports, so postgres and nanoid never appear in the workflow VM bundle. The in-process helper now calls those step functions instead of querying directly. Updated all four unit test files to mock fetchCompletedStepOutputStep and fetchCompletedStepOutputsBatchStep from the new step file instead of the lib/db mock target.
🧹 PR Environment Cleaned UpThe PR environment has been successfully deleted. Deleted Resources:
All resources have been cleaned up and will no longer incur costs. |
ℹ️ No PR Environment to Clean UpNo PR environment was found for this PR. This is expected if:
|
5 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
step-success-trackerMap was empty on the consumer side, so KEEP-395 (PR fix(executor): resolve convergence/template race + bounded downstream drain #1104) returned stale closure outputs and KEEP-398 (PR fix(executor): reconcile spurious max-retries via step-success-tracker #1103) refused to demote spurious "exceeded max retries" errors. Cross-pod resume is the actual prod failure mode; local UAT cannot reach it.output_rawcolumn onworkflow_execution_logs. Tracker stays as the fast path; on miss, a single batched query (WHERE node_id = ANY(...)) hydrates from the persisted unredacted step output. Both convergence merge and the post-drain spurious-recovery pass consume the same helper (getCompletedStepOutput/getCompletedStepOutputs).output_rawis necessary because the existingoutputcolumn isredactSensitiveData(...)'d for observability/UI; reading from it would inject[REDACTED]into downstream templates on cross-pod resume but not on single-process resume -- a divergent semantics regression worse than the bug being fixed. Surfaced during code review.KH_EXECUTOR_AUTHORITY_DB_FALLBACK(default on, read at module load -- toggling requires a pod restart) for ops emergency rollback. Errors in the DB path are contained insidegetCompletedStepOutput-- they evict the cache key, log a structured warning, incrementtracker_db_fallback.total{outcome=error}, and return null, falling back to closure value without cascading into a workflow failure.workflow.executor.tracker_db_fallback.total{source,outcome}counter,tracker_db_fallback.duration_mshistogram, andsource in {in_catch, post_drain}label on existingspurious_recovery.totalfor observability.output_raw IS NULL. BothfindFirstandfindManyWHERE clauses includeisNotNull(outputRaw)so these rows are treated as a miss -- the caller falls back to its closure value, identical to pre-PR cross-process resume behaviour, with no regression during the deploy window.Notes for review
(execution_id, node_id) WHERE status='success'). 0067 originally usedCREATE INDEX CONCURRENTLYbutdrizzle-kit migratewraps each migration in a transaction and CONCURRENTLY cannot run inside one -- confirmed empirically viapnpm db:migratedry-run. Accepted brief AccessExclusive lock during partial-index build (sub-second on prod-sized data; pre-create with CONCURRENTLY manually before deploy if the table exceeds 10M rows -- see DBA runbook in REVIEW-D-DBA.md).executionTracetruncation inincrementCompletedSteps(lib/workflow/executor/logging.ts:310-343) is a separate unguarded read-modify-write race confirmed during investigation (8-of-11 steps in trace on prod executionvdij1o81aos88rx4pve0q). Not fixed here -- suggest a follow-up Linear ticket.Test plan
pnpm db:migrateagainst a staging-state local DB showsoutput_rawcolumn onworkflow_execution_logsandidx_exec_logs_success_lookupindex presentdefi-position-aggregator-ethereum10x via wallet MCP, confirm output_race_warningsis empty and noexceeded max retrieserrors surfacestablecoin-yield-compare-base5x via wallet MCP, same checksKH_EXECUTOR_AUTHORITY_DB_FALLBACK=falsekill-switch: notracker_db_fallback.totalcounter increments while flag is off (requires pod restart to take effect)tracker_db_fallback.total{source,outcome},tracker_db_fallback.duration_ms,spurious_recovery.total{source}tracker_degradedwarns before/after deploy -- should approach zero (each remaining one is a healed cross-pod resume, not silent data loss)Closes KEEP-395, KEEP-398.