Skip to content

fix(executor): cross-process step-success authority (KEEP-395+398)#1110

Merged
eskp merged 12 commits intostagingfrom
simon/keep-395-keep-398-fix-cross-process-step-tracker
May 4, 2026
Merged

fix(executor): cross-process step-success authority (KEEP-395+398)#1110
eskp merged 12 commits intostagingfrom
simon/keep-395-keep-398-fix-cross-process-step-tracker

Conversation

@eskp
Copy link
Copy Markdown

@eskp eskp commented May 4, 2026

Summary

  • Convergence merges and spurious-max-retries recovery were silently failing in production when Workflow DevKit checkpoint-resumed the workflow body on a different process from where its predecessors ran. The in-process step-success-tracker Map was empty on the consumer side, so KEEP-395 (PR fix(executor): resolve convergence/template race + bounded downstream drain #1104) returned stale closure outputs and KEEP-398 (PR fix(executor): reconcile spurious max-retries via step-success-tracker #1103) refused to demote spurious "exceeded max retries" errors. Cross-pod resume is the actual prod failure mode; local UAT cannot reach it.
  • This PR replaces the Map oracle with a durable DB-backed authority via a new output_raw column on workflow_execution_logs. Tracker stays as the fast path; on miss, a single batched query (WHERE node_id = ANY(...)) hydrates from the persisted unredacted step output. Both convergence merge and the post-drain spurious-recovery pass consume the same helper (getCompletedStepOutput / getCompletedStepOutputs).
  • output_raw is necessary because the existing output column is redactSensitiveData(...)'d for observability/UI; reading from it would inject [REDACTED] into downstream templates on cross-pod resume but not on single-process resume -- a divergent semantics regression worse than the bug being fixed. Surfaced during code review.
  • DB-fallback path is gated by KH_EXECUTOR_AUTHORITY_DB_FALLBACK (default on, read at module load -- toggling requires a pod restart) for ops emergency rollback. Errors in the DB path are contained inside getCompletedStepOutput -- they evict the cache key, log a structured warning, increment tracker_db_fallback.total{outcome=error}, and return null, falling back to closure value without cascading into a workflow failure.
  • Adds workflow.executor.tracker_db_fallback.total{source,outcome} counter, tracker_db_fallback.duration_ms histogram, and source in {in_catch, post_drain} label on existing spurious_recovery.total for observability.
  • Pre-migration rows (written before 0066 lands) have output_raw IS NULL. Both findFirst and findMany WHERE clauses include isNotNull(outputRaw) so these rows are treated as a miss -- the caller falls back to its closure value, identical to pre-PR cross-process resume behaviour, with no regression during the deploy window.

Notes for review

  • New migrations 0066 (column add, transactional, instant metadata-only since PG 11) and 0067 (partial composite index on (execution_id, node_id) WHERE status='success'). 0067 originally used CREATE INDEX CONCURRENTLY but drizzle-kit migrate wraps each migration in a transaction and CONCURRENTLY cannot run inside one -- confirmed empirically via pnpm db:migrate dry-run. Accepted brief AccessExclusive lock during partial-index build (sub-second on prod-sized data; pre-create with CONCURRENTLY manually before deploy if the table exceeds 10M rows -- see DBA runbook in REVIEW-D-DBA.md).
  • Supersedes fix(executor): reconcile spurious max-retries via step-success-tracker #1103 and fix(executor): resolve convergence/template race + bounded downstream drain #1104 in architecture but does not revert them; those merges are the baseline this PR builds on.
  • Adjacent open: executionTrace truncation in incrementCompletedSteps (lib/workflow/executor/logging.ts:310-343) is a separate unguarded read-modify-write race confirmed during investigation (8-of-11 steps in trace on prod execution vdij1o81aos88rx4pve0q). Not fixed here -- suggest a follow-up Linear ticket.

Test plan

  • CI: 54 new tests in this PR pass; 56 tests across the four KEEP-395/398 files pass; 84 tests across the full executor module pass
  • Migration applies cleanly: pnpm db:migrate against a staging-state local DB shows output_raw column on workflow_execution_logs and idx_exec_logs_success_lookup index present
  • Staging UAT: call defi-position-aggregator-ethereum 10x via wallet MCP, confirm output _race_warnings is empty and no exceeded max retries errors surface
  • Staging UAT: call stablecoin-yield-compare-base 5x via wallet MCP, same checks
  • Verify KH_EXECUTOR_AUTHORITY_DB_FALLBACK=false kill-switch: no tracker_db_fallback.total counter increments while flag is off (requires pod restart to take effect)
  • Verify new metrics exported and visible in staging dashboard: tracker_db_fallback.total{source,outcome}, tracker_db_fallback.duration_ms, spurious_recovery.total{source}
  • Compare staging log volume of tracker_degraded warns before/after deploy -- should approach zero (each remaining one is a healed cross-pod resume, not silent data loss)

Closes KEEP-395, KEEP-398.

eskp added 9 commits May 4, 2026 04:48
Add failing/canary unit tests for KEEP-395 and KEEP-398:
- get-completed-step-output.test.ts: tracker hit, DB fallback, single-flight
- convergence-merge-authority.test.ts: 9-fan-in cross-process repro
- spurious-recovery-post-drain.test.ts: post-drain reconciliation pass
- cross-process-tracker-simulation.test.ts: regression canary for both bugs

Fix executor-final-success and executor-spurious-max-retries to import
from lightweight helper modules instead of executor.workflow.ts, removing
pre-existing test-environment failures caused by the step-registry import chain.
Introduce DB-backed authority for step completion that survives cross-pod
SDK checkpoint resumes, where the in-process step-success-tracker is empty
on a fresh worker that did not run the predecessor steps.

New modules:
- runner-error-patterns.ts: spurious SDK error regexes extracted from
  executor.workflow.ts so spurious-recovery.ts can import them without
  pulling the full plugin registry chain
- final-success.ts: computeFinalSuccess and ExecutionResult extracted
  for the same reason (test-importable without executor.workflow.ts)
- get-completed-step-output.ts: tracker-then-DB dual-read with single-flight
  coalescing for concurrent convergence node queries on the same predecessor
- spurious-recovery.ts: post-drain reconciliation pass for KEEP-398

convergence-tracker-merge.ts: add mergeFromAuthority (async, DB fallback)
alongside the existing synchronous mergeFromTracker

executor.workflow.ts:
- Replace mergeFromTracker call-site with await mergeFromAuthority for KEEP-395
- Add post-drain reconcileSpuriousFailures pass before computeFinalSuccess for KEEP-398
- Add clearOutputCache to finally block alongside clearExecution
- Re-export computeFinalSuccess, ExecutionResult, and error regexes from
  their new lightweight modules for backward compat
…execution_logs

Adds output_raw (nullable JSONB) to store the unredacted step payload
alongside the existing output column which receives redactSensitiveData().
The executor reads output_raw for cross-process resume; the UI/observability
layer continues reading output.

Also adds a partial composite index on (execution_id, node_id) WHERE
status = 'success' via CONCURRENTLY to cover the executor authority lookup
without blocking writes on deploy.

Migration 0066: ADD COLUMN output_raw jsonb
Migration 0067: CREATE INDEX CONCURRENTLY idx_exec_logs_success_lookup
…s authority

logStepCompleteDb now writes two output columns per row:
- output: redactSensitiveData(output) for observability/UI display (unchanged)
- output_raw: raw unredacted payload, authoritative source for executor resume

This closes the C1 redaction-divergence: cross-pod resume via DB now reads
the same raw value that same-process tracker returns, eliminating silent
"[REDACTED]" injection into downstream template values.
…kill-switch, observability

Addresses C1 (read side), C2, C3 (partial via batch helper), M1, M3, M4:

- Reads output_raw instead of output so cross-process resume returns raw
  values not "[REDACTED]" strings (C1)
- Wraps queryDb rejection in .catch: evicts cache key, logs structured
  warning, increments tracker_db_fallback.total{outcome=error}, returns null
  so callers fall back to closure value -- rejection cannot poison the cache (C2)
- Adds getCompletedStepOutputs() batch helper: single WHERE node_id = ANY($1)
  query for all tracker-miss predecessors, reducing N serial round-trips to
  one for cold-pod fan-in convergence (C3)
- Kill-switch: KH_EXECUTOR_AUTHORITY_DB_FALLBACK=false disables DB fallback
  entirely at module load, reverting to tracker-only behaviour (M1)
- Filters iterationIndex IS NULL + forEachNodeId IS NULL to select only
  canonical convergence rows; orderBy completedAt DESC as defense-in-depth (M3)
- Emits tracker_db_fallback.total{source,outcome} counter and
  tracker_db_fallback.duration_ms latency histogram (M4)
…hority

Replaces the sequential per-predecessor await loop with a single call to
getCompletedStepOutputs(), which issues one WHERE node_id = ANY($1) query
for all tracker-miss predecessors. For a 9-fan-in cold-pod resume this
reduces ~45-225ms of serial RTTs to a single round-trip.

Also marks mergeFromTracker @deprecated (KEEP-395) -- retained for the
regression-canary unit test that demonstrates pre-fix behaviour.
…e labels

reconcileSpuriousFailures (M2): each per-node DB lookup is now wrapped in
try/catch. An error on one candidate logs a structured warning, increments
spurious_recovery.error.total, and continues to the next candidate -- a
single transient DB failure cannot abort the full reconcile pass.

executor.workflow.ts in-catch fast-path (M4): adds source="in_catch" label
to the existing spurious_recovery.total counter so in-catch vs post_drain
recoveries can be split in dashboards.

spurious-recovery.ts post-drain counter (M4): adds source="post_drain" label
to match, completing the {source} label on all spurious_recovery.total calls.
Updated existing tests to match new mock shape (outputRaw instead of output,
findMany for batch path, findFirst for single-node path).

New test coverage added:
- output_raw flows through unredacted (apiKey field not replaced)
- DB rejection evicts cache; subsequent call retries cleanly
- Batch helper issues exactly 1 DB call for 9 missing predecessors
- Iteration rows excluded by IS NULL filter
- Latency histogram recorded on batch DB call
- Error containment in reconcileSpuriousFailures: first candidate DB error
  skipped, second candidate recovered, correct counters incremented
- spurious_recovery.total carries source="post_drain" label
- 9-fan-in mergeFromAuthority issues exactly 1 DB call
drizzle-kit migrate wraps each migration file in a transaction and
CREATE INDEX CONCURRENTLY cannot run inside a transaction block.
Confirmed via local pnpm db:migrate dry-run.

Accept the brief AccessExclusive lock during build of the partial
status='success' index. If lock duration becomes an issue at prod
scale, a follow-up can rebuild CONCURRENTLY out-of-band and drop
this index.
@eskp eskp requested review from a team, OleksandrUA, joelorzet and suisuss and removed request for a team May 4, 2026 00:02
eskp added 2 commits May 4, 2026 10:33
…ll safety

Rows written before migration 0066 have output_raw IS NULL. Without this
fix, getCompletedStepOutput returned { output: null, source: 'db' } for
these rows, and the convergence merge overwrote the closure value with null
-- reincarnating KEEP-395 for any workflow that started before the migration
landed and resumed on a post-migration pod.

Add isNotNull(workflowExecutionLogs.outputRaw) to both the findFirst WHERE
clause (single-row path) and the findMany WHERE clause (batch convergence
path). The DB now returns no row for pre-migration records; the caller falls
back to its closure value, which is identical to pre-PR cross-process resume
behaviour. No regression on same-process resume (tracker hit, no DB query).

Update the test that previously enshrined the buggy behaviour: null outputRaw
is now a miss, not a { output: null, source: 'db' } hit. Add a batch-path
companion test asserting the result Map omits the key for pre-migration rows.
…ic import

The previous kill-switch describe block mutated process.env at runtime, which
has no effect on the module-level constant read at import time. It then
asserted the default-on path (DB IS queried), not the off-state -- giving zero
coverage of the emergency rollback path.

Replace with vi.stubEnv("KH_EXECUTOR_AUTHORITY_DB_FALLBACK", "false") +
vi.resetModules() + dynamic import() to re-load the module with the env var
present at load time. Two tests now exercise the actual off-state:

- getCompletedStepOutput returns null without calling findFirst
- getCompletedStepOutputs returns empty Map without calling findMany

vi.unstubAllEnvs() + vi.resetModules() after each test restores the default-on
import for subsequent describe blocks.
… workflow bundler

Workflow function imports were pulling lib/db (postgres) and nanoid into the
workflow bundle via get-completed-step-output.ts, which Workflow DevKit forbids.

Extracted the Drizzle queries into get-completed-step-output.step.ts with
"use step" directives. The SWC plugin in workflow mode replaces "use step"
function bodies with runtime stubs and strips their module-level imports, so
postgres and nanoid never appear in the workflow VM bundle. The in-process
helper now calls those step functions instead of querying directly.

Updated all four unit test files to mock fetchCompletedStepOutputStep and
fetchCompletedStepOutputsBatchStep from the new step file instead of the
lib/db mock target.
@eskp eskp merged commit 7040e29 into staging May 4, 2026
33 checks passed
@eskp eskp deleted the simon/keep-395-keep-398-fix-cross-process-step-tracker branch May 4, 2026 02:53
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 4, 2026

🧹 PR Environment Cleaned Up

The PR environment has been successfully deleted.

Deleted Resources:

  • Namespace: pr-1110
  • All Helm releases (Keeperhub, Scheduler, Event services)
  • PostgreSQL Database (including data)
  • LocalStack, Redis
  • All associated secrets and configs

All resources have been cleaned up and will no longer incur costs.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 4, 2026

ℹ️ No PR Environment to Clean Up

No PR environment was found for this PR. This is expected if:

  • The PR never had the deploy-pr-environment label
  • The environment was already cleaned up
  • The deployment never completed successfully

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant