feat(core): checkpoint/resume (1.R) + human-gate suspend/resume & timeout (1.Q)#22
Conversation
…uisite)
A skip-propagated vertex emitted NOTHING, so the persisted event stream could not record which nodes a
condition dimmed — checkpoint/resume (1.R) reconstructs run state by replaying that stream, so resume
after a condition would mis-route. This adds `node:skipped` to make the log a complete, replayable
record (and it closes a real observability gap — surfaces never saw a node get skipped before).
- shared: `NodeSkippedEventSchema` ({ nodeId, reason: 'branch_not_taken' | 'upstream_unreachable' }) +
`NodeSkippedReason`; added to RUN_EVENT_TYPES + the RunEvent union; the contract-parity test now pins
19 names with a valid + reject fixture.
- engine: `#propagateSkips` collects the vertices it newly dims (+ a derived reason via `#skipReason`);
`#step` emits a durable `node:skipped` for each BEFORE any terminal settle (persist-before-deliver,
gap-free) so the log stays a complete record.
- docs: documented `node:skipped` in its canonical home (sse-event-schema.md).
- test: the 1.P condition e2e now asserts the dimmed branch emits node:skipped{branch_not_taken}.
Decided (per the 1.R Understand pass, maintainer-approved): a new `node:skipped` event over adding a
`selected` field to node:completed — it persists the skip decisions directly (no selectedTargets needed
on resume) and surfaces skips. Additive within ADR-0036; no new ADR.
pnpm turbo run lint typecheck test build format:check: green (579 core, 245 shared). Leakwatch: 0.
Refs: ADR-0036
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The read side that rebuilds a run's state from its persisted event stream so an interrupted run (crash,
or suspended at a gate) can resume — no checkpoint table; the state is DERIVED from `run_events`
(ADR-0003; execution-model.md §5). In-memory reference now; the SQLite/cloud store is Phase-2/CLI.
- checkpoint.ts: `Checkpointer { load(runId) }` + `CheckpointState` (schemaVersion, runStatus, nodeStates,
completedNodeIds, pendingGates, lastSequenceNumber) + the pure `reconstructCheckpointState(events)` —
a deterministic fold of the ordered stream. Trap (b) baked in: a node that emitted `node:started` but
no terminal event is ABSENT from nodeStates, so the rehydrating engine seeds it `pending` and re-runs
it (bounded by the idempotency key, not by skipping). A condition's `selectedTargets` is restored from
`node:completed.selected`; dimmed branches from `node:skipped`; a gate-parked run yields `pendingGates`
+ a `paused` node; a resumed gate records the decision as the node output.
- run-event.ts: `node:completed.selected?` — the authoritative record of a condition's branch selection
(the reconstruction needs it; `node:skipped` alone can't survive a crash between the condition's
completion and the dimmed branches' skip-emission). engine `#settleCompleted` sets it for a branch outcome.
- execution-host.ts: `ExecutionHost.checkpointer` (a SEPARATE read port from the write `RunStore`) +
`createInMemoryCheckpointer` reconstructing from an `InMemoryRunStore`; wired into `createInMemoryHost`.
- index.ts: export the checkpoint surface. Tests: 11 reconstruction + in-memory-checkpointer cases.
pnpm turbo run lint typecheck test build format:check: green (590 core). Leakwatch: 0.
Refs: ADR-0036, ADR-0003
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Complete the 1.R resume path on top of the Checkpointer read-side: a run
suspended at a gate in a prior process is rehydrated from its reconstructed
CheckpointState and driven to completion behind the one engine loop.
- WorkflowEngine.resumeFromCheckpoint({runId, workflow, gateId, decision}):
the cross-process resume entry. Loads the checkpoint, rehydrates a fresh
RunExecution (seeds per-node states, pending/resolved gates, token+cost
tallies, and the bus sequence so post-resume events stay gap-free), applies
the decision, and returns a RunHandle. No run:started is re-emitted.
- RunExecution: a checkpoint constructor arm (#seedFromCheckpoint), prepareResume
(clock only), kick (drive without re-applying), and #resolvedGates so a
re-delivered decision is an idempotent no-op rather than advancing the run twice.
- Idempotent re-delivery, three arms: an already-terminal checkpoint returns a
closed handle (nothing re-emitted/re-persisted, createClosedRunHandle); an
already-resolved gate on a live run drives remaining work without re-applying;
a still-pending gate applies the decision. The residual concurrent TOCTOU
(two processes loading the same pending gate before either persists) is closed
by a Phase-2 store-level uniqueness constraint, documented in checkpoint.ts.
- Identity guard: the surrogate workflowId reconstructed from run:started must
match the workflow handed to resume, else a typed EngineStateError
'workflow_mismatch'. The stronger same-slug-edited guard rides on the Phase-2
runs.workflow_definition_snapshot column (database-schema.md), not run:started.
- event-bus: seedSequence(key, next) — seed the per-run counter on rehydration,
never lowering an advanced one.
- CheckpointState gains workflowId (from run:started) for the identity guard.
- Tests: 7 resume-from-checkpoint e2e cases (cross-process resume gap-free,
idempotent re-delivery to a terminal run, workflow_mismatch, unknown_run,
already-in-memory, invalid_decision); checkpoint workflowId capture.
- Docs: the canonical Checkpoint-and-resume section in shared-core-engine.md now
describes the derived CheckpointState, the reconstruction trap (started-but-
unfinished node re-runs), what is NOT checkpointed (the resolved ctx, with the
structuredClone transport rule), the two resume entries, and the idempotency
+ identity boundaries — pointing to checkpoint.ts for the exact field set.
Refs: ADR-0003, ADR-0036
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Fill the `paused`/`GateRequest` arm 1.N/1.P reserved: the `human_in_the_loop`
node handler plus the engine-side timeout lifecycle, on top of 1.R.
- node-handlers/human-gate.ts: the gate handler resolves `message_template` /
`assignee` against inputs + run.outputs and returns `{ kind: 'paused', gate }`.
Raw resolution is safe — a `secret` reference in either field is rejected at
parse (secret-taint `node-text` category), mirroring the agent's prompt. It is
thin and clock-free; deadlines are the engine's job. Wired into
createStandardNodeExecutor (the type no longer fails loud).
- Timer port: ExecutionHost.setTimer (one-shot, returns disarm) — injected so core
never names the ambient setTimeout (purity lib). createInMemoryHost ships a manual,
deterministic timer (createManualTimerController) fired by hand in tests
(fireTimers/armedCount); a real surface injects a setTimeout-backed one.
- Engine timeout lifecycle: #settlePaused computes expiresAt from the host clock and
arms the timer; a decision (human or timeout-approve) disarms it; a terminal settle
disarms all. On fire, `approve` auto-resolves the gate as approved
(decidedBy: 'timeout', run continues); `reject` (the safe default) fails the run with
run_timeout (the AwaitingGate→Failed edge) — never routed through resume(), which
would wrongly complete the gate. A human decision that beats the timer disarms it
(single resolution).
- GateRequest gains timeoutAction ('approve'|'reject'); the handler supplies it from
the node's timeout_action (default reject). Re-arming a still-pending gate's timer on
rehydration is deferred to Phase-2 crash-reconciliation (needs timeout_action
persisted on human_gate:paused) — documented in #seedFromCheckpoint.
- Tests: 7 handler unit tests (template resolution, default/explicit timeout_action,
no-timeout, cancel, validation, wrong-node) + 4 engine timeout e2e (approve
auto-resolve, reject→run_timeout, disarm-on-human-decision, no-timer-without-timeout)
+ the dispatcher gate-wiring assertion.
- Docs: execution-model.md §4 made precise on the decision-continues vs the two
timeout outcomes.
Refs: ADR-0036
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Fold the confirmed findings from the adversarially-verified review of the
1.R + 1.Q diff (21/26 survived refutation).
Correctness:
- resume(): mark the gate vertex completed SYNCHRONOUSLY before the durable emit
(mirroring #settleCompleted), closing a multi-gate stall race where a sibling
gate's timeout firing during the persist saw the gate as deleted-but-paused and
mis-read the run as stalled (spurious run:failed{internal}). [HIGH]
- #failGateOnTimeout now adds the gateId to #resolvedGates (symmetry with the
approve/human path) so a late re-delivery of a reject-timed-out gate's decision
is an idempotent no-op, not a run_already_terminal throw.
Clarity / contracts:
- New EngineStateError code `run_already_active` for resumeFromCheckpoint on a run
already in memory (was the contradictory `unknown_run`); unknown_run comment fixed.
- human_gate:paused gains optional `timeoutAction` (reuses TimeoutActionSchema),
populated by the engine — immediate observability + pre-captures the data a
Phase-2 crash-resume needs to re-arm a gate timer (no future backfill).
- human-gate.ts header corrected: distinguishes parse-time taint (inputs/ctx) from
the runtime masking that keeps run.outputs secret-free.
Docs (one canonical home):
- sse-event-schema.md: add NodeSkippedEvent to the RunEvent union + interface,
node:completed.selected, and human_gate:paused.timeoutAction (interfaces + table).
- run-event.test.ts: fixture carries timeoutAction/expiresAt; stale "18" -> "19".
Tests (+13): the multi-gate stall-race regression (two timeout-approve gates settled
in one timer sweep), the kick() path (gate already resolved in a prior process drives
the remaining work without re-applying), reject-timeout re-delivery no-op, skip-before-
fail ordering, expiresAt deadline value, post-terminal timer no-op, no-rearm-on-
rehydration, token/cost tally restoration, and ManualTimerController unit tests.
Refs: ADR-0003, ADR-0036
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Second adversarially-verified review pass (9/20 findings survived; no blockers/ highs — the round-1 fixes held). Fold the confirmed items: - #settlePaused now emits the EFFECTIVE timeoutAction (default `reject`) used for both the armed timer and the persisted human_gate:paused event, so the log always reflects the exact policy the engine acts on — even when a handler set timeoutMs but left timeoutAction implicit (a Phase-2 crash-resume reads it back to re-arm). - shared: add the missing `export type NodeSkippedEvent` (restores the per-variant type-export pattern alongside NodeCompletedEvent/NodeFailedEvent). - resumeFromCheckpoint: a comment marking the single point a future engine guards/ migrates an older checkpoint.schemaVersion (the field's purpose; inert at v1). - docs: execution-model.md paragraph break before the cross-reference sentence. Tests (+3, strengthened 2): - a human `rejected` decision completes the gate and CONTINUES the run (the documented "rejection is not a failure" path), the decision reaching run.outputs. - an armed gate timer is disarmed by #settle when the run terminates for an unrelated reason (cancel) — the disarm-by-settle path (vs disarm-by-resume). - the kick-path test now also asserts gap-free sequence continuation, and the no-rearm-on-rehydration test spies on setTimer to prove it is NEVER called (distinguishing "never armed" from "armed then disarmed"). Deferred (documented, not a code change): docs/roadmap/current.md still names 1.Q as the next workstream — the roadmap status page is updated in the post-merge commit (project pattern; ADR/roadmap "done after merge" rule), not pre-merge. Refs: ADR-0003, ADR-0036 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Reviewer's GuideImplements event-derived checkpoint/resume for the workflow engine and adds human-gate suspend/resume with one-shot timeouts, wiring them through the core engine, execution host, node handlers, and shared run-event contracts with comprehensive tests and documentation updates. Sequence diagram for resumeFromCheckpoint cross-process gate resumesequenceDiagram
participant Caller
participant WorkflowEngine
participant ExecutionHost
participant Checkpointer
participant RunExecution
participant RunEventBus
Caller->>WorkflowEngine: resumeFromCheckpoint(input)
WorkflowEngine->>WorkflowEngine: GateDecisionSchema.safeParse(input.decision)
WorkflowEngine->>ExecutionHost: access checkpointer
ExecutionHost->>Checkpointer: load(input.runId)
Checkpointer-->>ExecutionHost: CheckpointState | undefined
ExecutionHost-->>WorkflowEngine: CheckpointState | undefined
alt no checkpoint
WorkflowEngine-->>Caller: throw EngineStateError unknown_run
else checkpoint exists
WorkflowEngine->>ExecutionHost: store.resolveWorkflowId(input.workflow.workflow.id)
ExecutionHost-->>WorkflowEngine: workflowId
alt workflowId mismatch
WorkflowEngine-->>Caller: throw EngineStateError workflow_mismatch
else workflowId matches
alt checkpoint.runStatus in TERMINAL_RUN_STATUSES
WorkflowEngine->>WorkflowEngine: createClosedRunHandle(input.runId)
WorkflowEngine-->>Caller: RunHandle(events completes immediately)
else non-terminal checkpoint
WorkflowEngine->>RunEventBus: new RunEventBus
WorkflowEngine->>RunExecution: new RunExecution({checkpoint,...})
RunExecution->>RunExecution: #seedFromCheckpoint(plan, checkpoint, bus, runId)
RunExecution->>RunEventBus: seedSequence(runId, checkpoint.lastSequenceNumber + 1)
RunExecution->>RunExecution: prepareResume()
WorkflowEngine->>WorkflowEngine: #runs.set(runId, execution)
alt input.gateId in checkpoint.resolvedGateIds
WorkflowEngine->>RunExecution: kick()
else gate still pending
WorkflowEngine->>RunExecution: resume(input.gateId, decision)
end
WorkflowEngine-->>Caller: RunHandle(events from resumed run)
end
end
end
Sequence diagram for human_gate pause, timeout, and resume with one-shot timersequenceDiagram
participant RunExecution
participant ExecutionHost
participant ManualTimerController as ManualTimer
%% Gate pause path
RunExecution->>RunExecution: #settlePaused(vertex, gate)
RunExecution->>RunExecution: #states.set(vertex.id, {status paused})
RunExecution->>RunExecution: #pendingGates.set(gateId, {vertexId})
RunExecution->>RunExecution: compute effectiveAction, expiresAt
alt gate.timeoutMs defined
RunExecution->>ExecutionHost: setTimer(gate.timeoutMs, onGateTimeout)
ExecutionHost-->>RunExecution: disarm()
RunExecution->>RunExecution: #gateTimers.set(gateId, disarm)
end
RunExecution->>RunExecution: #emitDurable(human_gate:paused)
%% Human decision arrives before timeout
RunExecution->>RunExecution: resume(gateId, decision)
RunExecution->>RunExecution: check #resolvedGates.has(gateId)
RunExecution->>RunExecution: #resolvedGates.add(gateId)
RunExecution->>RunExecution: #pendingGates.delete(gateId)
RunExecution->>RunExecution: #disarmTimer(gateId)
RunExecution->>RunExecution: update vertex.state to completed
RunExecution->>RunExecution: #emitDurable(human_gate:resumed)
RunExecution->>RunExecution: #schedule()
%% Timer fires first
ManualTimer->>RunExecution: #onGateTimeout(gateId, vertexId, action)
RunExecution->>RunExecution: #disarmTimer(gateId)
alt action == approve
RunExecution->>RunExecution: resume(gateId, {decision approved, decidedBy timeout})
else action == reject
RunExecution->>RunExecution: #failGateOnTimeout(gateId, vertexId)
RunExecution->>RunExecution: #pendingGates.delete(gateId)
RunExecution->>RunExecution: #resolvedGates.add(gateId)
RunExecution->>RunExecution: #settleFailed(vertex, run_timeout)
RunExecution->>RunExecution: #schedule()
end
%% Terminal settle disarms any remaining timers
RunExecution->>RunExecution: #settle(type)
RunExecution->>RunExecution: for gateId in #gateTimers.keys()
RunExecution->>RunExecution: #disarmTimer(gateId)
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (7)
🚧 Files skipped from review as they are similar to previous changes (4)
📝 WalkthroughWalkthroughThis PR adds a ChangesHuman Gate, Checkpoint Resume & node:skipped
Sequence Diagram(s)sequenceDiagram
rect rgba(70, 130, 180, 0.5)
note over Caller, RunEventBus: Cross-process gate resumption via checkpoint
end
participant Caller
participant Engine as WorkflowEngine
participant Checkpointer
participant Store as RunStore
participant Exec as RunExecution
participant Bus as RunEventBus
Caller->>Engine: resumeFromCheckpoint({runId, workflow, gateId, decision})
Engine->>Checkpointer: load(runId)
Checkpointer->>Store: getEvents(runId)
Store-->>Checkpointer: RunEvent[]
Checkpointer->>Checkpointer: reconstructCheckpointState(events)
Checkpointer-->>Engine: CheckpointState
Engine->>Engine: enforce workflow identity guard
alt run is terminal
Engine-->>Caller: createClosedRunHandle(runId)
else run paused at gate
Engine->>Exec: new RunExecution(checkpoint)
Exec->>Exec: `#seedFromCheckpoint` (vertices, gates, tallies)
Exec->>Bus: seedSequence(lastSequenceNumber + 1)
alt gateId already resolved
Engine->>Exec: kick()
else gateId pending
Engine->>Exec: resume(gateId, decision)
Exec->>Exec: mark gate completed, disarm timer
Exec->>Store: persist human_gate:resumed
Exec->>Exec: schedule next step
end
Engine-->>Caller: RunHandle (active)
end
sequenceDiagram
rect rgba(200, 150, 50, 0.5)
note over RunExecution, RunStore: Gate timeout lifecycle with one-shot timer
end
participant RunExecution
participant SetTimer as setTimer/ManualTimerController
participant RunStore
participant Scheduler
RunExecution->>RunExecution: `#settlePaused` (gate node)
RunExecution->>RunExecution: compute expiresAt from clock.now() + timeoutMs
RunExecution->>SetTimer: setTimer(timeoutMs, onFire)
SetTimer-->>RunExecution: disarm callback → store in `#gateTimers`
RunExecution->>RunStore: persist human_gate:paused {timeoutAction, expiresAt}
par Human decision arrives first
RunExecution->>RunExecution: resume(gateId, decision)
RunExecution->>RunExecution: `#disarmTimer`(gateId)
and Timer fires
SetTimer->>RunExecution: `#onGateTimeout`(gateId, timeoutAction)
alt timeoutAction = approve
RunExecution->>RunExecution: resolve gate, mark approved
RunExecution->>Scheduler: schedule next step
else timeoutAction = reject
RunExecution->>RunExecution: `#failGateOnTimeout`
RunExecution->>RunStore: persist run:failed (run_timeout)
end
end
RunExecution->>RunExecution: on terminal: disarm/clear all `#gateTimers`
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Hey - I've found 2 issues, and left some high level feedback:
- In
resumeFromCheckpoint, the ability to pass newinputs/executionMode/planOptionsfor an already-started run could diverge the rehydrated execution from the originalrun:startedstate; consider either ignoring these in favor of checkpointed values (once available) or enforcing/clarifying the intended invariants so callers can’t accidentally change execution characteristics on resume. - The skip-propagation reason in
#skipReasonis determined by the first condition dependency encountered; if a node has multiple upstream conditions and mixed reasons, consider making the selection rule explicit (e.g. preferbranch_not_takenonly when all relevant deps are conditions) or documenting this precedence to avoid surprisingnode:skipped.reasonvalues.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- In `resumeFromCheckpoint`, the ability to pass new `inputs`/`executionMode`/`planOptions` for an already-started run could diverge the rehydrated execution from the original `run:started` state; consider either ignoring these in favor of checkpointed values (once available) or enforcing/clarifying the intended invariants so callers can’t accidentally change execution characteristics on resume.
- The skip-propagation reason in `#skipReason` is determined by the first condition dependency encountered; if a node has multiple upstream conditions and mixed reasons, consider making the selection rule explicit (e.g. prefer `branch_not_taken` only when all relevant deps are conditions) or documenting this precedence to avoid surprising `node:skipped.reason` values.
## Individual Comments
### Comment 1
<location path="packages/core/src/engine/engine.ts" line_range="304-305" />
<code_context>
+
+ /** Prepare a checkpoint-seeded run to resume — set the lifecycle clock. State was seeded in the
+ * constructor; NO `run:started` is re-emitted (it is already in the persisted log). */
+ prepareResume(): void {
+ this.#startEpochMs = Date.parse(this.#host.clock.now());
+ }
+
</code_context>
<issue_to_address>
**question (bug_risk):** Resumed runs lose pre-crash wall-clock duration in `run:completed.durationMs`.
In `prepareResume` you set `#startEpochMs` to `clock.now()`, so `durationMs` for terminal events only covers time after resume. If `durationMs` is expected to represent total wall-clock run time, this will under-report resumed runs. Consider deriving `#startEpochMs` from the original `run:started.timestamp` stored in the checkpoint, or, if the new behavior is intentional, verify that downstream consumers don’t assume `durationMs` is total duration across resumes.
</issue_to_address>
### Comment 2
<location path="packages/core/src/engine/engine.ts" line_range="1096-1097" />
<code_context>
+ executor: this.#executor,
+ bus,
+ capacity: this.#capacity,
+ onSettled: () => {
+ /* retained like a started run (see start) */
+ },
+ checkpoint,
</code_context>
<issue_to_address>
**issue (bug_risk):** `resumeFromCheckpoint` executions are never evicted from `#runs`, which can leak memory.
`start` wires `onSettled` to remove executions from `#runs`, but `resumeFromCheckpoint` uses a no-op instead. This leaves resumed runs in the map indefinitely in long‑lived processes. Unless there’s a specific need to retain them, consider reusing the same `onSettled` handler as `start` so resumed runs are also removed from `#runs` on completion.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
| onSettled: () => { | ||
| /* retained like a started run (see start) */ |
There was a problem hiding this comment.
issue (bug_risk): resumeFromCheckpoint executions are never evicted from #runs, which can leak memory.
start wires onSettled to remove executions from #runs, but resumeFromCheckpoint uses a no-op instead. This leaves resumed runs in the map indefinitely in long‑lived processes. Unless there’s a specific need to retain them, consider reusing the same onSettled handler as start so resumed runs are also removed from #runs on completion.
There was a problem hiding this comment.
Code Review
This pull request implements the Checkpoint/Resume (1.R) and Human Gate Timeout (1.Q) execution model features, allowing runs to resume from a prior process using event-log reconstruction and introducing one-shot timers for gate timeouts. It also adds a new node:skipped event to ensure a complete, replayable event log. The review feedback highlights three key areas for improvement: a potential memory state leak in resumeFromCheckpoint if the resume execution throws an error, a map mutation safety issue during event iteration in reconstructCheckpointState, and a bug in the human gate handler where an abort signal during template resolution could be incorrectly caught and reported as a fatal validation failure instead of a clean cancellation.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| this.#runs.set(input.runId, execution); | ||
| if (checkpoint.resolvedGateIds.includes(input.gateId)) { | ||
| // The gate was already resolved in the prior process (double-delivery); do not re-apply the | ||
| // decision — just drive any unfinished downstream work (or re-pause on a remaining gate). | ||
| execution.kick(); | ||
| } else { | ||
| // Apply the decision + drive the loop (events buffer on the handle for the returned consumer). | ||
| await execution.resume(input.gateId, parsed.data); | ||
| } | ||
| return execution.handle; |
There was a problem hiding this comment.
If execution.resume throws an error (e.g., due to an unknown_gate or invalid_decision error), the run remains registered in this.#runs. Because of this, any subsequent attempt to call resumeFromCheckpoint for this run will fail with a run_already_active error, leaving the run permanently stuck in memory in a broken state. Wrapping the execution in a try...catch block to clean up this.#runs on failure prevents this state leak.
| this.#runs.set(input.runId, execution); | |
| if (checkpoint.resolvedGateIds.includes(input.gateId)) { | |
| // The gate was already resolved in the prior process (double-delivery); do not re-apply the | |
| // decision — just drive any unfinished downstream work (or re-pause on a remaining gate). | |
| execution.kick(); | |
| } else { | |
| // Apply the decision + drive the loop (events buffer on the handle for the returned consumer). | |
| await execution.resume(input.gateId, parsed.data); | |
| } | |
| return execution.handle; | |
| this.#runs.set(input.runId, execution); | |
| try { | |
| if (checkpoint.resolvedGateIds.includes(input.gateId)) { | |
| // The gate was already resolved in the prior process (double-delivery); do not re-apply the | |
| // decision — just drive any unfinished downstream work (or re-pause on a remaining gate). | |
| execution.kick(); | |
| } else { | |
| // Apply the decision + drive the loop (events buffer on the handle for the returned consumer). | |
| await execution.resume(input.gateId, parsed.data); | |
| } | |
| } catch (error) { | |
| this.#runs.delete(input.runId); | |
| throw error; | |
| } | |
| return execution.handle; |
| for (const [gateId, nodeId] of pendingGates) { | ||
| if (nodeId === event.nodeId) { | ||
| pendingGates.delete(gateId); | ||
| resolvedGateIds.add(gateId); | ||
| } | ||
| } |
There was a problem hiding this comment.
Mutating a Map (pendingGates) while iterating over it using a for...of loop can lead to subtle bugs or maintainability issues if the iteration logic is refactored. It is safer and more declarative to collect the keys to resolve first, and then perform the deletions.
| for (const [gateId, nodeId] of pendingGates) { | |
| if (nodeId === event.nodeId) { | |
| pendingGates.delete(gateId); | |
| resolvedGateIds.add(gateId); | |
| } | |
| } | |
| const toResolve: string[] = []; | |
| for (const [gateId, nodeId] of pendingGates) { | |
| if (nodeId === event.nodeId) { | |
| toResolve.push(gateId); | |
| } | |
| } | |
| for (const gateId of toResolve) { | |
| pendingGates.delete(gateId); | |
| resolvedGateIds.add(gateId); | |
| } |
| } catch (err) { | ||
| // An interpolation failure is an authoring/data fault, not a transient one — fatal `validation`, | ||
| // matching the agent handler's prompt-resolution failure mapping (agent-runner.ts). | ||
| return failed( | ||
| 'validation', | ||
| err instanceof Error ? err.message : 'gate template interpolation failed', | ||
| false, | ||
| ); | ||
| } |
There was a problem hiding this comment.
If ctx.signal is aborted while resolveTemplate is executing (e.g., during a slow file read or network call), resolveTemplate will throw an error. The catch block will intercept this and return a fatal validation failure instead of a clean cancelled() outcome. Checking ctx.signal.aborted inside the catch block ensures cancellation is handled correctly.
} catch (err) {
if (ctx.signal.aborted) {
return cancelled();
}
// An interpolation failure is an authoring/data fault, not a transient one — fatal `validation`,
// matching the agent handler's prompt-resolution failure mapping (agent-runner.ts).
return failed(
'validation',
err instanceof Error ? err.message : 'gate template interpolation failed',
false,
);
}There was a problem hiding this comment.
Actionable comments posted: 5
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
packages/shared/src/run-event.ts (1)
235-249:⚠️ Potential issue | 🟠 Major | ⚡ Quick winEnforce
timeoutActiononly when a timeout exists.
HumanGatePausedEventSchemacurrently acceptstimeoutActionwithouttimeoutMs, which creates an invalid persisted state for pause/timeout resume semantics.Suggested schema guard
-export const HumanGatePausedEventSchema = z.object({ +export const HumanGatePausedEventSchema = z.object({ type: z.literal('human_gate:paused'), ...runBase, nodeId: nonEmptyString, gateId: nonEmptyString, gateType: GateTypeSchema, message: z.string(), assignee: z.string().optional(), timeoutMs: nonNegativeInt.optional(), timeoutAction: TimeoutActionSchema.optional(), expiresAt: z.string().datetime({ offset: true }).optional(), -}); +}).superRefine((event, ctx) => { + if (event.timeoutAction !== undefined && event.timeoutMs === undefined) { + ctx.addIssue({ + code: z.ZodIssueCode.custom, + path: ['timeoutAction'], + message: 'timeoutAction requires timeoutMs', + }); + } +});🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@packages/shared/src/run-event.ts` around lines 235 - 249, The HumanGatePausedEventSchema currently allows timeoutAction to be present without timeoutMs, which violates the intended pause/timeout resume semantics where timeoutAction should only exist when a timeout is configured. Add a conditional validation constraint to the HumanGatePausedEventSchema object using Zod's refine or superRefine method to ensure that if timeoutAction is provided, timeoutMs must also be present, preventing invalid persisted states.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@packages/core/src/engine/checkpoint.ts`:
- Around line 80-181: The `reconstructCheckpointState` function exceeds the
cognitive complexity threshold (19 > 15) due to its large switch statement
handling multiple event types. Extract the event-application logic into separate
helper functions organized by category: one for handling run events
(run:started, run:paused, run:completed, run:failed, run:cancelled), one for
node events (node:completed, node:failed, node:skipped), one for gate events
(human_gate:paused, human_gate:resumed), and one for accounting (cost:updated).
Call these helpers from the main loop instead of inline switch cases, preserving
all current behavior and state mutations.
In `@packages/core/src/engine/engine.ts`:
- Around line 302-306: The prepareResume() method currently reinitializes the
`#startEpochMs` field to the current host time, which causes the run duration to
be reset on resume rather than continuing from the original start. Modify the
checkpoint serialization to persist the original `#startEpochMs` value (or the
accumulated elapsed duration) when creating a checkpoint, and update the
prepareResume() method to restore that persisted value instead of calling
Date.parse(this.#host.clock.now()). This ensures that run:completed.durationMs
accurately reflects the total elapsed time across the entire run including both
the pre-resume and post-resume segments.
- Around line 1102-1110: The execution is registered in this.#runs at line 1102
before validation occurs in execution.resume() at line 1109. If the resume call
throws a validation error, the half-initialized execution remains in `#runs`,
causing subsequent retries to fail with run_already_active instead of the
original error. Move the this.#runs.set(input.runId, execution) registration to
after both the execution.kick() path (for already-resolved gates) and the
execution.resume() path have completed successfully, or alternatively wrap the
entire resume/kick logic in a try-catch that deletes the execution from `#runs`
before rethrowing any validation errors.
In `@packages/core/src/engine/node-handlers/human-gate.ts`:
- Around line 60-67: In the catch block handling errors from resolveTemplate in
the human-gate.ts file, add logic to distinguish abort errors from other
interpolation failures. First, import the InterpolationError class at the top of
the file. Then, in the catch block (around lines 60-67), add a conditional check
before the existing failed call: if the caught error is an instance of
InterpolationError and its code property equals 'aborted', return cancelled() to
properly indicate the abort status; otherwise, proceed with the existing
failed('validation') logic for other interpolation errors. Additionally, add a
test case that verifies abort signals during template resolution are correctly
handled by returning cancelled() instead of failed().
In `@packages/shared/src/run-event.ts`:
- Around line 205-208: The `selected` field in the `node:completed` schema
currently allows empty arrays via `z.array(nonEmptyString).optional()`, which
creates an ambiguous branch outcome state that should not be permitted. Modify
the schema validation for the `selected` field to ensure that when the array is
present, it must contain at least one element. Use the `.min(1)` method on the
array validation to enforce that empty arrays are rejected, or alternatively add
a `.refine()` constraint that validates the array is non-empty when it is
defined.
---
Outside diff comments:
In `@packages/shared/src/run-event.ts`:
- Around line 235-249: The HumanGatePausedEventSchema currently allows
timeoutAction to be present without timeoutMs, which violates the intended
pause/timeout resume semantics where timeoutAction should only exist when a
timeout is configured. Add a conditional validation constraint to the
HumanGatePausedEventSchema object using Zod's refine or superRefine method to
ensure that if timeoutAction is provided, timeoutMs must also be present,
preventing invalid persisted states.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: ec0189e8-fe90-45ed-a3a2-cf9962a20d5a
📒 Files selected for processing (22)
docs/architecture/execution-model.mddocs/architecture/shared-core-engine.mddocs/reference/contracts/sse-event-schema.mdpackages/core/src/engine/checkpoint.test.tspackages/core/src/engine/checkpoint.tspackages/core/src/engine/engine.test.tspackages/core/src/engine/engine.tspackages/core/src/engine/errors.tspackages/core/src/engine/event-bus.tspackages/core/src/engine/execution-host.test.tspackages/core/src/engine/execution-host.tspackages/core/src/engine/node-executor.tspackages/core/src/engine/node-handlers/dispatcher.tspackages/core/src/engine/node-handlers/human-gate.test.tspackages/core/src/engine/node-handlers/human-gate.tspackages/core/src/engine/node-handlers/node-handlers.e2e.test.tspackages/core/src/engine/node-handlers/node-handlers.test.tspackages/core/src/engine/run-handle.tspackages/core/src/index.tspackages/shared/src/constants.tspackages/shared/src/run-event.test.tspackages/shared/src/run-event.ts
| // The immediate downstream ids a `condition` kept live (its branch selection). Present ONLY for a | ||
| // condition's branch outcome — it is the authoritative record checkpoint/resume (1.R) reconstructs | ||
| // `selectedTargets` from, so a selected branch that was mid-flight at a crash re-runs (not skipped). | ||
| selected: z.array(nonEmptyString).optional(), |
There was a problem hiding this comment.
Reject empty selected arrays in node:completed.
The branch-selection field should be non-empty when present; allowing selected: [] admits an impossible/ambiguous branch outcome.
Suggested tightening
- selected: z.array(nonEmptyString).optional(),
+ selected: z.array(nonEmptyString).min(1).optional(),📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| // The immediate downstream ids a `condition` kept live (its branch selection). Present ONLY for a | |
| // condition's branch outcome — it is the authoritative record checkpoint/resume (1.R) reconstructs | |
| // `selectedTargets` from, so a selected branch that was mid-flight at a crash re-runs (not skipped). | |
| selected: z.array(nonEmptyString).optional(), | |
| // The immediate downstream ids a `condition` kept live (its branch selection). Present ONLY for a | |
| // condition's branch outcome — it is the authoritative record checkpoint/resume (1.R) reconstructs | |
| // `selectedTargets` from, so a selected branch that was mid-flight at a crash re-runs (not skipped). | |
| selected: z.array(nonEmptyString).min(1).optional(), |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@packages/shared/src/run-event.ts` around lines 205 - 208, The `selected`
field in the `node:completed` schema currently allows empty arrays via
`z.array(nonEmptyString).optional()`, which creates an ambiguous branch outcome
state that should not be permitted. Modify the schema validation for the
`selected` field to ensure that when the array is present, it must contain at
least one element. Use the `.min(1)` method on the array validation to enforce
that empty arrays are rejected, or alternatively add a `.refine()` constraint
that validates the array is non-empty when it is defined.
Verified each finding against current code; fixed the still-valid ones, reverted
one that contradicts engine semantics, skipped two with reasons.
Fixed:
- checkpoint.ts: reduce reconstructCheckpointState cognitive complexity (19→under
15) by extracting per-category appliers (applyRunEvent / applyNodeEvent /
applyGateEvent) over a shared accumulator — behavior identical. The gate-resolve
arm now collects gate ids first, then deletes (no mutate-while-iterating the Map).
- Resumed-run durationMs: the checkpoint now carries the original start epoch
(`startedAtMs`, from run:started.timestamp); a rehydrated run measures durationMs
from it (seeded in #seedFromCheckpoint), so a terminal reports total wall-clock
across pre-/post-resume — not just the post-resume segment. prepareResume removed.
- resumeFromCheckpoint: wrap resume()/kick() in try/catch that deletes the run from
#runs on a validation throw (unknown_gate / run_not_paused), so a retry isn't
wrongly rejected with run_already_active and no broken run is stranded in memory.
- human-gate.ts: an abort DURING template resolution now returns cancelled() (a
deliberate fatal reason) rather than failed('validation') — checked via
ctx.signal.aborted in the catch; +unit test.
- run-event.ts: HumanGatePausedEvent — timeoutAction is now refused without
timeoutMs (union-level superRefine; a discriminatedUnion member can't self-refine).
- engine.ts #settle: disarm gate timers via values()+clear() (no array spread);
document #skipReason precedence (branch_not_taken wins over upstream_unreachable);
document the ResumeFromCheckpointInput invariant (caller passes the original
inputs/executionMode until the checkpoint persists them).
- execution-host.ts fireTimers: snapshot the armed set as a named array (keeps the
required snapshot — a fired callback may arm/disarm timers — without the inline
spread Sonar flags).
Reverted / skipped (with reason):
- selected .min(1) (REVERTED): an empty `selected` is a VALID outcome — a condition
that routes to no branch, which the engine skip-propagates downstream
(engine.ts #hasLiveEdge); .min(1) would reject that legitimate node:completed.
- onSettled "#runs leak" on resume (SKIP): start() also retains settled runs via a
no-op onSettled by design (for run_already_terminal reporting; TTL prune is future
scope) — resumeFromCheckpoint is consistent, not divergent.
- execution-host for-of "unnecessary array" (addressed, not removed): the snapshot
is load-bearing; restructured to a named array rather than dropped.
Refs: ADR-0003, ADR-0036
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
@coderabbitai review all |
|
✅ Action performedFull review finished. |
There was a problem hiding this comment.
Actionable comments posted: 4
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
docs/architecture/shared-core-engine.md (1)
167-179:⚠️ Potential issue | 🔴 CriticalRemove
step_executionsfrom the checkpoint reconstruction description.Lines 168–169 incorrectly list
step_executionsrows as an input to checkpoint reconstruction. The code is definitive:reconstructCheckpointStatetakes onlyrun_events(seepackages/core/src/engine/checkpoint.ts:176andexecution-host.ts:308). The header comment in checkpoint.ts explicitly states "There is no checkpoint table — theCheckpointStateis derived from the orderedrun_events." Rewrite lines 168–169 to namerun_eventsalone as the source, and keep lines 172–173's "pure fold" statement.step_executionsis separate run-history metadata, not part of checkpoint reconstruction.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/architecture/shared-core-engine.md` around lines 167 - 179, The checkpoint reconstruction description incorrectly includes step_executions rows as input data when in fact only run_events should be listed as the source. Remove the references to step_executions (status, attempt_number, output_json, error_json) from lines 168–169 and rewrite that sentence to state that the Checkpointer reconstructs CheckpointState solely from the ordered, replayable run_events log. Keep the description of the messages field and preserve the subsequent explanation of reconstructCheckpointState as a pure fold operation over the ordered event stream that derives CheckpointState fields. This aligns the documentation with the actual code implementation where reconstructCheckpointState takes only the events parameter, not step_executions data.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@docs/reference/contracts/sse-event-schema.md`:
- Line 76: The `selected?` field documentation for the `node:completed` event at
line 76 in docs/reference/contracts/sse-event-schema.md currently implies it
always contains at least one target id, but a condition can route to no branch
making it an empty array. Update the description of the `selected?` field to
explicitly clarify that it can be an empty array. Apply the same clarification
to the other affected location at lines 122-123 in the same file where
`selected` is documented.
In `@packages/core/src/engine/checkpoint.test.ts`:
- Around line 103-127: The test for the resumed gate scenario in the `'a resumed
gate clears the pending gate + records the decision as the node output'` test
case is missing an assertion to verify that the gate id is correctly moved to
`resolvedGateIds` after the gate is resumed. Add an expect statement after the
existing assertions to verify that state?.resolvedGateIds includes the gate id
'g1' that was paused and then resumed, ensuring the gate tracking is correct for
idempotent re-delivery detection.
In `@packages/core/src/engine/node-handlers/human-gate.test.ts`:
- Line 1: The file human-gate.test.ts has code formatting violations detected by
Prettier. Run prettier --write on this file to automatically fix all formatting
issues according to the project's Prettier configuration.
In `@packages/core/src/index.ts`:
- Around line 93-116: The export statements in packages/core/src/index.ts are
not properly formatted according to prettier standards and are causing the CI
prettier --check to fail. Run prettier formatting on all export blocks in the
file at lines 93-116 (the StartInput, ResumeFromCheckpointInput, and related
engine exports), lines 124-126 (additional exports), and lines 169-170
(checkpoint-related exports) to ensure they all comply with the project's
formatting standards before merging.
---
Outside diff comments:
In `@docs/architecture/shared-core-engine.md`:
- Around line 167-179: The checkpoint reconstruction description incorrectly
includes step_executions rows as input data when in fact only run_events should
be listed as the source. Remove the references to step_executions (status,
attempt_number, output_json, error_json) from lines 168–169 and rewrite that
sentence to state that the Checkpointer reconstructs CheckpointState solely from
the ordered, replayable run_events log. Keep the description of the messages
field and preserve the subsequent explanation of reconstructCheckpointState as a
pure fold operation over the ordered event stream that derives CheckpointState
fields. This aligns the documentation with the actual code implementation where
reconstructCheckpointState takes only the events parameter, not step_executions
data.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: cf6bba9f-c688-44de-9840-91d2379b3df0
📒 Files selected for processing (22)
docs/architecture/execution-model.mddocs/architecture/shared-core-engine.mddocs/reference/contracts/sse-event-schema.mdpackages/core/src/engine/checkpoint.test.tspackages/core/src/engine/checkpoint.tspackages/core/src/engine/engine.test.tspackages/core/src/engine/engine.tspackages/core/src/engine/errors.tspackages/core/src/engine/event-bus.tspackages/core/src/engine/execution-host.test.tspackages/core/src/engine/execution-host.tspackages/core/src/engine/node-executor.tspackages/core/src/engine/node-handlers/dispatcher.tspackages/core/src/engine/node-handlers/human-gate.test.tspackages/core/src/engine/node-handlers/human-gate.tspackages/core/src/engine/node-handlers/node-handlers.e2e.test.tspackages/core/src/engine/node-handlers/node-handlers.test.tspackages/core/src/engine/run-handle.tspackages/core/src/index.tspackages/shared/src/constants.tspackages/shared/src/run-event.test.tspackages/shared/src/run-event.ts
| export type { | ||
| StartInput, | ||
| ResumeFromCheckpointInput, | ||
| WorkflowEngineDeps, | ||
| } from './engine/engine.js'; | ||
| export { RunEventBus } from './engine/event-bus.js'; | ||
| export type { RunEventBusOptions, RunEventListener, RunEventDraft } from './engine/event-bus.js'; | ||
| export type { RunHandle } from './engine/run-handle.js'; | ||
| export { | ||
| InMemoryRunStore, | ||
| createInMemoryHost, | ||
| createInMemoryCheckpointer, | ||
| createAbortController, | ||
| createManualTimerController, | ||
| } from './engine/execution-host.js'; | ||
| // Checkpointer + resume (1.R) — reconstruct a run's state from its persisted event stream (no checkpoint | ||
| // table; ADR-0003). The in-memory reference ships here; the SQLite/cloud one is Phase-2/CLI. | ||
| export { reconstructCheckpointState, CHECKPOINT_SCHEMA_VERSION } from './engine/checkpoint.js'; | ||
| export type { | ||
| Checkpointer, | ||
| CheckpointState, | ||
| CheckpointNodeState, | ||
| CheckpointPendingGate, | ||
| } from './engine/checkpoint.js'; |
There was a problem hiding this comment.
Reformat the updated export surface.
CI is already failing prettier --check here, so the new export blocks need to be formatted before merge.
Also applies to: 124-126, 169-170
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@packages/core/src/index.ts` around lines 93 - 116, The export statements in
packages/core/src/index.ts are not properly formatted according to prettier
standards and are causing the CI prettier --check to fail. Run prettier
formatting on all export blocks in the file at lines 93-116 (the StartInput,
ResumeFromCheckpointInput, and related engine exports), lines 124-126
(additional exports), and lines 169-170 (checkpoint-related exports) to ensure
they all comply with the project's formatting standards before merging.
Source: Pipeline failures
…#22 review) - Run Prettier on the four files the CI format:check flagged (engine.ts, engine.test.ts, index.ts, human-gate.test.ts) — formatting only, no logic change. - sse-event-schema.md: clarify node:completed.selected MAY be an empty array (a condition routing to no branch), matching the reverted .min(1) and the engine's skip-propagation — both the event table and the interface block. - shared-core-engine.md: align the checkpoint-reconstruction description with the implementation — CheckpointState is folded from the ordered run_events log alone (each node's output/error rides node:completed/node:failed); step_executions / messages are denormalized persistence for the run-trace UI, not inputs the fold requires (reconstructCheckpointState takes only events). - checkpoint.test.ts: assert the resumed gate id moves into resolvedGateIds (the idempotent re-delivery guard), not just that pendingGates clears. Refs: ADR-0003, ADR-0036 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Post-merge roadmap + status update now that checkpoint/resume (1.R) and the human gate (1.Q) have merged. - phase-1-engine-and-llm.md: ✅ Done markers on §1.Q and §1.R; top status block records the PR #22 landing and the remaining 1.m4 lane (1.S, 1.AC). - current.md: status narrative + next-workstream pointer advanced to node retry (1.S); last-updated 2026-06-15. - CLAUDE.md: status paragraph + detailed status reflect 1.R/1.Q landed, 1.S next. - deferred-tasks.md: re-point the now-landed-context items — the structuredClone `ctx`-transport obligation moves off 1.R (the checkpoint carries no resolved ctx) to the ctx-threading work; mid-tool-loop resume noted as Phase-2 (1.R resumes at gate boundaries only); the ctx-threading fold-into-1.Q/1.R window noted closed (now its own task). New "Checkpoint/resume + human gate (1.R/1.Q) follow-ups" section captures the three confirmed Phase-2 deferrals (gate-timer re-arm on rehydration, content-hash workflow-snapshot identity guard, cross-process gate-resolve TOCTOU → store-level uniqueness). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>



Lands the two 1.m4 critical-path workstreams toward M2, both in
@relavium/core(engine-only; zero platform imports). The whole diff is green onpnpm turbo run lint typecheck test build(622 core tests) and Leakwatch-clean, and was put through two adversarially-verified multi-agent review passes (round 1: 21 findings fixed incl. a real HIGH concurrency bug; round 2: 9 findings fixed, no blockers/highs).1.R —
Checkpointer+ resume (critical path)reconstructCheckpointState(events)is a pure, total fold over the persistedrun_events— run status, surrogateworkflowId, per-node settled/paused states (acondition's branch fromnode:completed.selected, dimmed branches from the newnode:skipped), pending + already-resolved gate ids, lastsequenceNumber, token/cost tallies. The exact shape lives only incheckpoint.ts.WorkflowEngine.resumeFromCheckpoint({runId, workflow, gateId, decision})— rehydrates a freshRunExecutionfrom the reconstructed state (seeds node states / pending gates / tallies / thesequenceNumberso post-resume events stay gap-free; norun:startedre-emit) and returns aRunHandle.pending→ re-run (bounded by therunId+nodeId+retryCountidempotency key).workflowIdmust match the workflow handed to resume, else a typedworkflow_mismatch. The stronger same-slug-edited guard rides on the Phase-2runs.workflow_definition_snapshotcolumn (its canonical home).1.Q — Human-gate suspend/resume + timeout
human_in_the_loophandler (node-handlers/human-gate.ts): resolvesmessage_template/assigneeand returns{ kind: 'paused', gate }; wired intocreateStandardNodeExecutor(the type no longer fails loud). Secrets are parse-gated (inputs/ctx) + runtime-masked (run.outputs).ExecutionHost.setTimer(injected — core never names the ambientsetTimeout);createManualTimerControlleris the deterministic test timer.approveauto-resolves the gate as approved (decidedBy: 'timeout', run continues);reject(the safe default) fails the run withrun_timeout(theAwaitingGate → Failededge) — never routed throughresume(). A human decision (incl.rejected) continues the run; a human decision that beats the timer disarms it.human_gate:pausedcarriestimeoutAction(the effective policy) so a surface can show how a gate auto-resolves and a Phase-2 crash-resume can re-arm from the log.Contracts & docs (one canonical home)
@relavium/shared:node:skipped(+NodeSkippedReason),node:completed.selected,human_gate:paused.timeoutAction— schemas,RUN_EVENT_TYPES, per-variant type exports, andsse-event-schema.mdall updated.execution-model.md§4 (decision-continues vs the two timeout outcomes) andshared-core-engine.md(the derivedCheckpointState, the reconstruction trap, the two resume entries, the idempotency + identity boundaries) updated.Review trail
resume()mutated the gate vertex state after the durable await, so a sibling gate's timeout firing mid-persist could mis-read the run as stalled → spuriousrun:failed{internal}. Fixed (synchronous pre-emit mutation, mirroring#settleCompleted) with a deterministic multi-gate regression test.Deferred (documented, intentional)
docs/roadmap/current.md"next workstream" pointer + marking 1.R/1.Q Done happen in the post-merge roadmap commit (project pattern; "done after merge" rule).human_gate:paused; no backfill).runs.workflow_definition_snapshot.Refs: ADR-0003, ADR-0036
🤖 Generated with Claude Code
Summary by Sourcery
Add event-derived checkpoint/replay support with a cross-process resume API and implement human gate suspend/resume with one-shot timeouts in the core workflow engine.
New Features:
Enhancements:
Documentation:
Tests:
Summary by CodeRabbit
resumeFromCheckpoint, including cross-process continuation and deterministic checkpoint reconstruction exports.human_in_the_loopnode support plus human-gate timeout policies (approve/reject), with resumed gate completion.node:skippedevents for conditional branches not taken, including skip reasons.