Skip to content

feat(core): checkpoint/resume (1.R) + human-gate suspend/resume & timeout (1.Q)#22

Merged
cemililik merged 8 commits into
mainfrom
development
Jun 14, 2026
Merged

feat(core): checkpoint/resume (1.R) + human-gate suspend/resume & timeout (1.Q)#22
cemililik merged 8 commits into
mainfrom
development

Conversation

@cemililik

@cemililik cemililik commented Jun 14, 2026

Copy link
Copy Markdown
Contributor

Lands the two 1.m4 critical-path workstreams toward M2, both in @relavium/core (engine-only; zero platform imports). The whole diff is green on pnpm turbo run lint typecheck test build (622 core tests) and Leakwatch-clean, and was put through two adversarially-verified multi-agent review passes (round 1: 21 findings fixed incl. a real HIGH concurrency bug; round 2: 9 findings fixed, no blockers/highs).

1.R — Checkpointer + resume (critical path)

  • Derived checkpoint, no table (ADR-0003): reconstructCheckpointState(events) is a pure, total fold over the persisted run_events — run status, surrogate workflowId, per-node settled/paused states (a condition's branch from node:completed.selected, dimmed branches from the new node:skipped), pending + already-resolved gate ids, last sequenceNumber, token/cost tallies. The exact shape lives only in checkpoint.ts.
  • WorkflowEngine.resumeFromCheckpoint({runId, workflow, gateId, decision}) — rehydrates a fresh RunExecution from the reconstructed state (seeds node states / pending gates / tallies / the sequenceNumber so post-resume events stay gap-free; no run:started re-emit) and returns a RunHandle.
  • Reconstruction trap (b): a started-but-unfinished node is absent → seeded pending → re-run (bounded by the runId+nodeId+retryCount idempotency key).
  • Idempotent re-delivery (3 arms): already-terminal → closed handle (nothing re-emitted/re-persisted); already-resolved gate on a live run → drive the remaining work without re-applying; pending gate → apply. The residual concurrent-double-resolve window is closed by a Phase-2 store-level uniqueness constraint (documented).
  • Identity guard: the surrogate workflowId must match the workflow handed to resume, else a typed workflow_mismatch. The stronger same-slug-edited guard rides on the Phase-2 runs.workflow_definition_snapshot column (its canonical home).

1.Q — Human-gate suspend/resume + timeout

  • human_in_the_loop handler (node-handlers/human-gate.ts): resolves message_template / assignee and returns { kind: 'paused', gate }; wired into createStandardNodeExecutor (the type no longer fails loud). Secrets are parse-gated (inputs/ctx) + runtime-masked (run.outputs).
  • One-shot timer port ExecutionHost.setTimer (injected — core never names the ambient setTimeout); createManualTimerController is the deterministic test timer.
  • Timeout lifecycle: arm on pause, disarm on resume / terminal settle. approve auto-resolves the gate as approved (decidedBy: 'timeout', run continues); reject (the safe default) fails the run with run_timeout (the AwaitingGate → Failed edge) — never routed through resume(). A human decision (incl. rejected) continues the run; a human decision that beats the timer disarms it.
  • human_gate:paused carries timeoutAction (the effective policy) so a surface can show how a gate auto-resolves and a Phase-2 crash-resume can re-arm from the log.

Contracts & docs (one canonical home)

  • @relavium/shared: node:skipped (+ NodeSkippedReason), node:completed.selected, human_gate:paused.timeoutAction — schemas, RUN_EVENT_TYPES, per-variant type exports, and sse-event-schema.md all updated.
  • execution-model.md §4 (decision-continues vs the two timeout outcomes) and shared-core-engine.md (the derived CheckpointState, the reconstruction trap, the two resume entries, the idempotency + identity boundaries) updated.

Review trail

  • Round 1 headline: resume() mutated the gate vertex state after the durable await, so a sibling gate's timeout firing mid-persist could mis-read the run as stalled → spurious run:failed{internal}. Fixed (synchronous pre-emit mutation, mirroring #settleCompleted) with a deterministic multi-gate regression test.
  • Round 2: consistency/test-fidelity tightening; confirmed no regressions.

Deferred (documented, intentional)

  • docs/roadmap/current.md "next workstream" pointer + marking 1.R/1.Q Done happen in the post-merge roadmap commit (project pattern; "done after merge" rule).
  • Cross-process gate-timer re-arm on rehydration → Phase-2 crash-reconciliation (the data it needs is now persisted on human_gate:paused; no backfill).
  • Content-hash workflow-snapshot identity guard → Phase-2 runs.workflow_definition_snapshot.

Refs: ADR-0003, ADR-0036

🤖 Generated with Claude Code

Summary by Sourcery

Add event-derived checkpoint/replay support with a cross-process resume API and implement human gate suspend/resume with one-shot timeouts in the core workflow engine.

New Features:

  • Introduce a checkpoint reconstruction module and checkpointer interface that derive run state from persisted events, plus a cross-process WorkflowEngine.resumeFromCheckpoint API.
  • Add a human_in_the_loop node handler and gate timeout policy that arm one-shot timers for human gates, including auto-approve and run-timeout behaviors with idempotent decision handling.

Enhancements:

  • Emit node:skipped events with explicit reasons and extend node:completed and human_gate:paused payloads to support accurate checkpoint reconstruction and observability.
  • Extend the in-memory execution host with a deterministic manual timer controller and in-memory checkpointer, and expose new core engine types and utilities from the public index.
  • Tighten the run loop to persist skip propagation, track resolved gates, seed sequence numbers on resume, and guard workflow identity and already-active runs during checkpoint-based resumption.

Documentation:

  • Update execution model, shared core engine architecture, and SSE event schema docs to describe checkpoint-derived state, node:skipped semantics, human gate decisions and timeout behavior, and the new resumeFromCheckpoint flow.

Tests:

  • Add comprehensive tests for gate timeout behavior, skip propagation, checkpoint reconstruction, in-memory checkpointer, manual timer controller, human gate handler behavior, and resumeFromCheckpoint idempotency and error cases.

Summary by CodeRabbit

  • New Features
    • Added checkpoint-based run resumption via resumeFromCheckpoint, including cross-process continuation and deterministic checkpoint reconstruction exports.
    • Added human_in_the_loop node support plus human-gate timeout policies (approve/reject), with resumed gate completion.
    • Added node:skipped events for conditional branches not taken, including skip reasons.
  • Bug Fixes
    • Improved idempotent gate timeout/resume behavior (one-shot timers, correct disarming, and safe handling for terminal runs).
  • Documentation
    • Expanded human-gate decision lifecycle and checkpoint reconstruction/resume semantics; updated SSE/run-event contracts for skips, branch selection, and timeout metadata.

cemililik and others added 6 commits June 14, 2026 20:41
…uisite)

A skip-propagated vertex emitted NOTHING, so the persisted event stream could not record which nodes a
condition dimmed — checkpoint/resume (1.R) reconstructs run state by replaying that stream, so resume
after a condition would mis-route. This adds `node:skipped` to make the log a complete, replayable
record (and it closes a real observability gap — surfaces never saw a node get skipped before).

- shared: `NodeSkippedEventSchema` ({ nodeId, reason: 'branch_not_taken' | 'upstream_unreachable' }) +
  `NodeSkippedReason`; added to RUN_EVENT_TYPES + the RunEvent union; the contract-parity test now pins
  19 names with a valid + reject fixture.
- engine: `#propagateSkips` collects the vertices it newly dims (+ a derived reason via `#skipReason`);
  `#step` emits a durable `node:skipped` for each BEFORE any terminal settle (persist-before-deliver,
  gap-free) so the log stays a complete record.
- docs: documented `node:skipped` in its canonical home (sse-event-schema.md).
- test: the 1.P condition e2e now asserts the dimmed branch emits node:skipped{branch_not_taken}.

Decided (per the 1.R Understand pass, maintainer-approved): a new `node:skipped` event over adding a
`selected` field to node:completed — it persists the skip decisions directly (no selectedTargets needed
on resume) and surfaces skips. Additive within ADR-0036; no new ADR.

pnpm turbo run lint typecheck test build format:check: green (579 core, 245 shared). Leakwatch: 0.

Refs: ADR-0036
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The read side that rebuilds a run's state from its persisted event stream so an interrupted run (crash,
or suspended at a gate) can resume — no checkpoint table; the state is DERIVED from `run_events`
(ADR-0003; execution-model.md §5). In-memory reference now; the SQLite/cloud store is Phase-2/CLI.

- checkpoint.ts: `Checkpointer { load(runId) }` + `CheckpointState` (schemaVersion, runStatus, nodeStates,
  completedNodeIds, pendingGates, lastSequenceNumber) + the pure `reconstructCheckpointState(events)` —
  a deterministic fold of the ordered stream. Trap (b) baked in: a node that emitted `node:started` but
  no terminal event is ABSENT from nodeStates, so the rehydrating engine seeds it `pending` and re-runs
  it (bounded by the idempotency key, not by skipping). A condition's `selectedTargets` is restored from
  `node:completed.selected`; dimmed branches from `node:skipped`; a gate-parked run yields `pendingGates`
  + a `paused` node; a resumed gate records the decision as the node output.
- run-event.ts: `node:completed.selected?` — the authoritative record of a condition's branch selection
  (the reconstruction needs it; `node:skipped` alone can't survive a crash between the condition's
  completion and the dimmed branches' skip-emission). engine `#settleCompleted` sets it for a branch outcome.
- execution-host.ts: `ExecutionHost.checkpointer` (a SEPARATE read port from the write `RunStore`) +
  `createInMemoryCheckpointer` reconstructing from an `InMemoryRunStore`; wired into `createInMemoryHost`.
- index.ts: export the checkpoint surface. Tests: 11 reconstruction + in-memory-checkpointer cases.

pnpm turbo run lint typecheck test build format:check: green (590 core). Leakwatch: 0.

Refs: ADR-0036, ADR-0003
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Complete the 1.R resume path on top of the Checkpointer read-side: a run
suspended at a gate in a prior process is rehydrated from its reconstructed
CheckpointState and driven to completion behind the one engine loop.

- WorkflowEngine.resumeFromCheckpoint({runId, workflow, gateId, decision}):
  the cross-process resume entry. Loads the checkpoint, rehydrates a fresh
  RunExecution (seeds per-node states, pending/resolved gates, token+cost
  tallies, and the bus sequence so post-resume events stay gap-free), applies
  the decision, and returns a RunHandle. No run:started is re-emitted.
- RunExecution: a checkpoint constructor arm (#seedFromCheckpoint), prepareResume
  (clock only), kick (drive without re-applying), and #resolvedGates so a
  re-delivered decision is an idempotent no-op rather than advancing the run twice.
- Idempotent re-delivery, three arms: an already-terminal checkpoint returns a
  closed handle (nothing re-emitted/re-persisted, createClosedRunHandle); an
  already-resolved gate on a live run drives remaining work without re-applying;
  a still-pending gate applies the decision. The residual concurrent TOCTOU
  (two processes loading the same pending gate before either persists) is closed
  by a Phase-2 store-level uniqueness constraint, documented in checkpoint.ts.
- Identity guard: the surrogate workflowId reconstructed from run:started must
  match the workflow handed to resume, else a typed EngineStateError
  'workflow_mismatch'. The stronger same-slug-edited guard rides on the Phase-2
  runs.workflow_definition_snapshot column (database-schema.md), not run:started.
- event-bus: seedSequence(key, next) — seed the per-run counter on rehydration,
  never lowering an advanced one.
- CheckpointState gains workflowId (from run:started) for the identity guard.
- Tests: 7 resume-from-checkpoint e2e cases (cross-process resume gap-free,
  idempotent re-delivery to a terminal run, workflow_mismatch, unknown_run,
  already-in-memory, invalid_decision); checkpoint workflowId capture.
- Docs: the canonical Checkpoint-and-resume section in shared-core-engine.md now
  describes the derived CheckpointState, the reconstruction trap (started-but-
  unfinished node re-runs), what is NOT checkpointed (the resolved ctx, with the
  structuredClone transport rule), the two resume entries, and the idempotency
  + identity boundaries — pointing to checkpoint.ts for the exact field set.

Refs: ADR-0003, ADR-0036
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Fill the `paused`/`GateRequest` arm 1.N/1.P reserved: the `human_in_the_loop`
node handler plus the engine-side timeout lifecycle, on top of 1.R.

- node-handlers/human-gate.ts: the gate handler resolves `message_template` /
  `assignee` against inputs + run.outputs and returns `{ kind: 'paused', gate }`.
  Raw resolution is safe — a `secret` reference in either field is rejected at
  parse (secret-taint `node-text` category), mirroring the agent's prompt. It is
  thin and clock-free; deadlines are the engine's job. Wired into
  createStandardNodeExecutor (the type no longer fails loud).
- Timer port: ExecutionHost.setTimer (one-shot, returns disarm) — injected so core
  never names the ambient setTimeout (purity lib). createInMemoryHost ships a manual,
  deterministic timer (createManualTimerController) fired by hand in tests
  (fireTimers/armedCount); a real surface injects a setTimeout-backed one.
- Engine timeout lifecycle: #settlePaused computes expiresAt from the host clock and
  arms the timer; a decision (human or timeout-approve) disarms it; a terminal settle
  disarms all. On fire, `approve` auto-resolves the gate as approved
  (decidedBy: 'timeout', run continues); `reject` (the safe default) fails the run with
  run_timeout (the AwaitingGate→Failed edge) — never routed through resume(), which
  would wrongly complete the gate. A human decision that beats the timer disarms it
  (single resolution).
- GateRequest gains timeoutAction ('approve'|'reject'); the handler supplies it from
  the node's timeout_action (default reject). Re-arming a still-pending gate's timer on
  rehydration is deferred to Phase-2 crash-reconciliation (needs timeout_action
  persisted on human_gate:paused) — documented in #seedFromCheckpoint.
- Tests: 7 handler unit tests (template resolution, default/explicit timeout_action,
  no-timeout, cancel, validation, wrong-node) + 4 engine timeout e2e (approve
  auto-resolve, reject→run_timeout, disarm-on-human-decision, no-timer-without-timeout)
  + the dispatcher gate-wiring assertion.
- Docs: execution-model.md §4 made precise on the decision-continues vs the two
  timeout outcomes.

Refs: ADR-0036
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Fold the confirmed findings from the adversarially-verified review of the
1.R + 1.Q diff (21/26 survived refutation).

Correctness:
- resume(): mark the gate vertex completed SYNCHRONOUSLY before the durable emit
  (mirroring #settleCompleted), closing a multi-gate stall race where a sibling
  gate's timeout firing during the persist saw the gate as deleted-but-paused and
  mis-read the run as stalled (spurious run:failed{internal}). [HIGH]
- #failGateOnTimeout now adds the gateId to #resolvedGates (symmetry with the
  approve/human path) so a late re-delivery of a reject-timed-out gate's decision
  is an idempotent no-op, not a run_already_terminal throw.

Clarity / contracts:
- New EngineStateError code `run_already_active` for resumeFromCheckpoint on a run
  already in memory (was the contradictory `unknown_run`); unknown_run comment fixed.
- human_gate:paused gains optional `timeoutAction` (reuses TimeoutActionSchema),
  populated by the engine — immediate observability + pre-captures the data a
  Phase-2 crash-resume needs to re-arm a gate timer (no future backfill).
- human-gate.ts header corrected: distinguishes parse-time taint (inputs/ctx) from
  the runtime masking that keeps run.outputs secret-free.

Docs (one canonical home):
- sse-event-schema.md: add NodeSkippedEvent to the RunEvent union + interface,
  node:completed.selected, and human_gate:paused.timeoutAction (interfaces + table).
- run-event.test.ts: fixture carries timeoutAction/expiresAt; stale "18" -> "19".

Tests (+13): the multi-gate stall-race regression (two timeout-approve gates settled
in one timer sweep), the kick() path (gate already resolved in a prior process drives
the remaining work without re-applying), reject-timeout re-delivery no-op, skip-before-
fail ordering, expiresAt deadline value, post-terminal timer no-op, no-rearm-on-
rehydration, token/cost tally restoration, and ManualTimerController unit tests.

Refs: ADR-0003, ADR-0036
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Second adversarially-verified review pass (9/20 findings survived; no blockers/
highs — the round-1 fixes held). Fold the confirmed items:

- #settlePaused now emits the EFFECTIVE timeoutAction (default `reject`) used for
  both the armed timer and the persisted human_gate:paused event, so the log always
  reflects the exact policy the engine acts on — even when a handler set timeoutMs
  but left timeoutAction implicit (a Phase-2 crash-resume reads it back to re-arm).
- shared: add the missing `export type NodeSkippedEvent` (restores the per-variant
  type-export pattern alongside NodeCompletedEvent/NodeFailedEvent).
- resumeFromCheckpoint: a comment marking the single point a future engine guards/
  migrates an older checkpoint.schemaVersion (the field's purpose; inert at v1).
- docs: execution-model.md paragraph break before the cross-reference sentence.

Tests (+3, strengthened 2):
- a human `rejected` decision completes the gate and CONTINUES the run (the
  documented "rejection is not a failure" path), the decision reaching run.outputs.
- an armed gate timer is disarmed by #settle when the run terminates for an unrelated
  reason (cancel) — the disarm-by-settle path (vs disarm-by-resume).
- the kick-path test now also asserts gap-free sequence continuation, and the
  no-rearm-on-rehydration test spies on setTimer to prove it is NEVER called
  (distinguishing "never armed" from "armed then disarmed").

Deferred (documented, not a code change): docs/roadmap/current.md still names 1.Q as
the next workstream — the roadmap status page is updated in the post-merge commit
(project pattern; ADR/roadmap "done after merge" rule), not pre-merge.

Refs: ADR-0003, ADR-0036
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@sourcery-ai

sourcery-ai Bot commented Jun 14, 2026

Copy link
Copy Markdown

Reviewer's Guide

Implements event-derived checkpoint/resume for the workflow engine and adds human-gate suspend/resume with one-shot timeouts, wiring them through the core engine, execution host, node handlers, and shared run-event contracts with comprehensive tests and documentation updates.

Sequence diagram for resumeFromCheckpoint cross-process gate resume

sequenceDiagram
  participant Caller
  participant WorkflowEngine
  participant ExecutionHost
  participant Checkpointer
  participant RunExecution
  participant RunEventBus

  Caller->>WorkflowEngine: resumeFromCheckpoint(input)
  WorkflowEngine->>WorkflowEngine: GateDecisionSchema.safeParse(input.decision)
  WorkflowEngine->>ExecutionHost: access checkpointer
  ExecutionHost->>Checkpointer: load(input.runId)
  Checkpointer-->>ExecutionHost: CheckpointState | undefined
  ExecutionHost-->>WorkflowEngine: CheckpointState | undefined
  alt no checkpoint
    WorkflowEngine-->>Caller: throw EngineStateError unknown_run
  else checkpoint exists
    WorkflowEngine->>ExecutionHost: store.resolveWorkflowId(input.workflow.workflow.id)
    ExecutionHost-->>WorkflowEngine: workflowId
    alt workflowId mismatch
      WorkflowEngine-->>Caller: throw EngineStateError workflow_mismatch
    else workflowId matches
      alt checkpoint.runStatus in TERMINAL_RUN_STATUSES
        WorkflowEngine->>WorkflowEngine: createClosedRunHandle(input.runId)
        WorkflowEngine-->>Caller: RunHandle(events completes immediately)
      else non-terminal checkpoint
        WorkflowEngine->>RunEventBus: new RunEventBus
        WorkflowEngine->>RunExecution: new RunExecution({checkpoint,...})
        RunExecution->>RunExecution: #seedFromCheckpoint(plan, checkpoint, bus, runId)
        RunExecution->>RunEventBus: seedSequence(runId, checkpoint.lastSequenceNumber + 1)
        RunExecution->>RunExecution: prepareResume()
        WorkflowEngine->>WorkflowEngine: #runs.set(runId, execution)
        alt input.gateId in checkpoint.resolvedGateIds
          WorkflowEngine->>RunExecution: kick()
        else gate still pending
          WorkflowEngine->>RunExecution: resume(input.gateId, decision)
        end
        WorkflowEngine-->>Caller: RunHandle(events from resumed run)
      end
    end
  end
Loading

Sequence diagram for human_gate pause, timeout, and resume with one-shot timer

sequenceDiagram
  participant RunExecution
  participant ExecutionHost
  participant ManualTimerController as ManualTimer

  %% Gate pause path
  RunExecution->>RunExecution: #settlePaused(vertex, gate)
  RunExecution->>RunExecution: #states.set(vertex.id, {status paused})
  RunExecution->>RunExecution: #pendingGates.set(gateId, {vertexId})
  RunExecution->>RunExecution: compute effectiveAction, expiresAt
  alt gate.timeoutMs defined
    RunExecution->>ExecutionHost: setTimer(gate.timeoutMs, onGateTimeout)
    ExecutionHost-->>RunExecution: disarm()
    RunExecution->>RunExecution: #gateTimers.set(gateId, disarm)
  end
  RunExecution->>RunExecution: #emitDurable(human_gate:paused)

  %% Human decision arrives before timeout
  RunExecution->>RunExecution: resume(gateId, decision)
  RunExecution->>RunExecution: check #resolvedGates.has(gateId)
  RunExecution->>RunExecution: #resolvedGates.add(gateId)
  RunExecution->>RunExecution: #pendingGates.delete(gateId)
  RunExecution->>RunExecution: #disarmTimer(gateId)
  RunExecution->>RunExecution: update vertex.state to completed
  RunExecution->>RunExecution: #emitDurable(human_gate:resumed)
  RunExecution->>RunExecution: #schedule()

  %% Timer fires first
  ManualTimer->>RunExecution: #onGateTimeout(gateId, vertexId, action)
  RunExecution->>RunExecution: #disarmTimer(gateId)
  alt action == approve
    RunExecution->>RunExecution: resume(gateId, {decision approved, decidedBy timeout})
  else action == reject
    RunExecution->>RunExecution: #failGateOnTimeout(gateId, vertexId)
    RunExecution->>RunExecution: #pendingGates.delete(gateId)
    RunExecution->>RunExecution: #resolvedGates.add(gateId)
    RunExecution->>RunExecution: #settleFailed(vertex, run_timeout)
    RunExecution->>RunExecution: #schedule()
  end

  %% Terminal settle disarms any remaining timers
  RunExecution->>RunExecution: #settle(type)
  RunExecution->>RunExecution: for gateId in #gateTimers.keys()
  RunExecution->>RunExecution: #disarmTimer(gateId)
Loading

File-Level Changes

Change Details Files
Add derived checkpoint reconstruction and cross-process resumeFromCheckpoint entrypoint to WorkflowEngine, including idempotent re-delivery and workflow identity guarding.
  • Introduce CheckpointState model, reconstructCheckpointState() fold, and Checkpointer interface to derive run state from ordered RunEvent streams.
  • Extend RunEventBus to support seeding sequence numbers so resumed runs continue with gap-free sequenceNumber values.
  • Augment RunExecution to seed internal vertex state, pending/resolved gates, tallies, and sequence counters from a checkpoint, and add prepareResume() and kick() paths.
  • Implement WorkflowEngine.resumeFromCheckpoint() to load a checkpoint via ExecutionHost.checkpointer, enforce workflow identity and active-run guards, no-op on terminal checkpoints via a closed RunHandle, and either apply the gate decision or just drive remaining work.
  • Extend EngineStateError codes with run_already_active, workflow_mismatch, and reuse unknown_run/invalid_decision for the new resumeFromCheckpoint flow.
packages/core/src/engine/checkpoint.ts
packages/core/src/engine/checkpoint.test.ts
packages/core/src/engine/event-bus.ts
packages/core/src/engine/engine.ts
packages/core/src/engine/errors.ts
packages/core/src/engine/engine.test.ts
packages/core/src/engine/run-handle.ts
packages/core/src/index.ts
docs/architecture/shared-core-engine.md
Introduce human-in-the-loop gate handler and one-shot timeout lifecycle, integrating gate timeouts into the run loop with proper disarming and failure semantics.
  • Add createHumanGateNodeExecutor to resolve message_template and assignee with template interpolation, enforce secret-handling contracts, and surface GateRequest with timeoutMs/timeoutAction.
  • Wire human_in_the_loop into createStandardNodeExecutor so human_gate nodes suspend instead of failing, and export the handler/deps from the core index.
  • Extend RunExecution to track resolved gates and per-gate timeout timers, disarming timers on resume and terminal settle, and make resume() idempotent on already-resolved gates while synchronously updating gate vertex state before durable emit.
  • Implement #settlePaused gate timeout wiring using injected ExecutionHost.setTimer, computing expiresAt, deriving effective timeoutAction, and emitting human_gate:paused that carries timeoutMs, timeoutAction, and expiresAt.
  • Add #onGateTimeout and #failGateOnTimeout to auto-approve gates or fail runs with run_timeout on timeout_action: reject, ensuring late decisions become no-ops and that timers never fire after run termination.
packages/core/src/engine/node-handlers/human-gate.ts
packages/core/src/engine/node-handlers/human-gate.test.ts
packages/core/src/engine/node-handlers/dispatcher.ts
packages/core/src/engine/node-handlers/node-handlers.test.ts
packages/core/src/engine/engine.ts
packages/core/src/engine/engine.test.ts
packages/core/src/index.ts
docs/architecture/execution-model.md
Extend ExecutionHost with a platform-free timer port and in-memory checkpointer/timer implementations to support timeouts and checkpoint-based resume in tests and the reference engine.
  • Define SetTimer type and add setTimer and checkpointer to ExecutionHost, keeping timer and checkpoint responsibilities distinct from RunStore.
  • Implement createManualTimerController as a deterministic one-shot timer with fireTimers and armedCount helpers for tests, including race/edge-case coverage.
  • Extend createInMemoryHost to provide a ManualTimerController-backed setTimer, expose fireTimers/armedCount for tests, and wire in an in-memory Checkpointer using reconstructCheckpointState over InMemoryRunStore.
  • Add createInMemoryCheckpointer helper that only reconstructs checkpoints when the underlying RunStore is InMemoryRunStore, returning undefined for opaque/custom stores.
  • Update engine tests to use the manual timer host helpers to exercise gate timeout behavior, multi-gate races, and ensure timers are disarmed on resume and terminal closure.
packages/core/src/engine/execution-host.ts
packages/core/src/engine/execution-host.test.ts
packages/core/src/engine/checkpoint.ts
packages/core/src/engine/checkpoint.test.ts
packages/core/src/engine/engine.test.ts
packages/core/src/index.ts
Enrich shared run-event contracts with node:skipped, condition-branch selection, and timeoutAction metadata to make the event log fully replayable for checkpoint reconstruction and gate timeout UX.
  • Add NodeSkippedEvent and NodeSkippedReason (branch_not_taken/upstream_unreachable) to shared run-event schemas and constants, plus tests and SSE contract docs, and include it in RunEventUnion/RunEventType.
  • Update NodeCompletedEvent to optionally carry selected target ids for condition nodes, and document this in the SSE schema as the authoritative branch record.
  • Modify RunExecution skip propagation to compute a structured skip reason per vertex, return newly skipped nodes, and emit node:skipped events before terminals so reconstruction and UIs can see dimmed branches.
  • Extend HumanGatePausedEvent to include timeoutAction alongside timeoutMs and expiresAt, and update tests and SSE docs accordingly.
  • Update execution-model and architecture docs to describe gate decision semantics vs timeout outcomes, and how node:skipped and selected are used in checkpoint/resume.
packages/shared/src/run-event.ts
packages/shared/src/run-event.test.ts
packages/shared/src/constants.ts
packages/core/src/engine/engine.ts
packages/core/src/engine/node-handlers/node-handlers.e2e.test.ts
docs/reference/contracts/sse-event-schema.md
docs/architecture/shared-core-engine.md
docs/architecture/execution-model.md

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

@coderabbitai

coderabbitai Bot commented Jun 14, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: a4a8457d-3b90-4092-890a-85a853fe28cd

📥 Commits

Reviewing files that changed from the base of the PR and between 012b2bb and 8e8cd9c.

📒 Files selected for processing (7)
  • docs/architecture/shared-core-engine.md
  • docs/reference/contracts/sse-event-schema.md
  • packages/core/src/engine/checkpoint.test.ts
  • packages/core/src/engine/engine.test.ts
  • packages/core/src/engine/engine.ts
  • packages/core/src/engine/node-handlers/human-gate.test.ts
  • packages/core/src/index.ts
🚧 Files skipped from review as they are similar to previous changes (4)
  • packages/core/src/index.ts
  • docs/architecture/shared-core-engine.md
  • packages/core/src/engine/engine.test.ts
  • docs/reference/contracts/sse-event-schema.md

📝 Walkthrough

Walkthrough

This PR adds a human_in_the_loop node handler with one-shot timeout support (approve/reject), a checkpoint read-side (reconstructCheckpointState) enabling cross-process gate resumption via a new WorkflowEngine.resumeFromCheckpoint method, durable node:skipped events with branch_not_taken/upstream_unreachable reasons, deterministic test timer infrastructure, and updates shared event schemas, public API surface, and architecture documentation throughout.

Changes

Human Gate, Checkpoint Resume & node:skipped

Layer / File(s) Summary
Shared event schema extensions
packages/shared/src/constants.ts, packages/shared/src/run-event.ts, packages/shared/src/run-event.test.ts, docs/reference/contracts/sse-event-schema.md
Adds node:skipped to RUN_EVENT_TYPES and RunEventUnionSchema. Extends NodeCompletedEventSchema with optional selected array (branch target ids). Adds optional timeoutAction to HumanGatePausedEventSchema with cross-field validation (requires timeoutMs when present). Introduces NodeSkippedReasonSchema and NodeSkippedEventSchema with branch_not_taken and upstream_unreachable reasons. Updates contract documentation and validates all changes via expanded test matrix.
human_in_the_loop node handler and dispatcher wiring
packages/core/src/engine/node-executor.ts, packages/core/src/engine/node-handlers/human-gate.ts, packages/core/src/engine/node-handlers/dispatcher.ts, packages/core/src/engine/node-handlers/human-gate.test.ts, packages/core/src/engine/node-handlers/node-handlers.test.ts
Adds timeoutAction?: 'approve' | 'reject' to GateRequest to control gate timeout behavior. Implements createHumanGateNodeExecutor: validates node kind, handles aborts (including during template resolution), resolves message_template and assignee via RunScope with inputs/outputs, maps interpolation failures to validation errors, constructs GateRequest with optional timeout fields (defaults timeoutAction to 'reject' when timeout_ms is set). Wires handler into createStandardNodeExecutor via optional humanGate dependency in StandardNodeExecutorDeps. Comprehensive tests validate interpolation, timeout defaults, abort handling, non-gate errors, and integration with standard executor.
Checkpoint read-side: reconstructCheckpointState
packages/core/src/engine/checkpoint.ts, packages/core/src/engine/checkpoint.test.ts
Introduces complete checkpoint.ts module: CheckpointNodeState, CheckpointPendingGate, CheckpointState types, Checkpointer interface, CHECKPOINT_SCHEMA_VERSION constant. Core reconstructCheckpointState deterministically folds persisted RunEvent stream (in order) into derived state: returns undefined if no run:started; reconstructs run identity, status, and per-node terminal/paused states (omitting nodes with only node:started so they re-run); restores branch selections via selectedTargets; reconstructs gate-parked runs with pendingGates; handles gate resume by completing gate node and moving to resolvedGateIds; accumulates token totals and cumulative cost from cost:updated. Tests cover all scenarios including completed runs, in-flight nodes, branch/skip restoration, gate pause/resume cycles, cost accounting, typed failure preservation, and createInMemoryCheckpointer integration with InMemoryRunStore.
ExecutionHost: timer seam, ManualTimerController, utilities
packages/core/src/engine/execution-host.ts, packages/core/src/engine/execution-host.test.ts, packages/core/src/engine/event-bus.ts, packages/core/src/engine/run-handle.ts, packages/core/src/engine/errors.ts
Extends ExecutionHost contract with checkpointer: Checkpointer port and setTimer: (ms, onFire) => disarm one-shot timer port. Adds ManualTimerController interface and createManualTimerController for deterministic test-time timer control: setTimer arms timers, fireTimers fires all currently-armed timers exactly once (idempotent across sweeps), armedCount reports remaining timers. Expands createInMemoryHost to optionally accept injected checkpointer, wire manual timer controller as setTimer, and expose fireTimers/armedCount test controls. Adds createInMemoryCheckpointer (reconstructs checkpoint state from InMemoryRunStore, returns undefined for opaque stores). Adds seedSequence to RunEventBus for idempotent monotonic sequence counter advance during rehydration. Adds createClosedRunHandle for already-terminal runs (closed event stream, inert cancel/subscribe, resolved whenConsumersReady). Extends EngineStateErrorCode with 'run_already_active' and 'workflow_mismatch' discriminants.
Engine core: checkpoint seeding, skip propagation, gate timeouts, resumeFromCheckpoint
packages/core/src/engine/engine.ts, packages/core/src/engine/engine.test.ts, packages/core/src/engine/node-handlers/node-handlers.e2e.test.ts
Adds ResumeFromCheckpointInput interface for cross-process checkpoint resumption. Introduces WorkflowEngine.resumeFromCheckpoint(...): loads CheckpointState via checkpointer, enforces workflow identity guard (rejects mismatch), returns createClosedRunHandle for terminal checkpoints, otherwise creates checkpoint-seeded RunExecution and either kicks (gate already resolved) or applies decision via resume. Extends RunExecution with #resolvedGates for idempotent gate tracking and #gateTimers for timer disarm callbacks. Adds optional checkpoint constructor parameter and #seedFromCheckpoint method to rehydrate vertex states, pending/resolved gates, token/cost tallies, and sequence counter. Adds kick() to continue without re-applying a gate decision. Makes resume idempotent: skips re-application if gate already in #resolvedGates. During gate resume: marks gate resolved, disarms timer, clears pause state, synchronously completes gate vertex with stored output, emits human_gate:resumed, schedules continuation. Refactors skip propagation: #propagateSkips now returns newly skipped vertices with NodeSkippedReason; scheduler emits durable node:skipped events with reasons before checking terminal conditions. Persists branch outcomes: node:completed now includes selected targets when outcome is branch. Enhances gate pause: #settlePaused computes timeoutAction and expiresAt, arms one-shot timer via setTimer, stores disarm callback. Implements #disarmTimer (idempotent), #onGateTimeout (approve auto-resumes; reject fails run with run_timeout), #failGateOnTimeout (idempotent gate resolution with run failure). On terminal settlement, disarms all remaining gate timers and clears #gateTimers. Comprehensive test coverage validates timeout metadata, auto-resolve vs auto-fail, timer disarm on early decision, no timer when timeoutMs absent, rejection continuation, timer disarm on cancel, idempotent late re-delivery, correct skip event ordering, concurrent multi-gate timeout in single sweep, cross-process rehydration with gap-free sequencing, terminal run no-op, workflow mismatch guard, already-active rejection, invalid decision validation, kick-path regression, and no timer arm during rehydration. E2E tests assert node:skipped events with branch_not_taken reason for unselected branches.
Public API surface and documentation
packages/core/src/index.ts, docs/architecture/execution-model.md, docs/architecture/shared-core-engine.md
Re-exports ResumeFromCheckpointInput, checkpoint types (Checkpointer, CheckpointState, CheckpointNodeState, CheckpointPendingGate), checkpoint functions (reconstructCheckpointState, CHECKPOINT_SCHEMA_VERSION), timer types (SetTimer, ManualTimerController) and function (createManualTimerController), checkpointer factory (createInMemoryCheckpointer), and human-gate executor (createHumanGateNodeExecutor, HumanGateNodeExecutorDeps). Updates execution-model.md to specify human-gate full decision lifecycle (emit human_gate:resumed, continue run, checkpoint-idempotent resolution, allow parallel pending gates), expand timeout behavior (one-shot timer from injected clock, reject vs approve differ in run-timeout failure vs auto-resolve, first-arriving decision disarms). Updates shared-core-engine.md with detailed deterministic reconstructCheckpointState event-fold model, clarify derived state contents and exclusions (ctx.* not reconstructed), specify structuredClone requirement for checkpoint boundaries, expand gate-resume semantics (in-process vs restart paths, workflow identity guard with definition snapshot, idempotent decision re-delivery, Phase-2 store uniqueness constraint for concurrency race closure).

Sequence Diagram(s)

sequenceDiagram
  rect rgba(70, 130, 180, 0.5)
    note over Caller, RunEventBus: Cross-process gate resumption via checkpoint
  end
  participant Caller
  participant Engine as WorkflowEngine
  participant Checkpointer
  participant Store as RunStore
  participant Exec as RunExecution
  participant Bus as RunEventBus
  Caller->>Engine: resumeFromCheckpoint({runId, workflow, gateId, decision})
  Engine->>Checkpointer: load(runId)
  Checkpointer->>Store: getEvents(runId)
  Store-->>Checkpointer: RunEvent[]
  Checkpointer->>Checkpointer: reconstructCheckpointState(events)
  Checkpointer-->>Engine: CheckpointState
  Engine->>Engine: enforce workflow identity guard
  alt run is terminal
    Engine-->>Caller: createClosedRunHandle(runId)
  else run paused at gate
    Engine->>Exec: new RunExecution(checkpoint)
    Exec->>Exec: `#seedFromCheckpoint` (vertices, gates, tallies)
    Exec->>Bus: seedSequence(lastSequenceNumber + 1)
    alt gateId already resolved
      Engine->>Exec: kick()
    else gateId pending
      Engine->>Exec: resume(gateId, decision)
      Exec->>Exec: mark gate completed, disarm timer
      Exec->>Store: persist human_gate:resumed
      Exec->>Exec: schedule next step
    end
    Engine-->>Caller: RunHandle (active)
  end
Loading
sequenceDiagram
  rect rgba(200, 150, 50, 0.5)
    note over RunExecution, RunStore: Gate timeout lifecycle with one-shot timer
  end
  participant RunExecution
  participant SetTimer as setTimer/ManualTimerController
  participant RunStore
  participant Scheduler
  RunExecution->>RunExecution: `#settlePaused` (gate node)
  RunExecution->>RunExecution: compute expiresAt from clock.now() + timeoutMs
  RunExecution->>SetTimer: setTimer(timeoutMs, onFire)
  SetTimer-->>RunExecution: disarm callback → store in `#gateTimers`
  RunExecution->>RunStore: persist human_gate:paused {timeoutAction, expiresAt}
  par Human decision arrives first
    RunExecution->>RunExecution: resume(gateId, decision)
    RunExecution->>RunExecution: `#disarmTimer`(gateId)
  and Timer fires
    SetTimer->>RunExecution: `#onGateTimeout`(gateId, timeoutAction)
    alt timeoutAction = approve
      RunExecution->>RunExecution: resolve gate, mark approved
      RunExecution->>Scheduler: schedule next step
    else timeoutAction = reject
      RunExecution->>RunExecution: `#failGateOnTimeout`
      RunExecution->>RunStore: persist run:failed (run_timeout)
    end
  end
  RunExecution->>RunExecution: on terminal: disarm/clear all `#gateTimers`
Loading

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Possibly related PRs

  • HodeTech/Relavium#20: Both PRs modify the node-handler dispatch wiring to extend supported node types—main PR adds human_in_the_loop/humanGate while the retrieved PR introduced the standard per-type handler composition framework.
  • HodeTech/Relavium#17: Both PRs extend the core engine substrate around ExecutionHost contracts and run-loop control flow—main PR's checkpoint/resume and gate-timeout additions build on the run-loop foundation introduced in retrieved PR.

Poem

🐇 Hop, hop! The gate swings wide or closes tight,
A timer ticks—approve or reject by night.
From frozen events, the state is rebuilt anew,
node:skipped now echoes with branch_not_taken too.
Cross-process or in-mem, the run finds its way,
This bunny checkpointed every carrot today! 🥕

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 70.59% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the two main features: checkpoint/resume (1.R) and human-gate suspend/resume with timeout (1.Q), aligning precisely with the PR objectives.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch development

Comment @coderabbitai help to get the list of available commands and usage tips.

@sourcery-ai sourcery-ai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 2 issues, and left some high level feedback:

  • In resumeFromCheckpoint, the ability to pass new inputs/executionMode/planOptions for an already-started run could diverge the rehydrated execution from the original run:started state; consider either ignoring these in favor of checkpointed values (once available) or enforcing/clarifying the intended invariants so callers can’t accidentally change execution characteristics on resume.
  • The skip-propagation reason in #skipReason is determined by the first condition dependency encountered; if a node has multiple upstream conditions and mixed reasons, consider making the selection rule explicit (e.g. prefer branch_not_taken only when all relevant deps are conditions) or documenting this precedence to avoid surprising node:skipped.reason values.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- In `resumeFromCheckpoint`, the ability to pass new `inputs`/`executionMode`/`planOptions` for an already-started run could diverge the rehydrated execution from the original `run:started` state; consider either ignoring these in favor of checkpointed values (once available) or enforcing/clarifying the intended invariants so callers can’t accidentally change execution characteristics on resume.
- The skip-propagation reason in `#skipReason` is determined by the first condition dependency encountered; if a node has multiple upstream conditions and mixed reasons, consider making the selection rule explicit (e.g. prefer `branch_not_taken` only when all relevant deps are conditions) or documenting this precedence to avoid surprising `node:skipped.reason` values.

## Individual Comments

### Comment 1
<location path="packages/core/src/engine/engine.ts" line_range="304-305" />
<code_context>
+
+  /** Prepare a checkpoint-seeded run to resume — set the lifecycle clock. State was seeded in the
+   *  constructor; NO `run:started` is re-emitted (it is already in the persisted log). */
+  prepareResume(): void {
+    this.#startEpochMs = Date.parse(this.#host.clock.now());
+  }
+
</code_context>
<issue_to_address>
**question (bug_risk):** Resumed runs lose pre-crash wall-clock duration in `run:completed.durationMs`.

In `prepareResume` you set `#startEpochMs` to `clock.now()`, so `durationMs` for terminal events only covers time after resume. If `durationMs` is expected to represent total wall-clock run time, this will under-report resumed runs. Consider deriving `#startEpochMs` from the original `run:started.timestamp` stored in the checkpoint, or, if the new behavior is intentional, verify that downstream consumers don’t assume `durationMs` is total duration across resumes.
</issue_to_address>

### Comment 2
<location path="packages/core/src/engine/engine.ts" line_range="1096-1097" />
<code_context>
+      executor: this.#executor,
+      bus,
+      capacity: this.#capacity,
+      onSettled: () => {
+        /* retained like a started run (see start) */
+      },
+      checkpoint,
</code_context>
<issue_to_address>
**issue (bug_risk):** `resumeFromCheckpoint` executions are never evicted from `#runs`, which can leak memory.

`start` wires `onSettled` to remove executions from `#runs`, but `resumeFromCheckpoint` uses a no-op instead. This leaves resumed runs in the map indefinitely in long‑lived processes. Unless there’s a specific need to retain them, consider reusing the same `onSettled` handler as `start` so resumed runs are also removed from `#runs` on completion.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment thread packages/core/src/engine/engine.ts Outdated
Comment on lines +1096 to +1097
onSettled: () => {
/* retained like a started run (see start) */

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (bug_risk): resumeFromCheckpoint executions are never evicted from #runs, which can leak memory.

start wires onSettled to remove executions from #runs, but resumeFromCheckpoint uses a no-op instead. This leaves resumed runs in the map indefinitely in long‑lived processes. Unless there’s a specific need to retain them, consider reusing the same onSettled handler as start so resumed runs are also removed from #runs on completion.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements the Checkpoint/Resume (1.R) and Human Gate Timeout (1.Q) execution model features, allowing runs to resume from a prior process using event-log reconstruction and introducing one-shot timers for gate timeouts. It also adds a new node:skipped event to ensure a complete, replayable event log. The review feedback highlights three key areas for improvement: a potential memory state leak in resumeFromCheckpoint if the resume execution throws an error, a map mutation safety issue during event iteration in reconstructCheckpointState, and a bug in the human gate handler where an abort signal during template resolution could be incorrectly caught and reported as a fatal validation failure instead of a clean cancellation.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +1102 to +1111
this.#runs.set(input.runId, execution);
if (checkpoint.resolvedGateIds.includes(input.gateId)) {
// The gate was already resolved in the prior process (double-delivery); do not re-apply the
// decision — just drive any unfinished downstream work (or re-pause on a remaining gate).
execution.kick();
} else {
// Apply the decision + drive the loop (events buffer on the handle for the returned consumer).
await execution.resume(input.gateId, parsed.data);
}
return execution.handle;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

If execution.resume throws an error (e.g., due to an unknown_gate or invalid_decision error), the run remains registered in this.#runs. Because of this, any subsequent attempt to call resumeFromCheckpoint for this run will fail with a run_already_active error, leaving the run permanently stuck in memory in a broken state. Wrapping the execution in a try...catch block to clean up this.#runs on failure prevents this state leak.

Suggested change
this.#runs.set(input.runId, execution);
if (checkpoint.resolvedGateIds.includes(input.gateId)) {
// The gate was already resolved in the prior process (double-delivery); do not re-apply the
// decision — just drive any unfinished downstream work (or re-pause on a remaining gate).
execution.kick();
} else {
// Apply the decision + drive the loop (events buffer on the handle for the returned consumer).
await execution.resume(input.gateId, parsed.data);
}
return execution.handle;
this.#runs.set(input.runId, execution);
try {
if (checkpoint.resolvedGateIds.includes(input.gateId)) {
// The gate was already resolved in the prior process (double-delivery); do not re-apply the
// decision — just drive any unfinished downstream work (or re-pause on a remaining gate).
execution.kick();
} else {
// Apply the decision + drive the loop (events buffer on the handle for the returned consumer).
await execution.resume(input.gateId, parsed.data);
}
} catch (error) {
this.#runs.delete(input.runId);
throw error;
}
return execution.handle;

Comment thread packages/core/src/engine/checkpoint.ts Outdated
Comment on lines +149 to +154
for (const [gateId, nodeId] of pendingGates) {
if (nodeId === event.nodeId) {
pendingGates.delete(gateId);
resolvedGateIds.add(gateId);
}
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Mutating a Map (pendingGates) while iterating over it using a for...of loop can lead to subtle bugs or maintainability issues if the iteration logic is refactored. It is safer and more declarative to collect the keys to resolve first, and then perform the deletions.

Suggested change
for (const [gateId, nodeId] of pendingGates) {
if (nodeId === event.nodeId) {
pendingGates.delete(gateId);
resolvedGateIds.add(gateId);
}
}
const toResolve: string[] = [];
for (const [gateId, nodeId] of pendingGates) {
if (nodeId === event.nodeId) {
toResolve.push(gateId);
}
}
for (const gateId of toResolve) {
pendingGates.delete(gateId);
resolvedGateIds.add(gateId);
}

Comment on lines +60 to +68
} catch (err) {
// An interpolation failure is an authoring/data fault, not a transient one — fatal `validation`,
// matching the agent handler's prompt-resolution failure mapping (agent-runner.ts).
return failed(
'validation',
err instanceof Error ? err.message : 'gate template interpolation failed',
false,
);
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

If ctx.signal is aborted while resolveTemplate is executing (e.g., during a slow file read or network call), resolveTemplate will throw an error. The catch block will intercept this and return a fatal validation failure instead of a clean cancelled() outcome. Checking ctx.signal.aborted inside the catch block ensures cancellation is handled correctly.

  } catch (err) {
    if (ctx.signal.aborted) {
      return cancelled();
    }
    // An interpolation failure is an authoring/data fault, not a transient one — fatal `validation`,
    // matching the agent handler's prompt-resolution failure mapping (agent-runner.ts).
    return failed(
      'validation',
      err instanceof Error ? err.message : 'gate template interpolation failed',
      false,
    );
  }

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
packages/shared/src/run-event.ts (1)

235-249: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Enforce timeoutAction only when a timeout exists.

HumanGatePausedEventSchema currently accepts timeoutAction without timeoutMs, which creates an invalid persisted state for pause/timeout resume semantics.

Suggested schema guard
-export const HumanGatePausedEventSchema = z.object({
+export const HumanGatePausedEventSchema = z.object({
   type: z.literal('human_gate:paused'),
   ...runBase,
   nodeId: nonEmptyString,
   gateId: nonEmptyString,
   gateType: GateTypeSchema,
   message: z.string(),
   assignee: z.string().optional(),
   timeoutMs: nonNegativeInt.optional(),
   timeoutAction: TimeoutActionSchema.optional(),
   expiresAt: z.string().datetime({ offset: true }).optional(),
-});
+}).superRefine((event, ctx) => {
+  if (event.timeoutAction !== undefined && event.timeoutMs === undefined) {
+    ctx.addIssue({
+      code: z.ZodIssueCode.custom,
+      path: ['timeoutAction'],
+      message: 'timeoutAction requires timeoutMs',
+    });
+  }
+});
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/shared/src/run-event.ts` around lines 235 - 249, The
HumanGatePausedEventSchema currently allows timeoutAction to be present without
timeoutMs, which violates the intended pause/timeout resume semantics where
timeoutAction should only exist when a timeout is configured. Add a conditional
validation constraint to the HumanGatePausedEventSchema object using Zod's
refine or superRefine method to ensure that if timeoutAction is provided,
timeoutMs must also be present, preventing invalid persisted states.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@packages/core/src/engine/checkpoint.ts`:
- Around line 80-181: The `reconstructCheckpointState` function exceeds the
cognitive complexity threshold (19 > 15) due to its large switch statement
handling multiple event types. Extract the event-application logic into separate
helper functions organized by category: one for handling run events
(run:started, run:paused, run:completed, run:failed, run:cancelled), one for
node events (node:completed, node:failed, node:skipped), one for gate events
(human_gate:paused, human_gate:resumed), and one for accounting (cost:updated).
Call these helpers from the main loop instead of inline switch cases, preserving
all current behavior and state mutations.

In `@packages/core/src/engine/engine.ts`:
- Around line 302-306: The prepareResume() method currently reinitializes the
`#startEpochMs` field to the current host time, which causes the run duration to
be reset on resume rather than continuing from the original start. Modify the
checkpoint serialization to persist the original `#startEpochMs` value (or the
accumulated elapsed duration) when creating a checkpoint, and update the
prepareResume() method to restore that persisted value instead of calling
Date.parse(this.#host.clock.now()). This ensures that run:completed.durationMs
accurately reflects the total elapsed time across the entire run including both
the pre-resume and post-resume segments.
- Around line 1102-1110: The execution is registered in this.#runs at line 1102
before validation occurs in execution.resume() at line 1109. If the resume call
throws a validation error, the half-initialized execution remains in `#runs`,
causing subsequent retries to fail with run_already_active instead of the
original error. Move the this.#runs.set(input.runId, execution) registration to
after both the execution.kick() path (for already-resolved gates) and the
execution.resume() path have completed successfully, or alternatively wrap the
entire resume/kick logic in a try-catch that deletes the execution from `#runs`
before rethrowing any validation errors.

In `@packages/core/src/engine/node-handlers/human-gate.ts`:
- Around line 60-67: In the catch block handling errors from resolveTemplate in
the human-gate.ts file, add logic to distinguish abort errors from other
interpolation failures. First, import the InterpolationError class at the top of
the file. Then, in the catch block (around lines 60-67), add a conditional check
before the existing failed call: if the caught error is an instance of
InterpolationError and its code property equals 'aborted', return cancelled() to
properly indicate the abort status; otherwise, proceed with the existing
failed('validation') logic for other interpolation errors. Additionally, add a
test case that verifies abort signals during template resolution are correctly
handled by returning cancelled() instead of failed().

In `@packages/shared/src/run-event.ts`:
- Around line 205-208: The `selected` field in the `node:completed` schema
currently allows empty arrays via `z.array(nonEmptyString).optional()`, which
creates an ambiguous branch outcome state that should not be permitted. Modify
the schema validation for the `selected` field to ensure that when the array is
present, it must contain at least one element. Use the `.min(1)` method on the
array validation to enforce that empty arrays are rejected, or alternatively add
a `.refine()` constraint that validates the array is non-empty when it is
defined.

---

Outside diff comments:
In `@packages/shared/src/run-event.ts`:
- Around line 235-249: The HumanGatePausedEventSchema currently allows
timeoutAction to be present without timeoutMs, which violates the intended
pause/timeout resume semantics where timeoutAction should only exist when a
timeout is configured. Add a conditional validation constraint to the
HumanGatePausedEventSchema object using Zod's refine or superRefine method to
ensure that if timeoutAction is provided, timeoutMs must also be present,
preventing invalid persisted states.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: ec0189e8-fe90-45ed-a3a2-cf9962a20d5a

📥 Commits

Reviewing files that changed from the base of the PR and between 0a0019b and f912ce0.

📒 Files selected for processing (22)
  • docs/architecture/execution-model.md
  • docs/architecture/shared-core-engine.md
  • docs/reference/contracts/sse-event-schema.md
  • packages/core/src/engine/checkpoint.test.ts
  • packages/core/src/engine/checkpoint.ts
  • packages/core/src/engine/engine.test.ts
  • packages/core/src/engine/engine.ts
  • packages/core/src/engine/errors.ts
  • packages/core/src/engine/event-bus.ts
  • packages/core/src/engine/execution-host.test.ts
  • packages/core/src/engine/execution-host.ts
  • packages/core/src/engine/node-executor.ts
  • packages/core/src/engine/node-handlers/dispatcher.ts
  • packages/core/src/engine/node-handlers/human-gate.test.ts
  • packages/core/src/engine/node-handlers/human-gate.ts
  • packages/core/src/engine/node-handlers/node-handlers.e2e.test.ts
  • packages/core/src/engine/node-handlers/node-handlers.test.ts
  • packages/core/src/engine/run-handle.ts
  • packages/core/src/index.ts
  • packages/shared/src/constants.ts
  • packages/shared/src/run-event.test.ts
  • packages/shared/src/run-event.ts

Comment thread packages/core/src/engine/checkpoint.ts
Comment thread packages/core/src/engine/engine.ts Outdated
Comment thread packages/core/src/engine/engine.ts
Comment thread packages/core/src/engine/node-handlers/human-gate.ts
Comment on lines +205 to +208
// The immediate downstream ids a `condition` kept live (its branch selection). Present ONLY for a
// condition's branch outcome — it is the authoritative record checkpoint/resume (1.R) reconstructs
// `selectedTargets` from, so a selected branch that was mid-flight at a crash re-runs (not skipped).
selected: z.array(nonEmptyString).optional(),

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Reject empty selected arrays in node:completed.

The branch-selection field should be non-empty when present; allowing selected: [] admits an impossible/ambiguous branch outcome.

Suggested tightening
-  selected: z.array(nonEmptyString).optional(),
+  selected: z.array(nonEmptyString).min(1).optional(),
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// The immediate downstream ids a `condition` kept live (its branch selection). Present ONLY for a
// condition's branch outcome — it is the authoritative record checkpoint/resume (1.R) reconstructs
// `selectedTargets` from, so a selected branch that was mid-flight at a crash re-runs (not skipped).
selected: z.array(nonEmptyString).optional(),
// The immediate downstream ids a `condition` kept live (its branch selection). Present ONLY for a
// condition's branch outcome — it is the authoritative record checkpoint/resume (1.R) reconstructs
// `selectedTargets` from, so a selected branch that was mid-flight at a crash re-runs (not skipped).
selected: z.array(nonEmptyString).min(1).optional(),
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/shared/src/run-event.ts` around lines 205 - 208, The `selected`
field in the `node:completed` schema currently allows empty arrays via
`z.array(nonEmptyString).optional()`, which creates an ambiguous branch outcome
state that should not be permitted. Modify the schema validation for the
`selected` field to ensure that when the array is present, it must contain at
least one element. Use the `.min(1)` method on the array validation to enforce
that empty arrays are rejected, or alternatively add a `.refine()` constraint
that validates the array is non-empty when it is defined.

Verified each finding against current code; fixed the still-valid ones, reverted
one that contradicts engine semantics, skipped two with reasons.

Fixed:
- checkpoint.ts: reduce reconstructCheckpointState cognitive complexity (19→under
  15) by extracting per-category appliers (applyRunEvent / applyNodeEvent /
  applyGateEvent) over a shared accumulator — behavior identical. The gate-resolve
  arm now collects gate ids first, then deletes (no mutate-while-iterating the Map).
- Resumed-run durationMs: the checkpoint now carries the original start epoch
  (`startedAtMs`, from run:started.timestamp); a rehydrated run measures durationMs
  from it (seeded in #seedFromCheckpoint), so a terminal reports total wall-clock
  across pre-/post-resume — not just the post-resume segment. prepareResume removed.
- resumeFromCheckpoint: wrap resume()/kick() in try/catch that deletes the run from
  #runs on a validation throw (unknown_gate / run_not_paused), so a retry isn't
  wrongly rejected with run_already_active and no broken run is stranded in memory.
- human-gate.ts: an abort DURING template resolution now returns cancelled() (a
  deliberate fatal reason) rather than failed('validation') — checked via
  ctx.signal.aborted in the catch; +unit test.
- run-event.ts: HumanGatePausedEvent — timeoutAction is now refused without
  timeoutMs (union-level superRefine; a discriminatedUnion member can't self-refine).
- engine.ts #settle: disarm gate timers via values()+clear() (no array spread);
  document #skipReason precedence (branch_not_taken wins over upstream_unreachable);
  document the ResumeFromCheckpointInput invariant (caller passes the original
  inputs/executionMode until the checkpoint persists them).
- execution-host.ts fireTimers: snapshot the armed set as a named array (keeps the
  required snapshot — a fired callback may arm/disarm timers — without the inline
  spread Sonar flags).

Reverted / skipped (with reason):
- selected .min(1) (REVERTED): an empty `selected` is a VALID outcome — a condition
  that routes to no branch, which the engine skip-propagates downstream
  (engine.ts #hasLiveEdge); .min(1) would reject that legitimate node:completed.
- onSettled "#runs leak" on resume (SKIP): start() also retains settled runs via a
  no-op onSettled by design (for run_already_terminal reporting; TTL prune is future
  scope) — resumeFromCheckpoint is consistent, not divergent.
- execution-host for-of "unnecessary array" (addressed, not removed): the snapshot
  is load-bearing; restructured to a named array rather than dropped.

Refs: ADR-0003, ADR-0036
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@cemililik

Copy link
Copy Markdown
Contributor Author

@coderabbitai review all

@coderabbitai

coderabbitai Bot commented Jun 14, 2026

Copy link
Copy Markdown

@cemililik Sure! I'll kick off a full review of all the changes in this PR right away.

✅ Action performed

Full review finished.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
docs/architecture/shared-core-engine.md (1)

167-179: ⚠️ Potential issue | 🔴 Critical

Remove step_executions from the checkpoint reconstruction description.

Lines 168–169 incorrectly list step_executions rows as an input to checkpoint reconstruction. The code is definitive: reconstructCheckpointState takes only run_events (see packages/core/src/engine/checkpoint.ts:176 and execution-host.ts:308). The header comment in checkpoint.ts explicitly states "There is no checkpoint table — the CheckpointState is derived from the ordered run_events." Rewrite lines 168–169 to name run_events alone as the source, and keep lines 172–173's "pure fold" statement. step_executions is separate run-history metadata, not part of checkpoint reconstruction.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/architecture/shared-core-engine.md` around lines 167 - 179, The
checkpoint reconstruction description incorrectly includes step_executions rows
as input data when in fact only run_events should be listed as the source.
Remove the references to step_executions (status, attempt_number, output_json,
error_json) from lines 168–169 and rewrite that sentence to state that the
Checkpointer reconstructs CheckpointState solely from the ordered, replayable
run_events log. Keep the description of the messages field and preserve the
subsequent explanation of reconstructCheckpointState as a pure fold operation
over the ordered event stream that derives CheckpointState fields. This aligns
the documentation with the actual code implementation where
reconstructCheckpointState takes only the events parameter, not step_executions
data.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/reference/contracts/sse-event-schema.md`:
- Line 76: The `selected?` field documentation for the `node:completed` event at
line 76 in docs/reference/contracts/sse-event-schema.md currently implies it
always contains at least one target id, but a condition can route to no branch
making it an empty array. Update the description of the `selected?` field to
explicitly clarify that it can be an empty array. Apply the same clarification
to the other affected location at lines 122-123 in the same file where
`selected` is documented.

In `@packages/core/src/engine/checkpoint.test.ts`:
- Around line 103-127: The test for the resumed gate scenario in the `'a resumed
gate clears the pending gate + records the decision as the node output'` test
case is missing an assertion to verify that the gate id is correctly moved to
`resolvedGateIds` after the gate is resumed. Add an expect statement after the
existing assertions to verify that state?.resolvedGateIds includes the gate id
'g1' that was paused and then resumed, ensuring the gate tracking is correct for
idempotent re-delivery detection.

In `@packages/core/src/engine/node-handlers/human-gate.test.ts`:
- Line 1: The file human-gate.test.ts has code formatting violations detected by
Prettier. Run prettier --write on this file to automatically fix all formatting
issues according to the project's Prettier configuration.

In `@packages/core/src/index.ts`:
- Around line 93-116: The export statements in packages/core/src/index.ts are
not properly formatted according to prettier standards and are causing the CI
prettier --check to fail. Run prettier formatting on all export blocks in the
file at lines 93-116 (the StartInput, ResumeFromCheckpointInput, and related
engine exports), lines 124-126 (additional exports), and lines 169-170
(checkpoint-related exports) to ensure they all comply with the project's
formatting standards before merging.

---

Outside diff comments:
In `@docs/architecture/shared-core-engine.md`:
- Around line 167-179: The checkpoint reconstruction description incorrectly
includes step_executions rows as input data when in fact only run_events should
be listed as the source. Remove the references to step_executions (status,
attempt_number, output_json, error_json) from lines 168–169 and rewrite that
sentence to state that the Checkpointer reconstructs CheckpointState solely from
the ordered, replayable run_events log. Keep the description of the messages
field and preserve the subsequent explanation of reconstructCheckpointState as a
pure fold operation over the ordered event stream that derives CheckpointState
fields. This aligns the documentation with the actual code implementation where
reconstructCheckpointState takes only the events parameter, not step_executions
data.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: cf6bba9f-c688-44de-9840-91d2379b3df0

📥 Commits

Reviewing files that changed from the base of the PR and between 0a0019b and 012b2bb.

📒 Files selected for processing (22)
  • docs/architecture/execution-model.md
  • docs/architecture/shared-core-engine.md
  • docs/reference/contracts/sse-event-schema.md
  • packages/core/src/engine/checkpoint.test.ts
  • packages/core/src/engine/checkpoint.ts
  • packages/core/src/engine/engine.test.ts
  • packages/core/src/engine/engine.ts
  • packages/core/src/engine/errors.ts
  • packages/core/src/engine/event-bus.ts
  • packages/core/src/engine/execution-host.test.ts
  • packages/core/src/engine/execution-host.ts
  • packages/core/src/engine/node-executor.ts
  • packages/core/src/engine/node-handlers/dispatcher.ts
  • packages/core/src/engine/node-handlers/human-gate.test.ts
  • packages/core/src/engine/node-handlers/human-gate.ts
  • packages/core/src/engine/node-handlers/node-handlers.e2e.test.ts
  • packages/core/src/engine/node-handlers/node-handlers.test.ts
  • packages/core/src/engine/run-handle.ts
  • packages/core/src/index.ts
  • packages/shared/src/constants.ts
  • packages/shared/src/run-event.test.ts
  • packages/shared/src/run-event.ts

Comment thread docs/reference/contracts/sse-event-schema.md Outdated
Comment thread packages/core/src/engine/checkpoint.test.ts
Comment thread packages/core/src/engine/node-handlers/human-gate.test.ts
Comment thread packages/core/src/index.ts Outdated
Comment on lines +93 to +116
export type {
StartInput,
ResumeFromCheckpointInput,
WorkflowEngineDeps,
} from './engine/engine.js';
export { RunEventBus } from './engine/event-bus.js';
export type { RunEventBusOptions, RunEventListener, RunEventDraft } from './engine/event-bus.js';
export type { RunHandle } from './engine/run-handle.js';
export {
InMemoryRunStore,
createInMemoryHost,
createInMemoryCheckpointer,
createAbortController,
createManualTimerController,
} from './engine/execution-host.js';
// Checkpointer + resume (1.R) — reconstruct a run's state from its persisted event stream (no checkpoint
// table; ADR-0003). The in-memory reference ships here; the SQLite/cloud one is Phase-2/CLI.
export { reconstructCheckpointState, CHECKPOINT_SCHEMA_VERSION } from './engine/checkpoint.js';
export type {
Checkpointer,
CheckpointState,
CheckpointNodeState,
CheckpointPendingGate,
} from './engine/checkpoint.js';

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Reformat the updated export surface.

CI is already failing prettier --check here, so the new export blocks need to be formatted before merge.

Also applies to: 124-126, 169-170

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/core/src/index.ts` around lines 93 - 116, The export statements in
packages/core/src/index.ts are not properly formatted according to prettier
standards and are causing the CI prettier --check to fail. Run prettier
formatting on all export blocks in the file at lines 93-116 (the StartInput,
ResumeFromCheckpointInput, and related engine exports), lines 124-126
(additional exports), and lines 169-170 (checkpoint-related exports) to ensure
they all comply with the project's formatting standards before merging.

Source: Pipeline failures

…#22 review)

- Run Prettier on the four files the CI format:check flagged (engine.ts,
  engine.test.ts, index.ts, human-gate.test.ts) — formatting only, no logic change.
- sse-event-schema.md: clarify node:completed.selected MAY be an empty array (a
  condition routing to no branch), matching the reverted .min(1) and the engine's
  skip-propagation — both the event table and the interface block.
- shared-core-engine.md: align the checkpoint-reconstruction description with the
  implementation — CheckpointState is folded from the ordered run_events log alone
  (each node's output/error rides node:completed/node:failed); step_executions /
  messages are denormalized persistence for the run-trace UI, not inputs the fold
  requires (reconstructCheckpointState takes only events).
- checkpoint.test.ts: assert the resumed gate id moves into resolvedGateIds (the
  idempotent re-delivery guard), not just that pendingGates clears.

Refs: ADR-0003, ADR-0036
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@sonarqubecloud

Copy link
Copy Markdown

@cemililik cemililik merged commit 7013d49 into main Jun 14, 2026
7 checks passed
cemililik added a commit that referenced this pull request Jun 15, 2026
Post-merge roadmap + status update now that checkpoint/resume (1.R) and the human
gate (1.Q) have merged.

- phase-1-engine-and-llm.md: ✅ Done markers on §1.Q and §1.R; top status block
  records the PR #22 landing and the remaining 1.m4 lane (1.S, 1.AC).
- current.md: status narrative + next-workstream pointer advanced to node retry
  (1.S); last-updated 2026-06-15.
- CLAUDE.md: status paragraph + detailed status reflect 1.R/1.Q landed, 1.S next.
- deferred-tasks.md: re-point the now-landed-context items — the structuredClone
  `ctx`-transport obligation moves off 1.R (the checkpoint carries no resolved ctx)
  to the ctx-threading work; mid-tool-loop resume noted as Phase-2 (1.R resumes at
  gate boundaries only); the ctx-threading fold-into-1.Q/1.R window noted closed
  (now its own task). New "Checkpoint/resume + human gate (1.R/1.Q) follow-ups"
  section captures the three confirmed Phase-2 deferrals (gate-timer re-arm on
  rehydration, content-hash workflow-snapshot identity guard, cross-process
  gate-resolve TOCTOU → store-level uniqueness).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant