Feat/durable workflows by AlemTuzlak · Pull Request #589 · TanStack/ai

AlemTuzlak · 2026-05-19T18:09:41Z

🎯 Changes

✅ Checklist

I have followed the steps in the Contributing guide.
I have tested this code locally with pnpm run test:pr.

🚀 Release Impact

This change affects published code, and I have generated a changeset.
This change is docs/CI/dev-only (no release).

Step 1 of the durability roadmap. Adds the append-only step log surface to the RunStore contract; the engine doesn't write to it yet — that lands with the replay engine (step 2). Establishes the contract now so adapter authors and the replay path can share the same types. - types.ts: new StepKind, StepAttempt, StepRecord, LogConflictError. The StepKind union is declared with the full v1 set ('agent', 'approval', 'nested-workflow', 'step', 'sleep', 'now', 'uuid', 'signal') even though only the first three are produced by today's engine, so adapter implementations don't need to evolve across the upcoming step commits. - RunStore renamed get/set/delete -> getRunState/setRunState/deleteRun and gained appendStep + getSteps. appendStep is optimistic-CAS: an expectedNextIndex parameter the store must atomically check against the current log length, throwing LogConflictError on mismatch. - inMemoryRunStore migrated, stepLogs Map added, snapshots returned by getSteps are defensive copies. Index field on appended records is normalized to the actual position so callers can't desynchronize the log by passing a stale value. - Engine call sites renamed (no semantic change). Smoke test updated to use the new method names. - New tests/in-memory-store.test.ts (10 tests) pins the store contract: round-trip, deletion sweeps log, ordered append + read, index normalization, LogConflictError on mismatch, LogConflictError carries the existing record (for engine-side dedup), defensive snapshots, per-run log isolation.

Step 2 of the durability roadmap. Builds on the split state/log store from step 1 to make runs recoverable from the persisted log alone, with no in-memory generator handle required. - runWorkflow's resume path branches: * Fast path (single-node, no restart): if runStore.getLive(runId) returns a LiveRun, drive it forward as before. seedValue (the approval payload) is sent directly into the in-memory generator on its first next() call. Behavior unchanged from prior implementation for this case. * Replay path (process restart, multi-node, expired LiveRun): load RunState + step log from the store. Rebuild fresh state by calling workflow.initialize() again (deterministic by contract) — the persisted state snapshot is the materialized view but the log is authoritative, so re-running user code against the log restores state correctness. Construct a brand-new generator and drive it through the log step-by-step, short-circuiting each yielded descriptor with its recorded result without emitting client-facing events. After replay catches up, consume seedValue as the result for the descriptor that was awaiting at pause time (currently approval-only; signal generalization lands in step 5). - Live execution now appends a StepRecord to the log BEFORE emitting STEP_FINISHED (Q6: at-most-once observable). Three step kinds today: agent, nested-workflow, approval. Step kinds 'step', 'sleep', 'now', 'uuid', 'signal' are declared on the type but not yet emitted — they land in steps 4 and 5. - Failed agent steps still throw into user code, but the failure is now persisted as a StepRecord with an `error` field. On replay the engine re-runs user code which re-encounters the persisted error, reproducing the original throw path; user-side try/catch logic replays identically. - Custom events emitted via the workflow's emit() during replay are silently discarded (they were already on the wire in the original run). Live phase drains them as before. - State diffs are still computed every iteration so the local prevState reference stays in sync, but only emitted to the wire in live mode. - New tests/engine.durability.test.ts (5 tests) pins: * Per-completed-agent log appends. * Log contains all pre-pause agent results. * Resume-after-restart replays + completes the run (live handle stripped to simulate a fresh process). * Replay short-circuits agent re-execution (echoCallCount stays at 1 despite a phase-1 run + a phase-2 replay-resume). * run_lost when neither the live handle nor RunState exists.

Step 3 of the durability roadmap. Adds a stable fingerprint of the workflow definition (run + initialize + each agent's run, walked recursively through nested workflows) and persists it on RunState at run start. On replay-from-store resume, the engine compares the persisted fingerprint to the currently-loaded definition's; on mismatch it emits RUN_ERROR { code: 'workflow_version_mismatch' } and refuses to drive a generator against a log whose positional indices may no longer match. - New engine/fingerprint.ts: walks workflow.run, workflow.initialize, and every entry of workflow.agents (recursively for nested workflows). Sources come from Function.prototype.toString(), so the fingerprint is sensitive to whitespace and minification — the conservative choice that Temporal makes for the same reason. Hash is a 64-bit FNV-1a rendered as base36; no crypto dep. - RunState gains an optional `fingerprint?: string` field. - runWorkflow's startRun computes and stores it. The replay branch of resumeRun re-computes against the loaded workflow and compares; fast-path (in-memory) resume skips the check because it's reusing the same generator object that's tied to the same closure. - Two new tests in engine.durability.test.ts pin the contract: emits workflow_version_mismatch when source differs across phases, resumes normally when source is unchanged. Recovery story: drain-then-deploy. Operators refuse new runs until in-flight ones complete, then deploy. A Temporal-style `patched()` escape hatch for hot-fixing mid-flight runs is documented as a v2 follow-up; out of scope here.

Step 4 of the durability roadmap. Adds three new yieldable primitives for user code to safely express side effects, time, and randomness without breaking replay determinism. - `yield* step(name, fn)` — run a side-effecting fn once, persist the result, replay short-circuits to the recorded value. fn receives a StepContext with a deterministic `ctx.id` for use as an idempotency key with external systems (e.g., Stripe Idempotency-Key, GitHub X-GitHub-Delivery). Errors are persisted as log records with an `error` field; on replay the recorded error is reconstructed and re-thrown into the generator so user-side try/catch replays identically. - `yield* now()` — Date.now() once, recorded, replays return the same value. Replaces ad-hoc Date.now() inside workflow code, which would produce different values across replays. - `yield* uuid()` — globalThis.crypto.randomUUID() once, recorded. Same pattern for cross-system correlation IDs. Engine handling: - New dispatch branches in driveLoop for 'step', 'now', 'uuid'. STEP_STARTED/STEP_FINISHED are emitted for `step` (visible work), not for `now`/`uuid` (cheap deterministic values — emitting events for them would clutter the WorkflowTimeline UI). - ctx.id derives from runId + logLength (e.g., 'run-abc:step-3'), which is stable across replays because logLength on replay matches the position in the persisted log. - StepKind union (declared in step 1) already included 'step', 'now', 'uuid', so no type churn for adapter authors. Also fixes a pre-existing generator.throw double-advance bug in the agent error path: `generator.throw(err)` advances to the next yield, and the loop's `continue` was calling `.next()` again, advancing past it. Now the loop carries a `pendingResult` slot that the error paths populate, so the next iteration reuses the already-yielded descriptor instead of pulling a fresh one. Same fix applies to the new `step` error path and the replay error rethrow path. stepStartedEvent's stepType union widened to include 'step' and 'signal' (signal lands in step 5). New tests/engine.primitives.test.ts (6 tests): step runs fn once, ctx.id is deterministic, replay does NOT re-execute fn, replay re-throws persisted errors so try/catch replays identically, now() / uuid() record once and replay sees the same value.

Step 5 of the durability roadmap. Adds the signal primitive that makes pause/resume open-ended: hosts can integrate arbitrary external systems (webhooks, queue messages, manual ops triggers, scheduled timers) as resume sources without the engine knowing what each one means. Sleep is built on top as a typed wrapper. User-facing primitives: - `yield* waitForSignal<T>(name, options?)` — generic durable pause. Returns the payload the host delivered for `name`. options carries an optional `deadline` (UTC ms, surfaced on waitingFor for time- indexed wake jobs) and `meta` (free-form, opaque to the engine — UI rendering hint). - `yield* sleepUntil(timestamp)` / `yield* sleep(ms)` — sugar over waitForSignal with the reserved '__timer' signal name. Past-deadline wakes resolve immediately on delivery (no "skipped sleep"). - TIMER_SIGNAL_NAME exported as the canonical reserved name for hosts implementing timer schedulers. Engine: - New StepDescriptor variant `{ kind: 'signal', name, deadline?, meta? }`. - driveLoop dispatch for 'signal' pauses the run identically to approval — persists state with the new `RunState.waitingFor: { signalName, deadline?, meta? }` field (Q5 iii pull-discovery), emits a CUSTOM `run.paused` event on the SSE stream (Q5 iii push- discovery), and ends the response. The pre-existing approval pause is left untouched; approve() will fold into waitForSignal in a separate refactor. - runWorkflow accepts a new `signalDelivery: SignalResult` option alongside the legacy `approval` option. resumeRun derives a shared seedPayload from whichever is set; signalDelivery wins when both appear. - driveLoop's post-replay seed-consumption block handles both the 'approval' and 'signal' descriptors so a paused-on-signal run resumes correctly through the replay path after a process restart. - The in-memory fast-resume's "close the dangling STEP_STARTED" path marshals the payload based on the persisted waitingFor.signalName, so signal pauses don't pretend to be approvals. - parseWorkflowRequest surfaces `body.signal` (the wire-format key) onto WorkflowRequestParams.signalDelivery. New tests/engine.signals.test.ts (5 tests): pause sets waitingFor + emits run.paused + closes the SSE, in-memory resume delivers the payload, replay resume after restart delivers the same payload from the log, sleepUntil pauses on the __timer signal with deadline populated, sleep(ms) resumes when the host delivers a __timer signal.

Step 6 of the durability roadmap. Adds the third entrypoint to `runWorkflow` — `attach: true` — so fresh subscribers (browser tab refresh, shared run links, mobile reconnect) can read a snapshot of an in-flight run without driving it forward. Without this, durability is half-built — the engine survives restart but the UI can't catch up. Engine: - New attachRun() in engine/run-workflow.ts. Emits a synthetic RUN_STARTED + STATE_SNAPSHOT + 'steps-snapshot' CUSTOM event (carrying every completed step record with {index, kind, name, result, error, startedAt, finishedAt}) from the persisted log. After the snapshot: finished/error/aborted -> emit the terminal event and end paused -> emit run.paused with the waitingFor descriptor and end running -> emit run.current-status hint and end (tailing live events on a different node is the publisher hook's job — step 7) - runWorkflow accepts `attach?: boolean` on RunWorkflowOptions; the top-level dispatcher routes runId+attach to attachRun(), runId+ approval/signalDelivery to resumeRun(), and input to startRun() as before. Client (ai-client/workflow-client.ts): - WorkflowClientState gains `pendingSignal: WorkflowSignalWait | null` for non-approval pauses (waitForSignal, sleep). WorkflowStep's stepType union widened to include 'step' and 'signal'. - New methods: attach(runId) — reset local state, post {attach:true, runId}, rebuild from snapshot signal(name, payload, opts) — generic signal delivery - handleChunk consumes the 'steps-snapshot' CUSTOM event (rebuilds the steps array; synthetic stepId `snapshot:N` so subsequent STEP_FINISHED events can match by stepId) and 'run.paused' (sets pendingSignal for non-approval signals; approval signals continue to flow through the existing approval-requested event for back-compat with the demo UI). React (ai-react/use-workflow.ts): - useWorkflow / useOrchestration return gains `attach` and `signal` methods alongside `approve`, `start`, `stop`. `pendingSignal` is on the state via the spread. New tests/engine.attach.test.ts (3 tests): paused run attach emits state+steps snapshots and the pause descriptor without finishing; finished runs return run_lost (the engine deletes hot state on finish — hosts that want post-mortem attach should archive via the publisher hook); unknown runId returns run_lost. 34 engine tests pass. Build + lint + typecheck clean across the three touched packages and the example.

Step 7 of the durability roadmap. Adds the optional `publish(runId, event)` callback to `runWorkflow` — the host's seam for fanning out engine events to subscribers on other nodes (Redis pub/sub, NATS, EventBridge, etc.). Contract: - Every event the engine emits is passed to `publish` before being yielded to the local SSE consumer. The runId is plumbed through — for start paths it's captured from the first RUN_STARTED chunk so the publisher always sees a stable key. - Errors thrown by `publish` are caught and silently swallowed. A misbehaving publisher must never break the run; the SSE consumer still receives every event regardless. - The hook is optional. Single-node deployments ignore it and get the in-process behavior unchanged. Implementation: - `runWorkflow` reworked so the top-level dispatcher (start / resume / attach) is wrapped in an outer iterator that calls `publish` before each yield. The actual dispatch lives in an inner async generator; the wrapper is responsible only for runId tracking and publisher invocation. New tests/engine.publisher.test.ts (3 tests): publisher sees every lifecycle event with the right runId; throwing publisher does not prevent the run from finishing; the run.paused event reaches the publisher (so an out-of-process worker subscribed to the publisher can register the wake even when the SSE response has closed). 37 engine tests pass.

…signalId idempotency Step 8 of the durability roadmap. Adds the entry-point idempotency contract: callers pass stable IDs for `start` and `signalDelivery`; the engine treats duplicates as retries instead of corrupting state. Engine (ai-orchestration): - startRun: if `options.runId` is provided AND a run already exists at that id, two branches: * fingerprint matches -> idempotent retry; engine serves the existing run as an attach snapshot so the caller gets a consistent event envelope regardless of whether they hit a fresh start or a retry * fingerprint differs -> RUN_ERROR { code: 'run_id_conflict' } Auto-generated runIds (no `options.runId` passed) skip the check — collision rate is negligible and one store read per fresh start isn't worth paying. - DriveLoopArgs gains `seedSignalId?: string`. resumeRun propagates `options.signalDelivery.signalId` into the driveLoop seed args, and driveLoop records it on the resulting approval/signal step record. - Pre-loop in-memory fast-resume now ALSO appends the resolved approval/signal to the log before emitting the closing STEP_FINISHED (the original step-2 implementation skipped this, leaving a gap where the log would re-enter the pause on next replay). Fixes the multi-signal interim-state read pattern. Client (ai-client): - WorkflowClient.start gains an optional `{ runId? }` opts arg. When omitted, the client lib auto-generates `run_${crypto.randomUUID()}` before posting — so double-submits and network retries collapse onto the same run on the server. React (ai-react): - useWorkflow's start type widened to accept the optional `{ runId? }`. New tests/engine.idempotency.test.ts (5 tests): client-provided runId is used verbatim; duplicate start with same id + fingerprint serves attach snapshot; duplicate start with different fingerprint returns run_id_conflict; signalDelivery.signalId is persisted on the resulting log record; multi-signal run shows signalId stamped on the interim log entry between phases. 42 engine tests pass. ai-client + ai-react typecheck clean.

…gnal_lost Step 9 of the durability roadmap. With client-provided signalIds in place (step 8), the engine now classifies CAS conflicts on appendStep: - **idempotent** (same signalId at the same index) — second writer observes the first writer's record, the engine treats the append as a no-op, and the recorded result becomes the value sent into the generator. Matches the retry contract: client lib's stable signalId means double-submits collapse onto the same logical delivery. - **lost** (different signalId at the same index) — second writer loses the race; engine emits `RUN_ERROR { code: 'signal_lost' }` carrying the winner's signalId so the host can compensate or retry with a different id. - **appended** — uncontested; normal path. Implementation: - `tryAppendStep` returns a discriminated AppendOutcome. Non- LogConflictError errors from the store still rethrow. - Signal/approval append sites consume the outcome explicitly: lost -> emit signal_lost and return; idempotent -> use existing record's result as the seed value going into the generator; appended -> continue normally. - Non-signal append sites (agent, step, nested-workflow, now, uuid) use a thin `appendStep` helper that throws on any non-appended outcome. A non-signal conflict means two engines are driving the same generator — programmer error under v1's single-writer-per- run contract — and gets surfaced via the outer try/catch as RUN_ERROR { code: 'error' }. - The in-memory fast-resume path (where the engine closes a dangling STEP_FINISHED on the same node) also uses the outcome classification, so a same-node retry of an approval delivered twice is idempotent rather than corrupting the log. New tests/engine.cas.test.ts (3 tests): same-signalId retry through the live-handle path is idempotent and the run finishes; same- signalId retry through the replay path is idempotent; pre-populating the log with a different-signalId 'winner' record causes a later delivery to fall through the replay short-circuit (winner's payload flows to user code). 45 engine tests pass.

Step 10 of the durability roadmap. Lets user code opt step() invocations into a retry loop without rolling their own try/catch/sleep ladder, and lets workflows set a coarse default. API: - StepRetryOptions: { maxAttempts, backoff?, baseMs?, shouldRetry? } backoff: 'exponential' (default) | 'fixed' | (attempt) => ms baseMs: 500ms default for built-in strategies shouldRetry: predicate to abort retries early per error - step(name, fn, { retry }) — per-call policy - defineWorkflow({ defaultStepRetry }) — workflow-level fallback; per-step config overrides Engine: - 'step' dispatch wraps fn() in a retry loop. Each attempt's {startedAt, finishedAt, result|error} is captured on a local `attempts` array. On terminal success the array is written to the StepRecord only when there were 2+ attempts (no retry noise on the happy path); on terminal failure the array is always written. - StepContext gains `attempt: number` (1-indexed) so retry-aware step fns can widen timeouts or adjust input on later tries. - Backoff uses in-process setTimeout — durable across yields but not across process restart, a documented v1 limitation. Long-tail retries that need full durability should use `yield* sleep(...)` in user code. The backoff timer respects the run's AbortController so a Ctrl+C / abort mid-retry-wait terminates cleanly instead of hanging out the full delay. Types exported: StepRetryOptions, StepAttempt, StepOptions. New tests/engine.retry.test.ts (6 tests): retries up to maxAttempts with each attempt captured on the log record; first-attempt success leaves attempts undefined (happy-path is unchanged); shouldRetry predicate aborts retries early; exhausted retries throw the last error into user code (try/catch sees it); workflow-level defaultStepRetry applies when the step omits its own retry; per-step retry overrides the workflow default. 51 engine tests pass.

…igration Follow-up 1 of the durability roadmap. Adds Temporal-style mid-flight workflow migration. Lets in-flight runs survive code-body changes by branching user code on a recorded patch flag. User-facing API: defineWorkflow({ name: 'pipeline', patches: ['add-auth-check', 'use-fastify'], // declarative list run: async function* () { if (yield* patched('add-auth-check')) { // new behavior } else { // old behavior, kept for runs started before the patch } }, }) Semantics: - `patched(name)` returns true iff name is in RunState.startingPatches. startingPatches is captured at run-start time from workflow.patches. - Old runs (started before the patch was added) get FALSE — their startingPatches doesn't include the new name. New runs get TRUE. - `patched()` is deterministic from RunState; no log entry needed. Replay produces identical answers without engine bookkeeping. Fingerprint switch: - Workflows that declare `patches` opt INTO patch-versioned fingerprint mode. The fingerprint covers only name + sorted patch list — code-body changes no longer trigger workflow_version_mismatch on resume. - The integrity check becomes: run.startingPatches must be a SUBSET of current workflow.patches. Adding patches across deploys is fine; removing a patch while runs are in flight surfaces RUN_ERROR { code: 'workflow_patches_removed' } (the runs that gated their old path on that patch would lose the path). - Workflows WITHOUT `patches` keep the strict source-hash fingerprint (unchanged). Other plumbing: - WorkflowDefinition.version (optional, caller-supplied identifier). Pairs with selectWorkflowVersion (follow-up 3) for hosts running multiple versions side-by-side. - RunState.workflowVersion / startingPatches persisted at start. - Seed-consumption logic now falls through cleanly for non-pause descriptors that appear post-replay (patched/now/uuid). Previously it errored with resume_mismatch because deterministic primitives don't write to the log and re-yield on replay even when a seed is waiting. New tests/engine.patched.test.ts (5 tests): - patched returns true when the workflow declares the patch - patched returns false when not declared - old runs see false for newly-added patches (real migration scenario across a deploy) - removing a patch while runs are in flight surfaces workflow_patches_removed - code-body changes WITHOUT touching patches list resume cleanly 56 engine tests pass.

Follow-up 2 of the durability roadmap. Adds per-attempt timeout to step() so user code can opt out of letting a slow upstream hang the whole workflow. API: yield* step('charge', async (ctx) => { const r = await fetch('/charges', { signal: ctx.signal }) return r.json() }, { timeout: 30_000, retry: { maxAttempts: 3 } }) Semantics: - StepContext gains `signal: AbortSignal`. The engine creates a per-attempt AbortController that aborts when (a) the step's timeout elapses, or (b) the run as a whole is aborted (Ctrl+C / WorkflowClient.stop). User fns wire this into their fetch/axios/ db client for cooperative cancellation. - Timeouts compose with retry: each attempt gets a fresh timer; the retry policy's shouldRetry sees a `StepTimeoutError` and either retries (default) or fails fast. - The engine races the user fn against an abort-driven rejection, so unresponsive code that ignores ctx.signal still surfaces as StepTimeoutError rather than hanging — at the cost of leaving the fn's microtask running until it naturally completes (a "live zombie" — documented in the timeout option docblock as the user-policed half of the contract). - StepTimeoutError is exported alongside SchemaValidationError and LogConflictError so retry predicates can do `shouldRetry: err => !(err instanceof StepTimeoutError)`. The retry loop's existing per-attempt structure (from step 10) does the heavy lifting — this commit just adds the timeout race and signal plumbing inside each attempt. New tests/engine.timeout.test.ts (5 tests): timeout fires StepTimeoutError when fn exceeds the budget; ctx.signal fires so cooperative fns can bail; each retry attempt gets a fresh timeout and exhausted retries surface the last error; fast-enough fns proceed normally; the shouldRetry predicate can fail-fast on StepTimeoutError to avoid retrying an overloaded upstream. 61 engine tests pass.

Follow-up 3 of the durability roadmap. Adds two helpers for hosts running multiple versions of the same workflow side-by-side (typically: in-flight runs on v1 while new starts use v2). API: // Free function — minimal surface. const wf = await selectWorkflowVersion( [pipelineV1, pipelineV2], runId, runStore, ) ?? pipelineV2 // host picks the fallback for fresh starts // Or use the registry for a stateful collection. const registry = createWorkflowRegistry({ default: pipelineV2 }) registry.add(pipelineV1) registry.add(pipelineV2) const wf = await registry.forRun(runId, runStore) runWorkflow({ workflow: wf, runId, ... }) Routing rules: - Exact match by (workflowName, workflowVersion) on the run's persisted state. - Legacy fallback for pre-versioning runs (no workflowVersion persisted): match the first definition whose name matches AND which declares no `version` itself. - Otherwise undefined; the registry then returns its configured `default`, or the free function returns undefined for the host to decide. The registry also rejects duplicate (name, version) registrations to catch accidental double-adds during init. Pairs naturally with the `patches` + `patched()` work from follow-up 1: a v2 workflow declares the new patches, old in-flight runs route to v1 code via the registry, future runs route to v2 — fingerprint guard passes because patch-versioned mode tolerates body changes. New tests/registry.test.ts (7 tests): exact-match routing, no-match returns undefined, legacy unversioned fallback, registry rejects duplicate versions, registry routes runs correctly, registry default kicks in when no version matches, full end-to-end migration scenario (v1 in-flight, v2 deployed, registry routes correctly so v1 code runs for v1 runs). 68 engine tests pass.

Round 1 cr-loop returned ~21 bucket-(a) findings across 7 agents on the durability roadmap diff. Per user direction, this commit fixes the 8 highest-impact items that produce silent corruption or wire- format breakage; the rest are filed as a follow-up. 1. fingerprint: FNV-1a hash math was effectively 32-bit. Init used 16-bit halves of the offset basis instead of the canonical 32-bit halves; per-char mixing folded both bytes into hLo so the high accumulator never directly absorbed input; the multiply approximation dropped the carry term that diffuses bytes across the halves. Now uses canonical 0xcbf29ce4 / 0x84222325 init, TextEncoder for UTF-8 byte-at-a-time mixing, and an exact carry-aware modular multiply via Math.imul on 16-bit halves of hLo. Workflow fingerprints now actually have the dispersion the design contract claims. 2. WorkflowClient.stop(): openStream returns an AsyncIterable whose underlying request doesn't fire until something pulls from the generator. The prior code threw the result away, so abort POSTs never left the client. Now consumes the stream via consumeStream (fire-and-forget, with a catch to swallow network errors — the local state already flipped to 'aborted'). Run cancellation actually reaches the server. 3. patched() replay desync: patched() didn't append a log entry, but the replay short-circuit is purely positional. A workflow yielding `yield* patched('x')` then `yield* agents.foo(...)` would, on replay, pop the agent record and feed its result back into the patched yield — corrupting the boolean and downstream branching, then re-executing the agent live. Now patched appends a tiny log entry so positional replay stays aligned. No STEP_STARTED/STEP_FINISHED — still no timeline noise. 4. seedConsumed-with-undefined-payload: hasSeed was computed as `seedValue !== undefined`, but a resume call WITH `payload: undefined` (legitimate for sleep wakes, `waitForSignal<void>`) was treated as "no seed". On the replay path the engine then re-paused on the same signal indefinitely. Now hasSeed is derived from whether the caller supplied signalDelivery or approval, threaded through DriveLoopArgs as an explicit field. Sleep wakes through the replay path work. 5. legacy-approval idempotency: tryAppendStep classified CAS conflicts as 'idempotent' only when both records carried the same explicit signalId. Approval pauses via the legacy approve() primitive don't carry a signalId — so every retry collapsed to 'lost', emitting signal_lost on what should be an idempotent re-delivery. The classifier now also treats the "both records lack signalId AND share kind='approval' + name" case as idempotent, restoring the contract for the most common pause kind. 6. start fingerprint-bypass on legacy runs: the duplicate-runId check used `existing.fingerprint && existing.fingerprint !== fingerprint`, which short-circuits to false when the persisted record has no fingerprint (legacy runs, torn writes). That silently routed any duplicate to the attach path, regardless of whether the workflow had drifted. Now an absent persisted fingerprint surfaces run_id_conflict with a clear "cannot verify workflow identity, use attach:true explicitly" message. 7. signal_lost message format: `signalId=""` (empty string) when the winning record was unsigned was an ugly leak of internal state. Now reads `(unsigned)` when absent. The earlier review finding suggested marking the run as errored on signal_lost, but that's incorrect — `lost` means *this caller's delivery* lost; the run itself is still alive via the winning delivery. REFUTED at verification. 8. mergeStateDefaults silent failures: the function silently returned unvalidated `initial` when the schema validated asynchronously OR returned `issues`. Async schemas had their defaults silently dropped; invalid state silently slipped past the user's contract. Now throws with a clear error for async schemas (out of scope for v1) and surfaces issue summaries with path + message for sync-validation failures. 9. (bonus, no separate item) nested workflows didn't propagate the parent's publish hook to their own runWorkflow call. Multi- node attached subscribers never saw nested-run events on the transport. Now plumbed through DriveLoopArgs.publish and onto the nested call so the nested wrapper publishes under the nested runId. Deferred (round 2 / future CR): - attach to paused-approval run never populates pendingApproval - now/uuid leak into steps-snapshot (UI noise, not corruption) - snapshot:N stepId vs live step_... ID mismatch - replay throw reconstruction loses Error class - inMemoryRunStore TTL doesn't respect sleep deadlines - attachRun doesn't fingerprint-check - step retry abort/timeout discrimination inverted - broken tests (cas misnamed, idempotency no-op, timeout dead callCount, durability no-op) - replay→live STATE_DELTA duplication - WorkflowDefinition.run type is AsyncGenerator but primitives are Generator - parseWorkflowRequest body shape validation - useWorkflow connection captured at construction 68 engine tests still pass. Typecheck + lint clean across ai-orchestration, ai-client.

coderabbitai · 2026-05-19T18:09:48Z

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 2818aee6-0c8b-425d-876f-13c3441c8679

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/durable-workflows

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2026-05-19T18:10:35Z

🚀 Changeset Version Preview

22 package(s) bumped directly, 8 bumped as dependents.

🟩 Patch bumps

Package	Version	Reason
`@tanstack/ai`	0.20.0 → 0.20.1	Changeset
`@tanstack/ai-anthropic`	0.10.0 → 0.10.1	Changeset
`@tanstack/ai-code-mode`	0.1.15 → 0.1.16	Changeset
`@tanstack/ai-code-mode-skills`	0.1.15 → 0.1.16	Changeset
`@tanstack/ai-devtools-core`	0.3.32 → 0.3.33	Changeset
`@tanstack/ai-elevenlabs`	0.2.7 → 0.2.8	Changeset
`@tanstack/ai-fal`	0.7.8 → 0.7.9	Changeset
`@tanstack/ai-gemini`	0.10.7 → 0.10.8	Changeset
`@tanstack/ai-grok`	0.8.4 → 0.8.5	Changeset
`@tanstack/ai-groq`	0.2.3 → 0.2.4	Changeset
`@tanstack/ai-isolate-node`	0.1.15 → 0.1.16	Changeset
`@tanstack/ai-isolate-quickjs`	0.1.15 → 0.1.16	Changeset
`@tanstack/ai-ollama`	0.6.18 → 0.6.19	Changeset
`@tanstack/ai-openai`	0.9.4 → 0.9.5	Changeset
`@tanstack/ai-openrouter`	0.9.4 → 0.9.5	Changeset
`@tanstack/ai-react-ui`	0.7.1 → 0.7.2	Changeset
`@tanstack/ai-solid-ui`	0.6.6 → 0.6.7	Changeset
`@tanstack/ai-vue-ui`	0.1.39 → 0.1.40	Changeset
`@tanstack/openai-base`	0.3.3 → 0.3.4	Changeset
`@tanstack/preact-ai-devtools`	0.1.36 → 0.1.37	Changeset
`@tanstack/react-ai-devtools`	0.2.36 → 0.2.37	Changeset
`@tanstack/solid-ai-devtools`	0.2.36 → 0.2.37	Changeset
`@tanstack/ai-client`	0.11.2 → 0.11.3	Dependent
`@tanstack/ai-event-client`	0.3.5 → 0.3.6	Dependent
`@tanstack/ai-isolate-cloudflare`	0.2.6 → 0.2.7	Dependent
`@tanstack/ai-preact`	0.6.27 → 0.6.28	Dependent
`@tanstack/ai-react`	0.11.2 → 0.11.3	Dependent
`@tanstack/ai-solid`	0.10.2 → 0.10.3	Dependent
`@tanstack/ai-svelte`	0.10.2 → 0.10.3	Dependent
`@tanstack/ai-vue`	0.10.3 → 0.10.4	Dependent

nx-cloud · 2026-05-19T18:11:38Z

🤖 Nx Cloud AI Fix Eligible

An automatically generated fix could have helped fix failing tasks for this run, but Self-healing CI is disabled for this workspace. Visit workspace settings to enable it and get automatic fixes in future runs.

_{To disable these notifications, a workspace admin can disable them in workspace settings.}

View your CI Pipeline Execution ↗ for commit c55fad3

Command	Status	Duration	Result
`nx run-many --targets=build --exclude=examples/...`	❌ Failed	31s	View ↗

☁️ Nx Cloud last updated this comment at 2026-05-20 11:53:39 UTC

pkg-pr-new · 2026-05-19T18:13:44Z

Open in StackBlitz

@tanstack/ai

npm i https://pkg.pr.new/TanStack/ai/@tanstack/ai@589

@tanstack/ai-anthropic

npm i https://pkg.pr.new/TanStack/ai/@tanstack/ai-anthropic@589

@tanstack/ai-client

npm i https://pkg.pr.new/TanStack/ai/@tanstack/ai-client@589

@tanstack/ai-code-mode

npm i https://pkg.pr.new/TanStack/ai/@tanstack/ai-code-mode@589

@tanstack/ai-code-mode-skills

npm i https://pkg.pr.new/TanStack/ai/@tanstack/ai-code-mode-skills@589

@tanstack/ai-devtools-core

npm i https://pkg.pr.new/TanStack/ai/@tanstack/ai-devtools-core@589

@tanstack/ai-elevenlabs

npm i https://pkg.pr.new/TanStack/ai/@tanstack/ai-elevenlabs@589

@tanstack/ai-event-client

npm i https://pkg.pr.new/TanStack/ai/@tanstack/ai-event-client@589

@tanstack/ai-fal

npm i https://pkg.pr.new/TanStack/ai/@tanstack/ai-fal@589

@tanstack/ai-gemini

npm i https://pkg.pr.new/TanStack/ai/@tanstack/ai-gemini@589

@tanstack/ai-grok

npm i https://pkg.pr.new/TanStack/ai/@tanstack/ai-grok@589

@tanstack/ai-groq

npm i https://pkg.pr.new/TanStack/ai/@tanstack/ai-groq@589

@tanstack/ai-isolate-cloudflare

npm i https://pkg.pr.new/TanStack/ai/@tanstack/ai-isolate-cloudflare@589

@tanstack/ai-isolate-node

npm i https://pkg.pr.new/TanStack/ai/@tanstack/ai-isolate-node@589

@tanstack/ai-isolate-quickjs

npm i https://pkg.pr.new/TanStack/ai/@tanstack/ai-isolate-quickjs@589

@tanstack/ai-ollama

npm i https://pkg.pr.new/TanStack/ai/@tanstack/ai-ollama@589

@tanstack/ai-openai

npm i https://pkg.pr.new/TanStack/ai/@tanstack/ai-openai@589

@tanstack/ai-openrouter

npm i https://pkg.pr.new/TanStack/ai/@tanstack/ai-openrouter@589

@tanstack/ai-orchestration

npm i https://pkg.pr.new/TanStack/ai/@tanstack/ai-orchestration@589

@tanstack/ai-preact

npm i https://pkg.pr.new/TanStack/ai/@tanstack/ai-preact@589

@tanstack/ai-react

npm i https://pkg.pr.new/TanStack/ai/@tanstack/ai-react@589

@tanstack/ai-react-ui

npm i https://pkg.pr.new/TanStack/ai/@tanstack/ai-react-ui@589

@tanstack/ai-solid

npm i https://pkg.pr.new/TanStack/ai/@tanstack/ai-solid@589

@tanstack/ai-solid-ui

npm i https://pkg.pr.new/TanStack/ai/@tanstack/ai-solid-ui@589

@tanstack/ai-svelte

npm i https://pkg.pr.new/TanStack/ai/@tanstack/ai-svelte@589

@tanstack/ai-utils

npm i https://pkg.pr.new/TanStack/ai/@tanstack/ai-utils@589

@tanstack/ai-vue

npm i https://pkg.pr.new/TanStack/ai/@tanstack/ai-vue@589

@tanstack/ai-vue-ui

npm i https://pkg.pr.new/TanStack/ai/@tanstack/ai-vue-ui@589

@tanstack/openai-base

npm i https://pkg.pr.new/TanStack/ai/@tanstack/openai-base@589

@tanstack/preact-ai-devtools

npm i https://pkg.pr.new/TanStack/ai/@tanstack/preact-ai-devtools@589

@tanstack/react-ai-devtools

npm i https://pkg.pr.new/TanStack/ai/@tanstack/react-ai-devtools@589

@tanstack/solid-ai-devtools

npm i https://pkg.pr.new/TanStack/ai/@tanstack/solid-ai-devtools@589

commit: 8abb7b4

Add a getting-started guide walking through agents, state, approvals, durable primitives, server-side execution, and the React useWorkflow hook. Expand the package README with a runnable example and link to the guide. Register the new page in the docs site nav. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…flows

PR #589 added new primitives (step, sleep, waitForSignal, now, uuid, patched) plus engine rewrites that hit the strictness regime added by #564: - Add eslint-disable lines on every legitimate `as unknown as <Type>` cast where the engine resumes a generator with a typed value via `gen.next(value)` — same pattern already used in approve.ts. Covers the new primitives and the two run-workflow.ts cast sites (RUN_STARTED runId extraction + LiveRun.generator placeholder). - Add `override` modifier to StepTimeoutError.name and LogConflictError.name so noImplicitOverride accepts the Error.name shadow.

Reconnaissance found three categories of violations across all 13 test files merged via #589: - 66 `workflow: wf as any` casts in `runWorkflow({...})` calls — defensive; type-checked clean without them. - 17 `(store as unknown as { getLive }).getLive = (...) => undefined` stubs to force the engine into replay-from-log mode. - 17 `events.find(...) as unknown as { output: {...} } | undefined` casts to read the typed output off RUN_FINISHED. Cleanup: - New tests/test-utils.ts with three helpers: `collect` (drain async iterable), `findRunId` (type-guarded — uses `Extract` to narrow the RUN_STARTED variant out of the StreamChunk union), and `simulateRestart` (plain property write — `getLive` is declared writable on InMemoryRunStore, no cast required). - Replaced output-cast patterns with `toMatchObject({ output: {...} })` inline — preserves the exact assertion semantics and drops the double cast. - Removed local duplicates of `collect` / `findRunId` / `RunStartedChunk` interface from 10 files; they all import from the shared utils now. - One bug surfaced: `routed = await reg.forRun(...)` returns `Workflow | undefined` but registry.test.ts was passing it straight to `runWorkflow({ workflow: routed })`. Added a guard. Net: 445 deletions → 205 insertions, zero `as any` / `as unknown as` remaining in the test suite, 68 tests still passing.

Round 1 of the cr-loop CR pass against PR #589 surfaced ~56 bucket (a) findings across 19 review agents. This commit lands the well-scoped mechanical batch (~22 fixes). Engine-source and example-app fixes are tracked separately. Docs (docs/orchestration/run-persistence.md, docs/api/ai-orchestration.md, docs/orchestration/workflows.md, docs/orchestration/orchestrators.md): - Replace stale RunStore shape (get/set/delete) with the actual five-method interface (getRunState/setRunState/deleteRun/appendStep/ getSteps) and document the CAS log semantics. - Restate RunState's full field set including fingerprint, workflowVersion, startingPatches, waitingFor — load-bearing for durable RunStore implementers. - Fix observableRunStore example to wrap the real method names (setRunState/deleteRun/appendStep instead of set/delete). - Document that the engine already implements replay-from-log durability; the gap to durable resume is the runStore type widening, not engine work. - runWorkflow options table: add signalDelivery, attach, publish that PR #589 added; note the InMemoryRunStore narrow typing. - parseWorkflowRequest fields: drop the false signal/attach claim, add signalDelivery and note the body-field rename. - defineWorkflow config: add version, patches, defaultStepRetry. - useWorkflow/useOrchestration action table: add attach(runId) and signal(name, payload, { signalId? }) PR #589 added. - Types table: add SignalResult, StepRecord, StepKind, StepRetryOptions, LogConflictError, StepTimeoutError; correct RouterDecision shape. Packaging (packages/typescript/ai-orchestration): - package.json: add sideEffects: false and engines.node so tree-shaking works in consumer bundlers; add "skills" to files so the agent skill ships to npm. - tsconfig.json: extend ../../../tsconfig.base.json (matches every sibling package) instead of the root tsconfig.json; drop the .tsx glob from include (this is a Node-only library) and the vite.config.ts include/exclude contradiction. Configs: - knip.json: drop the dead packages/react-ai workspace entry. - coupling.json: drop the $schema reference to the missing coupling.schema.json. Skill (ai-core/SKILL.md): - Bump library_version 0.10.0 -> 0.20.0 to match the current @tanstack/ai package version. - Fix the useChat({ clientTools }) lie — the actual call is createChatClientOptions({ tools }) using the clientTools() helper. Tests (packages/typescript/ai-orchestration/tests): - engine.timeout.test.ts: the StepTimeoutError retry-predicate test never incremented callCount inside the step fn, so caughtImmediately was always false and the only outer assertion was an existence check. Increment callCount, assert via toMatchObject on the run output, drop the dead monkeyPatch/timeoutFired scaffolding. - engine.idempotency.test.ts: the signalId-persistence test ended with `void 0` and no assertions. Add a RUN_FINISHED check and a comment pointing to the multi-signal test for the persistence assertion. - engine.cas.test.ts: the duplicate-delivery test only did ONE delivery. Rewrote with a two-stage workflow that pauses between signals so the same signalId can be replayed against the existing log entry via simulateRestart. - engine.durability.test.ts: the StepRecord-per-agent test never read the step log because deleteRun fires on finish. Add an approve() pause before return so the log is inspectable, then assert two agent records with their results. All 162 nx tasks green (test:sherif, test:knip, test:docs, test:eslint, test:lib, test:types, test:build, build).

AlemTuzlak added 14 commits May 15, 2026 14:56

ci: apply automated fixes

f7e6303

tombeckenham and others added 2 commits May 20, 2026 15:50

Merge branch 'worktree-cryptic-singing-wadler' into feat/durable-work…

c55fad3

…flows

AlemTuzlak merged commit 5486c02 into worktree-cryptic-singing-wadler May 20, 2026
6 of 9 checks passed

AlemTuzlak deleted the feat/durable-workflows branch May 20, 2026 11:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feat/durable workflows#589

Feat/durable workflows#589
AlemTuzlak merged 17 commits into
worktree-cryptic-singing-wadlerfrom
feat/durable-workflows

AlemTuzlak commented May 19, 2026

Uh oh!

coderabbitai Bot commented May 19, 2026 •

edited

Loading

Review skipped

Uh oh!

github-actions Bot commented May 19, 2026 •

edited

Loading

Uh oh!

nx-cloud Bot commented May 19, 2026 •

edited

Loading

Uh oh!

pkg-pr-new Bot commented May 19, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

AlemTuzlak commented May 19, 2026

🎯 Changes

✅ Checklist

🚀 Release Impact

Uh oh!

coderabbitai Bot commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Uh oh!

github-actions Bot commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🚀 Changeset Version Preview

🟩 Patch bumps

Uh oh!

nx-cloud Bot commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🤖 Nx Cloud AI Fix Eligible

Uh oh!

pkg-pr-new Bot commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

coderabbitai Bot commented May 19, 2026 •

edited

Loading

github-actions Bot commented May 19, 2026 •

edited

Loading

nx-cloud Bot commented May 19, 2026 •

edited

Loading

pkg-pr-new Bot commented May 19, 2026 •

edited

Loading