Skip to content

fix(sdk): FileDb in-memory cache authoritative — fixes stale status after disk write failures#757

Merged
khaliqgant merged 3 commits intomainfrom
fix/filedb-cache-authoritative
Apr 20, 2026
Merged

fix(sdk): FileDb in-memory cache authoritative — fixes stale status after disk write failures#757
khaliqgant merged 3 commits intomainfrom
fix/filedb-cache-authoritative

Conversation

@khaliqgant
Copy link
Copy Markdown
Member

@khaliqgant khaliqgant commented Apr 20, 2026

Problem

`JsonFileWorkflowDb` re-reads the backing jsonl on every `getRun` / `getStepsByRunId`. When the filesystem denies writes — as it does in a cloud Daytona sandbox where the relayfile workspace ACL marks `/project/.agent-relay/` read-only — every state mutation is silently dropped and subsequent reads return the pre-failure row.

Observed failure: workflows pass every step, post to channels correctly, write every deliverable file, and still report `status: 'running'` to the caller. Concrete trace from a live cloud run:

```
[workflow] completed
Workflow "hi-interactive-workflow" — COMPLETED
5 passed, 0 failed, 0 skipped
[workflow 01:48] Shutting down broker...
Run status: running ← execute() returns stale row
FAILED ← caller's status check fails
Bootstrap fatal error: ... bun exit 1 ← non-zero exit kills the run


The warn that fires in this scenario also confirms it:

[workflow] warning: failed to write run state to /project/.agent-relay/workflow-runs.jsonl
— --resume will not be available for this run. Use --start-from instead.
Error: EACCES: permission denied, open


## Fix

**The in-memory cache is now the single source of truth.** Every mutation updates the cache synchronously. The jsonl file becomes a best-effort persistence layer — used once at construction to replay state for `--resume`, and otherwise write-only. Runtime correctness no longer depends on disk writes succeeding.

Three pieces:

1. **Cache-first reads**: `getRun` / `getStepsByRunId` return from the in-memory map. `snapshot()` is only called once, at construction, to replay whatever exists on disk.

2. **Structured failure signal**: new `JsonFileWorkflowDbOptions.onWriteFailure` listener fires on every failed append. Callers (CLI, dashboard, cloud bootstrap) can aggregate / surface write failures beyond the once-per-process `console.warn`.

3. **Opt-in `$HOME` fallback**: `JsonFileWorkflowDbOptions.homeFallback = true` routes to `$HOME/.agent-relay/workflow-runs-<base>.jsonl` when the preferred path is unwritable. Default is `false` — strict path semantics, matching the pre-cache behavior (the existing `'should warn once when append fails'` test passes unchanged). \`--resume\` in cloud is tracked as follow-up work.

Plain-string construction is preserved: \`new JsonFileWorkflowDb(path)\` still works identically. Callers opt in to the new options via \`new JsonFileWorkflowDb({ filePath, onWriteFailure, homeFallback })\`.

## Test plan

- [x] New `packages/sdk/src/__tests__/file-db.test.ts` — 8 tests covering round-trip, **stale-disk cache consistency (the regression guard)**, strict-default memory-only, opt-in `$HOME` fallback, failure listener semantics, replay-on-construction for `--resume`, step cache, and `hasStepOutputs` path resolution.
- [x] Existing `resume-fallback.test.ts` (including `'should warn once when append fails'`) still passes — optimistic `writable=true` when the unwritable path is the only candidate preserves lazy-failure semantics that test relies on.
- [x] Existing `builder-resume-persistence.test.ts` passes.
- [x] Ran the full SDK vitest suite; the two failures that remain (`models.test.ts` package-entry, `workflow-runner.test.ts > should record review completion in trajectory…`) also fail on unmodified `main` — pre-existing flakes, not caused by this PR.
- [ ] After merge + release: rerun `hi-interactive.ts` in cloud; runner should return `status: 'completed'` and the bootstrap should not report FAILED.

## Follow-up (not in this PR)

The cloud orchestrator agent still has no `fs:write` scope on `/project/.agent-relay/`, so `--resume` can't round-trip even with this fix — the jsonl isn't persisted across processes. Three options: (1) grant the scope in the workspace ACL, (2) have the cloud bootstrap construct `JsonFileWorkflowDb` with `homeFallback: true`, (3) accept `--resume` as local-only. Captured in the trail trajectory; will revisit after this lands.

🤖 Generated with [Claude Code](https://claude.com/claude-code)
<!-- devin-review-badge-begin -->

---

<a href="https://app.devin.ai/review/agentworkforce/relay/pull/757" target="_blank">
  <picture>
    <source media="(prefers-color-scheme: dark)" srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1">
    <img src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1" alt="Open in Devin Review">
  </picture>
</a>
<!-- devin-review-badge-end -->

… errors

JsonFileWorkflowDb was re-reading the jsonl file on every getRun /
getStepsByRunId call. When the backing file was unwritable — e.g. the
relayfile workspace ACL denies writes to /project/.agent-relay/ in a
cloud sandbox, so the dir chmod's to 0444 and appendFileSync throws
EACCES — every state mutation was silently dropped. The process
continued, steps ran correctly, but the stored WorkflowRunRow never
left its initial 'running' status.

Downstream symptom: WorkflowRunner.execute() ended with
updateRunStatus(runId, 'completed') → cache-less updateRun dropped it
→ finally block's getRun re-read the stale jsonl → execute returned
{ status: 'running' } → user's workflow.ts treated != 'completed' as
failure → process.exit(1) → cloud bootstrap reported the whole run
as FAILED despite 5/5 steps passing.

Fix: the in-memory cache is now the single source of truth. Every
mutation updates the cache synchronously. The jsonl file is a
best-effort persistence layer used to replay state at construction
(for --resume) and is otherwise write-only. If the disk write fails,
we surface it via console.warn once (matching pre-existing behavior)
and via a new onWriteFailure callback (fires every time), but the
runtime state stays coherent.

Three pieces in this change:

1. Cache-first reads: getRun / getStepsByRunId return from the
   in-memory map. snapshot() is only called once, at construction,
   to replay whatever exists on disk.

2. Structured failure signal: new JsonFileWorkflowDbOptions.onWriteFailure
   listener, invoked for every failed append. Callers (CLI, dashboard,
   bootstrap) can surface persistent write failures beyond the
   once-per-process console.warn.

3. Opt-in $HOME fallback: JsonFileWorkflowDbOptions.homeFallback makes
   unwritable paths route to $HOME/.agent-relay/workflow-runs-<base>.jsonl.
   Default is false (strict path semantics, matches pre-cache behavior);
   true is useful for cloud sandboxes where the workflow cwd is
   workspace-ACL-restricted but $HOME is always writable. Deferred
   wiring into the CLI — --resume in cloud is tracked separately.

Plain-string construction is preserved: `new JsonFileWorkflowDb(path)`
still works identically. Callers opt in to the new options via
`new JsonFileWorkflowDb({ filePath, onWriteFailure, homeFallback })`.

Adds file-db.test.ts with eight tests covering: happy-path round-trip,
stale-disk cache consistency (the regression guard), strict-default
memory-only mode, opt-in $HOME fallback, failure-listener semantics,
replay-on-construction for --resume, step cache across runs, and
hasStepOutputs path resolution. Pre-existing resume-fallback warn-once
test still passes — optimistic "writable=true on last candidate" path
preserves the lazy-failure semantics that test guards.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 23699e5135

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +151 to +154
if (!isLastCandidate) {
accessSync(path.dirname(candidate), fsConstants.W_OK);
}
return { resolvedPath: candidate, writable: true };
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Check file-level write access before accepting primary path

When homeFallback is enabled, resolveStoragePath only probes path.dirname(candidate) and immediately selects the primary path. If that directory is writable but the jsonl file itself is read-only (or blocked by file-level ACLs), appendFileSync will fail on every write and the DB never switches to the $HOME fallback, so persistence/--resume remain broken even though fallback was explicitly requested. Consider validating writability of the candidate file (when it exists) or retrying with the fallback after the first append failure.

Useful? React with 👍 / 👎.

devin-ai-integration[bot]

This comment was marked as resolved.

PR review feedback from @chatgpt-codex-connector: when homeFallback is
on, resolveStoragePath only checked `path.dirname(candidate)`. A
writable directory does NOT guarantee a writable file — the exact
cloud scenario (relayfile-mount chmod's synced jsonl files to 0o444
while leaving .agent-relay/ at 0o755) slipped through: the primary
path was accepted, every append failed lazily, and fallback never
fired. Persistence stayed broken despite the caller explicitly
opting in to fallback.

Extend the non-last-candidate probe: if the jsonl file already
exists, accessSync it for W_OK too. Only when both dir-write and
(conditional) file-write pass do we accept the candidate. Otherwise
the exception propagates into the catch and the loop tries the
next candidate.

Behavior matrix:
  - Dir writable, file missing          → dir probe passes; accept.
  - Dir writable, file writable         → both probes pass; accept.
  - Dir writable, file readonly         → file probe throws; try next.  ← the review scenario
  - Dir readonly (file existence n/a)   → dir probe throws; try next.
  - Last candidate (homeFallback off)   → no probes; optimistic.

Adds a regression test that sets up the exact writable-dir +
readonly-file combination and asserts fallback still fires.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@khaliqgant
Copy link
Copy Markdown
Member Author

Addressed Codex's P2 feedback on resolveStoragePath:

The non-last-candidate probe now also checks file-level writability when the jsonl already exists. The review scenario — writable directory but read-only file (exact shape of a relayfile-mount-synced file with workspace ACL enforcing 0o444 on the file while leaving the parent .agent-relay/ at 0o755) — now correctly throws out of the probe and falls through to the next candidate.

Behavior matrix:

dir file (if present) outcome
writable missing accept
writable writable accept
writable read-only try next candidate ← review fix
read-only n/a try next candidate
last candidate n/a optimistic accept

Regression test added: constructs the writable-dir + read-only-file combination and asserts the storage path resolves to the $HOME fallback, not the primary path. 18/18 tests in the file-db + resume suites pass.

devin-ai-integration[bot]

This comment was marked as resolved.

Two review fixes on PR #757:

1. Stale doc: the class-level JSDoc described homeFallback as defaulting
   to true, but the constructor was already defaulting to false (the
   option's own JSDoc had the correct story). Corrected the class doc.

2. Aliasing hazard (Devin P2): InMemoryWorkflowDb shallow-copies on
   insert (`this.runs.set(run.id, { ...run })`). JsonFileWorkflowDb
   was storing the caller's reference directly. runner.ts keeps each
   inserted row in its own stepStates map and later mutates
   state.row.status directly — through shared-reference aliasing,
   that mutation was landing in the cache and bypassing
   updateRun/updateStep's append + timestamp path, undermining the
   exact invariant this cache was built to enforce.

Shallow-copy on insert for both runs and steps brings
JsonFileWorkflowDb in line with the reference implementation.
Regression test added that mutates the caller's object post-insert
and asserts the cache is unaffected.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@khaliqgant
Copy link
Copy Markdown
Member Author

Addressed Devin's two findings:

  1. Stale class-level JSDoc — corrected to 'opt-in, default false' (matches the option's own JSDoc and the constructor default).

  2. Aliasing hazard — legitimate. InMemoryWorkflowDb shallow-copies on insert; JsonFileWorkflowDb was storing the caller's reference directly. The runner keeps each inserted row in its own stepStates map and later mutates state.row.status directly before calling updateStep/updateRun — through shared-reference aliasing, that mutation was landing in the cache and bypassing the append + timestamp path, undermining the exact invariant this cache was built to enforce.

Both insertRun and insertStep now shallow-copy on insert, matching the reference implementation. Added a regression test that inserts, mutates the caller's object post-insert, and asserts the cache is unaffected.

19/19 pass across file-db + resume test files.

@khaliqgant khaliqgant merged commit 9f8014d into main Apr 20, 2026
20 checks passed
@khaliqgant khaliqgant deleted the fix/filedb-cache-authoritative branch April 20, 2026 11:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant