fix(sdk): FileDb in-memory cache authoritative — fixes stale status after disk write failures#757
Conversation
… errors
JsonFileWorkflowDb was re-reading the jsonl file on every getRun /
getStepsByRunId call. When the backing file was unwritable — e.g. the
relayfile workspace ACL denies writes to /project/.agent-relay/ in a
cloud sandbox, so the dir chmod's to 0444 and appendFileSync throws
EACCES — every state mutation was silently dropped. The process
continued, steps ran correctly, but the stored WorkflowRunRow never
left its initial 'running' status.
Downstream symptom: WorkflowRunner.execute() ended with
updateRunStatus(runId, 'completed') → cache-less updateRun dropped it
→ finally block's getRun re-read the stale jsonl → execute returned
{ status: 'running' } → user's workflow.ts treated != 'completed' as
failure → process.exit(1) → cloud bootstrap reported the whole run
as FAILED despite 5/5 steps passing.
Fix: the in-memory cache is now the single source of truth. Every
mutation updates the cache synchronously. The jsonl file is a
best-effort persistence layer used to replay state at construction
(for --resume) and is otherwise write-only. If the disk write fails,
we surface it via console.warn once (matching pre-existing behavior)
and via a new onWriteFailure callback (fires every time), but the
runtime state stays coherent.
Three pieces in this change:
1. Cache-first reads: getRun / getStepsByRunId return from the
in-memory map. snapshot() is only called once, at construction,
to replay whatever exists on disk.
2. Structured failure signal: new JsonFileWorkflowDbOptions.onWriteFailure
listener, invoked for every failed append. Callers (CLI, dashboard,
bootstrap) can surface persistent write failures beyond the
once-per-process console.warn.
3. Opt-in $HOME fallback: JsonFileWorkflowDbOptions.homeFallback makes
unwritable paths route to $HOME/.agent-relay/workflow-runs-<base>.jsonl.
Default is false (strict path semantics, matches pre-cache behavior);
true is useful for cloud sandboxes where the workflow cwd is
workspace-ACL-restricted but $HOME is always writable. Deferred
wiring into the CLI — --resume in cloud is tracked separately.
Plain-string construction is preserved: `new JsonFileWorkflowDb(path)`
still works identically. Callers opt in to the new options via
`new JsonFileWorkflowDb({ filePath, onWriteFailure, homeFallback })`.
Adds file-db.test.ts with eight tests covering: happy-path round-trip,
stale-disk cache consistency (the regression guard), strict-default
memory-only mode, opt-in $HOME fallback, failure-listener semantics,
replay-on-construction for --resume, step cache across runs, and
hasStepOutputs path resolution. Pre-existing resume-fallback warn-once
test still passes — optimistic "writable=true on last candidate" path
preserves the lazy-failure semantics that test guards.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 23699e5135
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if (!isLastCandidate) { | ||
| accessSync(path.dirname(candidate), fsConstants.W_OK); | ||
| } | ||
| return { resolvedPath: candidate, writable: true }; |
There was a problem hiding this comment.
Check file-level write access before accepting primary path
When homeFallback is enabled, resolveStoragePath only probes path.dirname(candidate) and immediately selects the primary path. If that directory is writable but the jsonl file itself is read-only (or blocked by file-level ACLs), appendFileSync will fail on every write and the DB never switches to the $HOME fallback, so persistence/--resume remain broken even though fallback was explicitly requested. Consider validating writability of the candidate file (when it exists) or retrying with the fallback after the first append failure.
Useful? React with 👍 / 👎.
PR review feedback from @chatgpt-codex-connector: when homeFallback is on, resolveStoragePath only checked `path.dirname(candidate)`. A writable directory does NOT guarantee a writable file — the exact cloud scenario (relayfile-mount chmod's synced jsonl files to 0o444 while leaving .agent-relay/ at 0o755) slipped through: the primary path was accepted, every append failed lazily, and fallback never fired. Persistence stayed broken despite the caller explicitly opting in to fallback. Extend the non-last-candidate probe: if the jsonl file already exists, accessSync it for W_OK too. Only when both dir-write and (conditional) file-write pass do we accept the candidate. Otherwise the exception propagates into the catch and the loop tries the next candidate. Behavior matrix: - Dir writable, file missing → dir probe passes; accept. - Dir writable, file writable → both probes pass; accept. - Dir writable, file readonly → file probe throws; try next. ← the review scenario - Dir readonly (file existence n/a) → dir probe throws; try next. - Last candidate (homeFallback off) → no probes; optimistic. Adds a regression test that sets up the exact writable-dir + readonly-file combination and asserts fallback still fires. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Addressed Codex's P2 feedback on The non-last-candidate probe now also checks file-level writability when the jsonl already exists. The review scenario — writable directory but read-only file (exact shape of a relayfile-mount-synced file with workspace ACL enforcing Behavior matrix:
Regression test added: constructs the writable-dir + read-only-file combination and asserts the storage path resolves to the |
Two review fixes on PR #757: 1. Stale doc: the class-level JSDoc described homeFallback as defaulting to true, but the constructor was already defaulting to false (the option's own JSDoc had the correct story). Corrected the class doc. 2. Aliasing hazard (Devin P2): InMemoryWorkflowDb shallow-copies on insert (`this.runs.set(run.id, { ...run })`). JsonFileWorkflowDb was storing the caller's reference directly. runner.ts keeps each inserted row in its own stepStates map and later mutates state.row.status directly — through shared-reference aliasing, that mutation was landing in the cache and bypassing updateRun/updateStep's append + timestamp path, undermining the exact invariant this cache was built to enforce. Shallow-copy on insert for both runs and steps brings JsonFileWorkflowDb in line with the reference implementation. Regression test added that mutates the caller's object post-insert and asserts the cache is unaffected. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Addressed Devin's two findings:
Both 19/19 pass across file-db + resume test files. |
Problem
`JsonFileWorkflowDb` re-reads the backing jsonl on every `getRun` / `getStepsByRunId`. When the filesystem denies writes — as it does in a cloud Daytona sandbox where the relayfile workspace ACL marks `/project/.agent-relay/` read-only — every state mutation is silently dropped and subsequent reads return the pre-failure row.
Observed failure: workflows pass every step, post to channels correctly, write every deliverable file, and still report `status: 'running'` to the caller. Concrete trace from a live cloud run:
```
[workflow] completed
Workflow "hi-interactive-workflow" — COMPLETED
5 passed, 0 failed, 0 skipped
[workflow 01:48] Shutting down broker...
Run status: running ← execute() returns stale row
FAILED ← caller's status check fails
Bootstrap fatal error: ... bun exit 1 ← non-zero exit kills the run
[workflow] warning: failed to write run state to /project/.agent-relay/workflow-runs.jsonl
— --resume will not be available for this run. Use --start-from instead.
Error: EACCES: permission denied, open