fix(writeback): mount-daemon dead-letter + writeback status/retry CLI#84
fix(writeback): mount-daemon dead-letter + writeback status/retry CLI#84khaliqgant merged 5 commits intomainfrom
Conversation
Implements relayfile-side phases (1, 4) of the cross-repo writeback reliability spec at AgentWorkforce/cloud#448. Pattern: relay-80-100. The workflow does not commit until: - new Go test exercises a 4xx response from a httptest server and asserts a dead-letter file gets written - go build ./... clean - go test ./... green (existing suite + the two new tests) - claude reviewer signs off on each phase diff Team split per writing-agent-relay-workflows skill rule §5: lead / impl / tester → codex reviewer → claude The cloud-side phases (plus / 2 / 3) live in a sibling workflow on the cloud repo. Validates: - agent-relay run --dry-run: PASS (0 errors, 0 warnings, 30 steps, 26 waves, peak concurrency 4) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Codex-as-lead and codex-as-impl on the same channel both interpreted
"owns the channel" / "edit cmd/relayfile-cli/main.go" literally and
fought over file ownership. The retry budget exhausted at attempt 3:
08:23 impl-daemon-fix-r3 → "taking ownership... will only edit main.go"
09:15 lead-coordinate → "Stop. I have already taken ownership"
09:29 impl-daemon-fix-r3 → OWNER_DECISION: INCOMPLETE_RETRY
With only two phases × two implementers there's no real coordinator
to add. The impl agents read the spec directly from {{steps.read-spec.output}};
reviewer-claude is the inter-phase quality gate; collect-evidence is
the deterministic merge gate. No lead step needed.
Removed:
- lead-codex agent
- lead-coordinate step
- lead-signoff step
Validates:
- agent-relay run --dry-run: PASS (0 errors, 0 warnings, 28 steps,
25 waves, peak concurrency 4). impl-daemon-fix starts in wave 5
directly from read-writeback-region.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Implements relayfile-side phases (1, 4) of the writeback-reliability spec (AgentWorkforce/cloud#448). Authored as the fix output of workflow #83 (workflows/057-writeback-reliability-mount-and-cli.ts), inspected and pruned to scope. Phase 1 — mount daemon stops silently dropping writes: - Every non-2xx PUT response logged with status + truncated body - New FailedWritebacks counter on the daemon state, persisted in state.json (additive, backwards compatible — older mounts deserialize cleanly with default 0) - After retries exhausted, daemon writes ~/relayfile-mount/.relay/dead-letter/<opId>.json with op metadata ({ opId, path, attempts, lastStatus, lastBody, ts }) - writeback_daemon_test.go: drives an httptest 4xx through the daemon, asserts the dead-letter file appears AND failedWritebacks > 0 Phase 4 — CLI surfaces: - `relayfile writeback status [WORKSPACE] [--json]`: prints pending / failed / dead-lettered counts and most-recent error per provider; --json mode for machine consumption; exit code non-zero iff there are failures (so CI can gate) - `relayfile writeback retry --opId OP [WORKSPACE]`: re-enqueues a dead-lettered op and removes the dead-letter file on success - Both gracefully handle missing dead-letter dir / state.json - writeback_status_test.go: builds a fixture mount layout, runs the CLI as a subprocess, asserts output internal/mountsync/syncer.go: 17 lines threading the failed-writeback event through the existing reconcile pipeline (no behaviour change to the read side). Cloud-side phases (plus / 2 / 3) ship in AgentWorkforce/cloud#452. Verification: - go build ./... clean - go test -count=1 -run TestWritebackDeadLetter ./cmd/relayfile-cli/ → ok 1.797s - go test -count=1 -run TestWritebackStatus ./cmd/relayfile-cli/ → ok (covered by full suite) - go test -count=1 ./... → all 8 packages green (cmd/relayfile, cmd/relayfile-cli, cmd/relayfile-mount, internal/httpapi, internal/mountfuse, internal/mountsync, internal/relayfile, internal/schema) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Caution Review failedPull request was closed or merged during review No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (3)
✅ Files skipped from review due to trivial changes (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughAdds local writeback-failure capture and retry tooling: an HTTP transport that samples failed writeback requests and writes per-op dead-letter JSON, a persisted ChangesWriteback Failure Tracking & CLI Management
Sequence DiagramsequenceDiagram
participant Client as Local Client
participant Mount as Mount Loop
participant Transport as writebackFailureTransport
participant Server as Remote Server
participant State as .relay/state.json
participant CLI as writeback CLI
Client->>Mount: HandleLocalChange (PUT /fs/file or POST /fs/bulk)
Mount->>Transport: Perform HTTP request
Transport->>Server: Send request
Server-->>Transport: HTTP non-2xx response (e.g., 400)
alt Retries remain
Transport->>Transport: Increment attempt counter
Transport-->>Mount: Return error (retryable)
else Retries exhausted
Transport->>State: Increment FailedWritebacks
Transport->>State: Write dead-letter JSON (opId, path, attempts, lastStatus, lastBody, ts)
Transport-->>Mount: Return error (dead-lettered)
end
Mount->>State: Save sync snapshot (includes FailedWritebacks)
CLI->>State: writeback status
State-->>CLI: Return failed count + dead-letter list
CLI->>State: writeback retry (opId)
State-->>CLI: Return dead-letter record
CLI->>Mount: Re-enqueue local change via HandleLocalChange
Mount->>Transport: Retry HTTP request
Server-->>Transport: HTTP 200 response
Transport->>State: On success, dead-letter file removed
CLI-->>CLI: Return success
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 38138f1ab7
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if report.Failed > 0 || len(report.DeadLettered) > 0 { | ||
| return errWritebackFailuresPresent |
There was a problem hiding this comment.
Avoid failing status for recovered writebacks
When a writeback gets a transient 429/5xx that later succeeds on the built-in HTTP retry path, RoundTrip still increments failedWritebacks for the earlier non-2xx attempt, but success only clears the in-memory attempt key and never decrements or resets that persisted counter. Because writeback status returns errWritebackFailuresPresent whenever report.Failed > 0, a fully recovered mount with no pending work and no dead-letter files will keep exiting non-zero forever after any transient retry, making the new status command unusable as a current-health check.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@cmd/relayfile-cli/main.go`:
- Around line 1616-1637: resolveWorkspaceLikeStatus currently returns an error
when loadCredentials() fails, preventing local-only inspection; change it to not
fail fast: if loadCredentials() returns an error, continue with zero/nil creds
(do not return) so resolveToken("", creds) and resolveWorkspaceIDWithToken can
still run with an empty token, then proceed to workspaceRecordByID(workspaceID)
and populate record.ID/Name as before. Keep the existing calls to resolveToken,
resolveWorkspaceIDWithToken, and workspaceRecordByID but remove the early return
on loadCredentials() failure so local workspaces.json + mirror can be inspected
without valid credentials.
- Around line 1441-1443: The exit condition should be driven by
current/unresolved failures, not the lifetime counter; remove any use of the
cumulative FailedWritebacks counter and ensure the non-zero exit is based only
on actionable fields (e.g., report.Failed and len(report.DeadLettered) or any
other current/unresolved failure fields), so update the check around
report.Failed/report.DeadLettered in main.go to not consult or return based on
FailedWritebacks and instead rely solely on the report's live failure state.
- Around line 1740-1788: The retry logic in retryDeadLetterWriteback assumes the
remote mount root is "/" which breaks mounts created with a non-root remote
path; update retryDeadLetterWriteback to read the original mount root from the
workspace record (e.g., record.RemoteRoot or record.RemotePath, defaulting to
"/") and pass that into mountsync.NewSyncer via SyncerOptions.RemoteRoot instead
of "/", and adjust retryRelativePath (or create a variant) to strip that remote
root prefix from dead-letter remote paths (using deadLetterRetryPaths output)
before resolving them to local paths so /github/file.md maps to
<localDir>/file.md for a mount rooted at /github. Ensure references:
retryDeadLetterWriteback, deadLetterRetryPaths, retryRelativePath, and
mountsync.NewSyncer / SyncerOptions.RemoteRoot are updated accordingly.
In `@workflows/057-writeback-reliability-mount-and-cli.ts`:
- Around line 291-320: The workflow steps run-daemon-test and
run-daemon-test-final use the -run flag value "TestWritebackDeadLetter" which
doesn't match the actual test name TestWritebackDaemonDeadLettersHTTP400, so the
new test never executes; update the command strings in steps run-daemon-test and
run-daemon-test-final to use a regex that matches the new test (e.g.,
TestWritebackDaemonDeadLetters or TestWritebackDaemonDeadLettersHTTP400) or
broaden to TestWriteback* so the dead-letter test is actually run, and keep
captureOutput/failOnError/verification behavior the same.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro Plus
Run ID: 37f71ded-f556-4043-a1e0-1436d5c918ea
📒 Files selected for processing (5)
cmd/relayfile-cli/main.gocmd/relayfile-cli/writeback_daemon_test.gocmd/relayfile-cli/writeback_status_test.gointernal/mountsync/syncer.goworkflows/057-writeback-reliability-mount-and-cli.ts
Addresses two coverage gaps in the writeback-reliability spec acceptance criteria flagged in self-review of PR #84: P4.3 — "Both subcommands handle a missing dead-letter dir / state.json gracefully (don't panic)." The original TestWritebackStatusNoFailures created state.json with failedWritebacks:0 and no dead-letter dir, so it covers the missing-dead-letter-dir half. Adds TestWritebackStatusMissingStateJSON which deliberately omits state.json itself and asserts the CLI prints "no failures" and exits 0. P4.2 — `writeback retry --opId OP` had implementation but no test. Adds two sad-path tests: - TestWritebackRetryUnknownOpIDFailsCleanly — opId references no dead-letter file. Asserts a clear "unknown dead-letter op" error rather than a panic. - TestWritebackRetryRejectsMalformedRecord — dead-letter file exists but contents don't parse as JSON. Asserts "invalid dead-letter record" error AND that the malformed file is preserved (the user can inspect it; we never silently delete bad data). The retry happy-path is intentionally not E2E-tested here. It would require a full HTTP mock of the cloud writeback API plus a local file matching the dead-letter record's path — substantially overlapping with TestWritebackDaemonDeadLettersHTTP400 from writeback_daemon_test.go which already exercises the failure-injection side of the same code path. Documented inline. Verification: - go test -count=1 -run TestWriteback ./cmd/relayfile-cli/ → 6/6 pass (1 daemon + 5 status/retry) - go test -count=1 ./... → 8/8 packages green Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
cmd/relayfile-cli/writeback_status_test.go (2)
141-174: ⚡ Quick winAdd missing-state coverage for
writeback status --json.This test currently validates only human output when
state.jsonis absent. Adding a JSON-mode assertion would close the contract gap for the same graceful path.Suggested extension
func TestWritebackStatusMissingStateJSON(t *testing.T) { @@ if got := strings.ToLower(human.String()); !strings.Contains(got, "no failures") { t.Fatalf("expected no-failures marker when state is absent, got: %q", human.String()) } + + var jsonOut bytes.Buffer + if err := run([]string{"writeback", "status", "demo", "--json"}, strings.NewReader(""), &jsonOut, io.Discard); err != nil { + t.Fatalf("expected no error in --json mode when state.json is missing, got %v", err) + } + var report struct { + Failed int `json:"failed"` + DeadLettered []struct{} `json:"deadLettered"` + } + if err := json.Unmarshal(jsonOut.Bytes(), &report); err != nil { + t.Fatalf("parse --json output failed: %v\npayload:\n%s", err, jsonOut.String()) + } + if report.Failed != 0 || len(report.DeadLettered) != 0 { + t.Fatalf("expected failed=0 and no dead-letter entries, got failed=%d deadLettered=%d", report.Failed, len(report.DeadLettered)) + } }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@cmd/relayfile-cli/writeback_status_test.go` around lines 141 - 174, Add a JSON-mode assertion to TestWritebackStatusMissingStateJSON: call run([]string{"writeback","status","demo","--json"}, ...) capturing output (use a bytes.Buffer), then unmarshal the output into a map[string]interface{} and assert that the "failures" key exists and is an empty slice (or equivalent empty value). Modify the test body around the existing human-mode check to perform this additional run/unmarshal/assert using the same workspace ID "ws_demo" and helpers (run, testJWTWithWorkspace) so the graceful missing-state path is covered for JSON output as well.
60-74: ⚡ Quick winIsolate JSON stdout from stderr in the
--jsonassertion path.Line 61 passes the same buffer for stdout and stderr. If any stderr text is emitted in this error-return path, JSON parsing at Line 73 will become flaky.
Suggested change
var jsonOut bytes.Buffer -err = run([]string{"writeback", "status", "demo", "--json"}, strings.NewReader(""), &jsonOut, &jsonOut) +var jsonErr bytes.Buffer +err = run([]string{"writeback", "status", "demo", "--json"}, strings.NewReader(""), &jsonOut, &jsonErr) if !errors.Is(err, errWritebackFailuresPresent) { t.Fatalf("expected errWritebackFailuresPresent in --json mode, got %v", err) }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@cmd/relayfile-cli/writeback_status_test.go` around lines 60 - 74, The test uses the same bytes.Buffer (jsonOut) for both stdout and stderr when calling run in the "--json" error path, which can cause stderr content to contaminate JSON parsing; change the call to run in writeback_status_test.go to provide a dedicated stderr buffer (e.g., bytes.Buffer errOut) or io.Discard for stderr while keeping jsonOut for stdout, then continue to assert errors.Is(err, errWritebackFailuresPresent) and json.Unmarshal(jsonOut.Bytes(), &report) so only stdout JSON is parsed.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@cmd/relayfile-cli/writeback_status_test.go`:
- Around line 267-268: The test currently only fails when os.Stat returns
os.IsNotExist, allowing other stat errors to slip by; change the check that
calls os.Stat(filepath.Join(dlDir, "op_garbled.json")) to fail when statErr !=
nil (i.e., treat any non-nil error as a test failure) so the malformed-file
retention assertion triggers on any unexpected stat error rather than only on
missing files.
---
Nitpick comments:
In `@cmd/relayfile-cli/writeback_status_test.go`:
- Around line 141-174: Add a JSON-mode assertion to
TestWritebackStatusMissingStateJSON: call
run([]string{"writeback","status","demo","--json"}, ...) capturing output (use a
bytes.Buffer), then unmarshal the output into a map[string]interface{} and
assert that the "failures" key exists and is an empty slice (or equivalent empty
value). Modify the test body around the existing human-mode check to perform
this additional run/unmarshal/assert using the same workspace ID "ws_demo" and
helpers (run, testJWTWithWorkspace) so the graceful missing-state path is
covered for JSON output as well.
- Around line 60-74: The test uses the same bytes.Buffer (jsonOut) for both
stdout and stderr when calling run in the "--json" error path, which can cause
stderr content to contaminate JSON parsing; change the call to run in
writeback_status_test.go to provide a dedicated stderr buffer (e.g.,
bytes.Buffer errOut) or io.Discard for stderr while keeping jsonOut for stdout,
then continue to assert errors.Is(err, errWritebackFailuresPresent) and
json.Unmarshal(jsonOut.Bytes(), &report) so only stdout JSON is parsed.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro Plus
Run ID: 4d6ba0b0-96d7-4319-85e5-ac5d6c9ef7d4
📒 Files selected for processing (1)
cmd/relayfile-cli/writeback_status_test.go
Five actionable review items from chatgpt-codex-connector and coderabbitai. All fixed: A. (Codex P2 + CodeRabbit Major, main.go:1442) `failedWritebacks` is a lifetime counter that only ever increments in this PR. Driving the exit code from `Failed > 0` would keep `writeback status` exiting non-zero forever after any single transient 429/5xx, even after retries succeed and the dead-letter queue empties. Now drives the exit gate from `len(DeadLettered) > 0` only — actionable failures, not historical bookkeeping. Lifetime counter still surfaces in the report for observability. Adds TestWritebackStatusLifetimeCounterDoesNotFailExitCode pinning the contract: `failedWritebacks: 7, deadLettered: []` exits 0. B. (CodeRabbit Major, main.go:1637) `resolveWorkspaceLikeStatus` called `loadCredentials()` before consulting workspaces.json, blocking offline / expired-creds usage that the feature is meant to support. Now tries the local registry first when a workspace name/id is given; only falls back to the credentials path when nothing matches locally (or no value was given and we need the JWT's `wks` claim to identify the default). The new lifetime-counter test (above) deliberately omits saveCredentials() to also pin this behaviour. C. (CodeRabbit Major / heavy lift, main.go:1788) `writeback retry` hardcoded `RemoteRoot: "/"`. For a mount created with `--remote-path /github`, a dead-lettered `/github/file.md` would be looked up at `<localDir>/github/file.md` instead of `<localDir>/file.md`, so replay would fail even though the mirrored file exists. Added `readMountRemoteRoot(localDir)` which reads the live `<localDir>/.relay/state.json` (defaults to `/` when missing or unparseable). `retryRelativePath` now strips the mount root prefix before joining to localDir. D. (CodeRabbit, workflows/057-...ts phase-4 commands) Phase-4 test runner regex `TestWritebackStatus` matched the *Status tests but skipped the *Retry tests added in this PR. Widened to `TestWriteback(Status|Retry)` so retry coverage actually runs in the workflow's gate. E. (CodeRabbit Minor, writeback_status_test.go:268) The malformed-record retention assertion used `os.IsNotExist` only — other stat errors silently passed. Now uses a strict nil-check on stat so any unexpected access error fails the test. Verification: - go build ./... clean - go test -count=1 ./... → all 8 packages green, 7 writeback tests (1 daemon + 6 status/retry, all PASS) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
Relayfile-side fix for the productized cloud-mount writeback path. Implements phases 1 + 4 of the writeback-reliability spec at AgentWorkforce/cloud#448. This is the inspected + scope-pruned implementation produced by workflow #83.
Phase 1 — mount daemon stops silently dropping writes.
FailedWritebackscounter persisted instate.json(additive, backwards compatible)~/relayfile-mount/.relay/dead-letter/<opId>.jsonwith op metadatawriteback_daemon_test.godrives anhttptest4xx through the daemon and asserts the dead-letter file appears ANDfailedWritebacks > 0Phase 4 — CLI surfaces.
relayfile writeback status [WORKSPACE] [--json]: pending / failed / dead-lettered counts + most-recent error per provider;--jsonmode for machine parsing; non-zero exit iff failuresrelayfile writeback retry --opId OP [WORKSPACE]: re-enqueues, removes dead-letter file on successwriteback_status_test.gobuilds a fixture mount and asserts CLI outputinternal/mountsync/syncer.goadds 17 lines threading the failed-writeback event through the existing reconcile pipeline (no behaviour change to the read side).Cloud-side phases (plus / 2 / 3) ship in AgentWorkforce/cloud#452.
Verification
All 8 packages green, including the existing test suites — no regressions from the new daemon hooks or
state.jsonschema addition.Test plan
writeback_daemon_test.gotest passeswriteback_status_test.gotest passesgo build ./...cleango test ./...green across all packagesrelayfile writeback statusreports the right counts during a forced 4xx reproCloses part of AgentWorkforce/cloud#448.
🤖 Generated with Claude Code