fix(mountsync): short-circuit reconcile when cloud has no new events#92
fix(mountsync): short-circuit reconcile when cloud has no new events#92khaliqgant merged 1 commit intomainfrom
Conversation
The polling reconcile loop times out and stalls on workspaces with non-trivial content (low hundreds of files). Concrete repro: a workspace with 132 Notion pages produces mount sync cycle failed: context deadline exceeded mount stalled: no successful reconcile for 10m over and over. The default per-cycle deadline is 15s (RELAYFILE_MOUNT_TIMEOUT); the periodic full tree pull (every 20 incremental cycles) and the bootstrap full pull both perform sequential ReadFile calls per entry, which exceeds the deadline once the workspace crosses ~100 files. Two fixes, both leveraging the existing events feed: 1. Skip-if-no-events short-circuit. When EventsCursor is set and the events feed reports no new events since that cursor (limit=1 probe), pullRemote returns immediately. Most reconcile cycles on a quiet workspace are no-ops; this turns them into a single cheap ListEvents call instead of a full-tree fetch. The trust-but-verify periodic full pull cadence is still respected, but only counts non-empty cycles toward its threshold (counting empty cycles would race the periodic pull against the very condition the short-circuit is designed to avoid). 2. Restart fast-path. When EventsCursor is empty but the state file already records tracked files AND a prior LastEventAt — meaning a previous daemon successfully observed events from this workspace — seed the cursor from the events tip and skip the bootstrap full pull entirely. This unblocks the production case where a daemon's state.json persists 289 tracked files + lastEventAt but null eventsCursor (because the previous run's full pull deadlined out before resolveLatestEventCursor ran). Without this, every restart re-tries the same doomed full pull and the daemon stays stalled forever. The LastEventAt gate is load-bearing: callers and tests that hand-seed a state file without ever observing live events still take the full-pull bootstrap path (necessary for e.g. TestPullDeletesLocalDeniedFile, where the full pull is what detects and tears down the now-denied file). Verification: rebuilt the binary against rw_517d60b6 (acme-demo, 289 tracked files). Pre-fix, every cycle logged "context deadline exceeded". Post-fix, the first reconcile after restart logs "restart fast-path: seeded events cursor evt_1 from 289 tracked files; skipping bootstrap full pull" and "mount sync cycle completed". Tests: - TestPullShortCircuitsWhenNoNewEvents — quiet cycles issue exactly one ListEvents probe and zero ListTree/ReadFile calls. - TestPullRestartFastPathSkipsFullPull — restart with persisted tracked.Files + LastEventAt seeds cursor instead of running full pull. - TestPullPeriodicFullCycle updated to append events between cycles (so the cadence counter still advances under the new short-circuit). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f57917f4f9
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| cursor, err := s.resolveLatestEventCursor(ctx) | ||
| if err == nil { | ||
| s.state.EventsCursor = cursor |
There was a problem hiding this comment.
Avoid advancing cursor past unseen downtime events
When a restarted daemon has tracked files and LastEventAt but no EventsCursor, resolving the cursor to the current tip and returning skips every event that occurred after the persisted mirror was last updated. In that scenario (for example, a remote file is updated or deleted while the daemon is down), the next reconcile probes from this newly seeded tip, sees no events, and short-circuits; because quiet cycles no longer advance the full-pull counter, the local mirror can remain stale indefinitely until some later non-empty cycle happens to trigger verification.
Useful? React with 👍 / 👎.
|
Caution Review failedPull request was closed or merged during review 📝 WalkthroughWalkthroughThis PR optimizes the mount sync reconciliation logic by adding two short-circuit paths to reduce unnecessary work across restarts and quiet cycles. When events are tracked, a lightweight probe skips work if quiet. When restarting with missing cursor state, it seeds the cursor from prior tracking rather than re-bootstrapping. Tests validate all new paths and ensure periodic full-pull cadence remains intact. ChangesReconciliation Short-Circuit and Restart Optimizations
Possibly Related PRs
Estimated Code Review Effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Warning Review ran into problems🔥 ProblemsGit: Failed to clone repository. Please run the Comment |
The bug
The polling reconcile loop times out and stalls on workspaces with non-trivial content. Concrete repro: a workspace with 132 Notion pages produces this in
~/relayfile-mount/.relay/mount.logrepeatedly:relayfile statusreportsstall: no successful reconcile for 10mindefinitely. Restarting the daemon doesn't help — the next reconcile cycle hits the same deadline. Local file edits never propagate to the cloud while the daemon is in this state.The threshold is in the low hundreds of files. The default per-cycle timeout is 15s (
RELAYFILE_MOUNT_TIMEOUT); the periodic full-tree pull (everydefaultFullPullEvery=20incremental cycles) and the bootstrap full pull both perform sequentialReadFilecalls per entry, which exceeds the deadline once the workspace crosses ~100 files.Root cause
Two locations in
internal/mountsync/syncer.gowere unconditionally falling into the slow full-tree path:fullPullEvery=20, ~10 min at 30s intervals) forced a full tree pull regardless of whether the events feed had reported any activity.EventsCursorwas empty — including on every daemon restart against a workspace whose prior sync had timed out before the cursor was resolved — the reconcile path always ranpullRemoteFull. The state file persisted 289 tracked files +lastEventAtbutnull eventsCursor, so each restart re-tried the same doomed full pull.pullRemoteFullTree(syncer.go:1515) iterates every tree entry and callsReadFilesequentially per file. With 132+ Notion pages this exceeds the 15s deadline deterministically.Fix
Both paths now leverage the cursor-based events feed the SDK already exposes:
Skip-if-no-events short-circuit (
syncer.go:1414). WhenEventsCursoris set andListEvents(since=cursor, limit=1)returns empty,pullRemotereturns immediately. Most reconcile cycles on a quiet workspace are no-ops; this turns them into a single cheapListEventscall instead of a full-tree fetch. The trust-but-verify periodic full-pull cadence is still respected, but the counter only advances on non-empty cycles — counting empty cycles toward the threshold would race the periodic pull against the very condition the short-circuit is designed to avoid.Restart fast-path (
syncer.go:1477). WhenEventsCursoris empty but the state file already records tracked files AND a priorLastEventAt(i.e. a previous daemon successfully observed events from this workspace), seed the cursor from the events tip and skip the bootstrap full pull entirely. TheLastEventAtgate keeps the fast-path opt-in: tests and callers that hand-seed a state file without ever observing live events still take the full-pull bootstrap path (necessary for e.g.TestPullDeletesLocalDeniedFile, where the full pull is what detects and tears down the now-denied file).Both fixes are no-ops on backends without an events feed:
ListEventsreturns 404 and we fall through to the existing full-pull path.Verification
Rebuilt the binary against
rw_517d60b6(acme-demo, 289 tracked files):Pre-fix log:
Post-fix log:
Test plan
TestPullShortCircuitsWhenNoNewEvents— quiet cycles issue exactly oneListEventsprobe and zeroListTree/ReadFilecalls; a real event still triggers a fetch and updates the mirror.TestPullRestartFastPathSkipsFullPull— restart with persistedtracked.Files+LastEventAtseeds the cursor instead of running the full pull; subsequent quiet cycle short-circuits cleanly.TestPullPeriodicFullCycleupdated to append events between cycles (so the cadence counter still advances under the new short-circuit).internal/mountsyncandcmd/relayfile-clitests pass withgo test ./... -count=1.acme-demoworkspace (289 files): pre-fix permanent stall, post-fixmount sync cycle completed.🤖 Generated with Claude Code