You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
On a workspace with substantial Gmail (or other large-file) traffic, the mount daemon's incremental sync makes intra-page progress but does not advance EventsCursor fast enough to keep up with cloud's emit rate, so the events backlog grows monotonically. The fix shipped in #198 preserves work via per-page persistence + the intra-page IncrementalCheckpoint from #177, but advances cursor only when the whole 500-event page completes. On this workspace a single page takes hours of cumulative apply time, while cloud emits another page's worth of events in much less.
#198 covered the moderate case (cycle deadline mid-pagination of ListEvents). This issue is the severe case: cycle deadline mid-application of a single page.
Measurements on rw_fc7b534b (relayfile v0.7.35, post #198)
Daemon cursor: evt_133470 (unchanged across many cycles).
Daemon is mid-page (evt_133471–evt_133970), applying the changed phase, currently on file ~340 of the page. So intra-page checkpoint works — but a page costs ~5 min of cycle deadline per ~100–200 large files applied, and the page has 491 of them.
Recent log shows back-to-back mount sync cycle failed: context deadline exceeded while the checkpoint inches forward. Every 20th cycle the periodic full-tree pull kicks in and succeeds, which keeps the on-disk mirror current via tree comparison (so the user-visible mirror state isn't wrong — but the event log is permanently stuck).
Suggested fixes (any one of these gets us out; combining is even better)
Reduce default page size for apply. Lower ListEventslimit from 500 → e.g. 50. A page of 50 large-file events fits comfortably in one cycle's apply budget; cursor advances every ~30 s instead of every several hours. Trivial change, fully backward compatible.
Parallelize per-event apply within a page. Mirror fix: speed up mount bootstrap fallback #185's bootstrap fan-out (bounded parallel ReadFile workers, default 16, cap 64). The page is naturally embarrassingly-parallel — each event's apply is independent. Expected speedup is the obvious 10–30×.
Skip-on-hash for incremental apply. When an event references a (path, revision, contentHash) that matches the local snapshot exactly, skip the ReadFile entirely — the file is already correct, the event is a metadata-only update. Mirrors what Skip-on-hash content cache + #181 conflict-scan regression fix #190 did for bootstrap (storedContentHashForFile). Particularly useful when the periodic full-tree pull has already refreshed a file that's now appearing in the events backlog: applying that event is pure overhead.
Advance cursor per applied event, not per applied page. Changes the apply semantics from "atomic page" to "stream". Riskier (re-apply on crash now means partial-page replay), but completely eliminates the "cursor lag grows" failure mode. Probably the cleanest long-term but needs careful thought around partial-failure recovery.
Why these matter beyond this workspace
Any workspace whose dominant provider emits frequent updates to large files (Gmail, Drive, Slack channel history with attachments, large GitHub repo metadata) will eventually develop a permanent events backlog. The mirror is kept consistent via the periodic full-pull defense net, but the events feed is effectively unusable for downstream consumers (digests, anything reading LastEventAt, anything resuming from cursor). On rw_fc7b534b we're at a ~17 k event lag and growing.
#197 / #198 are correct and necessary for the case where ListEventsitself times out mid-pagination — the fix preserved work that would otherwise be dropped. This issue is a different failure mode: ListEvents succeeds quickly, but the per-page apply is too slow for the natural cycle deadline. The fixes are complementary.
fix: speed up mount bootstrap fallback #185 — same bottleneck class for bootstrap, fixed via skip-on-hash + bounded parallel reads. Worth pattern-matching for incremental apply.
AgentWorkforce/cloud#907 — compacted Gmail records reduces per-event apply size for new events; doesn't help with the back-catalog of pre-compaction events still in the feed
Summary
On a workspace with substantial Gmail (or other large-file) traffic, the mount daemon's incremental sync makes intra-page progress but does not advance
EventsCursorfast enough to keep up with cloud's emit rate, so the events backlog grows monotonically. The fix shipped in #198 preserves work via per-page persistence + the intra-pageIncrementalCheckpointfrom #177, but advances cursor only when the whole 500-event page completes. On this workspace a single page takes hours of cumulative apply time, while cloud emits another page's worth of events in much less.#198 covered the moderate case (cycle deadline mid-pagination of ListEvents). This issue is the severe case: cycle deadline mid-application of a single page.
Measurements on
rw_fc7b534b(relayfile v0.7.35, post #198)evt_133470(unchanged across many cycles).~evt_150800+(gap ~17,330 events).ListEvents(cursor=evt_133470, limit=500)→ HTTP 200, 174 KB, 615 ms. Cloud-side cursor lookup is fast.file.created/file.updatedprovider=google-mail(mostly/google-mail/messages/*.jsonand/google-mail/threads/*.json)/google-mail/threads/19bda79df6c5e5a9.json) — apply requires a per-fileReadFileround trip + decode + disk writeIncrementalCheckpointstate captured live mid-cycle:{ "cursor": "evt_133470", "pageCursor": "evt_133970", "phase": "changed", "path": "/github/repos/AgentWorkforce__cloud/pulls/by-state/open/713.json" }changedphase, currently on file ~340 of the page. So intra-page checkpoint works — but a page costs ~5 min of cycle deadline per ~100–200 large files applied, and the page has 491 of them.mount sync cycle failed: context deadline exceededwhile the checkpoint inches forward. Every 20th cycle the periodic full-tree pull kicks in and succeeds, which keeps the on-disk mirror current via tree comparison (so the user-visible mirror state isn't wrong — but the event log is permanently stuck).Suggested fixes (any one of these gets us out; combining is even better)
Reduce default page size for apply. Lower
ListEventslimitfrom 500 → e.g. 50. A page of 50 large-file events fits comfortably in one cycle's apply budget; cursor advances every ~30 s instead of every several hours. Trivial change, fully backward compatible.Parallelize per-event apply within a page. Mirror fix: speed up mount bootstrap fallback #185's bootstrap fan-out (bounded parallel
ReadFileworkers, default 16, cap 64). The page is naturally embarrassingly-parallel — each event's apply is independent. Expected speedup is the obvious 10–30×.Skip-on-hash for incremental apply. When an event references a
(path, revision, contentHash)that matches the local snapshot exactly, skip theReadFileentirely — the file is already correct, the event is a metadata-only update. Mirrors what Skip-on-hash content cache + #181 conflict-scan regression fix #190 did for bootstrap (storedContentHashForFile). Particularly useful when the periodic full-tree pull has already refreshed a file that's now appearing in the events backlog: applying that event is pure overhead.Advance cursor per applied event, not per applied page. Changes the apply semantics from "atomic page" to "stream". Riskier (re-apply on crash now means partial-page replay), but completely eliminates the "cursor lag grows" failure mode. Probably the cleanest long-term but needs careful thought around partial-failure recovery.
Why these matter beyond this workspace
Any workspace whose dominant provider emits frequent updates to large files (Gmail, Drive, Slack channel history with attachments, large GitHub repo metadata) will eventually develop a permanent events backlog. The mirror is kept consistent via the periodic full-pull defense net, but the events feed is effectively unusable for downstream consumers (digests, anything reading
LastEventAt, anything resuming from cursor). Onrw_fc7b534bwe're at a ~17 k event lag and growing.Doesn't this break #197 / #198?
#197 / #198 are correct and necessary for the case where
ListEventsitself times out mid-pagination — the fix preserved work that would otherwise be dropped. This issue is a different failure mode: ListEvents succeeds quickly, but the per-page apply is too slow for the natural cycle deadline. The fixes are complementary.Related
contentHashcache onFilestruct — the prereq for skip-on-hash on the incremental path