Skip to content

mount: incremental apply per-page granularity too coarse — cursor falls behind on workspaces with large-file events #200

@khaliqgant

Description

@khaliqgant

Summary

On a workspace with substantial Gmail (or other large-file) traffic, the mount daemon's incremental sync makes intra-page progress but does not advance EventsCursor fast enough to keep up with cloud's emit rate, so the events backlog grows monotonically. The fix shipped in #198 preserves work via per-page persistence + the intra-page IncrementalCheckpoint from #177, but advances cursor only when the whole 500-event page completes. On this workspace a single page takes hours of cumulative apply time, while cloud emits another page's worth of events in much less.

#198 covered the moderate case (cycle deadline mid-pagination of ListEvents). This issue is the severe case: cycle deadline mid-application of a single page.

Measurements on rw_fc7b534b (relayfile v0.7.35, post #198)

  • Daemon cursor: evt_133470 (unchanged across many cycles).
  • Cloud tip: ~evt_150800+ (gap ~17,330 events).
  • ListEvents(cursor=evt_133470, limit=500)HTTP 200, 174 KB, 615 ms. Cloud-side cursor lookup is fast.
  • First page of 500 events composition:
    • 491 of 500 are file.created / file.updated
    • 438 of 500 are provider=google-mail (mostly /google-mail/messages/*.json and /google-mail/threads/*.json)
    • Some referenced threads were 9 MB pre-#907 (/google-mail/threads/19bda79df6c5e5a9.json) — apply requires a per-file ReadFile round trip + decode + disk write
  • IncrementalCheckpoint state captured live mid-cycle:
    {
      "cursor": "evt_133470",
      "pageCursor": "evt_133970",
      "phase": "changed",
      "path": "/github/repos/AgentWorkforce__cloud/pulls/by-state/open/713.json"
    }
    Daemon is mid-page (evt_133471–evt_133970), applying the changed phase, currently on file ~340 of the page. So intra-page checkpoint works — but a page costs ~5 min of cycle deadline per ~100–200 large files applied, and the page has 491 of them.
  • Recent log shows back-to-back mount sync cycle failed: context deadline exceeded while the checkpoint inches forward. Every 20th cycle the periodic full-tree pull kicks in and succeeds, which keeps the on-disk mirror current via tree comparison (so the user-visible mirror state isn't wrong — but the event log is permanently stuck).

Suggested fixes (any one of these gets us out; combining is even better)

  1. Reduce default page size for apply. Lower ListEvents limit from 500 → e.g. 50. A page of 50 large-file events fits comfortably in one cycle's apply budget; cursor advances every ~30 s instead of every several hours. Trivial change, fully backward compatible.

  2. Parallelize per-event apply within a page. Mirror fix: speed up mount bootstrap fallback #185's bootstrap fan-out (bounded parallel ReadFile workers, default 16, cap 64). The page is naturally embarrassingly-parallel — each event's apply is independent. Expected speedup is the obvious 10–30×.

  3. Skip-on-hash for incremental apply. When an event references a (path, revision, contentHash) that matches the local snapshot exactly, skip the ReadFile entirely — the file is already correct, the event is a metadata-only update. Mirrors what Skip-on-hash content cache + #181 conflict-scan regression fix #190 did for bootstrap (storedContentHashForFile). Particularly useful when the periodic full-tree pull has already refreshed a file that's now appearing in the events backlog: applying that event is pure overhead.

  4. Advance cursor per applied event, not per applied page. Changes the apply semantics from "atomic page" to "stream". Riskier (re-apply on crash now means partial-page replay), but completely eliminates the "cursor lag grows" failure mode. Probably the cleanest long-term but needs careful thought around partial-failure recovery.

Why these matter beyond this workspace

Any workspace whose dominant provider emits frequent updates to large files (Gmail, Drive, Slack channel history with attachments, large GitHub repo metadata) will eventually develop a permanent events backlog. The mirror is kept consistent via the periodic full-pull defense net, but the events feed is effectively unusable for downstream consumers (digests, anything reading LastEventAt, anything resuming from cursor). On rw_fc7b534b we're at a ~17 k event lag and growing.

Doesn't this break #197 / #198?

#197 / #198 are correct and necessary for the case where ListEvents itself times out mid-pagination — the fix preserved work that would otherwise be dropped. This issue is a different failure mode: ListEvents succeeds quickly, but the per-page apply is too slow for the natural cycle deadline. The fixes are complementary.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions