-
Notifications
You must be signed in to change notification settings - Fork 12
Description
Parent
Sub-issue of #204 (Phase 4: Hardening)
Summary
Two in_progress issue beads (d93f38f3, f4ac6ea2) have been stuck for 1.5+ hours in town 8a6f9375. Their assigned agents (Toast, Maple) are idle with hooks and dispatch_attempts=2. The alarm is running (5s intervals, nextFireAt advancing). reconcileBeads Rule 3 (STALE_IN_PROGRESS_TIMEOUT_MS = 5 min) should be resetting these beads to open, but isn't acting.
Observed State
- Beads:
in_progress,updated_at = 2026-03-21T05:51:10(1.5h stale) - Agents:
idle, hooked to the beads,dispatch_attempts=2,last_activity_at = 2026-03-21T05:55:50 - Alarm:
active (5s),nextFireAtadvancing normally - Recent events: last event at 05:56, no events for 1.5 hours despite alarm running
- Workers observability: 0 exceptions, 0 reconciler log messages
- No triage requests or GUPP escalations created for this problem
Diagnosis
The reconciler (reconcile() at reconciler.ts:299) appears to be silently failing or returning 0 actions every tick. If it were running correctly, Rule 3 would match both beads (stale in_progress, no working/stalled agent hooked, last_activity_at older than 90s).
Possible causes:
-
Zod parse failure in an earlier rule —
reconcileAgents()runs beforereconcileBeads(). IfAgentRow.array().parse()throws (e.g., a new column added toAgentMetadataRecord.pick()but not selected in the query), the entirereconcile()throws. The catch atTown.do.ts:2977logs toconsole.errorwhich is invisible (DO alarm events aren't captured by Workers observability). -
Phase 0 event drain blocking —
events.drainEvents()at line 2927 runs beforereconcile(). If it throws or hangs, Phase 1 never runs. The catch at line 2947 handles individual events but not a failure indrainEvents()itself. -
Rule 3 fires but dispatch immediately re-sets in_progress — Rule 3 emits
transition_bead(open)+clear_bead_assignee. But if Rule 1 or Rule 2 also matches in the same pass (idle+hooked agent with now-open bead),dispatch_agentimmediately sets it back toin_progress. The net effect is a no-op with events logged. However, we see NO events for 1.5h, ruling this out.
Most likely: cause 1 (Zod parse failure)
The AgentRow schema picks last_event_type, last_event_at, active_tools from AgentMetadataRecord. These columns were recently added via ALTER TABLE. If a town's DO was created before the migration ran, these columns might not exist in the SQLite schema. The AgentRow fields are .nullable().optional() so missing values parse OK, but if the columns don't exist in the table at all, the SQL query itself would throw SqlStorageError: no such column.
Recommended Fix
-
Add try/catch per reconcile sub-function — instead of one big
reconcile(), catch errors in eachreconcileAgents(),reconcileBeads(), etc. so a failure in one doesn't block all others. -
Emit a health event from the reconciler — Write a
reconciler_ticktown_event every N ticks (e.g. every 60s) with metrics. This makes reconciler health visible via the debug endpoint. -
Investigate the actual error — Add a temporary debug field to the
/debug/towns/:id/statusresponse that runsreconcile()in a try/catch and returns the error if it throws.
Affected Code
cloudflare-gastown/src/dos/Town.do.ts:2954-2979— reconciler Phase 1cloudflare-gastown/src/dos/town/reconciler.ts:299-306—reconcile()top-levelcloudflare-gastown/src/dos/town/reconciler.ts:326-361—reconcileAgents()query + parse