Skip to content

Bug: Reconciler silently failing — stuck in_progress beads not recovered after 1.5+ hours #1361

@jrf0110

Description

@jrf0110

Parent

Sub-issue of #204 (Phase 4: Hardening)

Summary

Two in_progress issue beads (d93f38f3, f4ac6ea2) have been stuck for 1.5+ hours in town 8a6f9375. Their assigned agents (Toast, Maple) are idle with hooks and dispatch_attempts=2. The alarm is running (5s intervals, nextFireAt advancing). reconcileBeads Rule 3 (STALE_IN_PROGRESS_TIMEOUT_MS = 5 min) should be resetting these beads to open, but isn't acting.

Observed State

  • Beads: in_progress, updated_at = 2026-03-21T05:51:10 (1.5h stale)
  • Agents: idle, hooked to the beads, dispatch_attempts=2, last_activity_at = 2026-03-21T05:55:50
  • Alarm: active (5s), nextFireAt advancing normally
  • Recent events: last event at 05:56, no events for 1.5 hours despite alarm running
  • Workers observability: 0 exceptions, 0 reconciler log messages
  • No triage requests or GUPP escalations created for this problem

Diagnosis

The reconciler (reconcile() at reconciler.ts:299) appears to be silently failing or returning 0 actions every tick. If it were running correctly, Rule 3 would match both beads (stale in_progress, no working/stalled agent hooked, last_activity_at older than 90s).

Possible causes:

  1. Zod parse failure in an earlier rulereconcileAgents() runs before reconcileBeads(). If AgentRow.array().parse() throws (e.g., a new column added to AgentMetadataRecord.pick() but not selected in the query), the entire reconcile() throws. The catch at Town.do.ts:2977 logs to console.error which is invisible (DO alarm events aren't captured by Workers observability).

  2. Phase 0 event drain blockingevents.drainEvents() at line 2927 runs before reconcile(). If it throws or hangs, Phase 1 never runs. The catch at line 2947 handles individual events but not a failure in drainEvents() itself.

  3. Rule 3 fires but dispatch immediately re-sets in_progress — Rule 3 emits transition_bead(open) + clear_bead_assignee. But if Rule 1 or Rule 2 also matches in the same pass (idle+hooked agent with now-open bead), dispatch_agent immediately sets it back to in_progress. The net effect is a no-op with events logged. However, we see NO events for 1.5h, ruling this out.

Most likely: cause 1 (Zod parse failure)

The AgentRow schema picks last_event_type, last_event_at, active_tools from AgentMetadataRecord. These columns were recently added via ALTER TABLE. If a town's DO was created before the migration ran, these columns might not exist in the SQLite schema. The AgentRow fields are .nullable().optional() so missing values parse OK, but if the columns don't exist in the table at all, the SQL query itself would throw SqlStorageError: no such column.

Recommended Fix

  1. Add try/catch per reconcile sub-function — instead of one big reconcile(), catch errors in each reconcileAgents(), reconcileBeads(), etc. so a failure in one doesn't block all others.

  2. Emit a health event from the reconciler — Write a reconciler_tick town_event every N ticks (e.g. every 60s) with metrics. This makes reconciler health visible via the debug endpoint.

  3. Investigate the actual error — Add a temporary debug field to the /debug/towns/:id/status response that runs reconcile() in a try/catch and returns the error if it throws.

Affected Code

  • cloudflare-gastown/src/dos/Town.do.ts:2954-2979 — reconciler Phase 1
  • cloudflare-gastown/src/dos/town/reconciler.ts:299-306reconcile() top-level
  • cloudflare-gastown/src/dos/town/reconciler.ts:326-361reconcileAgents() query + parse

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingkilo-auto-fixAuto-generated label by Kilokilo-triagedAuto-generated label by Kilo

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions