Bug: Reconciler silently failing — stuck in_progress beads not recovered after 1.5+ hours

## Parent

Sub-issue of #204 (Phase 4: Hardening)

## Summary

Two `in_progress` issue beads (`d93f38f3`, `f4ac6ea2`) have been stuck for 1.5+ hours in town `8a6f9375`. Their assigned agents (Toast, Maple) are `idle` with hooks and `dispatch_attempts=2`. The alarm is running (5s intervals, `nextFireAt` advancing). `reconcileBeads` Rule 3 (`STALE_IN_PROGRESS_TIMEOUT_MS = 5 min`) should be resetting these beads to `open`, but isn't acting.

## Observed State

- Beads: `in_progress`, `updated_at = 2026-03-21T05:51:10` (1.5h stale)
- Agents: `idle`, hooked to the beads, `dispatch_attempts=2`, `last_activity_at = 2026-03-21T05:55:50`
- Alarm: `active (5s)`, `nextFireAt` advancing normally
- Recent events: last event at 05:56, no events for 1.5 hours despite alarm running
- Workers observability: 0 exceptions, 0 reconciler log messages
- No triage requests or GUPP escalations created for this problem

## Diagnosis

The reconciler (`reconcile()` at `reconciler.ts:299`) appears to be silently failing or returning 0 actions every tick. If it were running correctly, Rule 3 would match both beads (stale `in_progress`, no `working/stalled` agent hooked, `last_activity_at` older than 90s).

### Possible causes:

1. **Zod parse failure in an earlier rule** — `reconcileAgents()` runs before `reconcileBeads()`. If `AgentRow.array().parse()` throws (e.g., a new column added to `AgentMetadataRecord.pick()` but not selected in the query), the entire `reconcile()` throws. The catch at `Town.do.ts:2977` logs to `console.error` which is invisible (DO alarm events aren't captured by Workers observability).

2. **Phase 0 event drain blocking** — `events.drainEvents()` at line 2927 runs before `reconcile()`. If it throws or hangs, Phase 1 never runs. The catch at line 2947 handles individual events but not a failure in `drainEvents()` itself.

3. **Rule 3 fires but dispatch immediately re-sets in_progress** — Rule 3 emits `transition_bead(open)` + `clear_bead_assignee`. But if Rule 1 or Rule 2 also matches in the same pass (idle+hooked agent with now-open bead), `dispatch_agent` immediately sets it back to `in_progress`. The net effect is a no-op with events logged. However, we see NO events for 1.5h, ruling this out.

### Most likely: cause 1 (Zod parse failure)

The `AgentRow` schema picks `last_event_type`, `last_event_at`, `active_tools` from `AgentMetadataRecord`. These columns were recently added via `ALTER TABLE`. If a town's DO was created before the migration ran, these columns might not exist in the SQLite schema. The `AgentRow` fields are `.nullable().optional()` so missing values parse OK, but if the columns don't exist in the table at all, the SQL query itself would throw `SqlStorageError: no such column`.

## Recommended Fix

1. **Add try/catch per reconcile sub-function** — instead of one big `reconcile()`, catch errors in each `reconcileAgents()`, `reconcileBeads()`, etc. so a failure in one doesn't block all others.

2. **Emit a health event from the reconciler** — Write a `reconciler_tick` town_event every N ticks (e.g. every 60s) with metrics. This makes reconciler health visible via the debug endpoint.

3. **Investigate the actual error** — Add a temporary debug field to the `/debug/towns/:id/status` response that runs `reconcile()` in a try/catch and returns the error if it throws.

## Affected Code

- `cloudflare-gastown/src/dos/Town.do.ts:2954-2979` — reconciler Phase 1
- `cloudflare-gastown/src/dos/town/reconciler.ts:299-306` — `reconcile()` top-level
- `cloudflare-gastown/src/dos/town/reconciler.ts:326-361` — `reconcileAgents()` query + parse

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: Reconciler silently failing — stuck in_progress beads not recovered after 1.5+ hours #1361

Parent

Summary

Observed State

Diagnosis

Possible causes:

Most likely: cause 1 (Zod parse failure)

Recommended Fix

Affected Code

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Bug: Reconciler silently failing — stuck in_progress beads not recovered after 1.5+ hours #1361

Description

Parent

Summary

Observed State

Diagnosis

Possible causes:

Most likely: cause 1 (Zod parse failure)

Recommended Fix

Affected Code

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions