Bug: Polecat heartbeats don't restore working status after dispatch timeout race

## Parent

Sub-issue of #204 (Phase 4: Hardening)

## Summary

Polecats that survive a container restart get stuck in a 5-minute reset cycle. The bead cycles `in_progress → open → in_progress` every 5 minutes, restarting the agent's work from checkpoint each time.

## Root Cause

A timing race between `startAgentInContainer`'s 60-second `AbortSignal.timeout` and the container's actual agent startup time:

1. `dispatchAgent` sets the agent to `working` and calls `startAgentInContainer`
2. The container takes >60s to respond (cold start: git clone + worktree). The timeout fires → returns `false`
3. `dispatchAgent`'s `!started` path sets the agent back to **`idle`** (`scheduling.ts:166`)
4. The container DID start the agent — the timeout was for the HTTP response, not the process
5. Agent starts working, sends heartbeats via `touchAgentHeartbeat` → updates `last_activity_at`. **But `touchAgent` only updates `last_activity_at` — it does NOT restore `status` to `working`.** The agent remains `idle` in `agent_metadata`.
6. 5 minutes later: `reconcileBeads` Rule 3 (`reconciler.ts:580-633`) fires: bead is `in_progress` and stale (`STALE_IN_PROGRESS_TIMEOUT_MS = 5 min`), checks for a `working`/`stalled` agent hooked to it (line 610-613). The agent IS hooked but status is `idle` → match → bead reset to `open`, assignee cleared
7. Normal scheduling re-hooks and re-dispatches → cycle repeats

The "already running" detection in `startAgentInContainer` doesn't help because the retry paths (`schedulePendingWork` / `reconcileBeads` Rule 2) only dispatch agents whose hooked bead is `open`. The bead is `in_progress` when the race happens.

## Proposed Fix

**Option A (simplest):** In `touchAgentHeartbeat` / `touchAgent`, if the agent's status is `idle` but it's receiving heartbeats, restore status to `working`. A heartbeat is proof the agent is alive in the container:

```ts
// In touchAgent:
UPDATE agent_metadata
SET last_activity_at = ?,
    status = CASE WHEN status = 'idle' THEN 'working' ELSE status END
WHERE bead_id = ?
```

**Option B:** In `reconcileBeads` Rule 3, also check `last_activity_at` freshness. If the agent has a recent heartbeat (within 90s), skip the rule — the agent is alive regardless of its `status` field.

**Option C:** Increase `startAgentInContainer` timeout beyond typical cold-start time (120s+), or make it not set the agent to `idle` on timeout (leave it `working` — if the agent truly didn't start, `reconcileAgents` heartbeat check catches it after 90s).

## Reproduction

1. Deploy to production (`pnpm deploy:prod`)
2. Container eviction + restart occurs
3. Create a convoy with 2+ beads
4. Observe beads cycling `in_progress → open → in_progress` every 5 minutes
5. Agent status messages show active work (tool calls, file reads) but `agent_metadata.status` stays `idle`

## Affected Code

- `cloudflare-gastown/src/dos/town/scheduling.ts:155-178` — `dispatchAgent` `!started` path
- `cloudflare-gastown/src/dos/town/agents.ts:540-567` — `touchAgent` (updates `last_activity_at` only)
- `cloudflare-gastown/src/dos/town/reconciler.ts:580-633` — `reconcileBeads` Rule 3
- `cloudflare-gastown/src/dos/town/reconciler.ts:55-56` — `STALE_IN_PROGRESS_TIMEOUT_MS = 5 min`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: Polecat heartbeats don't restore working status after dispatch timeout race #1358

Parent

Summary

Root Cause

Proposed Fix

Reproduction

Affected Code

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Bug: Polecat heartbeats don't restore working status after dispatch timeout race #1358

Description

Parent

Summary

Root Cause

Proposed Fix

Reproduction

Affected Code

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions