-
Notifications
You must be signed in to change notification settings - Fork 12
Description
Parent
Sub-issue of #204 (Phase 4: Hardening)
Summary
Polecats that survive a container restart get stuck in a 5-minute reset cycle. The bead cycles in_progress → open → in_progress every 5 minutes, restarting the agent's work from checkpoint each time.
Root Cause
A timing race between startAgentInContainer's 60-second AbortSignal.timeout and the container's actual agent startup time:
dispatchAgentsets the agent toworkingand callsstartAgentInContainer- The container takes >60s to respond (cold start: git clone + worktree). The timeout fires → returns
false dispatchAgent's!startedpath sets the agent back toidle(scheduling.ts:166)- The container DID start the agent — the timeout was for the HTTP response, not the process
- Agent starts working, sends heartbeats via
touchAgentHeartbeat→ updateslast_activity_at. ButtouchAgentonly updateslast_activity_at— it does NOT restorestatustoworking. The agent remainsidleinagent_metadata. - 5 minutes later:
reconcileBeadsRule 3 (reconciler.ts:580-633) fires: bead isin_progressand stale (STALE_IN_PROGRESS_TIMEOUT_MS = 5 min), checks for aworking/stalledagent hooked to it (line 610-613). The agent IS hooked but status isidle→ match → bead reset toopen, assignee cleared - Normal scheduling re-hooks and re-dispatches → cycle repeats
The "already running" detection in startAgentInContainer doesn't help because the retry paths (schedulePendingWork / reconcileBeads Rule 2) only dispatch agents whose hooked bead is open. The bead is in_progress when the race happens.
Proposed Fix
Option A (simplest): In touchAgentHeartbeat / touchAgent, if the agent's status is idle but it's receiving heartbeats, restore status to working. A heartbeat is proof the agent is alive in the container:
// In touchAgent:
UPDATE agent_metadata
SET last_activity_at = ?,
status = CASE WHEN status = 'idle' THEN 'working' ELSE status END
WHERE bead_id = ?Option B: In reconcileBeads Rule 3, also check last_activity_at freshness. If the agent has a recent heartbeat (within 90s), skip the rule — the agent is alive regardless of its status field.
Option C: Increase startAgentInContainer timeout beyond typical cold-start time (120s+), or make it not set the agent to idle on timeout (leave it working — if the agent truly didn't start, reconcileAgents heartbeat check catches it after 90s).
Reproduction
- Deploy to production (
pnpm deploy:prod) - Container eviction + restart occurs
- Create a convoy with 2+ beads
- Observe beads cycling
in_progress → open → in_progressevery 5 minutes - Agent status messages show active work (tool calls, file reads) but
agent_metadata.statusstaysidle
Affected Code
cloudflare-gastown/src/dos/town/scheduling.ts:155-178—dispatchAgent!startedpathcloudflare-gastown/src/dos/town/agents.ts:540-567—touchAgent(updateslast_activity_atonly)cloudflare-gastown/src/dos/town/reconciler.ts:580-633—reconcileBeadsRule 3cloudflare-gastown/src/dos/town/reconciler.ts:55-56—STALE_IN_PROGRESS_TIMEOUT_MS = 5 min