Skip to content

Bug: Polecat heartbeats don't restore working status after dispatch timeout race #1358

@jrf0110

Description

@jrf0110

Parent

Sub-issue of #204 (Phase 4: Hardening)

Summary

Polecats that survive a container restart get stuck in a 5-minute reset cycle. The bead cycles in_progress → open → in_progress every 5 minutes, restarting the agent's work from checkpoint each time.

Root Cause

A timing race between startAgentInContainer's 60-second AbortSignal.timeout and the container's actual agent startup time:

  1. dispatchAgent sets the agent to working and calls startAgentInContainer
  2. The container takes >60s to respond (cold start: git clone + worktree). The timeout fires → returns false
  3. dispatchAgent's !started path sets the agent back to idle (scheduling.ts:166)
  4. The container DID start the agent — the timeout was for the HTTP response, not the process
  5. Agent starts working, sends heartbeats via touchAgentHeartbeat → updates last_activity_at. But touchAgent only updates last_activity_at — it does NOT restore status to working. The agent remains idle in agent_metadata.
  6. 5 minutes later: reconcileBeads Rule 3 (reconciler.ts:580-633) fires: bead is in_progress and stale (STALE_IN_PROGRESS_TIMEOUT_MS = 5 min), checks for a working/stalled agent hooked to it (line 610-613). The agent IS hooked but status is idle → match → bead reset to open, assignee cleared
  7. Normal scheduling re-hooks and re-dispatches → cycle repeats

The "already running" detection in startAgentInContainer doesn't help because the retry paths (schedulePendingWork / reconcileBeads Rule 2) only dispatch agents whose hooked bead is open. The bead is in_progress when the race happens.

Proposed Fix

Option A (simplest): In touchAgentHeartbeat / touchAgent, if the agent's status is idle but it's receiving heartbeats, restore status to working. A heartbeat is proof the agent is alive in the container:

// In touchAgent:
UPDATE agent_metadata
SET last_activity_at = ?,
    status = CASE WHEN status = 'idle' THEN 'working' ELSE status END
WHERE bead_id = ?

Option B: In reconcileBeads Rule 3, also check last_activity_at freshness. If the agent has a recent heartbeat (within 90s), skip the rule — the agent is alive regardless of its status field.

Option C: Increase startAgentInContainer timeout beyond typical cold-start time (120s+), or make it not set the agent to idle on timeout (leave it working — if the agent truly didn't start, reconcileAgents heartbeat check catches it after 90s).

Reproduction

  1. Deploy to production (pnpm deploy:prod)
  2. Container eviction + restart occurs
  3. Create a convoy with 2+ beads
  4. Observe beads cycling in_progress → open → in_progress every 5 minutes
  5. Agent status messages show active work (tool calls, file reads) but agent_metadata.status stays idle

Affected Code

  • cloudflare-gastown/src/dos/town/scheduling.ts:155-178dispatchAgent !started path
  • cloudflare-gastown/src/dos/town/agents.ts:540-567touchAgent (updates last_activity_at only)
  • cloudflare-gastown/src/dos/town/reconciler.ts:580-633reconcileBeads Rule 3
  • cloudflare-gastown/src/dos/town/reconciler.ts:55-56STALE_IN_PROGRESS_TIMEOUT_MS = 5 min

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingkilo-auto-fixAuto-generated label by Kilokilo-triagedAuto-generated label by Kilo

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions