Skip to content

[upstream-take #7244] Recover agents stuck in error after upstream quota windows#175

Merged
AnilChinchawaleXDC merged 2 commits into
masterfrom
chore/take-upstream-pr-7244-agent-error-recovery
Jun 1, 2026
Merged

[upstream-take #7244] Recover agents stuck in error after upstream quota windows#175
AnilChinchawaleXDC merged 2 commits into
masterfrom
chore/take-upstream-pr-7244-agent-error-recovery

Conversation

@AnilChinchawaleXDC

Copy link
Copy Markdown
Collaborator

No description provided.

joachimBrindeau and others added 2 commits May 30, 2026 19:42
When an agent's heartbeat call fails (shared-subscription quota
exhaustion is the recurring case), `finalizeAgentStatus` flips it to
`status: "error"` and freezes `lastHeartbeatAt`. With the default
`runtimeConfig.heartbeat` policy (empty / disabled), the timer-driven
`tickTimers` path never re-evaluates the agent, and the event-driven
recovery sweeps only fire on issue activity. Agents responsible for
quiet specialties (infra, security, legacy modernization) therefore
stay error-stuck after the upstream quota window has clearly reopened
for healthier peers in the same fleet — observed: 8 agents frozen 2-3
days while sibling agents heartbeated normally every ~90 minutes.

Adds a `recoverErroredAgents` sweep that walks `status: "error"` rows
whose `lastHeartbeatAt` (falling back to `updatedAt` for never-beat
agents) is older than `STICKY_ERROR_RECOVERY_MIN_AGE_MS` (2h) and flips
them to `idle` — mirroring `POST /api/agents/:id/resume` exactly, so
the next event-driven wake (assignment, mention, recovery sweep)
picks them up naturally. `lastHeartbeatAt` is intentionally NOT bumped,
preserving the original failure timestamp for audit. The UPDATE
recreates the status guard in its WHERE clause so a concurrent /resume
or fresh heartbeat can't be clobbered.

Wires the sweep into the existing periodic interval in `index.ts`
alongside `reapOrphanedRuns` etc. It runs independently so a slow
reaper can't starve quota-recovered agents.

The 2h floor sits below Anthropic's daily shared-subscription quota
reset cadence so every natural quota window auto-recovers, while
keeping real (non-quota) failures visible for two hours before silent
recovery.

Coverage: new `heartbeat-sticky-error-recovery.test.ts` covers the
recovery floor (above/below), non-error status protection, the
coalesce(updatedAt) fallback, the batch limit, activity-log emission,
and idempotency across back-to-back sweeps.
@AnilChinchawaleXDC AnilChinchawaleXDC merged commit 230156d into master Jun 1, 2026
1 of 13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants