[upstream-take #7244] Recover agents stuck in error after upstream quota windows#175
Merged
AnilChinchawaleXDC merged 2 commits intoJun 1, 2026
Conversation
When an agent's heartbeat call fails (shared-subscription quota exhaustion is the recurring case), `finalizeAgentStatus` flips it to `status: "error"` and freezes `lastHeartbeatAt`. With the default `runtimeConfig.heartbeat` policy (empty / disabled), the timer-driven `tickTimers` path never re-evaluates the agent, and the event-driven recovery sweeps only fire on issue activity. Agents responsible for quiet specialties (infra, security, legacy modernization) therefore stay error-stuck after the upstream quota window has clearly reopened for healthier peers in the same fleet — observed: 8 agents frozen 2-3 days while sibling agents heartbeated normally every ~90 minutes. Adds a `recoverErroredAgents` sweep that walks `status: "error"` rows whose `lastHeartbeatAt` (falling back to `updatedAt` for never-beat agents) is older than `STICKY_ERROR_RECOVERY_MIN_AGE_MS` (2h) and flips them to `idle` — mirroring `POST /api/agents/:id/resume` exactly, so the next event-driven wake (assignment, mention, recovery sweep) picks them up naturally. `lastHeartbeatAt` is intentionally NOT bumped, preserving the original failure timestamp for audit. The UPDATE recreates the status guard in its WHERE clause so a concurrent /resume or fresh heartbeat can't be clobbered. Wires the sweep into the existing periodic interval in `index.ts` alongside `reapOrphanedRuns` etc. It runs independently so a slow reaper can't starve quota-recovered agents. The 2h floor sits below Anthropic's daily shared-subscription quota reset cadence so every natural quota window auto-recovers, while keeping real (non-quota) failures visible for two hours before silent recovery. Coverage: new `heartbeat-sticky-error-recovery.test.ts` covers the recovery floor (above/below), non-error status protection, the coalesce(updatedAt) fallback, the batch limit, activity-log emission, and idempotency across back-to-back sweeps.
…m-pr-7244-agent-error-recovery
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.