Skip to content

Orphaned Execution Recovery #23

@vybe

Description

@vybe

Problem

When executions fail mid-flight (Claude API disconnects, rate limits, transient errors), the system has no active recovery mechanism. The subprocess may hang indefinitely or spin on retries without the platform knowing. Current timeout defaults (900s) are a safety net, not a responsive solution.

Concrete scenario: Claude API returns "Server disconnected without sending a response" — the Claude Code process may hang or retry internally. The execution sits in running status consuming a capacity slot for up to 15 minutes before the hard timeout fires. During this time:

  • The capacity slot is wasted
  • The scheduler can't run new jobs on that agent
  • No notification is sent to the operator
  • The only remedy is the manual Stop button

What Exists Today

Mechanism What It Does Limitation
asyncio.wait_for() (900s) Kills process on hard timeout 15 min is too long for a stuck process
Slot TTL (30 min) Auto-releases orphaned Redis slots Doesn't terminate the actual process
Monitoring Service Detects stuck executions (>30 min), marks health as "degraded" Observe-only — takes no action
Process Engine Recovery On startup, resumes/fails stale process executions Only for Process Engine, not regular tasks
Stop Button SIGINT → SIGKILL via ProcessRegistry Manual — requires human

What's Missing

  1. No liveness check — nobody polls a running execution to see if it's still producing output
  2. No idle timeout — if Claude Code produces no output for 5 minutes, it's likely stuck, but nothing acts on that
  3. No startup recovery for regular tasks — Process Engine has recovery, but schedule_executions with status=running are never cleaned up after a backend crash
  4. Monitoring detects but doesn't remediate — stuck detection only sets health status, doesn't terminate

Proposed Solution

Phase 1: Startup Recovery (Original Issue Scope)

On backend/scheduler startup:

  1. Query schedule_executions WHERE status = 'running'
  2. For each: check if agent container is actually running that execution (GET /api/executions/running)
  3. If not found on agent → mark as failed with error "Orphaned execution recovered on startup"
  4. Release any associated capacity slots
  5. Log and optionally notify

Phase 2: Active Liveness Monitoring

Execution Watchdog Service — periodic background task (runs every 60s):

  1. List all running executions across all agents (from schedule_executions table + slot service)
  2. For each running execution, query the agent's process registry: GET /api/executions/running
  3. If process not found on agent → execution is orphaned:
    • Mark as failed in DB
    • Release capacity slot
    • Send operator notification
  4. If process exists but idle (no output for > configurable threshold):
    • Configurable idle_timeout_seconds (default: 300s / 5 min)
    • Agent-side: track last_output_at timestamp in ProcessRegistry
    • New agent endpoint: GET /api/executions/{id}/health → returns last_output_at, running_seconds, output_line_count
    • If idle > threshold → auto-terminate via existing SIGINT/SIGKILL flow

Phase 3: Smarter Execution Timeouts

Replace the single hard timeout with a tiered approach:

Timeout Default Purpose
idle_timeout 300s (5 min) No output from Claude Code process
execution_timeout 1800s (30 min) Total wall-clock time for the execution
api_error_timeout 60s (1 min) Consecutive API errors without progress

These should be configurable per-schedule and per-task trigger.

Phase 4: Remediation Actions

When a stuck/failed execution is detected:

  1. Terminate the process (existing flow)
  2. Release capacity slot
  3. Update execution record with diagnostic info
  4. Send notification to operator (via existing notification system)
  5. If retry is configured (see Configurable retry mechanism for task and schedule execution failures #89): schedule retry with backoff

Database Changes

Extend schedule_executions:

  • last_heartbeat_at (TEXT) — last time the execution produced output (set by agent, synced to backend)

Agent ProcessRegistry:

  • Track last_output_at per registered process
  • Expose via health endpoint

Configuration

# Backend config / per-schedule override
EXECUTION_IDLE_TIMEOUT = 300        # 5 min no output → terminate
EXECUTION_MAX_TIMEOUT = 1800        # 30 min hard cap
WATCHDOG_INTERVAL = 60              # Check every 60s
ORPHAN_RECOVERY_ON_STARTUP = True   # Phase 1

Acceptance Criteria

  • On startup, orphaned running executions are detected and failed
  • Background watchdog periodically checks all running executions against agent process registries
  • Executions with no output for idle_timeout seconds are auto-terminated
  • Capacity slots are released for all terminated/orphaned executions
  • Operator notifications sent for auto-terminated executions
  • Configurable timeouts per-schedule (idle, max, API error)
  • Integration with retry mechanism (Configurable retry mechanism for task and schedule execution failures #89) for auto-retry after failure

Related

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions