You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When executions fail mid-flight (Claude API disconnects, rate limits, transient errors), the system has no active recovery mechanism. The subprocess may hang indefinitely or spin on retries without the platform knowing. Current timeout defaults (900s) are a safety net, not a responsive solution.
Concrete scenario: Claude API returns "Server disconnected without sending a response" — the Claude Code process may hang or retry internally. The execution sits in running status consuming a capacity slot for up to 15 minutes before the hard timeout fires. During this time:
The capacity slot is wasted
The scheduler can't run new jobs on that agent
No notification is sent to the operator
The only remedy is the manual Stop button
What Exists Today
Mechanism
What It Does
Limitation
asyncio.wait_for() (900s)
Kills process on hard timeout
15 min is too long for a stuck process
Slot TTL (30 min)
Auto-releases orphaned Redis slots
Doesn't terminate the actual process
Monitoring Service
Detects stuck executions (>30 min), marks health as "degraded"
Observe-only — takes no action
Process Engine Recovery
On startup, resumes/fails stale process executions
Only for Process Engine, not regular tasks
Stop Button
SIGINT → SIGKILL via ProcessRegistry
Manual — requires human
What's Missing
No liveness check — nobody polls a running execution to see if it's still producing output
No idle timeout — if Claude Code produces no output for 5 minutes, it's likely stuck, but nothing acts on that
No startup recovery for regular tasks — Process Engine has recovery, but schedule_executions with status=running are never cleaned up after a backend crash
Monitoring detects but doesn't remediate — stuck detection only sets health status, doesn't terminate
Proposed Solution
Phase 1: Startup Recovery (Original Issue Scope)
On backend/scheduler startup:
Query schedule_executions WHERE status = 'running'
For each: check if agent container is actually running that execution (GET /api/executions/running)
If not found on agent → mark as failed with error "Orphaned execution recovered on startup"
Release any associated capacity slots
Log and optionally notify
Phase 2: Active Liveness Monitoring
Execution Watchdog Service — periodic background task (runs every 60s):
List all running executions across all agents (from schedule_executions table + slot service)
For each running execution, query the agent's process registry: GET /api/executions/running
If process not found on agent → execution is orphaned:
Mark as failed in DB
Release capacity slot
Send operator notification
If process exists but idle (no output for > configurable threshold):
last_heartbeat_at (TEXT) — last time the execution produced output (set by agent, synced to backend)
Agent ProcessRegistry:
Track last_output_at per registered process
Expose via health endpoint
Configuration
# Backend config / per-schedule overrideEXECUTION_IDLE_TIMEOUT=300# 5 min no output → terminateEXECUTION_MAX_TIMEOUT=1800# 30 min hard capWATCHDOG_INTERVAL=60# Check every 60sORPHAN_RECOVERY_ON_STARTUP=True# Phase 1
Acceptance Criteria
On startup, orphaned running executions are detected and failed
Background watchdog periodically checks all running executions against agent process registries
Executions with no output for idle_timeout seconds are auto-terminated
Capacity slots are released for all terminated/orphaned executions
Operator notifications sent for auto-terminated executions
Configurable timeouts per-schedule (idle, max, API error)
Problem
When executions fail mid-flight (Claude API disconnects, rate limits, transient errors), the system has no active recovery mechanism. The subprocess may hang indefinitely or spin on retries without the platform knowing. Current timeout defaults (900s) are a safety net, not a responsive solution.
Concrete scenario: Claude API returns
"Server disconnected without sending a response"— the Claude Code process may hang or retry internally. The execution sits inrunningstatus consuming a capacity slot for up to 15 minutes before the hard timeout fires. During this time:What Exists Today
asyncio.wait_for()(900s)What's Missing
schedule_executionswith status=runningare never cleaned up after a backend crashProposed Solution
Phase 1: Startup Recovery (Original Issue Scope)
On backend/scheduler startup:
schedule_executionsWHEREstatus = 'running'/api/executions/running)failedwith error"Orphaned execution recovered on startup"Phase 2: Active Liveness Monitoring
Execution Watchdog Service — periodic background task (runs every 60s):
schedule_executionstable + slot service)GET /api/executions/runningfailedin DBidle_timeout_seconds(default: 300s / 5 min)last_output_attimestamp in ProcessRegistryGET /api/executions/{id}/health→ returnslast_output_at,running_seconds,output_line_countPhase 3: Smarter Execution Timeouts
Replace the single hard timeout with a tiered approach:
idle_timeoutexecution_timeoutapi_error_timeoutThese should be configurable per-schedule and per-task trigger.
Phase 4: Remediation Actions
When a stuck/failed execution is detected:
Database Changes
Extend
schedule_executions:last_heartbeat_at(TEXT) — last time the execution produced output (set by agent, synced to backend)Agent ProcessRegistry:
last_output_atper registered processConfiguration
Acceptance Criteria
runningexecutions are detected and failedidle_timeoutseconds are auto-terminatedRelated
monitoring_service.py) — stuck detectionslot_service.py) — TTL-based cleanupprocess_registry.py) — subprocess lifecycle