You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The monitoring service detects stuck executions (running > 30 minutes) and flags agent health as "degraded", but takes no remediation action. Operators must manually stop stuck executions via the UI Stop button. This defeats the purpose of autonomous operation.
Context
The monitoring service already does the hard work — it queries each agent's /api/executions/running endpoint and compares against expected state. But the result is only used for health status reporting, not recovery.
What Exists Today
Mechanism
What It Does
Limitation
Monitoring service
Detects stuck executions (>30 min), sets health to "degraded"
Observe-only — takes no action
Cleanup service
Marks running > 120 min as failed in DB
Doesn't terminate the actual process on agent
Stop button
SIGINT → SIGKILL via ProcessRegistry
Manual — requires human
The Gap
Between detection (30 min) and passive cleanup (120 min), there's a 90-minute window where the system knows an execution is stuck but does nothing about it. During this time:
The capacity slot is wasted
The scheduler can't run new jobs on that agent
No notification is sent
The agent process may be consuming resources doing nothing useful
Proposed Solution
Extend the monitoring service (or cleanup service) to act on stuck execution detection:
Reconcile DB vs agent process registry (every 5 min, piggyback on cleanup cycle):
For each running execution in DB, check agent's process registry via GET /api/executions/{id}/status
If NOT found on agent → execution finished but DB wasn't updated → mark failed with error: "Execution completed on agent but status not reported — recovered by watchdog"
Summary
The monitoring service detects stuck executions (running > 30 minutes) and flags agent health as "degraded", but takes no remediation action. Operators must manually stop stuck executions via the UI Stop button. This defeats the purpose of autonomous operation.
Context
The monitoring service already does the hard work — it queries each agent's
/api/executions/runningendpoint and compares against expected state. But the result is only used for health status reporting, not recovery.What Exists Today
running> 120 min as failed in DBThe Gap
Between detection (30 min) and passive cleanup (120 min), there's a 90-minute window where the system knows an execution is stuck but does nothing about it. During this time:
Proposed Solution
Extend the monitoring service (or cleanup service) to act on stuck execution detection:
Reconcile DB vs agent process registry (every 5 min, piggyback on cleanup cycle):
runningexecution in DB, check agent's process registry viaGET /api/executions/{id}/statusfailedwith error:"Execution completed on agent but status not reported — recovered by watchdog"Auto-terminate idle executions (configurable, default disabled):
POST /api/executions/{id}/stopon agentfailedwith error:"Execution auto-terminated after {N} minutes by watchdog"execution_timeout_overridefieldNotify on watchdog recovery:
Files to Change
src/backend/services/cleanup_service.py— Add reconciliation logic in cleanup cyclesrc/backend/services/monitoring_service.py— Optional: move stuck detection threshold to configsrc/backend/db/schedules.py— Add method to get running executions with agent infoAcceptance Criteria
runningstatus against agent process registriesfailedwith descriptive errorRelated
src/backend/services/monitoring_service.py— Existing stuck detection (lines 232-252)src/backend/services/cleanup_service.py— Existing time-based cleanupdocker/base-image/agent_server/services/process_registry.py— Agent-side process tracking