Skip to content

Active watchdog: remediate stuck executions detected by monitoring #129

@vybe

Description

@vybe

Summary

The monitoring service detects stuck executions (running > 30 minutes) and flags agent health as "degraded", but takes no remediation action. Operators must manually stop stuck executions via the UI Stop button. This defeats the purpose of autonomous operation.

Context

The monitoring service already does the hard work — it queries each agent's /api/executions/running endpoint and compares against expected state. But the result is only used for health status reporting, not recovery.

What Exists Today

Mechanism What It Does Limitation
Monitoring service Detects stuck executions (>30 min), sets health to "degraded" Observe-only — takes no action
Cleanup service Marks running > 120 min as failed in DB Doesn't terminate the actual process on agent
Stop button SIGINT → SIGKILL via ProcessRegistry Manual — requires human

The Gap

Between detection (30 min) and passive cleanup (120 min), there's a 90-minute window where the system knows an execution is stuck but does nothing about it. During this time:

  • The capacity slot is wasted
  • The scheduler can't run new jobs on that agent
  • No notification is sent
  • The agent process may be consuming resources doing nothing useful

Proposed Solution

Extend the monitoring service (or cleanup service) to act on stuck execution detection:

  1. Reconcile DB vs agent process registry (every 5 min, piggyback on cleanup cycle):

    • For each running execution in DB, check agent's process registry via GET /api/executions/{id}/status
    • If NOT found on agent → execution finished but DB wasn't updated → mark failed with error: "Execution completed on agent but status not reported — recovered by watchdog"
    • Release capacity slot
  2. Auto-terminate idle executions (configurable, default disabled):

    • If execution has been running > configurable threshold (e.g., 60 min) AND is confirmed running on agent
    • Terminate via existing POST /api/executions/{id}/stop on agent
    • Mark as failed with error: "Execution auto-terminated after {N} minutes by watchdog"
    • This is opt-in per schedule via an execution_timeout_override field
  3. Notify on watchdog recovery:

    • When watchdog recovers an orphaned execution, emit a WebSocket event so the UI updates
    • Log structured event for Vector pipeline

Files to Change

  • src/backend/services/cleanup_service.py — Add reconciliation logic in cleanup cycle
  • src/backend/services/monitoring_service.py — Optional: move stuck detection threshold to config
  • src/backend/db/schedules.py — Add method to get running executions with agent info

Acceptance Criteria

  • Cleanup cycle reconciles DB running status against agent process registries
  • Executions not found on agent are marked failed with descriptive error
  • Capacity slots released for reconciled executions
  • WebSocket event emitted on watchdog recovery (so UI updates)
  • Structured log entry for each recovered execution
  • Optional per-schedule execution timeout override (default: use global 120 min)
  • No regressions to existing monitoring or cleanup behavior

Related

  • Supersedes Phases 2-3 of Orphaned Execution Recovery #23 (Orphaned Execution Recovery)
  • src/backend/services/monitoring_service.py — Existing stuck detection (lines 232-252)
  • src/backend/services/cleanup_service.py — Existing time-based cleanup
  • docker/base-image/agent_server/services/process_registry.py — Agent-side process tracking

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions