Active watchdog: remediate stuck executions detected by monitoring

## Summary

The monitoring service detects stuck executions (running > 30 minutes) and flags agent health as "degraded", but takes no remediation action. Operators must manually stop stuck executions via the UI Stop button. This defeats the purpose of autonomous operation.

## Context

The monitoring service already does the hard work — it queries each agent's `/api/executions/running` endpoint and compares against expected state. But the result is only used for health status reporting, not recovery.

### What Exists Today

| Mechanism | What It Does | Limitation |
|-----------|-------------|------------|
| Monitoring service | Detects stuck executions (>30 min), sets health to "degraded" | **Observe-only — takes no action** |
| Cleanup service | Marks `running` > 120 min as failed in DB | Doesn't terminate the actual process on agent |
| Stop button | SIGINT → SIGKILL via ProcessRegistry | Manual — requires human |

### The Gap

Between detection (30 min) and passive cleanup (120 min), there's a 90-minute window where the system knows an execution is stuck but does nothing about it. During this time:
- The capacity slot is wasted
- The scheduler can't run new jobs on that agent
- No notification is sent
- The agent process may be consuming resources doing nothing useful

## Proposed Solution

Extend the monitoring service (or cleanup service) to **act on stuck execution detection**:

1. **Reconcile DB vs agent process registry** (every 5 min, piggyback on cleanup cycle):
   - For each `running` execution in DB, check agent's process registry via `GET /api/executions/{id}/status`
   - If NOT found on agent → execution finished but DB wasn't updated → mark `failed` with error: `"Execution completed on agent but status not reported — recovered by watchdog"`
   - Release capacity slot

2. **Auto-terminate idle executions** (configurable, default disabled):
   - If execution has been running > configurable threshold (e.g., 60 min) AND is confirmed running on agent
   - Terminate via existing `POST /api/executions/{id}/stop` on agent
   - Mark as `failed` with error: `"Execution auto-terminated after {N} minutes by watchdog"`
   - This is opt-in per schedule via an `execution_timeout_override` field

3. **Notify on watchdog recovery**:
   - When watchdog recovers an orphaned execution, emit a WebSocket event so the UI updates
   - Log structured event for Vector pipeline

## Files to Change

- `src/backend/services/cleanup_service.py` — Add reconciliation logic in cleanup cycle
- `src/backend/services/monitoring_service.py` — Optional: move stuck detection threshold to config
- `src/backend/db/schedules.py` — Add method to get running executions with agent info

## Acceptance Criteria

- [ ] Cleanup cycle reconciles DB `running` status against agent process registries
- [ ] Executions not found on agent are marked `failed` with descriptive error
- [ ] Capacity slots released for reconciled executions
- [ ] WebSocket event emitted on watchdog recovery (so UI updates)
- [ ] Structured log entry for each recovered execution
- [ ] Optional per-schedule execution timeout override (default: use global 120 min)
- [ ] No regressions to existing monitoring or cleanup behavior

## Related

- Supersedes Phases 2-3 of #23 (Orphaned Execution Recovery)
- `src/backend/services/monitoring_service.py` — Existing stuck detection (lines 232-252)
- `src/backend/services/cleanup_service.py` — Existing time-based cleanup
- `docker/base-image/agent_server/services/process_registry.py` — Agent-side process tracking

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Active watchdog: remediate stuck executions detected by monitoring #129

Summary

Context

What Exists Today

The Gap

Proposed Solution

Files to Change

Acceptance Criteria

Related

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Mechanism	What It Does	Limitation
Monitoring service	Detects stuck executions (>30 min), sets health to "degraded"	Observe-only — takes no action
Cleanup service	Marks `running` > 120 min as failed in DB	Doesn't terminate the actual process on agent
Stop button	SIGINT → SIGKILL via ProcessRegistry	Manual — requires human

Active watchdog: remediate stuck executions detected by monitoring #129

Description

Summary

Context

What Exists Today

The Gap

Proposed Solution

Files to Change

Acceptance Criteria

Related

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions