Orphaned Execution Recovery

## Problem

When executions fail mid-flight (Claude API disconnects, rate limits, transient errors), the system has no active recovery mechanism. The subprocess may hang indefinitely or spin on retries without the platform knowing. Current timeout defaults (900s) are a safety net, not a responsive solution.

**Concrete scenario:** Claude API returns `"Server disconnected without sending a response"` — the Claude Code process may hang or retry internally. The execution sits in `running` status consuming a capacity slot for up to 15 minutes before the hard timeout fires. During this time:
- The capacity slot is wasted
- The scheduler can't run new jobs on that agent
- No notification is sent to the operator
- The only remedy is the manual Stop button

### What Exists Today

| Mechanism | What It Does | Limitation |
|-----------|-------------|------------|
| `asyncio.wait_for()` (900s) | Kills process on hard timeout | 15 min is too long for a stuck process |
| Slot TTL (30 min) | Auto-releases orphaned Redis slots | Doesn't terminate the actual process |
| Monitoring Service | Detects stuck executions (>30 min), marks health as "degraded" | **Observe-only — takes no action** |
| Process Engine Recovery | On startup, resumes/fails stale process executions | Only for Process Engine, not regular tasks |
| Stop Button | SIGINT → SIGKILL via ProcessRegistry | Manual — requires human |

### What's Missing

1. **No liveness check** — nobody polls a running execution to see if it's still producing output
2. **No idle timeout** — if Claude Code produces no output for 5 minutes, it's likely stuck, but nothing acts on that
3. **No startup recovery for regular tasks** — Process Engine has recovery, but `schedule_executions` with status=`running` are never cleaned up after a backend crash
4. **Monitoring detects but doesn't remediate** — stuck detection only sets health status, doesn't terminate

## Proposed Solution

### Phase 1: Startup Recovery (Original Issue Scope)

On backend/scheduler startup:
1. Query `schedule_executions` WHERE `status = 'running'`
2. For each: check if agent container is actually running that execution (GET `/api/executions/running`)
3. If not found on agent → mark as `failed` with error `"Orphaned execution recovered on startup"`
4. Release any associated capacity slots
5. Log and optionally notify

### Phase 2: Active Liveness Monitoring

**Execution Watchdog Service** — periodic background task (runs every 60s):

1. **List all running executions** across all agents (from `schedule_executions` table + slot service)
2. **For each running execution**, query the agent's process registry: `GET /api/executions/running`
3. **If process not found on agent** → execution is orphaned:
   - Mark as `failed` in DB
   - Release capacity slot
   - Send operator notification
4. **If process exists but idle** (no output for > configurable threshold):
   - Configurable `idle_timeout_seconds` (default: 300s / 5 min)
   - Agent-side: track `last_output_at` timestamp in ProcessRegistry
   - New agent endpoint: `GET /api/executions/{id}/health` → returns `last_output_at`, `running_seconds`, `output_line_count`
   - If idle > threshold → auto-terminate via existing SIGINT/SIGKILL flow

### Phase 3: Smarter Execution Timeouts

Replace the single hard timeout with a tiered approach:

| Timeout | Default | Purpose |
|---------|---------|---------|
| `idle_timeout` | 300s (5 min) | No output from Claude Code process |
| `execution_timeout` | 1800s (30 min) | Total wall-clock time for the execution |
| `api_error_timeout` | 60s (1 min) | Consecutive API errors without progress |

These should be configurable per-schedule and per-task trigger.

### Phase 4: Remediation Actions

When a stuck/failed execution is detected:
1. Terminate the process (existing flow)
2. Release capacity slot
3. Update execution record with diagnostic info
4. Send notification to operator (via existing notification system)
5. **If retry is configured** (see #89): schedule retry with backoff

## Database Changes

**Extend `schedule_executions`:**
- `last_heartbeat_at` (TEXT) — last time the execution produced output (set by agent, synced to backend)

**Agent ProcessRegistry:**
- Track `last_output_at` per registered process
- Expose via health endpoint

## Configuration

```python
# Backend config / per-schedule override
EXECUTION_IDLE_TIMEOUT = 300        # 5 min no output → terminate
EXECUTION_MAX_TIMEOUT = 1800        # 30 min hard cap
WATCHDOG_INTERVAL = 60              # Check every 60s
ORPHAN_RECOVERY_ON_STARTUP = True   # Phase 1
```

## Acceptance Criteria

- [ ] On startup, orphaned `running` executions are detected and failed
- [ ] Background watchdog periodically checks all running executions against agent process registries
- [ ] Executions with no output for `idle_timeout` seconds are auto-terminated
- [ ] Capacity slots are released for all terminated/orphaned executions
- [ ] Operator notifications sent for auto-terminated executions
- [ ] Configurable timeouts per-schedule (idle, max, API error)
- [ ] Integration with retry mechanism (#89) for auto-retry after failure

## Related

- #89 — Configurable retry mechanism for task and schedule failures
- Monitoring Service (`monitoring_service.py`) — stuck detection
- Slot Service (`slot_service.py`) — TTL-based cleanup
- Process Registry (`process_registry.py`) — subprocess lifecycle

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Orphaned Execution Recovery #23

Problem

What Exists Today

What's Missing

Proposed Solution

Phase 1: Startup Recovery (Original Issue Scope)

Phase 2: Active Liveness Monitoring

Phase 3: Smarter Execution Timeouts

Phase 4: Remediation Actions

Database Changes

Configuration

Acceptance Criteria

Related

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Mechanism	What It Does	Limitation
`asyncio.wait_for()` (900s)	Kills process on hard timeout	15 min is too long for a stuck process
Slot TTL (30 min)	Auto-releases orphaned Redis slots	Doesn't terminate the actual process
Monitoring Service	Detects stuck executions (>30 min), marks health as "degraded"	Observe-only — takes no action
Process Engine Recovery	On startup, resumes/fails stale process executions	Only for Process Engine, not regular tasks
Stop Button	SIGINT → SIGKILL via ProcessRegistry	Manual — requires human

Timeout	Default	Purpose
`idle_timeout`	300s (5 min)	No output from Claude Code process
`execution_timeout`	1800s (30 min)	Total wall-clock time for the execution
`api_error_timeout`	60s (1 min)	Consecutive API errors without progress

Orphaned Execution Recovery #23

Description

Problem

What Exists Today

What's Missing

Proposed Solution

Phase 1: Startup Recovery (Original Issue Scope)

Phase 2: Active Liveness Monitoring

Phase 3: Smarter Execution Timeouts

Phase 4: Remediation Actions

Database Changes

Configuration

Acceptance Criteria

Related

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions