bug: Cleanup service does not fail execution records when reclaiming stale slots

## Summary

The cleanup service (added per #94) successfully reclaims stale Redis slots and cleans leaked capacity, but it does **not** mark the corresponding `schedule_executions` database records as failed. This leaves execution records permanently stuck in "running" (or "skipped") state, requiring manual cleanup. The slot cleanup and execution record cleanup are disconnected.

## Component

Backend / Cleanup Service

## Priority

P2 — Executions still run, but stuck records accumulate and pollute the UI. Manual cleanup required.

## Error

No error logged. The gap is silent — cleanup logs show slot reclamation but execution records are untouched:

```
[Slots] Cleaned up 1 stale slots for agent 'oracle-5-geopolitics'
[Cleanup] Cycle complete: {'stale_executions': 0, 'no_session_executions': 0, 'orphaned_skipped': 0, 'stale_activities': 0, 'stale_slots': 1, 'total': 1}
```

Note `stale_executions: 0` and `stale_slots: 1` — the slot was cleaned but the execution record was not.

## Location

- **File**: `src/backend/services/cleanup_service.py` — cleanup cycle logic
- **File**: `src/backend/services/slot_service.py` — stale slot cleanup

## Root Cause

Two gaps in the cleanup service:

### Gap 1: Stale slot cleanup doesn't fail execution records

When `SlotService` cleans up a stale slot, it removes the Redis ZSET entry but does not update the `schedule_executions` table. The execution record stays "running" with `session_id="dispatched"` forever. The cleanup service reports `stale_slots` cleaned but `stale_executions: 0`.

**Observed**: `oracle-5-geopolitics` acquired slot at 15:30 (execution `hDTmGixuXipGoEY_XkS_9A`, TTL=3900s). Slot reclaimed by cleanup at 15:51. Execution record stayed "running" — scheduler polled it 234 times before manual cleanup.

### Gap 2: "Skipped" executions are non-terminal

When the scheduler skips an execution (agent busy), it sets `status="skipped"` with no `claude_session_id`. This is treated as a non-terminal state — the record sits in the DB indefinitely. The cleanup service doesn't consider "skipped" as a state that needs cleanup.

**Observed**: `oracle-6-science` Heartbeat at 14:40 was skipped (agent busy). Record stayed as `status="skipped"` with no session.

## Reproduction Steps

1. Have multiple agents with scheduled Heartbeat tasks
2. Wait for a silent launch failure (dispatched but agent container doesn't complete — common under load)
3. Or wait for a skip (agent busy when schedule fires)
4. Observe the cleanup service reclaiming the stale slot but NOT failing the execution record
5. Query: `SELECT agent_name, status, claude_session_id FROM schedule_executions WHERE status NOT IN ('completed', 'failed', 'cancelled', 'success')`
6. Stuck records accumulate over time

## Suggested Fix

### Fix Gap 1: Correlate slot cleanup with execution records

When the cleanup service reclaims a stale slot, look up the execution ID from the slot metadata and fail the corresponding execution record:

```python
# In cleanup cycle, after stale slot cleanup:
for slot in stale_slots_cleaned:
    execution_id = slot.execution_id  # from ZSET member
    db.update_execution_status(
        execution_id=execution_id,
        status="failed",
        error="Stale execution — slot TTL expired, cleaned by cleanup service"
    )
```

### Fix Gap 2: Treat "skipped" as terminal, or clean it up

Option A: Make "skipped" a terminal state (add to terminal state checks).
Option B: Have the cleanup service transition old "skipped" records to "cancelled" after a threshold (e.g., 10 minutes).

```python
# In cleanup cycle:
stale_skipped = db.get_executions_by_status(
    status="skipped",
    older_than=timedelta(minutes=10)
)
for exec in stale_skipped:
    db.update_execution_status(exec.id, status="cancelled", error="Skipped — agent was busy")
```

## Environment

- Trinity version: `4e7d161`
- Observed on: eu2 instance, 2026-03-28
- 3 stuck executions (1 skipped, 2 running/dispatched), 11 orphaned activities

## Related

- #94 — Original cleanup service request (this bug is a gap in the implementation)
- #90 — Root cause of silent launch failures (scheduler/backend DB split)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug: Cleanup service does not fail execution records when reclaiming stale slots #219

Summary

Component

Priority

Error

Location

Root Cause

Gap 1: Stale slot cleanup doesn't fail execution records

Gap 2: "Skipped" executions are non-terminal

Reproduction Steps

Suggested Fix

Fix Gap 1: Correlate slot cleanup with execution records

Fix Gap 2: Treat "skipped" as terminal, or clean it up

Environment

Related

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

bug: Cleanup service does not fail execution records when reclaiming stale slots #219

Description

Summary

Component

Priority

Error

Location

Root Cause

Gap 1: Stale slot cleanup doesn't fail execution records

Gap 2: "Skipped" executions are non-terminal

Reproduction Steps

Suggested Fix

Fix Gap 1: Correlate slot cleanup with execution records

Fix Gap 2: Treat "skipped" as terminal, or clean it up

Environment

Related

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions