Summary
The cleanup service (added per #94) successfully reclaims stale Redis slots and cleans leaked capacity, but it does not mark the corresponding schedule_executions database records as failed. This leaves execution records permanently stuck in "running" (or "skipped") state, requiring manual cleanup. The slot cleanup and execution record cleanup are disconnected.
Component
Backend / Cleanup Service
Priority
P2 — Executions still run, but stuck records accumulate and pollute the UI. Manual cleanup required.
Error
No error logged. The gap is silent — cleanup logs show slot reclamation but execution records are untouched:
[Slots] Cleaned up 1 stale slots for agent 'oracle-5-geopolitics'
[Cleanup] Cycle complete: {'stale_executions': 0, 'no_session_executions': 0, 'orphaned_skipped': 0, 'stale_activities': 0, 'stale_slots': 1, 'total': 1}
Note stale_executions: 0 and stale_slots: 1 — the slot was cleaned but the execution record was not.
Location
- File:
src/backend/services/cleanup_service.py — cleanup cycle logic
- File:
src/backend/services/slot_service.py — stale slot cleanup
Root Cause
Two gaps in the cleanup service:
Gap 1: Stale slot cleanup doesn't fail execution records
When SlotService cleans up a stale slot, it removes the Redis ZSET entry but does not update the schedule_executions table. The execution record stays "running" with session_id="dispatched" forever. The cleanup service reports stale_slots cleaned but stale_executions: 0.
Observed: oracle-5-geopolitics acquired slot at 15:30 (execution hDTmGixuXipGoEY_XkS_9A, TTL=3900s). Slot reclaimed by cleanup at 15:51. Execution record stayed "running" — scheduler polled it 234 times before manual cleanup.
Gap 2: "Skipped" executions are non-terminal
When the scheduler skips an execution (agent busy), it sets status="skipped" with no claude_session_id. This is treated as a non-terminal state — the record sits in the DB indefinitely. The cleanup service doesn't consider "skipped" as a state that needs cleanup.
Observed: oracle-6-science Heartbeat at 14:40 was skipped (agent busy). Record stayed as status="skipped" with no session.
Reproduction Steps
- Have multiple agents with scheduled Heartbeat tasks
- Wait for a silent launch failure (dispatched but agent container doesn't complete — common under load)
- Or wait for a skip (agent busy when schedule fires)
- Observe the cleanup service reclaiming the stale slot but NOT failing the execution record
- Query:
SELECT agent_name, status, claude_session_id FROM schedule_executions WHERE status NOT IN ('completed', 'failed', 'cancelled', 'success')
- Stuck records accumulate over time
Suggested Fix
Fix Gap 1: Correlate slot cleanup with execution records
When the cleanup service reclaims a stale slot, look up the execution ID from the slot metadata and fail the corresponding execution record:
# In cleanup cycle, after stale slot cleanup:
for slot in stale_slots_cleaned:
execution_id = slot.execution_id # from ZSET member
db.update_execution_status(
execution_id=execution_id,
status="failed",
error="Stale execution — slot TTL expired, cleaned by cleanup service"
)
Fix Gap 2: Treat "skipped" as terminal, or clean it up
Option A: Make "skipped" a terminal state (add to terminal state checks).
Option B: Have the cleanup service transition old "skipped" records to "cancelled" after a threshold (e.g., 10 minutes).
# In cleanup cycle:
stale_skipped = db.get_executions_by_status(
status="skipped",
older_than=timedelta(minutes=10)
)
for exec in stale_skipped:
db.update_execution_status(exec.id, status="cancelled", error="Skipped — agent was busy")
Environment
- Trinity version:
4e7d161
- Observed on: eu2 instance, 2026-03-28
- 3 stuck executions (1 skipped, 2 running/dispatched), 11 orphaned activities
Related
Summary
The cleanup service (added per #94) successfully reclaims stale Redis slots and cleans leaked capacity, but it does not mark the corresponding
schedule_executionsdatabase records as failed. This leaves execution records permanently stuck in "running" (or "skipped") state, requiring manual cleanup. The slot cleanup and execution record cleanup are disconnected.Component
Backend / Cleanup Service
Priority
P2 — Executions still run, but stuck records accumulate and pollute the UI. Manual cleanup required.
Error
No error logged. The gap is silent — cleanup logs show slot reclamation but execution records are untouched:
Note
stale_executions: 0andstale_slots: 1— the slot was cleaned but the execution record was not.Location
src/backend/services/cleanup_service.py— cleanup cycle logicsrc/backend/services/slot_service.py— stale slot cleanupRoot Cause
Two gaps in the cleanup service:
Gap 1: Stale slot cleanup doesn't fail execution records
When
SlotServicecleans up a stale slot, it removes the Redis ZSET entry but does not update theschedule_executionstable. The execution record stays "running" withsession_id="dispatched"forever. The cleanup service reportsstale_slotscleaned butstale_executions: 0.Observed:
oracle-5-geopoliticsacquired slot at 15:30 (executionhDTmGixuXipGoEY_XkS_9A, TTL=3900s). Slot reclaimed by cleanup at 15:51. Execution record stayed "running" — scheduler polled it 234 times before manual cleanup.Gap 2: "Skipped" executions are non-terminal
When the scheduler skips an execution (agent busy), it sets
status="skipped"with noclaude_session_id. This is treated as a non-terminal state — the record sits in the DB indefinitely. The cleanup service doesn't consider "skipped" as a state that needs cleanup.Observed:
oracle-6-scienceHeartbeat at 14:40 was skipped (agent busy). Record stayed asstatus="skipped"with no session.Reproduction Steps
SELECT agent_name, status, claude_session_id FROM schedule_executions WHERE status NOT IN ('completed', 'failed', 'cancelled', 'success')Suggested Fix
Fix Gap 1: Correlate slot cleanup with execution records
When the cleanup service reclaims a stale slot, look up the execution ID from the slot metadata and fail the corresponding execution record:
Fix Gap 2: Treat "skipped" as terminal, or clean it up
Option A: Make "skipped" a terminal state (add to terminal state checks).
Option B: Have the cleanup service transition old "skipped" records to "cancelled" after a threshold (e.g., 10 minutes).
Environment
4e7d161Related