Skip to content

bug: Cleanup service does not fail execution records when reclaiming stale slots #219

@vybe

Description

@vybe

Summary

The cleanup service (added per #94) successfully reclaims stale Redis slots and cleans leaked capacity, but it does not mark the corresponding schedule_executions database records as failed. This leaves execution records permanently stuck in "running" (or "skipped") state, requiring manual cleanup. The slot cleanup and execution record cleanup are disconnected.

Component

Backend / Cleanup Service

Priority

P2 — Executions still run, but stuck records accumulate and pollute the UI. Manual cleanup required.

Error

No error logged. The gap is silent — cleanup logs show slot reclamation but execution records are untouched:

[Slots] Cleaned up 1 stale slots for agent 'oracle-5-geopolitics'
[Cleanup] Cycle complete: {'stale_executions': 0, 'no_session_executions': 0, 'orphaned_skipped': 0, 'stale_activities': 0, 'stale_slots': 1, 'total': 1}

Note stale_executions: 0 and stale_slots: 1 — the slot was cleaned but the execution record was not.

Location

  • File: src/backend/services/cleanup_service.py — cleanup cycle logic
  • File: src/backend/services/slot_service.py — stale slot cleanup

Root Cause

Two gaps in the cleanup service:

Gap 1: Stale slot cleanup doesn't fail execution records

When SlotService cleans up a stale slot, it removes the Redis ZSET entry but does not update the schedule_executions table. The execution record stays "running" with session_id="dispatched" forever. The cleanup service reports stale_slots cleaned but stale_executions: 0.

Observed: oracle-5-geopolitics acquired slot at 15:30 (execution hDTmGixuXipGoEY_XkS_9A, TTL=3900s). Slot reclaimed by cleanup at 15:51. Execution record stayed "running" — scheduler polled it 234 times before manual cleanup.

Gap 2: "Skipped" executions are non-terminal

When the scheduler skips an execution (agent busy), it sets status="skipped" with no claude_session_id. This is treated as a non-terminal state — the record sits in the DB indefinitely. The cleanup service doesn't consider "skipped" as a state that needs cleanup.

Observed: oracle-6-science Heartbeat at 14:40 was skipped (agent busy). Record stayed as status="skipped" with no session.

Reproduction Steps

  1. Have multiple agents with scheduled Heartbeat tasks
  2. Wait for a silent launch failure (dispatched but agent container doesn't complete — common under load)
  3. Or wait for a skip (agent busy when schedule fires)
  4. Observe the cleanup service reclaiming the stale slot but NOT failing the execution record
  5. Query: SELECT agent_name, status, claude_session_id FROM schedule_executions WHERE status NOT IN ('completed', 'failed', 'cancelled', 'success')
  6. Stuck records accumulate over time

Suggested Fix

Fix Gap 1: Correlate slot cleanup with execution records

When the cleanup service reclaims a stale slot, look up the execution ID from the slot metadata and fail the corresponding execution record:

# In cleanup cycle, after stale slot cleanup:
for slot in stale_slots_cleaned:
    execution_id = slot.execution_id  # from ZSET member
    db.update_execution_status(
        execution_id=execution_id,
        status="failed",
        error="Stale execution — slot TTL expired, cleaned by cleanup service"
    )

Fix Gap 2: Treat "skipped" as terminal, or clean it up

Option A: Make "skipped" a terminal state (add to terminal state checks).
Option B: Have the cleanup service transition old "skipped" records to "cancelled" after a threshold (e.g., 10 minutes).

# In cleanup cycle:
stale_skipped = db.get_executions_by_status(
    status="skipped",
    older_than=timedelta(minutes=10)
)
for exec in stale_skipped:
    db.update_execution_status(exec.id, status="cancelled", error="Skipped — agent was busy")

Environment

  • Trinity version: 4e7d161
  • Observed on: eu2 instance, 2026-03-28
  • 3 stuck executions (1 skipped, 2 running/dispatched), 11 orphaned activities

Related

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions