Skip to content

fix(cleanup): use per-agent slot TTL instead of fixed 20-min default (#226)#290

Closed
webmixgamer wants to merge 1 commit intomainfrom
feature/226-slot-ttl-per-agent
Closed

fix(cleanup): use per-agent slot TTL instead of fixed 20-min default (#226)#290
webmixgamer wants to merge 1 commit intomainfrom
feature/226-slot-ttl-per-agent

Conversation

@webmixgamer
Copy link
Copy Markdown
Contributor

@webmixgamer webmixgamer commented Apr 10, 2026

Summary

  • Periodic slot cleanup now queries per-agent execution_timeout_seconds from DB instead of using a fixed 20-min TTL, preventing premature slot reclamation for agents with custom timeouts
  • Watchdog reconciliation returns confirmed-running execution IDs; slot cleanup skips DB failure marking for those IDs, preventing false "failed" status on actively running tasks
  • Added get_all_execution_timeouts() bulk DB query for efficient per-agent timeout lookup

Changes

  • src/backend/services/slot_service.pycleanup_stale_slots() accepts agent_timeouts map
  • src/backend/services/cleanup_service.py — wires watchdog confirmed-running set to slot cleanup
  • src/backend/db/agent_settings/resources.py — bulk timeout query
  • src/backend/database.py — delegate new method
  • docs/memory/feature-flows/cleanup-service.md — updated flow docs
  • docs/memory/feature-flows/parallel-capacity.md — updated slot cleanup reference

Test Plan

  • All modified files pass syntax check
  • get_all_execution_timeouts() returns correct data from live DB
  • Manual cleanup trigger returns correct report (no errors, same structure)
  • Existing test_cleanup_service.py API contract unchanged
  • Verify agent with custom timeout (30 min) no longer gets false "failed" status — injected 25-min-old slot survived cleanup (per-agent TTL = 35 min), while 40-min-old slot was correctly reclaimed

Closes #226

🤖 Generated with Claude Code

…226)

The periodic slot cleanup used a fixed DEFAULT_SLOT_TTL_SECONDS (1200s)
for all agents, causing premature slot reclamation and false "failed"
status for agents with custom execution timeouts (e.g., 30+ minutes).

Two fixes:
- Query per-agent execution_timeout_seconds from DB and pass to
  cleanup_stale_slots() so each agent's slot TTL = timeout + 5min buffer
- Watchdog now returns confirmed-running execution IDs; slot cleanup
  skips DB failure marking for those IDs to prevent false failures

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@webmixgamer webmixgamer requested a review from vybe April 10, 2026 15:35
@vybe
Copy link
Copy Markdown
Contributor

vybe commented Apr 13, 2026

Superseded by #323 — rebased fresh on current main after #95 and #285 landed.

@vybe vybe closed this Apr 13, 2026
@vybe vybe deleted the feature/226-slot-ttl-per-agent branch April 13, 2026 19:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug: Stale slot cleanup uses fixed 20-min TTL regardless of agent timeout

2 participants