Skip to content

fix(cleanup): use per-agent slot TTL instead of fixed 20-min default (#226)#323

Merged
vybe merged 3 commits intomainfrom
feature/226-per-agent-slot-ttl
Apr 14, 2026
Merged

fix(cleanup): use per-agent slot TTL instead of fixed 20-min default (#226)#323
vybe merged 3 commits intomainfrom
feature/226-per-agent-slot-ttl

Conversation

@vybe
Copy link
Copy Markdown
Contributor

@vybe vybe commented Apr 13, 2026

Summary

  • Fixed slot cleanup using fixed 20-min TTL regardless of agent timeout — now uses per-agent TTL (timeout + 5 min buffer)
  • Watchdog returns confirmed running execution IDs to avoid double-failing legitimately running tasks
  • Agents with long timeouts (60-120 min) no longer have slots prematurely reclaimed

Changes

  • db/agent_settings/resources.py: Add get_all_execution_timeouts() bulk query
  • database.py: Add delegation method
  • slot_service.py: Accept agent_timeouts param in cleanup_stale_slots()
  • cleanup_service.py: Pass per-agent timeouts; track confirmed running IDs from watchdog
  • tests/test_watchdog_unit.py: Update tests for 3-tuple return value
  • docs/memory/feature-flows/cleanup-service.md: Document bug: Stale slot cleanup uses fixed 20-min TTL regardless of agent timeout #226 changes

Test Plan

  • test_cleanup_service.py — 7 tests pass
  • test_capacity.py — 24 tests pass
  • test_agent_timeout.py — 21 tests pass
  • test_watchdog_unit.py — 18 tests pass (updated for new return signature)
  • test_watchdog.py — 4 tests pass

Closes #226

🤖 Generated with Claude Code

…226)

The periodic slot cleanup sweep was using a fixed DEFAULT_SLOT_TTL_SECONDS
(20 minutes) for all agents, regardless of their configured execution timeout.
This caused premature slot reclamation for agents with longer timeouts (e.g.,
60-120 minutes), leading to false "stale" failures while executions were
still legitimately running.

Changes:
- Add `get_all_execution_timeouts()` bulk query to fetch all agents' timeouts
- Pass per-agent timeouts to `cleanup_stale_slots()`
- Slot TTL now computed as `timeout_seconds + 5 min buffer` per agent
- Watchdog returns `confirmed_running_ids` set to avoid double-failing
  executions verified as still running within their timeout
- Update unit tests to expect 3-tuple return from `_reconcile_orphaned_executions`
- Update cleanup-service feature flow documentation

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
vybe and others added 2 commits April 14, 2026 07:50
Add Auth Failure Fast-Fail section to task-execution-service.md documenting:
- Agent-side pattern matcher and real-time stderr scanning
- Process kill on auth failure detection
- HTTP 503 response for auth failures
- Backend error code classification (TaskExecutionErrorCode.AUTH)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add test_cleanup_service.py (7 tests) to Operations & Observability
- Add test_watchdog.py (4 tests) for integration tests
- Add test_watchdog_unit.py (18 tests) for unit tests
- Update test counts: +29 tests, now ~2,172 total
- Add 2026-04-14 entry documenting #226 test updates

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@vybe vybe merged commit 17cfbab into main Apr 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug: Stale slot cleanup uses fixed 20-min TTL regardless of agent timeout

1 participant