fix(cleanup): use per-agent slot TTL instead of fixed 20-min default (#226)#323
Merged
fix(cleanup): use per-agent slot TTL instead of fixed 20-min default (#226)#323
Conversation
…226) The periodic slot cleanup sweep was using a fixed DEFAULT_SLOT_TTL_SECONDS (20 minutes) for all agents, regardless of their configured execution timeout. This caused premature slot reclamation for agents with longer timeouts (e.g., 60-120 minutes), leading to false "stale" failures while executions were still legitimately running. Changes: - Add `get_all_execution_timeouts()` bulk query to fetch all agents' timeouts - Pass per-agent timeouts to `cleanup_stale_slots()` - Slot TTL now computed as `timeout_seconds + 5 min buffer` per agent - Watchdog returns `confirmed_running_ids` set to avoid double-failing executions verified as still running within their timeout - Update unit tests to expect 3-tuple return from `_reconcile_orphaned_executions` - Update cleanup-service feature flow documentation Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
5 tasks
Add Auth Failure Fast-Fail section to task-execution-service.md documenting: - Agent-side pattern matcher and real-time stderr scanning - Process kill on auth failure detection - HTTP 503 response for auth failures - Backend error code classification (TaskExecutionErrorCode.AUTH) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add test_cleanup_service.py (7 tests) to Operations & Observability - Add test_watchdog.py (4 tests) for integration tests - Add test_watchdog_unit.py (18 tests) for unit tests - Update test counts: +29 tests, now ~2,172 total - Add 2026-04-14 entry documenting #226 test updates Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Changes
db/agent_settings/resources.py: Addget_all_execution_timeouts()bulk querydatabase.py: Add delegation methodslot_service.py: Acceptagent_timeoutsparam incleanup_stale_slots()cleanup_service.py: Pass per-agent timeouts; track confirmed running IDs from watchdogtests/test_watchdog_unit.py: Update tests for 3-tuple return valuedocs/memory/feature-flows/cleanup-service.md: Document bug: Stale slot cleanup uses fixed 20-min TTL regardless of agent timeout #226 changesTest Plan
test_cleanup_service.py— 7 tests passtest_capacity.py— 24 tests passtest_agent_timeout.py— 21 tests passtest_watchdog_unit.py— 18 tests pass (updated for new return signature)test_watchdog.py— 4 tests passCloses #226
🤖 Generated with Claude Code