fix(cleanup): use per-agent slot TTL instead of fixed 20-min default (#226) by vybe · Pull Request #323 · Abilityai/trinity

vybe · 2026-04-13T19:12:23Z

Summary

Fixed slot cleanup using fixed 20-min TTL regardless of agent timeout — now uses per-agent TTL (timeout + 5 min buffer)
Watchdog returns confirmed running execution IDs to avoid double-failing legitimately running tasks
Agents with long timeouts (60-120 min) no longer have slots prematurely reclaimed

Changes

db/agent_settings/resources.py: Add get_all_execution_timeouts() bulk query
database.py: Add delegation method
slot_service.py: Accept agent_timeouts param in cleanup_stale_slots()
cleanup_service.py: Pass per-agent timeouts; track confirmed running IDs from watchdog
tests/test_watchdog_unit.py: Update tests for 3-tuple return value
docs/memory/feature-flows/cleanup-service.md: Document bug: Stale slot cleanup uses fixed 20-min TTL regardless of agent timeout #226 changes

Test Plan

test_cleanup_service.py — 7 tests pass
test_capacity.py — 24 tests pass
test_agent_timeout.py — 21 tests pass
test_watchdog_unit.py — 18 tests pass (updated for new return signature)
test_watchdog.py — 4 tests pass

Closes #226

🤖 Generated with Claude Code

…226) The periodic slot cleanup sweep was using a fixed DEFAULT_SLOT_TTL_SECONDS (20 minutes) for all agents, regardless of their configured execution timeout. This caused premature slot reclamation for agents with longer timeouts (e.g., 60-120 minutes), leading to false "stale" failures while executions were still legitimately running. Changes: - Add `get_all_execution_timeouts()` bulk query to fetch all agents' timeouts - Pass per-agent timeouts to `cleanup_stale_slots()` - Slot TTL now computed as `timeout_seconds + 5 min buffer` per agent - Watchdog returns `confirmed_running_ids` set to avoid double-failing executions verified as still running within their timeout - Update unit tests to expect 3-tuple return from `_reconcile_orphaned_executions` - Update cleanup-service feature flow documentation Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Add Auth Failure Fast-Fail section to task-execution-service.md documenting: - Agent-side pattern matcher and real-time stderr scanning - Process kill on auth failure detection - HTTP 503 response for auth failures - Backend error code classification (TaskExecutionErrorCode.AUTH) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Add test_cleanup_service.py (7 tests) to Operations & Observability - Add test_watchdog.py (4 tests) for integration tests - Add test_watchdog_unit.py (18 tests) for unit tests - Update test counts: +29 tests, now ~2,172 total - Add 2026-04-14 entry documenting #226 test updates Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

vybe mentioned this pull request Apr 13, 2026

fix(cleanup): use per-agent slot TTL instead of fixed 20-min default (#226) #290

Closed

5 tasks

vybe and others added 2 commits April 14, 2026 07:50

vybe merged commit 17cfbab into main Apr 14, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(cleanup): use per-agent slot TTL instead of fixed 20-min default (#226)#323

fix(cleanup): use per-agent slot TTL instead of fixed 20-min default (#226)#323
vybe merged 3 commits intomainfrom
feature/226-per-agent-slot-ttl

vybe commented Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

vybe commented Apr 13, 2026

Summary

Changes

Test Plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant