feat: Active watchdog — remediate stuck executions (#129)#166
Conversation
…ring (Abilityai#129) The cleanup service now actively reconciles DB execution state against agent process registries every 5 minutes. Orphaned executions (not found on agent) are recovered, timed-out executions are auto-terminated, and capacity slots/queue state are properly released. Includes race-condition guard, per-execution error isolation, and WebSocket event broadcasting. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…s, cleaner imports - Reuse single httpx.AsyncClient per reconciliation cycle instead of per-call - Parallelize agent HTTP queries with asyncio.gather (O(1) vs O(agents)) - Use parse_iso_timestamp() from utils/helpers instead of manual datetime parsing - Move utc_now_iso import to module level (consistent with codebase) - Remove unused timedelta import - Fix tuple return type hint to tuple[int, int] - Remove unused message column from DB query - Deduplicate sys.path manipulation in unit tests (module-level once) - Update test mocks for new shared-client method signatures Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1. Queue force_release safety: Only release the agent's queue running slot if the recovered execution is the one currently holding it. Prevents corrupting queue state for a different active execution. 2. Manual execution timeout: Use agent's configured execution_timeout_seconds (from agent_ownership) as fallback instead of hardcoded 900s. Agents with custom timeouts (up to 7200s) won't have manual executions prematurely terminated. 3. Remove unused datetime import (dead code after simplification). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… IDs, debug log - Replace parse_iso_timestamp(utc_now_iso()) with utc_now() to avoid unnecessary string serialization round-trip - Filter empty/None execution_id values from agent response set - Add debug log when WebSocket manager is not set during recovery broadcast Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…Abilityai#129) Adds watchdog reconciliation documentation: decision matrix, recovery helper, parallel agent fan-out, WebSocket events, error handling table, updated API response examples with new fields, and file summary. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…patch grace (Abilityai#129) Resolves 8 findings from 4-pass review (Claude structured, Claude adversarial, Codex structured, Codex adversarial): Critical fixes: - Atomic queue release via Lua script (force_release_if_matches) prevents TOCTOU race where a new execution could start between check and release - _terminate_on_agent returns bool; caller skips DB/resource cleanup if terminate failed, deferring to 120-min stale cleanup safety net - 60-second dispatch grace period prevents false orphan recovery of executions still registering on the agent Informational fixes: - recovery_attempts counter only increments on actual recovery actions - Test mocks updated for force_release_if_matches and terminate return value - DB unit tests match production 3-way COALESCE with agent_ownership table - httpx.TimeoutException catches all timeout types at DEBUG level - asyncio.Lock prevents concurrent cleanup cycles Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Integration tests (test_watchdog.py) and comprehensive unit tests (test_watchdog_unit.py) for the active watchdog feature, plus shared test utilities (conftest, api_client, assertions, cleanup). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…Abilityai#129) - CLEANUP-001 in requirements.md now documents watchdog capabilities: orphan recovery, auto-terminate, atomic queue release, dispatch grace, systemic failure detection, WebSocket broadcast - Feature-flows.md index updated with Abilityai#129 entry and refreshed descriptions Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…lows index Both conflicts were in documentation files where our branch (Abilityai#129 watchdog) and upstream (Abilityai#74 auto-assign subscriptions, Abilityai#128 startup recovery, Abilityai#19 MCP execution tools, SLACK-002 channel adapter, Abilityai#100 docs cleanup) added entries to the same location. Resolution: keep both sides in chronological order. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ion encryption fix Additive conflict in changelog.md and feature-flows.md: our Abilityai#129 watchdog entry and upstream's Abilityai#148 subscription encryption fix both added to the same location. Resolution: keep both entries in chronological order. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
vybe
left a comment
There was a problem hiding this comment.
This PR requires the following changes before merge:
- Update
docs/memory/architecture.mdlines 184 and 444 to mention active watchdog reconciliation (per Trinity methodology: feature change = changelog + architecture) - Remove duplicate test files — keep either
tests/orsrc/backend/tests/, not both - Fix default test password in
conftest.py("trinity"should be"password"to match dev environment)
Code and feature design look solid — the watchdog reconciliation logic, race-condition guards, Lua-script atomic release, and decision matrix are well-engineered. Just need the doc/test cleanup items above.
vybe
left a comment
There was a problem hiding this comment.
This PR still requires the same three changes from the previous review (March 25) before merge:
- Update
docs/memory/architecture.mdto mention active watchdog reconciliation (lines ~184 and ~444 — cleanup service description and monitoring API table) - Remove duplicate test files — keep either
tests/orsrc/backend/tests/, not both - Fix default test password in
conftest.py("trinity"→"password"to match dev environment)
Core feature implementation remains solid — watchdog reconciliation logic, race-condition guards, Lua-script atomic release, and decision matrix are all well-engineered. Just need the doc/test cleanup items above.
Inline Notes
|
…, conftest password - Update docs/memory/architecture.md to mention active watchdog reconciliation at lines 184 (service list) and 444 (background services table) - Remove duplicate test directory src/backend/tests/ (keep tests/ as canonical) - Fix conftest.py docstring: default password "trinity" → "password" to match code Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ering Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Summary
timeout_secondsare auto-terminated on the agentChanges
src/backend/db/schedules.py— Addedget_running_executions_with_agent_info()(SQL JOIN with schedule + agent_ownership for timeout resolution) andmark_execution_failed_by_watchdog()(conditional UPDATE with race guard)src/backend/database.py— Exposed new DB methods via facadesrc/backend/services/cleanup_service.py— Watchdog reconciliation as first cleanup operation, shared_recover_execution()helper, parallel agent HTTP viaasyncio.gather, per-execution error isolation, systemic failure detection, WebSocket broadcastingsrc/backend/main.py— Wired WebSocket manager to cleanup servicedocs/memory/changelog.md— Added changelog entryKey Design Decisions
GET /api/executions/runningper agent, parallel viaasyncio.gatherWHERE status='running'prevents overwriting normal completionsforce_releaseif recovered execution holds the queue slotCOALESCE(schedule.timeout, agent.timeout, 900)respects per-agent configTest Plan
pytest tests/test_watchdog.py -vpytest tests/test_watchdog_unit.py -vpytest tests/test_cleanup_service.py -vCloses #129
Generated with Claude Code