feat: Active watchdog — remediate stuck executions (#129) by AndriiPasternak31 · Pull Request #166 · Abilityai/trinity

AndriiPasternak31 · 2026-03-25T03:12:28Z

Summary

Cleanup service now actively reconciles DB execution state against agent process registries every 5 minutes
Orphaned executions (DB says running, agent has no record) are recovered with descriptive error messages
Executions exceeding their schedule's timeout_seconds are auto-terminated on the agent
Capacity slots and execution queue state properly released on recovery
WebSocket events broadcast to notify frontend of watchdog actions

Changes

src/backend/db/schedules.py — Added get_running_executions_with_agent_info() (SQL JOIN with schedule + agent_ownership for timeout resolution) and mark_execution_failed_by_watchdog() (conditional UPDATE with race guard)
src/backend/database.py — Exposed new DB methods via facade
src/backend/services/cleanup_service.py — Watchdog reconciliation as first cleanup operation, shared _recover_execution() helper, parallel agent HTTP via asyncio.gather, per-execution error isolation, systemic failure detection, WebSocket broadcasting
src/backend/main.py — Wired WebSocket manager to cleanup service
docs/memory/changelog.md — Added changelog entry

Key Design Decisions

Watchdog runs BEFORE stale cleanup — releases resources before passive DB-only cleanup
Batch HTTP — one GET /api/executions/running per agent, parallel via asyncio.gather
Race-condition guard — WHERE status='running' prevents overwriting normal completions
Queue safety — only force_release if recovered execution holds the queue slot
Timeout resolution — COALESCE(schedule.timeout, agent.timeout, 900) respects per-agent config

Test Plan

New integration tests pass: pytest tests/test_watchdog.py -v
New unit tests pass: pytest tests/test_watchdog_unit.py -v
Existing cleanup tests unaffected: pytest tests/test_cleanup_service.py -v
Manual: Create schedule with short timeout, stop agent, verify watchdog recovers execution

Closes #129

Generated with Claude Code

…ring (Abilityai#129) The cleanup service now actively reconciles DB execution state against agent process registries every 5 minutes. Orphaned executions (not found on agent) are recovered, timed-out executions are auto-terminated, and capacity slots/queue state are properly released. Includes race-condition guard, per-execution error isolation, and WebSocket event broadcasting. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…s, cleaner imports - Reuse single httpx.AsyncClient per reconciliation cycle instead of per-call - Parallelize agent HTTP queries with asyncio.gather (O(1) vs O(agents)) - Use parse_iso_timestamp() from utils/helpers instead of manual datetime parsing - Move utc_now_iso import to module level (consistent with codebase) - Remove unused timedelta import - Fix tuple return type hint to tuple[int, int] - Remove unused message column from DB query - Deduplicate sys.path manipulation in unit tests (module-level once) - Update test mocks for new shared-client method signatures Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

1. Queue force_release safety: Only release the agent's queue running slot if the recovered execution is the one currently holding it. Prevents corrupting queue state for a different active execution. 2. Manual execution timeout: Use agent's configured execution_timeout_seconds (from agent_ownership) as fallback instead of hardcoded 900s. Agents with custom timeouts (up to 7200s) won't have manual executions prematurely terminated. 3. Remove unused datetime import (dead code after simplification). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

… IDs, debug log - Replace parse_iso_timestamp(utc_now_iso()) with utc_now() to avoid unnecessary string serialization round-trip - Filter empty/None execution_id values from agent response set - Add debug log when WebSocket manager is not set during recovery broadcast Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…Abilityai#129) Adds watchdog reconciliation documentation: decision matrix, recovery helper, parallel agent fan-out, WebSocket events, error handling table, updated API response examples with new fields, and file summary. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…patch grace (Abilityai#129) Resolves 8 findings from 4-pass review (Claude structured, Claude adversarial, Codex structured, Codex adversarial): Critical fixes: - Atomic queue release via Lua script (force_release_if_matches) prevents TOCTOU race where a new execution could start between check and release - _terminate_on_agent returns bool; caller skips DB/resource cleanup if terminate failed, deferring to 120-min stale cleanup safety net - 60-second dispatch grace period prevents false orphan recovery of executions still registering on the agent Informational fixes: - recovery_attempts counter only increments on actual recovery actions - Test mocks updated for force_release_if_matches and terminate return value - DB unit tests match production 3-way COALESCE with agent_ownership table - httpx.TimeoutException catches all timeout types at DEBUG level - asyncio.Lock prevents concurrent cleanup cycles Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Integration tests (test_watchdog.py) and comprehensive unit tests (test_watchdog_unit.py) for the active watchdog feature, plus shared test utilities (conftest, api_client, assertions, cleanup). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…Abilityai#129) - CLEANUP-001 in requirements.md now documents watchdog capabilities: orphan recovery, auto-terminate, atomic queue release, dispatch grace, systemic failure detection, WebSocket broadcast - Feature-flows.md index updated with Abilityai#129 entry and refreshed descriptions Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…lows index Both conflicts were in documentation files where our branch (Abilityai#129 watchdog) and upstream (Abilityai#74 auto-assign subscriptions, Abilityai#128 startup recovery, Abilityai#19 MCP execution tools, SLACK-002 channel adapter, Abilityai#100 docs cleanup) added entries to the same location. Resolution: keep both sides in chronological order. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ion encryption fix Additive conflict in changelog.md and feature-flows.md: our Abilityai#129 watchdog entry and upstream's Abilityai#148 subscription encryption fix both added to the same location. Resolution: keep both entries in chronological order. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

vybe

This PR requires the following changes before merge:

Update docs/memory/architecture.md lines 184 and 444 to mention active watchdog reconciliation (per Trinity methodology: feature change = changelog + architecture)
Remove duplicate test files — keep either tests/ or src/backend/tests/, not both
Fix default test password in conftest.py ("trinity" should be "password" to match dev environment)

Code and feature design look solid — the watchdog reconciliation logic, race-condition guards, Lua-script atomic release, and decision matrix are well-engineered. Just need the doc/test cleanup items above.

vybe

This PR still requires the same three changes from the previous review (March 25) before merge:

Update docs/memory/architecture.md to mention active watchdog reconciliation (lines ~184 and ~444 — cleanup service description and monitoring API table)
Remove duplicate test files — keep either tests/ or src/backend/tests/, not both
Fix default test password in conftest.py ("trinity" → "password" to match dev environment)

Core feature implementation remains solid — watchdog reconciliation logic, race-condition guards, Lua-script atomic release, and decision matrix are all well-engineered. Just need the doc/test cleanup items above.

vybe · 2026-03-26T11:20:58Z

Inline Notes

src/backend/tests/conftest.py — Line 450: Default TRINITY_TEST_PASSWORD is "trinity" but the dev environment uses "password". Tests will fail auth out of the box.

tests/ vs src/backend/tests/ — Both locations contain identical test_watchdog.py and test_watchdog_unit.py. Pick one location and remove the other. The new src/backend/tests/ infrastructure (conftest.py, utils/ — ~800 lines) is also substantial and may conflict with existing test conventions.

docs/memory/architecture.md — Not in the diff at all. Needs watchdog mentions added.

…, conftest password - Update docs/memory/architecture.md to mention active watchdog reconciliation at lines 184 (service list) and 444 (background services table) - Remove duplicate test directory src/backend/tests/ (keep tests/ as canonical) - Fix conftest.py docstring: default password "trinity" → "password" to match code Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ering Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

AndriiPasternak31 and others added 10 commits March 25, 2026 02:43

vybe requested changes Mar 25, 2026

View reviewed changes

vybe requested changes Mar 26, 2026

View reviewed changes

AndriiPasternak31 requested a review from vybe March 27, 2026 00:22

merge: Resolve conflicts with upstream/main — feature-flows table ord…

337684c

…ering Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

vybe merged commit b529e49 into Abilityai:main Apr 3, 2026

AndriiPasternak31 deleted the AndriiPasternak31/active-watchdog-129 branch April 25, 2026 17:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Active watchdog — remediate stuck executions (#129)#166

feat: Active watchdog — remediate stuck executions (#129)#166
vybe merged 12 commits into
Abilityai:mainfrom
AndriiPasternak31:AndriiPasternak31/active-watchdog-129

AndriiPasternak31 commented Mar 25, 2026

Uh oh!

vybe left a comment

Uh oh!

vybe left a comment

Uh oh!

vybe commented Mar 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

AndriiPasternak31 commented Mar 25, 2026

Summary

Changes

Key Design Decisions

Test Plan

Uh oh!

vybe left a comment

Choose a reason for hiding this comment

Uh oh!

vybe left a comment

Choose a reason for hiding this comment

Uh oh!

vybe commented Mar 26, 2026

Inline Notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants