Skip to content

feat: Active watchdog — remediate stuck executions (#129)#166

Merged
vybe merged 12 commits into
Abilityai:mainfrom
AndriiPasternak31:AndriiPasternak31/active-watchdog-129
Apr 3, 2026
Merged

feat: Active watchdog — remediate stuck executions (#129)#166
vybe merged 12 commits into
Abilityai:mainfrom
AndriiPasternak31:AndriiPasternak31/active-watchdog-129

Conversation

@AndriiPasternak31
Copy link
Copy Markdown
Contributor

Summary

  • Cleanup service now actively reconciles DB execution state against agent process registries every 5 minutes
  • Orphaned executions (DB says running, agent has no record) are recovered with descriptive error messages
  • Executions exceeding their schedule's timeout_seconds are auto-terminated on the agent
  • Capacity slots and execution queue state properly released on recovery
  • WebSocket events broadcast to notify frontend of watchdog actions

Changes

  • src/backend/db/schedules.py — Added get_running_executions_with_agent_info() (SQL JOIN with schedule + agent_ownership for timeout resolution) and mark_execution_failed_by_watchdog() (conditional UPDATE with race guard)
  • src/backend/database.py — Exposed new DB methods via facade
  • src/backend/services/cleanup_service.py — Watchdog reconciliation as first cleanup operation, shared _recover_execution() helper, parallel agent HTTP via asyncio.gather, per-execution error isolation, systemic failure detection, WebSocket broadcasting
  • src/backend/main.py — Wired WebSocket manager to cleanup service
  • docs/memory/changelog.md — Added changelog entry

Key Design Decisions

  1. Watchdog runs BEFORE stale cleanup — releases resources before passive DB-only cleanup
  2. Batch HTTP — one GET /api/executions/running per agent, parallel via asyncio.gather
  3. Race-condition guardWHERE status='running' prevents overwriting normal completions
  4. Queue safety — only force_release if recovered execution holds the queue slot
  5. Timeout resolutionCOALESCE(schedule.timeout, agent.timeout, 900) respects per-agent config

Test Plan

  • New integration tests pass: pytest tests/test_watchdog.py -v
  • New unit tests pass: pytest tests/test_watchdog_unit.py -v
  • Existing cleanup tests unaffected: pytest tests/test_cleanup_service.py -v
  • Manual: Create schedule with short timeout, stop agent, verify watchdog recovers execution

Closes #129

Generated with Claude Code

AndriiPasternak31 and others added 10 commits March 25, 2026 02:43
…ring (Abilityai#129)

The cleanup service now actively reconciles DB execution state against
agent process registries every 5 minutes. Orphaned executions (not found
on agent) are recovered, timed-out executions are auto-terminated, and
capacity slots/queue state are properly released. Includes race-condition
guard, per-execution error isolation, and WebSocket event broadcasting.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…s, cleaner imports

- Reuse single httpx.AsyncClient per reconciliation cycle instead of per-call
- Parallelize agent HTTP queries with asyncio.gather (O(1) vs O(agents))
- Use parse_iso_timestamp() from utils/helpers instead of manual datetime parsing
- Move utc_now_iso import to module level (consistent with codebase)
- Remove unused timedelta import
- Fix tuple return type hint to tuple[int, int]
- Remove unused message column from DB query
- Deduplicate sys.path manipulation in unit tests (module-level once)
- Update test mocks for new shared-client method signatures

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1. Queue force_release safety: Only release the agent's queue running
   slot if the recovered execution is the one currently holding it.
   Prevents corrupting queue state for a different active execution.

2. Manual execution timeout: Use agent's configured execution_timeout_seconds
   (from agent_ownership) as fallback instead of hardcoded 900s. Agents
   with custom timeouts (up to 7200s) won't have manual executions
   prematurely terminated.

3. Remove unused datetime import (dead code after simplification).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… IDs, debug log

- Replace parse_iso_timestamp(utc_now_iso()) with utc_now() to avoid
  unnecessary string serialization round-trip
- Filter empty/None execution_id values from agent response set
- Add debug log when WebSocket manager is not set during recovery broadcast

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…Abilityai#129)

Adds watchdog reconciliation documentation: decision matrix, recovery
helper, parallel agent fan-out, WebSocket events, error handling table,
updated API response examples with new fields, and file summary.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…patch grace (Abilityai#129)

Resolves 8 findings from 4-pass review (Claude structured, Claude adversarial,
Codex structured, Codex adversarial):

Critical fixes:
- Atomic queue release via Lua script (force_release_if_matches) prevents TOCTOU
  race where a new execution could start between check and release
- _terminate_on_agent returns bool; caller skips DB/resource cleanup if terminate
  failed, deferring to 120-min stale cleanup safety net
- 60-second dispatch grace period prevents false orphan recovery of executions
  still registering on the agent

Informational fixes:
- recovery_attempts counter only increments on actual recovery actions
- Test mocks updated for force_release_if_matches and terminate return value
- DB unit tests match production 3-way COALESCE with agent_ownership table
- httpx.TimeoutException catches all timeout types at DEBUG level
- asyncio.Lock prevents concurrent cleanup cycles

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Integration tests (test_watchdog.py) and comprehensive unit tests
(test_watchdog_unit.py) for the active watchdog feature, plus shared
test utilities (conftest, api_client, assertions, cleanup).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…Abilityai#129)

- CLEANUP-001 in requirements.md now documents watchdog capabilities:
  orphan recovery, auto-terminate, atomic queue release, dispatch grace,
  systemic failure detection, WebSocket broadcast
- Feature-flows.md index updated with Abilityai#129 entry and refreshed descriptions

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…lows index

Both conflicts were in documentation files where our branch (Abilityai#129 watchdog)
and upstream (Abilityai#74 auto-assign subscriptions, Abilityai#128 startup recovery, Abilityai#19 MCP
execution tools, SLACK-002 channel adapter, Abilityai#100 docs cleanup) added entries
to the same location. Resolution: keep both sides in chronological order.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ion encryption fix

Additive conflict in changelog.md and feature-flows.md: our Abilityai#129 watchdog
entry and upstream's Abilityai#148 subscription encryption fix both added to the
same location. Resolution: keep both entries in chronological order.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@vybe vybe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR requires the following changes before merge:

  • Update docs/memory/architecture.md lines 184 and 444 to mention active watchdog reconciliation (per Trinity methodology: feature change = changelog + architecture)
  • Remove duplicate test files — keep either tests/ or src/backend/tests/, not both
  • Fix default test password in conftest.py ("trinity" should be "password" to match dev environment)

Code and feature design look solid — the watchdog reconciliation logic, race-condition guards, Lua-script atomic release, and decision matrix are well-engineered. Just need the doc/test cleanup items above.

Copy link
Copy Markdown
Contributor

@vybe vybe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR still requires the same three changes from the previous review (March 25) before merge:

  • Update docs/memory/architecture.md to mention active watchdog reconciliation (lines ~184 and ~444 — cleanup service description and monitoring API table)
  • Remove duplicate test files — keep either tests/ or src/backend/tests/, not both
  • Fix default test password in conftest.py ("trinity""password" to match dev environment)

Core feature implementation remains solid — watchdog reconciliation logic, race-condition guards, Lua-script atomic release, and decision matrix are all well-engineered. Just need the doc/test cleanup items above.

@vybe
Copy link
Copy Markdown
Contributor

vybe commented Mar 26, 2026

Inline Notes

src/backend/tests/conftest.py — Line 450: Default TRINITY_TEST_PASSWORD is "trinity" but the dev environment uses "password". Tests will fail auth out of the box.

tests/ vs src/backend/tests/ — Both locations contain identical test_watchdog.py and test_watchdog_unit.py. Pick one location and remove the other. The new src/backend/tests/ infrastructure (conftest.py, utils/ — ~800 lines) is also substantial and may conflict with existing test conventions.

docs/memory/architecture.md — Not in the diff at all. Needs watchdog mentions added.

…, conftest password

- Update docs/memory/architecture.md to mention active watchdog reconciliation
  at lines 184 (service list) and 444 (background services table)
- Remove duplicate test directory src/backend/tests/ (keep tests/ as canonical)
- Fix conftest.py docstring: default password "trinity" → "password" to match code

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@AndriiPasternak31 AndriiPasternak31 requested a review from vybe March 27, 2026 00:22
…ering

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@vybe vybe merged commit b529e49 into Abilityai:main Apr 3, 2026
@AndriiPasternak31 AndriiPasternak31 deleted the AndriiPasternak31/active-watchdog-129 branch April 25, 2026 17:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Active watchdog: remediate stuck executions detected by monitoring

2 participants