fix(cleanup): prevent phantom stale-slot failures via JIT re-verify (#378)#403
Merged
vybe merged 3 commits intoApr 19, 2026
Merged
Conversation
Adds TestProcessStaleSlotReclaims (8 scenarios) to test_watchdog_unit.py and tests/unit/test_schedule_status_observability.py (4 scenarios). Tests reproduce the race where cleanup service's Phase 3 marks executions FAILED with "Stale execution — slot TTL expired" while the task is still running on the agent, and the subsequent SUCCESS write overwrites FAILED. These tests fail against the current code and pass after the fix + log that follow in subsequent commits. Refs Abilityai#378 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…#378) Phase 3 (slot cleanup) previously marked reclaimed executions FAILED using only Phase 0's `confirmed_running_ids` snapshot. Between Phase 0's batch query and Phase 3's decision, an agent could drop a just-completed execution from its registry while the corresponding SUCCESS response was still in flight to the backend — Phase 3 would then write FAILED and be overwritten by the late SUCCESS, producing a phantom failure flash. This extracts the Phase 3 loop into `_process_stale_slot_reclaims` (mirrors the testability pattern of `_reconcile_orphaned_executions`) and adds a just-in-time re-verify call to each agent immediately before writing FAILED. Per-agent fan-out via `asyncio.gather` keeps worst-case Phase 3 wall-time at O(5s) regardless of fleet size (same pattern as Phase 0 at cleanup_service.py:258-263). On agent unreachable during re-verify: skip this cycle. No cross-cycle deferral state — `slot_service.cleanup_stale_slots` removes reclaimed IDs from Redis permanently (zremrangebyscore), so a deferred ID would never reappear to be re-checked. Transient unreachability is caught by Phase 0's orphan recovery on later cycles; truly stuck agents by Phase 1 (120-min stale cleanup). Refs Abilityai#378 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…docs
Emit a WARNING log in `update_execution_status` when a SUCCESS write
overwrites a row whose error matches the Phase-3 phantom-stale marker
("Stale execution — slot TTL expired"). Gives production signal if the
primary fix misses a corner case, without changing update semantics —
the agent's authoritative SUCCESS still wins.
The pattern match is narrow on purpose: other legitimate cleanup paths
(Phase 0 auto-terminate, Phase 1 stale cleanup, startup recovery) also
write FAILED via unguarded `update_execution_status`. An unscoped log
would misattribute those transitions to Abilityai#378.
Grep production logs with:
docker logs trinity-backend | grep "residual race condition (Abilityai#378)"
Updates `docs/memory/feature-flows/cleanup-service.md` with a new
"Phase 3 Slot Reclaim Re-verification (Issue Abilityai#378)" section and adds
an index row to `docs/memory/feature-flows.md`.
Refs Abilityai#378
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
46f9ee1 to
5a00650
Compare
5 tasks
vybe
added a commit
that referenced
this pull request
Apr 19, 2026
… design Catalogs ~55 invariants across 14 subsystems (executions, slots, backlog, activities, agent lifecycle, subprocess hygiene, channels, etc.) with tier (always vs. eventually-consistent), severity, and queryable violation signals. Proposes a continuous canary harness to verify them on staging. Grounded in recent fixes: #378/#403 (phantom stale slots), #407 (subprocess reaping), #129 (orphaned executions), #219/#226 (slot TTL mismatch). Tracked in #411.
obasilakis
added a commit
to obasilakis/trinity
that referenced
this pull request
May 4, 2026
Implements the canary harness from Abilityai#411 that catches the bug class behind PRs Abilityai#378, Abilityai#403, Abilityai#129, Abilityai#226 — race conditions and cross-component state drift between Redis × SQLite × agent registries that unit tests miss. Architecture (per design discussion in PR comments): - Deterministic library (src/backend/canary/) — pure-function checks shared between the watcher and the on-demand admin endpoint. No LLM reasoning anywhere — same snapshot in, same violations out. - Watcher service (services/canary_service.py) — 5-min APScheduler-style loop modeled on cleanup_service.py. Disabled by default; enable on staging/dev with CANARY_ENABLED=1. - Fleet manifest (config/canary-fleet.yaml) — synthetic load via the existing /api/systems/deploy mechanism. Without traffic the harness reports trivially-green cycles. Phase 1 invariants: - S-01 (slot–row bijection) — Redis ZRANGE vs SQL running rows; drain sentinels filtered. Critical severity. - E-02 (no phantom reversal) — terminal executions stay terminal. Phase 1 uses Redis-backed state comparison instead of Vector log diff for simplicity. Critical severity. - L-03 (delete cascades) — orphan-row scan across cross-cutting tables (agent_sharing, agent_schedules, schedule_executions [non-terminal], agent_skills, agent_tags, agent_shared_files, agent_public_links, pending operator_queue / access_requests, agent-scoped mcp_api_keys, active chat_sessions) plus orphan agent:slots:* Redis key scan. Critical for active-orchestration orphans, major otherwise. Persistence + API: - canary_violations table (migration 34, schema in db/schema.py). - GET /api/canary/violations + /violations/stats + /violations/{id} — admin-only read paths over the persisted history. - POST /api/canary/run-cycle — admin-only on-demand trigger; delegates to the same CanaryService.run_cycle() the background loop calls. Notifications: - One green→red transition emits one notification via db.create_notification(notification_type='alert', category='canary'). - Severity → priority: critical → urgent, major → high, minor → normal. - Continuing-red cycles do not re-notify; the row in canary_violations is the trend signal, the bell is the "now" signal. Tests: - 27 unit tests in tests/test_canary_invariants.py covering CanaryOperations CRUD/filters/stats, snapshot collector behaviour, each invariant's holds-and-fires cases, the L-03 critical/major severity split, and the runner registry. Smoke-test recipe (Option 1 — orphan SQL row, no code changes): # 1. Set CANARY_ENABLED=1 and restart backend (creates canary_violations # table via migration; starts the watcher). # 2. Deploy the fleet: curl -X POST -H "Authorization: Bearer $TOKEN" \ -H "Content-Type: application/json" \ -d "$(jq -Rs '{manifest: .}' < config/canary-fleet.yaml)" \ http://localhost:8000/api/systems/deploy # 3. Baseline cycle: curl -X POST -H "Authorization: Bearer $TOKEN" \ http://localhost:8000/api/canary/run-cycle | jq '.violations, .transitions' # 4. Inject one orphan row: sqlite3 ~/trinity-data/trinity.db \ "INSERT INTO agent_sharing (agent_name, shared_with_email, shared_by_id, created_at) \ VALUES ('ghost-canary-test', 'smoketest@example.com', '<admin-id>', datetime('now'));" # 5. Trigger again — expect L-03 in violations + transitions: curl -X POST -H "Authorization: Bearer $TOKEN" \ http://localhost:8000/api/canary/run-cycle | jq '.violations[0], .transitions' # 6. Verify in UI: bell shows new 'canary' notification (severity major). # 7. Cleanup: sqlite3 ~/trinity-data/trinity.db \ "DELETE FROM agent_sharing WHERE agent_name='ghost-canary-test';" Closes part of Abilityai#411 — Phase 1 acceptance criteria (S-01, E-02, L-03 running on staging continuously, table + read endpoint shipped). Phase 2 adds S-02, S-03, E-01, E-05, E-06, B-01, B-02, G-01, R-01 incrementally without changes to the watcher or API surface.
8 tasks
vybe
pushed a commit
that referenced
this pull request
May 10, 2026
* feat(#411): canary invariant harness Phase 1 (S-01, E-02, L-03) Implements the canary harness from #411 that catches the bug class behind PRs #378, #403, #129, #226 — race conditions and cross-component state drift between Redis × SQLite × agent registries that unit tests miss. Architecture (per design discussion in PR comments): - Deterministic library (src/backend/canary/) — pure-function checks shared between the watcher and the on-demand admin endpoint. No LLM reasoning anywhere — same snapshot in, same violations out. - Watcher service (services/canary_service.py) — 5-min APScheduler-style loop modeled on cleanup_service.py. Disabled by default; enable on staging/dev with CANARY_ENABLED=1. - Fleet manifest (config/canary-fleet.yaml) — synthetic load via the existing /api/systems/deploy mechanism. Without traffic the harness reports trivially-green cycles. Phase 1 invariants: - S-01 (slot–row bijection) — Redis ZRANGE vs SQL running rows; drain sentinels filtered. Critical severity. - E-02 (no phantom reversal) — terminal executions stay terminal. Phase 1 uses Redis-backed state comparison instead of Vector log diff for simplicity. Critical severity. - L-03 (delete cascades) — orphan-row scan across cross-cutting tables (agent_sharing, agent_schedules, schedule_executions [non-terminal], agent_skills, agent_tags, agent_shared_files, agent_public_links, pending operator_queue / access_requests, agent-scoped mcp_api_keys, active chat_sessions) plus orphan agent:slots:* Redis key scan. Critical for active-orchestration orphans, major otherwise. Persistence + API: - canary_violations table (migration 34, schema in db/schema.py). - GET /api/canary/violations + /violations/stats + /violations/{id} — admin-only read paths over the persisted history. - POST /api/canary/run-cycle — admin-only on-demand trigger; delegates to the same CanaryService.run_cycle() the background loop calls. Notifications: - One green→red transition emits one notification via db.create_notification(notification_type='alert', category='canary'). - Severity → priority: critical → urgent, major → high, minor → normal. - Continuing-red cycles do not re-notify; the row in canary_violations is the trend signal, the bell is the "now" signal. Tests: - 27 unit tests in tests/test_canary_invariants.py covering CanaryOperations CRUD/filters/stats, snapshot collector behaviour, each invariant's holds-and-fires cases, the L-03 critical/major severity split, and the runner registry. Smoke-test recipe (Option 1 — orphan SQL row, no code changes): # 1. Set CANARY_ENABLED=1 and restart backend (creates canary_violations # table via migration; starts the watcher). # 2. Deploy the fleet: curl -X POST -H "Authorization: Bearer $TOKEN" \ -H "Content-Type: application/json" \ -d "$(jq -Rs '{manifest: .}' < config/canary-fleet.yaml)" \ http://localhost:8000/api/systems/deploy # 3. Baseline cycle: curl -X POST -H "Authorization: Bearer $TOKEN" \ http://localhost:8000/api/canary/run-cycle | jq '.violations, .transitions' # 4. Inject one orphan row: sqlite3 ~/trinity-data/trinity.db \ "INSERT INTO agent_sharing (agent_name, shared_with_email, shared_by_id, created_at) \ VALUES ('ghost-canary-test', 'smoketest@example.com', '<admin-id>', datetime('now'));" # 5. Trigger again — expect L-03 in violations + transitions: curl -X POST -H "Authorization: Bearer $TOKEN" \ http://localhost:8000/api/canary/run-cycle | jq '.violations[0], .transitions' # 6. Verify in UI: bell shows new 'canary' notification (severity major). # 7. Cleanup: sqlite3 ~/trinity-data/trinity.db \ "DELETE FROM agent_sharing WHERE agent_name='ghost-canary-test';" Closes part of #411 — Phase 1 acceptance criteria (S-01, E-02, L-03 running on staging continuously, table + read endpoint shipped). Phase 2 adds S-02, S-03, E-01, E-05, E-06, B-01, B-02, G-01, R-01 incrementally without changes to the watcher or API surface. * fix(#411): canary harness demo-driven fixes Six fixes from the live end-to-end smoke test on staging-equivalent local stack — all surfaced by walking the L-03 + S-01 demo recipes: 1. **`CANARY_ENABLED` env passthrough** (`docker-compose.yml`). The flag was set in `.env` but never reached the backend container, so the watcher silently no-op'd on staging. Added explicit `${CANARY_ENABLED:-0}` mapping. 2. **`canary-fleet.yaml` CPU fix** (`config/canary-fleet.yaml`). `cpu: "0.5"` failed `int()` in Trinity's resource parser. Bumped `burst` to `cpu: "1"` matching the platform's whole-number convention. 3. **Notifications API admin bypass** (`routers/notifications.py`). `agent_notifications.agent_name` for canary entries is the synthetic `canary-harness` (no Docker container). The existing per-agent access filter excluded it from `GET /api/notifications` so the bell stayed empty. Admins now skip the per-agent filter entirely; non-admins keep the existing semantics. 4. **Drop redundant timestamp from notification body** (`services/canary_service.py:_render_message`). The notification panel renders `created_at` as "just now / 4m ago" and the precise ISO `snapshot_time` is in `metadata.snapshot_time`. Embedding the full ISO microsecond suffix in the message text was pure clutter. 5. **Fix transition detection** (`services/canary_service.py`). Old rule "previous violation predates this snapshot → transition" fired on *every* continuing-red cycle, since prior violations are always older than the current snapshot. Replaced with a Redis-backed previous-cycle cursor (`canary:last_cycle_at`): transition only fires when the latest stored violation predates the *previous cycle's* snapshot, i.e. the previous cycle was green for that invariant. Handles first-ever / continuing-red / red→green→red correctly; verified live against L-03 (one notification across two consecutive red cycles). 6. **Run-cycle endpoint reflects what the service emitted** (`routers/canary.py`). Replaced the router's duplicate transition-detection algorithm with `CycleResult. transition_invariant_ids` returned from `CanaryService.run_cycle()`. The endpoint and the bell can no longer disagree about which invariants fired. Plus a `snapshot.py` module-docstring expansion explaining why state collection is its own module separate from the invariant checks (consistency-per-cycle, no duplicated query code, test-friendliness, explicit non-atomicity across Redis × SQLite). * fix(#411): canary terminal status set must match TaskExecutionStatus enum Surfaced by the live E-02 demo: cycle 1 was supposed to seed `canary:e02:terminal_seen` from a freshly-completed fleet task but the Redis hash stayed empty, so the simulated reversal in cycle 2 went undetected. Root cause: `TERMINAL_EXECUTION_STATUSES` in `canary/snapshot.py` listed `("completed", "failed", "cancelled", "timeout")` but Trinity's actual terminal set (per `models.TaskExecutionStatus` and the CAS state machine in PR #524) is `("success", "failed", "cancelled", "skipped")`. Two of the four values were wrong: - `completed` is never written — `success` is the canonical name. - `timeout` is not a status; timed-out executions land in `failed`. - `skipped` (capacity-overflow path, SCHED-COND-001) was missing. Effects of the bug: 1. **E-02 silently broken on production data.** Fleet executions finishing with `status='success'` were not added to `terminal_exec_ids`, so the next cycle's reversal-detection comparison set was always empty. A real terminal→running flip would have gone undetected indefinitely. 2. **L-03 orphan-scan filter wrong.** The `schedule_executions` filter used the same hardcoded list, so legitimate-terminal `success`/`skipped` rows were treated as non-terminal and scanned for orphans on every cycle. Harmless when no orphans exist, but would have generated false-positive L-03 hits in ordinary recovery scenarios. Fix: - Replace the constant with `("success", "failed", "cancelled", "skipped")` matching `TaskExecutionStatus`. - Build the L-03 orphan-scan SQL filter from the constant (`f"status NOT IN ({_TERMINAL_SQL_LIST})"`) so the two stay in lockstep. - Drop the duplicate constant declaration further down the file. - Update e02 docstring + test fixtures (`'completed'` → `'success'`). Verified locally: - `pytest tests/test_canary_invariants.py` — 27/27 passing. - End-to-end E-02 demo: cycle 1 seeds `canary:e02:terminal_seen` (HEXISTS=1), cycle 2 fires after `UPDATE schedule_executions SET status='running'`, cycle 3 stays silent (continuing-red), fix-step reverts the row → cycle 4 green. Exactly 1 notification. This was a real defect waiting to ship — caught only because the demo went all the way through to a real fleet task. Argues for an integration-test layer in a follow-up PR (per `test_systems.py`- style harness discussed in the original PR thread). * fix(#411): use iso_cutoff() for E-02 terminal-row window (invariant #16) _collect_terminal_executions filtered with `datetime('now', '-N minutes')` against `completed_at` (written via `utc_now_iso()` → ISO-Z). The two formats diverge at the date/time separator (T vs space), so lexicographic compare let every same-day terminal row through regardless of age — the exact failure invariant #16 / issue #476 was written about. Consequence: on busy installs the canary:e02:terminal_seen Redis hash grew unbounded until PREV_TERMINAL_MAX nuked the prev set, leaving E-02 with nothing to diff against on the following cycle. - Extend iso_cutoff() with a keyword-only `minutes=` parameter for sub-hour windows. Existing positional callers unaffected. - Switch snapshot._collect_terminal_executions to iso_cutoff(minutes=...). - Pin the new signature with TestMinutesKwarg in test_iso_cutoff.py (incl. a TypeError check on `iso_cutoff(2, 30)` so a future refactor can't quietly turn that into "2h30m"). * perf(#411): drop N+1 in run-cycle, thread row ids through CycleResult The router's re-fetch loop ran list_canary_violations once per violation (not per invariant), repeatedly querying the same (invariant_id, snapshot_time) window and matching rows back to violations by signal_query — both wasteful and silently wrong when two violations share a signal_query. The service already had the row id from insert_canary_violation and threw it away. Thread it through CycleResult.persisted_violation_ids (index-aligned with .violations, None on insert failure) so the router maps directly to row ids without touching the DB. * test(#411): cover CanaryService.run_cycle orchestration The 27-test suite hit DB CRUD and the pure invariants but never exercised the orchestrator — which is where the demo-driven bugs lived (e7c11b2 transition spam, ef40cf9 terminal-status typo). Both shipped past the unit suite and were caught only by hand-driven demo runs. Adds a TestCanaryService class with four end-to-end tests over run_cycle() using the existing FakeRedis + temp SQLite, plus a canary_service fixture that stubs `database` / `db_models` and records create_notification calls so transitions can be asserted directly: - test_first_cycle_violation_fires_one_notification — baseline - test_continuing_red_does_not_re_fire — regression for e7c11b2 - test_red_green_red_fires_twice — covers the re-arm branch - test_terminal_status_set_seeds_e02_side_table — regression for ef40cf9 (would fail against the pre-fix tuple) Also adds string get/set ops to FakeRedis so the previous-cycle cursor (canary:last_cycle_at) round-trips. 31/31 tests pass. * feat(#411): stub canary alert sink — drop dashboard notifications Per product call: green→red transitions no longer write agent_notifications rows or broadcast over WebSocket. Slack lands in a follow-up PR and plugs into the same _emit_transition seam. - _emit_transition now calls a no-op _notify_transition_stub (logs only) - _render_message / _broadcast / _severity_rank / ws-manager hooks kept in place so the next PR is a single-file change - Tests assert on transition_invariant_ids / cumulative_transitions and pin notification_calls == [] to catch regressions - architecture.md notes Phase 1 has no alert sink wired * docs(#411): drop "bell" shorthand from canary comments The dashboard surface is the Operating Room notifications panel, not a literal navbar bell — comments and docstrings now say "alert sink" or "panel" so a future reader doesn't go hunting for a bell icon. * fix(#411): point canary fleet at test-echo template (local:default doesn't exist) The manifest referenced `local:default`, which has no template directory under config/agent-templates/. Agent creation silently skips template-volume mounting on missing paths (services/agent_service/crud.py:341), so the fleet members were spinning up from the bare base image with no CLAUDE.md — verified with `docker exec agent-canary-fleet-burst cat /home/developer/CLAUDE.md` returning "No such file". Switched to `local:test-echo`: minimal stock template, no MCP/credentials, deterministic short replies — exactly what S-01/E-02 slot-churn invariants need. Schedule messages simplified accordingly. Verified end-to-end on local: deleted broken containers, redeployed via POST /api/systems/deploy, agents came up with templates, autonomy enabled, schedules fired, real-flow ZREM regression in slot_service.py produced S-01 critical violation captured by both the 5-min watcher (transitions=1) and on-demand POST /api/canary/run-cycle. Recovery path returned to green. * docs(#411): sync requirements.md §31.1 with stubbed alert sink The Notifications bullet still described the rejected dashboard path (db.create_notification → agent_notifications) after 6e1bb94 stubbed the alert sink. Replaced with an Alert sink bullet noting the no-op _notify_transition_stub and the preserved _emit_transition seam for the follow-up Slack PR. Architecture.md and requirements.md now agree. * feat(#411): wire canary alert sink to Slack via incoming webhook - _emit_transition POSTs Block Kit payload via slack_service.post_webhook - payload: severity emoji header + friendly invariant name, summary, per-invariant forensic detail, runbook hint, context line with snapshot_time + violation row #N + "last red Xm ago" badge - CANARY_SLACK_WEBHOOK_URL env var (unset = silent sink) - previous_violation_at now populated through CycleResult (was always None) - removes dead WS-broadcast scaffolding from the dropped Phase 1 path * revert(#411): drop admin notifications bypass Justification was canary harness notifications under a synthetic agent name, but the alert sink was rerouted to Slack (21b833d) and never writes to the notifications table. Reverting restores ownership-scoped filtering for admins — no live caller depends on the bypass. * fix(#411): surface snapshot sources_unavailable in run-cycle response Thread snapshot.sources_unavailable through CycleResult so the on-demand run-cycle endpoint reflects degraded Redis/SQLite reads instead of always returning []. The field's whole purpose is to expose partial-state cycles to the operator running the smoke test. * docs(#411): sweep stale canary agent/skill references Pre-pivot vocabulary referenced a canary agent template invoking mcp__trinity__send_notification on a per-agent loop. The harness has been a backend service with a Slack webhook sink for several commits; update the docstrings to match. * docs(#411): sync migrations.py header to all 57 entries Header docstring stopped at #34 while MIGRATIONS tuple has 57 entries. Pre-existing drift; the canary migration added to it. * fix(#411): preserve real terminal status in E-02 reversal alerts The E-02 phantom-reversal side-table stored the literal string "terminal" against every id, so once the Slack sink landed its forensic line printed "terminal → running" — useless to on-call. The snapshot now carries terminal_exec_statuses: dict[eid, status] and E-02 maintains a parallel canary:e02:terminal_status hash keyed alongside the canary:e02:terminal_seen ZSET. The real prior status (success / failed / cancelled / skipped) is read back when a reversal fires and rendered into both observed_state and the signal_query line. Aged-out ids are HDEL'd in the same sweep that ZREMRANGEBYSCORE drops them so the two structures stay in sync. Bundled side-table refactor: HASH + hard 5000-row DEL-on-overflow becomes ZSET trimmed by ZREMRANGEBYSCORE on PREV_TERMINAL_RETENTION_SECONDS (1h). The hard-reset trim left a one-cycle E-02 blind spot every time the cap crossed on busy installs. Tests: extends FakeRedis with zrangebyscore / hmget / hdel; asserts the existing reversal test now reports previous_status="success", adds a new test that round-trips all four terminal statuses through seed→reverse, and a regression test for the age-based trim invariant. * refactor(#411): drop redundant valid_ids filter in run-cycle endpoint After the 422 raise on invalid_ids, requested_ids is provably all-valid, so re-filtering against INVARIANTS was dead code. Pass requested_ids directly to run_cycle(). * docs(#411): correct operator in _is_green_to_red comment Comment said `>=` but the code uses `<`. * docs(#411): correct phase label in canary alert sink comment CANARY-001 alert sink ships in Phase 1, not Phase 2. * fix(#411): grace window in S-01 to suppress slot/row race false positives The bijection check fired on a transient mismatch caught by a snapshot landing in the ~30ms gap between the SQL `status='running'` commit and the Redis ZADD on start (or the ~5ms gap between SQL terminal flip and ZREM on stop). Real `release()` skips and phantom rows persist across multiple cycles, so 3s of grace silences the races without losing coverage of the bug class S-01 was designed to catch. - snapshot exposes `running_started_at` per running id and `slot_scores` per slot member (`time.time()` at ZADD), so the check can age-filter. - S-01 drops `in_sql_only` ids whose `started_at` is within 3s and `in_redis_only` ids whose ZSET score is within 3s. Existing strict behaviour applies for older / unparseable timestamps. - FakeRedis test stub gains `withscores` for ZRANGE. - Three new tests: fresh-sql suppressed, fresh-redis suppressed, durable mismatch still fires. 51/51 canary tests pass. * fix(#411): address vybe PR review — NameError in run-cycle + split alerts (1) NameError on POST /api/canary/run-cycle: handler referenced `valid_ids` which never existed (commit 02cfc97 only fixed the call site, not the two trailing usages). Endpoint crashed on every request. Replace with `requested_ids` (provably valid post-422 check). (2) services/canary_service.py was 692 lines and growing. Extract the Slack alert surface (constants + `_emit_transition` + `_build_slack_payload` + `_render_forensic` + `_format_row_refs` + `_format_last_red` + `_render_message`) into services/canary_alerts.py as `CanaryAlerts`. canary_service.py drops to 347 lines, alerts module is 370 — both well under the 1000-line target. No behaviour change. Tests pivoted to `CanaryAlerts._foo` references; 51/51 pass. * docs(#411): update test docstrings/comments after CanaryAlerts split Stale `_emit_transition` / bare `_build_slack_payload` references in test docstrings and the slack_capture fixture's hint pointed at the old method names. Renaming to `CanaryAlerts.emit_transition` etc. so the hints match the actual code path. No test logic changes. * fix(#411): address /review informational findings (I1/I2/I3) I1 — Webhook URL leak in error logs (services/slack_service.py:309-311): httpx exceptions embed the request URL in their __str__, and for canary alerts that URL IS the credential. The catchall returned `str(e)` which was then logged by `CanaryAlerts.emit_transition`. Sanitize by returning only the exception class name + a generic message; no URL ever lands in Vector logs. I2 — `checks_skipped` always empty (routers/canary.py): `run_invariants` populates `cycle.violations[inv_id]` for every selected id (with `[]` when clean or when a check raised), so the field would always be empty. The actionable signal lives in `sources_unavailable` already. Drop the field from `RunCycleResponse` and simplify `checks_run` to the request's filter list directly. I3 — Severity rank dict duplicated: promote `_severity_rank` → `severity_rank` in canary_alerts and import it into the router instead of redeclaring the literal. 51/51 canary tests pass; live-stack endpoint verified (POST /run-cycle, partial filter, 422 path). * fix(#411): address /review findings (I1/I2/I3 round 2) - I1: architecture.md run-cycle row referenced a non-existent `checks_skipped` field; replaced with `sources_unavailable` and documented the new 409 response. - I2: lock-contention path returned an empty 200 indistinguishable from a real green cycle. CycleResult now carries `skipped: bool` and the router maps it to HTTP 409 `cycle in progress`. - I3: TERMINAL_EXECUTION_STATUSES is now derived from TaskExecutionStatus.{SUCCESS,FAILED,CANCELLED,SKIPPED}.value instead of a hand-maintained string tuple. Tests stub `models` so unit imports don't pull in pydantic/db_models. 51/51 canary unit tests + 13/13 iso_cutoff tests green. * fix(#411): use tz-aware fromisoformat in S-01 grace check `datetime.fromisoformat(ts.rstrip("Z")).timestamp()` stripped the Z suffix, leaving a naive datetime; `.timestamp()` then interprets naive datetimes as local time, so on any host with TZ != UTC the grace-window math was off by the host's UTC offset — producing false-positive S-01 alerts. Python 3.11's `fromisoformat` parses Z natively into `tzinfo=UTC`, so dropping the rstrip restores correctness on any host. Surfaced by /review I1. * docs(security): CSO audit 2026-05-09 — 0 findings (#411 diff) Diff-scoped audit of the canary harness (Phase 1 of #411). Zero findings at the 8/10 daily-mode confidence gate — admin-only API surface, outbound-only Slack with explicit URL-leak mitigation, parameterized SQL with constants-only f-string composition, env-gated activation. Five candidates considered and dropped below the gate (documented in the report).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Closes #378.
Scheduled task executions briefly showed Failed with "Stale execution — slot TTL expired" and then flipped to Success ~14s later after the real agent response landed. Reported on the Paradigm Live instance (2026-04-17).
Root cause: race between cleanup service Phase 0 (batch watchdog query) and Phase 3 (slot cleanup). An agent could drop a just-completed execution from its registry between these two phases, so
confirmed_running_idsmissed it — Phase 3 then wrote FAILED seconds before the agent's in-flight SUCCESS response arrived.Fix: Extract Phase 3 into
_process_stale_slot_reclaims, add a just-in-time per-agent re-verify call (parallelized viaasyncio.gather, mirroring Phase 0) immediately before writing FAILED. On agent unreachable: skip this cycle — Phase 1 (120-min stale cleanup) is the backstop. No cross-cycle deferral state:slot_service.cleanup_stale_slotsremoves reclaimed IDs from Redis permanently (zremrangebyscore), so a deferred ID cannot reappear to be re-checked.Also adds a narrowly-scoped observability WARNING log in
update_execution_statusfor residual races that slip past the primary fix. Pattern-matched to the stale-slot error marker so other legitimate FAILED→SUCCESS transitions (Phase 0 auto-terminate, Phase 1 stale cleanup, startup recovery) don't misfire it. Semantics unchanged — the agent's authoritative SUCCESS still wins.Changes
src/backend/services/cleanup_service.py_process_stale_slot_reclaimswith parallel JIT re-verifysrc/backend/db/schedules.pytests/test_watchdog_unit.pyTestProcessStaleSlotReclaims(8 scenarios)tests/unit/test_schedule_status_observability.pydocs/memory/feature-flows/cleanup-service.mddocs/memory/feature-flows.mdThree logical commits: failing tests → fix → observability log + docs.
Test plan
TestProcessStaleSlotReclaimsunit tests passTestResidualRaceObservabilityLogunit tests passtest_watchdog_unit.pyregression (8 new + 27 existing)test_backlog.pyregression (exercises the sameupdate_execution_statuspath)Manual integration test (from plan):
execution_timeout_seconds = 60(slot TTL = 360s).CLEANUP_INTERVAL_SECONDSto 30 and restart backend so cycles fire during the task.[Cleanup] Skipping <id> ... — re-verification shows still running (#378)and no FAILED row written for this execution.Follow-up
slot_service.cleanup_stale_slotsremoves stale slots from Redis before Phase 3 decides whether to fail. When Phase 3 skips (re-verify says still running, agent unreachable, orconfirmed_running_ids), the slot stays removed while the task remains tracked as running — briefly breachesmax_parallel_tasks. Pre-existing; not fixed here. I'll file a separate issue.Review artifacts
/plan-eng-review(4 sections, 7 decisions) + Codex outside voice (3 critical corrections applied)./Users/andrii/.claude/plans/system-instruction-you-are-working-cheeky-porcupine.md🤖 Generated with Claude Code