Skip to content

fix(cleanup): prevent phantom stale-slot failures via JIT re-verify (#378)#403

Merged
vybe merged 3 commits into
Abilityai:mainfrom
AndriiPasternak31:AndriiPasternak31/issue-378
Apr 19, 2026
Merged

fix(cleanup): prevent phantom stale-slot failures via JIT re-verify (#378)#403
vybe merged 3 commits into
Abilityai:mainfrom
AndriiPasternak31:AndriiPasternak31/issue-378

Conversation

@AndriiPasternak31
Copy link
Copy Markdown
Contributor

Summary

Closes #378.

Scheduled task executions briefly showed Failed with "Stale execution — slot TTL expired" and then flipped to Success ~14s later after the real agent response landed. Reported on the Paradigm Live instance (2026-04-17).

Root cause: race between cleanup service Phase 0 (batch watchdog query) and Phase 3 (slot cleanup). An agent could drop a just-completed execution from its registry between these two phases, so confirmed_running_ids missed it — Phase 3 then wrote FAILED seconds before the agent's in-flight SUCCESS response arrived.

Fix: Extract Phase 3 into _process_stale_slot_reclaims, add a just-in-time per-agent re-verify call (parallelized via asyncio.gather, mirroring Phase 0) immediately before writing FAILED. On agent unreachable: skip this cycle — Phase 1 (120-min stale cleanup) is the backstop. No cross-cycle deferral state: slot_service.cleanup_stale_slots removes reclaimed IDs from Redis permanently (zremrangebyscore), so a deferred ID cannot reappear to be re-checked.

Also adds a narrowly-scoped observability WARNING log in update_execution_status for residual races that slip past the primary fix. Pattern-matched to the stale-slot error marker so other legitimate FAILED→SUCCESS transitions (Phase 0 auto-terminate, Phase 1 stale cleanup, startup recovery) don't misfire it. Semantics unchanged — the agent's authoritative SUCCESS still wins.

Changes

File Change
src/backend/services/cleanup_service.py Extracted _process_stale_slot_reclaims with parallel JIT re-verify
src/backend/db/schedules.py Scoped residual-race observability log
tests/test_watchdog_unit.py New TestProcessStaleSlotReclaims (8 scenarios)
tests/unit/test_schedule_status_observability.py New (4 scenarios)
docs/memory/feature-flows/cleanup-service.md Phase 3 Re-verification section + file summary
docs/memory/feature-flows.md Index row for #378

Three logical commits: failing tests → fix → observability log + docs.

Test plan

  • 8/8 new TestProcessStaleSlotReclaims unit tests pass
  • 4/4 new TestResidualRaceObservabilityLog unit tests pass
  • 35/35 test_watchdog_unit.py regression (8 new + 27 existing)
  • 29/29 test_backlog.py regression (exercises the same update_execution_status path)
  • Ruff clean on changed lines (2 pre-existing errors unrelated)
  • Mypy: zero new errors (193 baseline = 193 after)

Manual integration test (from plan):

  1. Create agent with execution_timeout_seconds = 60 (slot TTL = 360s).
  2. Trigger a task that sleeps ~55s on the agent.
  3. Temporarily drop CLEANUP_INTERVAL_SECONDS to 30 and restart backend so cycles fire during the task.
  4. Briefly block agent egress to delay SUCCESS response past a cleanup cycle.
  5. Expect backend logs show [Cleanup] Skipping <id> ... — re-verification shows still running (#378) and no FAILED row written for this execution.

Follow-up

  • Out-of-scope capacity-invariant gap (Codex Point 5): slot_service.cleanup_stale_slots removes stale slots from Redis before Phase 3 decides whether to fail. When Phase 3 skips (re-verify says still running, agent unreachable, or confirmed_running_ids), the slot stays removed while the task remains tracked as running — briefly breaches max_parallel_tasks. Pre-existing; not fixed here. I'll file a separate issue.

Review artifacts

  • Plan: reviewed via /plan-eng-review (4 sections, 7 decisions) + Codex outside voice (3 critical corrections applied).
  • Plan file: /Users/andrii/.claude/plans/system-instruction-you-are-working-cheeky-porcupine.md

🤖 Generated with Claude Code

Andrii Pasternak and others added 3 commits April 19, 2026 10:04
Adds TestProcessStaleSlotReclaims (8 scenarios) to test_watchdog_unit.py
and tests/unit/test_schedule_status_observability.py (4 scenarios).
Tests reproduce the race where cleanup service's Phase 3 marks executions
FAILED with "Stale execution — slot TTL expired" while the task is still
running on the agent, and the subsequent SUCCESS write overwrites FAILED.

These tests fail against the current code and pass after the fix + log
that follow in subsequent commits.

Refs Abilityai#378

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…#378)

Phase 3 (slot cleanup) previously marked reclaimed executions FAILED
using only Phase 0's `confirmed_running_ids` snapshot. Between Phase 0's
batch query and Phase 3's decision, an agent could drop a just-completed
execution from its registry while the corresponding SUCCESS response was
still in flight to the backend — Phase 3 would then write FAILED and be
overwritten by the late SUCCESS, producing a phantom failure flash.

This extracts the Phase 3 loop into `_process_stale_slot_reclaims`
(mirrors the testability pattern of `_reconcile_orphaned_executions`)
and adds a just-in-time re-verify call to each agent immediately before
writing FAILED. Per-agent fan-out via `asyncio.gather` keeps worst-case
Phase 3 wall-time at O(5s) regardless of fleet size (same pattern as
Phase 0 at cleanup_service.py:258-263).

On agent unreachable during re-verify: skip this cycle. No cross-cycle
deferral state — `slot_service.cleanup_stale_slots` removes reclaimed
IDs from Redis permanently (zremrangebyscore), so a deferred ID would
never reappear to be re-checked. Transient unreachability is caught by
Phase 0's orphan recovery on later cycles; truly stuck agents by Phase 1
(120-min stale cleanup).

Refs Abilityai#378

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…docs

Emit a WARNING log in `update_execution_status` when a SUCCESS write
overwrites a row whose error matches the Phase-3 phantom-stale marker
("Stale execution — slot TTL expired"). Gives production signal if the
primary fix misses a corner case, without changing update semantics —
the agent's authoritative SUCCESS still wins.

The pattern match is narrow on purpose: other legitimate cleanup paths
(Phase 0 auto-terminate, Phase 1 stale cleanup, startup recovery) also
write FAILED via unguarded `update_execution_status`. An unscoped log
would misattribute those transitions to Abilityai#378.

Grep production logs with:
  docker logs trinity-backend | grep "residual race condition (Abilityai#378)"

Updates `docs/memory/feature-flows/cleanup-service.md` with a new
"Phase 3 Slot Reclaim Re-verification (Issue Abilityai#378)" section and adds
an index row to `docs/memory/feature-flows.md`.

Refs Abilityai#378

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@vybe vybe force-pushed the AndriiPasternak31/issue-378 branch from 46f9ee1 to 5a00650 Compare April 19, 2026 09:04
Copy link
Copy Markdown
Contributor

@vybe vybe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved via /validate-pr — docs, tests, security all clean. Rebased to resolve feature-flows.md conflict with #402.

@vybe vybe merged commit fcbf55c into Abilityai:main Apr 19, 2026
vybe added a commit that referenced this pull request Apr 19, 2026
… design

Catalogs ~55 invariants across 14 subsystems (executions, slots, backlog,
activities, agent lifecycle, subprocess hygiene, channels, etc.) with tier
(always vs. eventually-consistent), severity, and queryable violation
signals. Proposes a continuous canary harness to verify them on staging.

Grounded in recent fixes: #378/#403 (phantom stale slots), #407 (subprocess
reaping), #129 (orphaned executions), #219/#226 (slot TTL mismatch). Tracked
in #411.
@AndriiPasternak31 AndriiPasternak31 deleted the AndriiPasternak31/issue-378 branch April 25, 2026 17:59
obasilakis added a commit to obasilakis/trinity that referenced this pull request May 4, 2026
Implements the canary harness from Abilityai#411 that catches the bug class behind
PRs Abilityai#378, Abilityai#403, Abilityai#129, Abilityai#226 — race conditions and cross-component state
drift between Redis × SQLite × agent registries that unit tests miss.

Architecture (per design discussion in PR comments):
- Deterministic library (src/backend/canary/) — pure-function checks
  shared between the watcher and the on-demand admin endpoint. No LLM
  reasoning anywhere — same snapshot in, same violations out.
- Watcher service (services/canary_service.py) — 5-min APScheduler-style
  loop modeled on cleanup_service.py. Disabled by default; enable on
  staging/dev with CANARY_ENABLED=1.
- Fleet manifest (config/canary-fleet.yaml) — synthetic load via the
  existing /api/systems/deploy mechanism. Without traffic the harness
  reports trivially-green cycles.

Phase 1 invariants:
- S-01 (slot–row bijection) — Redis ZRANGE vs SQL running rows; drain
  sentinels filtered. Critical severity.
- E-02 (no phantom reversal) — terminal executions stay terminal.
  Phase 1 uses Redis-backed state comparison instead of Vector log diff
  for simplicity. Critical severity.
- L-03 (delete cascades) — orphan-row scan across cross-cutting tables
  (agent_sharing, agent_schedules, schedule_executions [non-terminal],
  agent_skills, agent_tags, agent_shared_files, agent_public_links,
  pending operator_queue / access_requests, agent-scoped mcp_api_keys,
  active chat_sessions) plus orphan agent:slots:* Redis key scan.
  Critical for active-orchestration orphans, major otherwise.

Persistence + API:
- canary_violations table (migration 34, schema in db/schema.py).
- GET /api/canary/violations + /violations/stats + /violations/{id} —
  admin-only read paths over the persisted history.
- POST /api/canary/run-cycle — admin-only on-demand trigger; delegates
  to the same CanaryService.run_cycle() the background loop calls.

Notifications:
- One green→red transition emits one notification via
  db.create_notification(notification_type='alert', category='canary').
- Severity → priority: critical → urgent, major → high, minor → normal.
- Continuing-red cycles do not re-notify; the row in canary_violations
  is the trend signal, the bell is the "now" signal.

Tests:
- 27 unit tests in tests/test_canary_invariants.py covering
  CanaryOperations CRUD/filters/stats, snapshot collector behaviour,
  each invariant's holds-and-fires cases, the L-03 critical/major
  severity split, and the runner registry.

Smoke-test recipe (Option 1 — orphan SQL row, no code changes):

  # 1. Set CANARY_ENABLED=1 and restart backend (creates canary_violations
  #    table via migration; starts the watcher).
  # 2. Deploy the fleet:
  curl -X POST -H "Authorization: Bearer $TOKEN" \
    -H "Content-Type: application/json" \
    -d "$(jq -Rs '{manifest: .}' < config/canary-fleet.yaml)" \
    http://localhost:8000/api/systems/deploy
  # 3. Baseline cycle:
  curl -X POST -H "Authorization: Bearer $TOKEN" \
    http://localhost:8000/api/canary/run-cycle | jq '.violations, .transitions'
  # 4. Inject one orphan row:
  sqlite3 ~/trinity-data/trinity.db \
    "INSERT INTO agent_sharing (agent_name, shared_with_email, shared_by_id, created_at) \
     VALUES ('ghost-canary-test', 'smoketest@example.com', '<admin-id>', datetime('now'));"
  # 5. Trigger again — expect L-03 in violations + transitions:
  curl -X POST -H "Authorization: Bearer $TOKEN" \
    http://localhost:8000/api/canary/run-cycle | jq '.violations[0], .transitions'
  # 6. Verify in UI: bell shows new 'canary' notification (severity major).
  # 7. Cleanup:
  sqlite3 ~/trinity-data/trinity.db \
    "DELETE FROM agent_sharing WHERE agent_name='ghost-canary-test';"

Closes part of Abilityai#411 — Phase 1 acceptance criteria (S-01, E-02, L-03
running on staging continuously, table + read endpoint shipped). Phase 2
adds S-02, S-03, E-01, E-05, E-06, B-01, B-02, G-01, R-01 incrementally
without changes to the watcher or API surface.
vybe pushed a commit that referenced this pull request May 10, 2026
* feat(#411): canary invariant harness Phase 1 (S-01, E-02, L-03)

Implements the canary harness from #411 that catches the bug class behind
PRs #378, #403, #129, #226 — race conditions and cross-component state
drift between Redis × SQLite × agent registries that unit tests miss.

Architecture (per design discussion in PR comments):
- Deterministic library (src/backend/canary/) — pure-function checks
  shared between the watcher and the on-demand admin endpoint. No LLM
  reasoning anywhere — same snapshot in, same violations out.
- Watcher service (services/canary_service.py) — 5-min APScheduler-style
  loop modeled on cleanup_service.py. Disabled by default; enable on
  staging/dev with CANARY_ENABLED=1.
- Fleet manifest (config/canary-fleet.yaml) — synthetic load via the
  existing /api/systems/deploy mechanism. Without traffic the harness
  reports trivially-green cycles.

Phase 1 invariants:
- S-01 (slot–row bijection) — Redis ZRANGE vs SQL running rows; drain
  sentinels filtered. Critical severity.
- E-02 (no phantom reversal) — terminal executions stay terminal.
  Phase 1 uses Redis-backed state comparison instead of Vector log diff
  for simplicity. Critical severity.
- L-03 (delete cascades) — orphan-row scan across cross-cutting tables
  (agent_sharing, agent_schedules, schedule_executions [non-terminal],
  agent_skills, agent_tags, agent_shared_files, agent_public_links,
  pending operator_queue / access_requests, agent-scoped mcp_api_keys,
  active chat_sessions) plus orphan agent:slots:* Redis key scan.
  Critical for active-orchestration orphans, major otherwise.

Persistence + API:
- canary_violations table (migration 34, schema in db/schema.py).
- GET /api/canary/violations + /violations/stats + /violations/{id} —
  admin-only read paths over the persisted history.
- POST /api/canary/run-cycle — admin-only on-demand trigger; delegates
  to the same CanaryService.run_cycle() the background loop calls.

Notifications:
- One green→red transition emits one notification via
  db.create_notification(notification_type='alert', category='canary').
- Severity → priority: critical → urgent, major → high, minor → normal.
- Continuing-red cycles do not re-notify; the row in canary_violations
  is the trend signal, the bell is the "now" signal.

Tests:
- 27 unit tests in tests/test_canary_invariants.py covering
  CanaryOperations CRUD/filters/stats, snapshot collector behaviour,
  each invariant's holds-and-fires cases, the L-03 critical/major
  severity split, and the runner registry.

Smoke-test recipe (Option 1 — orphan SQL row, no code changes):

  # 1. Set CANARY_ENABLED=1 and restart backend (creates canary_violations
  #    table via migration; starts the watcher).
  # 2. Deploy the fleet:
  curl -X POST -H "Authorization: Bearer $TOKEN" \
    -H "Content-Type: application/json" \
    -d "$(jq -Rs '{manifest: .}' < config/canary-fleet.yaml)" \
    http://localhost:8000/api/systems/deploy
  # 3. Baseline cycle:
  curl -X POST -H "Authorization: Bearer $TOKEN" \
    http://localhost:8000/api/canary/run-cycle | jq '.violations, .transitions'
  # 4. Inject one orphan row:
  sqlite3 ~/trinity-data/trinity.db \
    "INSERT INTO agent_sharing (agent_name, shared_with_email, shared_by_id, created_at) \
     VALUES ('ghost-canary-test', 'smoketest@example.com', '<admin-id>', datetime('now'));"
  # 5. Trigger again — expect L-03 in violations + transitions:
  curl -X POST -H "Authorization: Bearer $TOKEN" \
    http://localhost:8000/api/canary/run-cycle | jq '.violations[0], .transitions'
  # 6. Verify in UI: bell shows new 'canary' notification (severity major).
  # 7. Cleanup:
  sqlite3 ~/trinity-data/trinity.db \
    "DELETE FROM agent_sharing WHERE agent_name='ghost-canary-test';"

Closes part of #411 — Phase 1 acceptance criteria (S-01, E-02, L-03
running on staging continuously, table + read endpoint shipped). Phase 2
adds S-02, S-03, E-01, E-05, E-06, B-01, B-02, G-01, R-01 incrementally
without changes to the watcher or API surface.

* fix(#411): canary harness demo-driven fixes

Six fixes from the live end-to-end smoke test on staging-equivalent
local stack — all surfaced by walking the L-03 + S-01 demo recipes:

1. **`CANARY_ENABLED` env passthrough** (`docker-compose.yml`).
   The flag was set in `.env` but never reached the backend container,
   so the watcher silently no-op'd on staging. Added explicit
   `${CANARY_ENABLED:-0}` mapping.

2. **`canary-fleet.yaml` CPU fix** (`config/canary-fleet.yaml`).
   `cpu: "0.5"` failed `int()` in Trinity's resource parser. Bumped
   `burst` to `cpu: "1"` matching the platform's whole-number
   convention.

3. **Notifications API admin bypass** (`routers/notifications.py`).
   `agent_notifications.agent_name` for canary entries is the
   synthetic `canary-harness` (no Docker container). The existing
   per-agent access filter excluded it from `GET /api/notifications`
   so the bell stayed empty. Admins now skip the per-agent filter
   entirely; non-admins keep the existing semantics.

4. **Drop redundant timestamp from notification body**
   (`services/canary_service.py:_render_message`). The notification
   panel renders `created_at` as "just now / 4m ago" and the
   precise ISO `snapshot_time` is in `metadata.snapshot_time`.
   Embedding the full ISO microsecond suffix in the message text was
   pure clutter.

5. **Fix transition detection** (`services/canary_service.py`).
   Old rule "previous violation predates this snapshot → transition"
   fired on *every* continuing-red cycle, since prior violations
   are always older than the current snapshot. Replaced with a
   Redis-backed previous-cycle cursor (`canary:last_cycle_at`):
   transition only fires when the latest stored violation predates
   the *previous cycle's* snapshot, i.e. the previous cycle was
   green for that invariant. Handles first-ever / continuing-red /
   red→green→red correctly; verified live against L-03 (one
   notification across two consecutive red cycles).

6. **Run-cycle endpoint reflects what the service emitted**
   (`routers/canary.py`). Replaced the router's duplicate
   transition-detection algorithm with `CycleResult.
   transition_invariant_ids` returned from
   `CanaryService.run_cycle()`. The endpoint and the bell can no
   longer disagree about which invariants fired.

Plus a `snapshot.py` module-docstring expansion explaining why state
collection is its own module separate from the invariant checks
(consistency-per-cycle, no duplicated query code, test-friendliness,
explicit non-atomicity across Redis × SQLite).

* fix(#411): canary terminal status set must match TaskExecutionStatus enum

Surfaced by the live E-02 demo: cycle 1 was supposed to seed
`canary:e02:terminal_seen` from a freshly-completed fleet task but
the Redis hash stayed empty, so the simulated reversal in cycle 2
went undetected.

Root cause: `TERMINAL_EXECUTION_STATUSES` in `canary/snapshot.py`
listed `("completed", "failed", "cancelled", "timeout")` but
Trinity's actual terminal set (per `models.TaskExecutionStatus` and
the CAS state machine in PR #524) is
`("success", "failed", "cancelled", "skipped")`. Two of the four
values were wrong:

  - `completed` is never written — `success` is the canonical name.
  - `timeout` is not a status; timed-out executions land in `failed`.
  - `skipped` (capacity-overflow path, SCHED-COND-001) was missing.

Effects of the bug:

  1. **E-02 silently broken on production data.** Fleet executions
     finishing with `status='success'` were not added to
     `terminal_exec_ids`, so the next cycle's reversal-detection
     comparison set was always empty. A real terminal→running flip
     would have gone undetected indefinitely.
  2. **L-03 orphan-scan filter wrong.** The `schedule_executions`
     filter used the same hardcoded list, so legitimate-terminal
     `success`/`skipped` rows were treated as non-terminal and
     scanned for orphans on every cycle. Harmless when no orphans
     exist, but would have generated false-positive L-03 hits in
     ordinary recovery scenarios.

Fix:

  - Replace the constant with `("success", "failed", "cancelled",
    "skipped")` matching `TaskExecutionStatus`.
  - Build the L-03 orphan-scan SQL filter from the constant
    (`f"status NOT IN ({_TERMINAL_SQL_LIST})"`) so the two stay
    in lockstep.
  - Drop the duplicate constant declaration further down the file.
  - Update e02 docstring + test fixtures (`'completed'` → `'success'`).

Verified locally:

  - `pytest tests/test_canary_invariants.py` — 27/27 passing.
  - End-to-end E-02 demo: cycle 1 seeds `canary:e02:terminal_seen`
    (HEXISTS=1), cycle 2 fires after `UPDATE schedule_executions
    SET status='running'`, cycle 3 stays silent (continuing-red),
    fix-step reverts the row → cycle 4 green. Exactly 1 notification.

This was a real defect waiting to ship — caught only because the
demo went all the way through to a real fleet task. Argues for an
integration-test layer in a follow-up PR (per `test_systems.py`-
style harness discussed in the original PR thread).

* fix(#411): use iso_cutoff() for E-02 terminal-row window (invariant #16)

_collect_terminal_executions filtered with `datetime('now', '-N minutes')`
against `completed_at` (written via `utc_now_iso()` → ISO-Z). The two
formats diverge at the date/time separator (T vs space), so lexicographic
compare let every same-day terminal row through regardless of age — the
exact failure invariant #16 / issue #476 was written about.

Consequence: on busy installs the canary:e02:terminal_seen Redis hash
grew unbounded until PREV_TERMINAL_MAX nuked the prev set, leaving E-02
with nothing to diff against on the following cycle.

- Extend iso_cutoff() with a keyword-only `minutes=` parameter for
  sub-hour windows. Existing positional callers unaffected.
- Switch snapshot._collect_terminal_executions to iso_cutoff(minutes=...).
- Pin the new signature with TestMinutesKwarg in test_iso_cutoff.py
  (incl. a TypeError check on `iso_cutoff(2, 30)` so a future refactor
  can't quietly turn that into "2h30m").

* perf(#411): drop N+1 in run-cycle, thread row ids through CycleResult

The router's re-fetch loop ran list_canary_violations once per
violation (not per invariant), repeatedly querying the same
(invariant_id, snapshot_time) window and matching rows back to
violations by signal_query — both wasteful and silently wrong when
two violations share a signal_query.

The service already had the row id from insert_canary_violation
and threw it away. Thread it through CycleResult.persisted_violation_ids
(index-aligned with .violations, None on insert failure) so the
router maps directly to row ids without touching the DB.

* test(#411): cover CanaryService.run_cycle orchestration

The 27-test suite hit DB CRUD and the pure invariants but never
exercised the orchestrator — which is where the demo-driven bugs
lived (e7c11b2 transition spam, ef40cf9 terminal-status typo).
Both shipped past the unit suite and were caught only by hand-driven
demo runs.

Adds a TestCanaryService class with four end-to-end tests over
run_cycle() using the existing FakeRedis + temp SQLite, plus a
canary_service fixture that stubs `database` / `db_models` and
records create_notification calls so transitions can be asserted
directly:

  - test_first_cycle_violation_fires_one_notification — baseline
  - test_continuing_red_does_not_re_fire — regression for e7c11b2
  - test_red_green_red_fires_twice — covers the re-arm branch
  - test_terminal_status_set_seeds_e02_side_table — regression for
    ef40cf9 (would fail against the pre-fix tuple)

Also adds string get/set ops to FakeRedis so the previous-cycle
cursor (canary:last_cycle_at) round-trips. 31/31 tests pass.

* feat(#411): stub canary alert sink — drop dashboard notifications

Per product call: green→red transitions no longer write
agent_notifications rows or broadcast over WebSocket. Slack lands in
a follow-up PR and plugs into the same _emit_transition seam.

- _emit_transition now calls a no-op _notify_transition_stub (logs only)
- _render_message / _broadcast / _severity_rank / ws-manager hooks kept
  in place so the next PR is a single-file change
- Tests assert on transition_invariant_ids / cumulative_transitions and
  pin notification_calls == [] to catch regressions
- architecture.md notes Phase 1 has no alert sink wired

* docs(#411): drop "bell" shorthand from canary comments

The dashboard surface is the Operating Room notifications panel, not
a literal navbar bell — comments and docstrings now say "alert sink"
or "panel" so a future reader doesn't go hunting for a bell icon.

* fix(#411): point canary fleet at test-echo template (local:default doesn't exist)

The manifest referenced `local:default`, which has no template directory under
config/agent-templates/. Agent creation silently skips template-volume mounting
on missing paths (services/agent_service/crud.py:341), so the fleet members
were spinning up from the bare base image with no CLAUDE.md — verified with
`docker exec agent-canary-fleet-burst cat /home/developer/CLAUDE.md` returning
"No such file".

Switched to `local:test-echo`: minimal stock template, no MCP/credentials,
deterministic short replies — exactly what S-01/E-02 slot-churn invariants
need. Schedule messages simplified accordingly.

Verified end-to-end on local: deleted broken containers, redeployed via
POST /api/systems/deploy, agents came up with templates, autonomy enabled,
schedules fired, real-flow ZREM regression in slot_service.py produced
S-01 critical violation captured by both the 5-min watcher (transitions=1)
and on-demand POST /api/canary/run-cycle. Recovery path returned to green.

* docs(#411): sync requirements.md §31.1 with stubbed alert sink

The Notifications bullet still described the rejected dashboard path
(db.create_notification → agent_notifications) after 6e1bb94 stubbed
the alert sink. Replaced with an Alert sink bullet noting the no-op
_notify_transition_stub and the preserved _emit_transition seam for
the follow-up Slack PR. Architecture.md and requirements.md now agree.

* feat(#411): wire canary alert sink to Slack via incoming webhook

- _emit_transition POSTs Block Kit payload via slack_service.post_webhook
- payload: severity emoji header + friendly invariant name, summary,
  per-invariant forensic detail, runbook hint, context line with
  snapshot_time + violation row #N + "last red Xm ago" badge
- CANARY_SLACK_WEBHOOK_URL env var (unset = silent sink)
- previous_violation_at now populated through CycleResult (was always None)
- removes dead WS-broadcast scaffolding from the dropped Phase 1 path

* revert(#411): drop admin notifications bypass

Justification was canary harness notifications under a synthetic agent
name, but the alert sink was rerouted to Slack (21b833d) and never
writes to the notifications table. Reverting restores ownership-scoped
filtering for admins — no live caller depends on the bypass.

* fix(#411): surface snapshot sources_unavailable in run-cycle response

Thread snapshot.sources_unavailable through CycleResult so the on-demand
run-cycle endpoint reflects degraded Redis/SQLite reads instead of always
returning []. The field's whole purpose is to expose partial-state cycles
to the operator running the smoke test.

* docs(#411): sweep stale canary agent/skill references

Pre-pivot vocabulary referenced a canary agent template invoking
mcp__trinity__send_notification on a per-agent loop. The harness has
been a backend service with a Slack webhook sink for several commits;
update the docstrings to match.

* docs(#411): sync migrations.py header to all 57 entries

Header docstring stopped at #34 while MIGRATIONS tuple has 57 entries.
Pre-existing drift; the canary migration added to it.

* fix(#411): preserve real terminal status in E-02 reversal alerts

The E-02 phantom-reversal side-table stored the literal string
"terminal" against every id, so once the Slack sink landed its
forensic line printed "terminal → running" — useless to on-call.
The snapshot now carries terminal_exec_statuses: dict[eid, status]
and E-02 maintains a parallel canary:e02:terminal_status hash keyed
alongside the canary:e02:terminal_seen ZSET. The real prior status
(success / failed / cancelled / skipped) is read back when a
reversal fires and rendered into both observed_state and the
signal_query line. Aged-out ids are HDEL'd in the same sweep that
ZREMRANGEBYSCORE drops them so the two structures stay in sync.

Bundled side-table refactor: HASH + hard 5000-row DEL-on-overflow
becomes ZSET trimmed by ZREMRANGEBYSCORE on
PREV_TERMINAL_RETENTION_SECONDS (1h). The hard-reset trim left a
one-cycle E-02 blind spot every time the cap crossed on busy installs.

Tests: extends FakeRedis with zrangebyscore / hmget / hdel; asserts
the existing reversal test now reports previous_status="success",
adds a new test that round-trips all four terminal statuses through
seed→reverse, and a regression test for the age-based trim invariant.

* refactor(#411): drop redundant valid_ids filter in run-cycle endpoint

After the 422 raise on invalid_ids, requested_ids is provably all-valid,
so re-filtering against INVARIANTS was dead code. Pass requested_ids
directly to run_cycle().

* docs(#411): correct operator in _is_green_to_red comment

Comment said `>=` but the code uses `<`.

* docs(#411): correct phase label in canary alert sink comment

CANARY-001 alert sink ships in Phase 1, not Phase 2.

* fix(#411): grace window in S-01 to suppress slot/row race false positives

The bijection check fired on a transient mismatch caught by a snapshot
landing in the ~30ms gap between the SQL `status='running'` commit and
the Redis ZADD on start (or the ~5ms gap between SQL terminal flip and
ZREM on stop). Real `release()` skips and phantom rows persist across
multiple cycles, so 3s of grace silences the races without losing
coverage of the bug class S-01 was designed to catch.

- snapshot exposes `running_started_at` per running id and `slot_scores`
  per slot member (`time.time()` at ZADD), so the check can age-filter.
- S-01 drops `in_sql_only` ids whose `started_at` is within 3s and
  `in_redis_only` ids whose ZSET score is within 3s. Existing strict
  behaviour applies for older / unparseable timestamps.
- FakeRedis test stub gains `withscores` for ZRANGE.
- Three new tests: fresh-sql suppressed, fresh-redis suppressed,
  durable mismatch still fires.

51/51 canary tests pass.

* fix(#411): address vybe PR review — NameError in run-cycle + split alerts

(1) NameError on POST /api/canary/run-cycle: handler referenced `valid_ids`
which never existed (commit 02cfc97 only fixed the call site, not the
two trailing usages). Endpoint crashed on every request. Replace with
`requested_ids` (provably valid post-422 check).

(2) services/canary_service.py was 692 lines and growing. Extract the
Slack alert surface (constants + `_emit_transition` + `_build_slack_payload`
+ `_render_forensic` + `_format_row_refs` + `_format_last_red` +
`_render_message`) into services/canary_alerts.py as `CanaryAlerts`.
canary_service.py drops to 347 lines, alerts module is 370 — both well
under the 1000-line target. No behaviour change.

Tests pivoted to `CanaryAlerts._foo` references; 51/51 pass.

* docs(#411): update test docstrings/comments after CanaryAlerts split

Stale `_emit_transition` / bare `_build_slack_payload` references in
test docstrings and the slack_capture fixture's hint pointed at the
old method names. Renaming to `CanaryAlerts.emit_transition` etc. so
the hints match the actual code path. No test logic changes.

* fix(#411): address /review informational findings (I1/I2/I3)

I1 — Webhook URL leak in error logs (services/slack_service.py:309-311):
httpx exceptions embed the request URL in their __str__, and for canary
alerts that URL IS the credential. The catchall returned `str(e)` which
was then logged by `CanaryAlerts.emit_transition`. Sanitize by returning
only the exception class name + a generic message; no URL ever lands in
Vector logs.

I2 — `checks_skipped` always empty (routers/canary.py): `run_invariants`
populates `cycle.violations[inv_id]` for every selected id (with `[]`
when clean or when a check raised), so the field would always be empty.
The actionable signal lives in `sources_unavailable` already. Drop the
field from `RunCycleResponse` and simplify `checks_run` to the request's
filter list directly.

I3 — Severity rank dict duplicated: promote `_severity_rank` →
`severity_rank` in canary_alerts and import it into the router instead
of redeclaring the literal.

51/51 canary tests pass; live-stack endpoint verified
(POST /run-cycle, partial filter, 422 path).

* fix(#411): address /review findings (I1/I2/I3 round 2)

- I1: architecture.md run-cycle row referenced a non-existent
  `checks_skipped` field; replaced with `sources_unavailable` and
  documented the new 409 response.
- I2: lock-contention path returned an empty 200 indistinguishable
  from a real green cycle. CycleResult now carries `skipped: bool`
  and the router maps it to HTTP 409 `cycle in progress`.
- I3: TERMINAL_EXECUTION_STATUSES is now derived from
  TaskExecutionStatus.{SUCCESS,FAILED,CANCELLED,SKIPPED}.value
  instead of a hand-maintained string tuple. Tests stub `models`
  so unit imports don't pull in pydantic/db_models.

51/51 canary unit tests + 13/13 iso_cutoff tests green.

* fix(#411): use tz-aware fromisoformat in S-01 grace check

`datetime.fromisoformat(ts.rstrip("Z")).timestamp()` stripped the Z
suffix, leaving a naive datetime; `.timestamp()` then interprets naive
datetimes as local time, so on any host with TZ != UTC the grace-window
math was off by the host's UTC offset — producing false-positive S-01
alerts. Python 3.11's `fromisoformat` parses Z natively into
`tzinfo=UTC`, so dropping the rstrip restores correctness on any host.

Surfaced by /review I1.

* docs(security): CSO audit 2026-05-09 — 0 findings (#411 diff)

Diff-scoped audit of the canary harness (Phase 1 of #411). Zero
findings at the 8/10 daily-mode confidence gate — admin-only API
surface, outbound-only Slack with explicit URL-leak mitigation,
parameterized SQL with constants-only f-string composition, env-gated
activation. Five candidates considered and dropped below the gate
(documented in the report).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug: Execution shows Failed then Success after cleanup (race condition)

2 participants