feat(#411): canary invariant harness Phase 1 (S-01, E-02, L-03)#653
Conversation
Implements the canary harness from Abilityai#411 that catches the bug class behind PRs Abilityai#378, Abilityai#403, Abilityai#129, Abilityai#226 — race conditions and cross-component state drift between Redis × SQLite × agent registries that unit tests miss. Architecture (per design discussion in PR comments): - Deterministic library (src/backend/canary/) — pure-function checks shared between the watcher and the on-demand admin endpoint. No LLM reasoning anywhere — same snapshot in, same violations out. - Watcher service (services/canary_service.py) — 5-min APScheduler-style loop modeled on cleanup_service.py. Disabled by default; enable on staging/dev with CANARY_ENABLED=1. - Fleet manifest (config/canary-fleet.yaml) — synthetic load via the existing /api/systems/deploy mechanism. Without traffic the harness reports trivially-green cycles. Phase 1 invariants: - S-01 (slot–row bijection) — Redis ZRANGE vs SQL running rows; drain sentinels filtered. Critical severity. - E-02 (no phantom reversal) — terminal executions stay terminal. Phase 1 uses Redis-backed state comparison instead of Vector log diff for simplicity. Critical severity. - L-03 (delete cascades) — orphan-row scan across cross-cutting tables (agent_sharing, agent_schedules, schedule_executions [non-terminal], agent_skills, agent_tags, agent_shared_files, agent_public_links, pending operator_queue / access_requests, agent-scoped mcp_api_keys, active chat_sessions) plus orphan agent:slots:* Redis key scan. Critical for active-orchestration orphans, major otherwise. Persistence + API: - canary_violations table (migration 34, schema in db/schema.py). - GET /api/canary/violations + /violations/stats + /violations/{id} — admin-only read paths over the persisted history. - POST /api/canary/run-cycle — admin-only on-demand trigger; delegates to the same CanaryService.run_cycle() the background loop calls. Notifications: - One green→red transition emits one notification via db.create_notification(notification_type='alert', category='canary'). - Severity → priority: critical → urgent, major → high, minor → normal. - Continuing-red cycles do not re-notify; the row in canary_violations is the trend signal, the bell is the "now" signal. Tests: - 27 unit tests in tests/test_canary_invariants.py covering CanaryOperations CRUD/filters/stats, snapshot collector behaviour, each invariant's holds-and-fires cases, the L-03 critical/major severity split, and the runner registry. Smoke-test recipe (Option 1 — orphan SQL row, no code changes): # 1. Set CANARY_ENABLED=1 and restart backend (creates canary_violations # table via migration; starts the watcher). # 2. Deploy the fleet: curl -X POST -H "Authorization: Bearer $TOKEN" \ -H "Content-Type: application/json" \ -d "$(jq -Rs '{manifest: .}' < config/canary-fleet.yaml)" \ http://localhost:8000/api/systems/deploy # 3. Baseline cycle: curl -X POST -H "Authorization: Bearer $TOKEN" \ http://localhost:8000/api/canary/run-cycle | jq '.violations, .transitions' # 4. Inject one orphan row: sqlite3 ~/trinity-data/trinity.db \ "INSERT INTO agent_sharing (agent_name, shared_with_email, shared_by_id, created_at) \ VALUES ('ghost-canary-test', 'smoketest@example.com', '<admin-id>', datetime('now'));" # 5. Trigger again — expect L-03 in violations + transitions: curl -X POST -H "Authorization: Bearer $TOKEN" \ http://localhost:8000/api/canary/run-cycle | jq '.violations[0], .transitions' # 6. Verify in UI: bell shows new 'canary' notification (severity major). # 7. Cleanup: sqlite3 ~/trinity-data/trinity.db \ "DELETE FROM agent_sharing WHERE agent_name='ghost-canary-test';" Closes part of Abilityai#411 — Phase 1 acceptance criteria (S-01, E-02, L-03 running on staging continuously, table + read endpoint shipped). Phase 2 adds S-02, S-03, E-01, E-05, E-06, B-01, B-02, G-01, R-01 incrementally without changes to the watcher or API surface.
Six fixes from the live end-to-end smoke test on staging-equivalent
local stack — all surfaced by walking the L-03 + S-01 demo recipes:
1. **`CANARY_ENABLED` env passthrough** (`docker-compose.yml`).
The flag was set in `.env` but never reached the backend container,
so the watcher silently no-op'd on staging. Added explicit
`${CANARY_ENABLED:-0}` mapping.
2. **`canary-fleet.yaml` CPU fix** (`config/canary-fleet.yaml`).
`cpu: "0.5"` failed `int()` in Trinity's resource parser. Bumped
`burst` to `cpu: "1"` matching the platform's whole-number
convention.
3. **Notifications API admin bypass** (`routers/notifications.py`).
`agent_notifications.agent_name` for canary entries is the
synthetic `canary-harness` (no Docker container). The existing
per-agent access filter excluded it from `GET /api/notifications`
so the bell stayed empty. Admins now skip the per-agent filter
entirely; non-admins keep the existing semantics.
4. **Drop redundant timestamp from notification body**
(`services/canary_service.py:_render_message`). The notification
panel renders `created_at` as "just now / 4m ago" and the
precise ISO `snapshot_time` is in `metadata.snapshot_time`.
Embedding the full ISO microsecond suffix in the message text was
pure clutter.
5. **Fix transition detection** (`services/canary_service.py`).
Old rule "previous violation predates this snapshot → transition"
fired on *every* continuing-red cycle, since prior violations
are always older than the current snapshot. Replaced with a
Redis-backed previous-cycle cursor (`canary:last_cycle_at`):
transition only fires when the latest stored violation predates
the *previous cycle's* snapshot, i.e. the previous cycle was
green for that invariant. Handles first-ever / continuing-red /
red→green→red correctly; verified live against L-03 (one
notification across two consecutive red cycles).
6. **Run-cycle endpoint reflects what the service emitted**
(`routers/canary.py`). Replaced the router's duplicate
transition-detection algorithm with `CycleResult.
transition_invariant_ids` returned from
`CanaryService.run_cycle()`. The endpoint and the bell can no
longer disagree about which invariants fired.
Plus a `snapshot.py` module-docstring expansion explaining why state
collection is its own module separate from the invariant checks
(consistency-per-cycle, no duplicated query code, test-friendliness,
explicit non-atomicity across Redis × SQLite).
…onStatus enum
Surfaced by the live E-02 demo: cycle 1 was supposed to seed
`canary:e02:terminal_seen` from a freshly-completed fleet task but
the Redis hash stayed empty, so the simulated reversal in cycle 2
went undetected.
Root cause: `TERMINAL_EXECUTION_STATUSES` in `canary/snapshot.py`
listed `("completed", "failed", "cancelled", "timeout")` but
Trinity's actual terminal set (per `models.TaskExecutionStatus` and
the CAS state machine in PR Abilityai#524) is
`("success", "failed", "cancelled", "skipped")`. Two of the four
values were wrong:
- `completed` is never written — `success` is the canonical name.
- `timeout` is not a status; timed-out executions land in `failed`.
- `skipped` (capacity-overflow path, SCHED-COND-001) was missing.
Effects of the bug:
1. **E-02 silently broken on production data.** Fleet executions
finishing with `status='success'` were not added to
`terminal_exec_ids`, so the next cycle's reversal-detection
comparison set was always empty. A real terminal→running flip
would have gone undetected indefinitely.
2. **L-03 orphan-scan filter wrong.** The `schedule_executions`
filter used the same hardcoded list, so legitimate-terminal
`success`/`skipped` rows were treated as non-terminal and
scanned for orphans on every cycle. Harmless when no orphans
exist, but would have generated false-positive L-03 hits in
ordinary recovery scenarios.
Fix:
- Replace the constant with `("success", "failed", "cancelled",
"skipped")` matching `TaskExecutionStatus`.
- Build the L-03 orphan-scan SQL filter from the constant
(`f"status NOT IN ({_TERMINAL_SQL_LIST})"`) so the two stay
in lockstep.
- Drop the duplicate constant declaration further down the file.
- Update e02 docstring + test fixtures (`'completed'` → `'success'`).
Verified locally:
- `pytest tests/test_canary_invariants.py` — 27/27 passing.
- End-to-end E-02 demo: cycle 1 seeds `canary:e02:terminal_seen`
(HEXISTS=1), cycle 2 fires after `UPDATE schedule_executions
SET status='running'`, cycle 3 stays silent (continuing-red),
fix-step reverts the row → cycle 4 green. Exactly 1 notification.
This was a real defect waiting to ship — caught only because the
demo went all the way through to a real fleet task. Argues for an
integration-test layer in a follow-up PR (per `test_systems.py`-
style harness discussed in the original PR thread).
…ration # Conflicts: # docs/memory/architecture.md
…variant #16)
_collect_terminal_executions filtered with `datetime('now', '-N minutes')`
against `completed_at` (written via `utc_now_iso()` → ISO-Z). The two
formats diverge at the date/time separator (T vs space), so lexicographic
compare let every same-day terminal row through regardless of age — the
exact failure invariant #16 / issue Abilityai#476 was written about.
Consequence: on busy installs the canary:e02:terminal_seen Redis hash
grew unbounded until PREV_TERMINAL_MAX nuked the prev set, leaving E-02
with nothing to diff against on the following cycle.
- Extend iso_cutoff() with a keyword-only `minutes=` parameter for
sub-hour windows. Existing positional callers unaffected.
- Switch snapshot._collect_terminal_executions to iso_cutoff(minutes=...).
- Pin the new signature with TestMinutesKwarg in test_iso_cutoff.py
(incl. a TypeError check on `iso_cutoff(2, 30)` so a future refactor
can't quietly turn that into "2h30m").
…cleResult The router's re-fetch loop ran list_canary_violations once per violation (not per invariant), repeatedly querying the same (invariant_id, snapshot_time) window and matching rows back to violations by signal_query — both wasteful and silently wrong when two violations share a signal_query. The service already had the row id from insert_canary_violation and threw it away. Thread it through CycleResult.persisted_violation_ids (index-aligned with .violations, None on insert failure) so the router maps directly to row ids without touching the DB.
The 27-test suite hit DB CRUD and the pure invariants but never exercised the orchestrator — which is where the demo-driven bugs lived (e7c11b2 transition spam, ef40cf9 terminal-status typo). Both shipped past the unit suite and were caught only by hand-driven demo runs. Adds a TestCanaryService class with four end-to-end tests over run_cycle() using the existing FakeRedis + temp SQLite, plus a canary_service fixture that stubs `database` / `db_models` and records create_notification calls so transitions can be asserted directly: - test_first_cycle_violation_fires_one_notification — baseline - test_continuing_red_does_not_re_fire — regression for e7c11b2 - test_red_green_red_fires_twice — covers the re-arm branch - test_terminal_status_set_seeds_e02_side_table — regression for ef40cf9 (would fail against the pre-fix tuple) Also adds string get/set ops to FakeRedis so the previous-cycle cursor (canary:last_cycle_at) round-trips. 31/31 tests pass.
…ration # Conflicts: # src/backend/db/migrations.py
…tions Per product call: green→red transitions no longer write agent_notifications rows or broadcast over WebSocket. Slack lands in a follow-up PR and plugs into the same _emit_transition seam. - _emit_transition now calls a no-op _notify_transition_stub (logs only) - _render_message / _broadcast / _severity_rank / ws-manager hooks kept in place so the next PR is a single-file change - Tests assert on transition_invariant_ids / cumulative_transitions and pin notification_calls == [] to catch regressions - architecture.md notes Phase 1 has no alert sink wired
The dashboard surface is the Operating Room notifications panel, not a literal navbar bell — comments and docstrings now say "alert sink" or "panel" so a future reader doesn't go hunting for a bell icon.
…efault doesn't exist) The manifest referenced `local:default`, which has no template directory under config/agent-templates/. Agent creation silently skips template-volume mounting on missing paths (services/agent_service/crud.py:341), so the fleet members were spinning up from the bare base image with no CLAUDE.md — verified with `docker exec agent-canary-fleet-burst cat /home/developer/CLAUDE.md` returning "No such file". Switched to `local:test-echo`: minimal stock template, no MCP/credentials, deterministic short replies — exactly what S-01/E-02 slot-churn invariants need. Schedule messages simplified accordingly. Verified end-to-end on local: deleted broken containers, redeployed via POST /api/systems/deploy, agents came up with templates, autonomy enabled, schedules fired, real-flow ZREM regression in slot_service.py produced S-01 critical violation captured by both the 5-min watcher (transitions=1) and on-demand POST /api/canary/run-cycle. Recovery path returned to green.
The Notifications bullet still described the rejected dashboard path (db.create_notification → agent_notifications) after 6e1bb94 stubbed the alert sink. Replaced with an Alert sink bullet noting the no-op _notify_transition_stub and the preserved _emit_transition seam for the follow-up Slack PR. Architecture.md and requirements.md now agree.
…hook - _emit_transition POSTs Block Kit payload via slack_service.post_webhook - payload: severity emoji header + friendly invariant name, summary, per-invariant forensic detail, runbook hint, context line with snapshot_time + violation row #N + "last red Xm ago" badge - CANARY_SLACK_WEBHOOK_URL env var (unset = silent sink) - previous_violation_at now populated through CycleResult (was always None) - removes dead WS-broadcast scaffolding from the dropped Phase 1 path
Justification was canary harness notifications under a synthetic agent name, but the alert sink was rerouted to Slack (21b833d) and never writes to the notifications table. Reverting restores ownership-scoped filtering for admins — no live caller depends on the bypass.
… response Thread snapshot.sources_unavailable through CycleResult so the on-demand run-cycle endpoint reflects degraded Redis/SQLite reads instead of always returning []. The field's whole purpose is to expose partial-state cycles to the operator running the smoke test.
Pre-pivot vocabulary referenced a canary agent template invoking mcp__trinity__send_notification on a per-agent loop. The harness has been a backend service with a Slack webhook sink for several commits; update the docstrings to match.
Header docstring stopped at Abilityai#34 while MIGRATIONS tuple has 57 entries. Pre-existing drift; the canary migration added to it.
…erts The E-02 phantom-reversal side-table stored the literal string "terminal" against every id, so once the Slack sink landed its forensic line printed "terminal → running" — useless to on-call. The snapshot now carries terminal_exec_statuses: dict[eid, status] and E-02 maintains a parallel canary:e02:terminal_status hash keyed alongside the canary:e02:terminal_seen ZSET. The real prior status (success / failed / cancelled / skipped) is read back when a reversal fires and rendered into both observed_state and the signal_query line. Aged-out ids are HDEL'd in the same sweep that ZREMRANGEBYSCORE drops them so the two structures stay in sync. Bundled side-table refactor: HASH + hard 5000-row DEL-on-overflow becomes ZSET trimmed by ZREMRANGEBYSCORE on PREV_TERMINAL_RETENTION_SECONDS (1h). The hard-reset trim left a one-cycle E-02 blind spot every time the cap crossed on busy installs. Tests: extends FakeRedis with zrangebyscore / hmget / hdel; asserts the existing reversal test now reports previous_status="success", adds a new test that round-trips all four terminal statuses through seed→reverse, and a regression test for the age-based trim invariant.
… endpoint After the 422 raise on invalid_ids, requested_ids is provably all-valid, so re-filtering against INVARIANTS was dead code. Pass requested_ids directly to run_cycle().
Comment said `>=` but the code uses `<`.
CANARY-001 alert sink ships in Phase 1, not Phase 2.
vybe
left a comment
There was a problem hiding this comment.
Review: Request Changes
Two items before this can merge.
1. NameError in POST /api/canary/run-cycle — hard blocker
In src/backend/routers/canary.py, the run_canary_cycle() handler references valid_ids which is never defined. It should be requested_ids (defined ~50 lines earlier). Every call to the endpoint will throw NameError: name 'valid_ids' is not defined. The background loop is unaffected since it calls run_cycle() directly, but the on-demand endpoint is completely broken.
# BUG — valid_ids is undefined; crashes on every request
checks_skipped = [i for i in valid_ids if i not in cycle.violations]
checks_run=[i for i in valid_ids if i in cycle.violations],
# Fix — replace with requested_ids
checks_skipped = [i for i in requested_ids if i not in cycle.violations]
checks_run=[i for i in requested_ids if i in cycle.violations],2. canary_service.py is 692 lines — please split before merge
The service file is already 692 lines and the diff shows canary_service.py at 692 lines added. Target is under 1000 lines per file, but this one is already getting close and Phase 2 invariants will make it grow. Recommend splitting before merge while the boundaries are obvious:
canary_service.py— just theCanaryServiceclass and the 5-min loop (~200 lines)canary/alert_sink.py(or similar) — Slack Block Kit formatting + webhook POST logic (currently embedded in the service)canary/transition.py(or similar) — green→red transition detection logic
The snapshot collector and invariant library are already well-separated in src/backend/canary/. The service file should be equally lean.
Everything else here is well-designed — pure-function invariants, graceful degradation on source unavailability, age-based E-02 trimming, one-alert-per-transition suppression, and a solid 1400-line test suite. Happy to approve once these two are addressed.
…lse positives The bijection check fired on a transient mismatch caught by a snapshot landing in the ~30ms gap between the SQL `status='running'` commit and the Redis ZADD on start (or the ~5ms gap between SQL terminal flip and ZREM on stop). Real `release()` skips and phantom rows persist across multiple cycles, so 3s of grace silences the races without losing coverage of the bug class S-01 was designed to catch. - snapshot exposes `running_started_at` per running id and `slot_scores` per slot member (`time.time()` at ZADD), so the check can age-filter. - S-01 drops `in_sql_only` ids whose `started_at` is within 3s and `in_redis_only` ids whose ZSET score is within 3s. Existing strict behaviour applies for older / unparseable timestamps. - FakeRedis test stub gains `withscores` for ZRANGE. - Three new tests: fresh-sql suppressed, fresh-redis suppressed, durable mismatch still fires. 51/51 canary tests pass.
…ration # Conflicts: # src/backend/db/schema.py
… split alerts (1) NameError on POST /api/canary/run-cycle: handler referenced `valid_ids` which never existed (commit 02cfc97 only fixed the call site, not the two trailing usages). Endpoint crashed on every request. Replace with `requested_ids` (provably valid post-422 check). (2) services/canary_service.py was 692 lines and growing. Extract the Slack alert surface (constants + `_emit_transition` + `_build_slack_payload` + `_render_forensic` + `_format_row_refs` + `_format_last_red` + `_render_message`) into services/canary_alerts.py as `CanaryAlerts`. canary_service.py drops to 347 lines, alerts module is 370 — both well under the 1000-line target. No behaviour change. Tests pivoted to `CanaryAlerts._foo` references; 51/51 pass.
…ts split Stale `_emit_transition` / bare `_build_slack_payload` references in test docstrings and the slack_capture fixture's hint pointed at the old method names. Renaming to `CanaryAlerts.emit_transition` etc. so the hints match the actual code path. No test logic changes.
I1 — Webhook URL leak in error logs (services/slack_service.py:309-311): httpx exceptions embed the request URL in their __str__, and for canary alerts that URL IS the credential. The catchall returned `str(e)` which was then logged by `CanaryAlerts.emit_transition`. Sanitize by returning only the exception class name + a generic message; no URL ever lands in Vector logs. I2 — `checks_skipped` always empty (routers/canary.py): `run_invariants` populates `cycle.violations[inv_id]` for every selected id (with `[]` when clean or when a check raised), so the field would always be empty. The actionable signal lives in `sources_unavailable` already. Drop the field from `RunCycleResponse` and simplify `checks_run` to the request's filter list directly. I3 — Severity rank dict duplicated: promote `_severity_rank` → `severity_rank` in canary_alerts and import it into the router instead of redeclaring the literal. 51/51 canary tests pass; live-stack endpoint verified (POST /run-cycle, partial filter, 422 path).
- I1: architecture.md run-cycle row referenced a non-existent
`checks_skipped` field; replaced with `sources_unavailable` and
documented the new 409 response.
- I2: lock-contention path returned an empty 200 indistinguishable
from a real green cycle. CycleResult now carries `skipped: bool`
and the router maps it to HTTP 409 `cycle in progress`.
- I3: TERMINAL_EXECUTION_STATUSES is now derived from
TaskExecutionStatus.{SUCCESS,FAILED,CANCELLED,SKIPPED}.value
instead of a hand-maintained string tuple. Tests stub `models`
so unit imports don't pull in pydantic/db_models.
51/51 canary unit tests + 13/13 iso_cutoff tests green.
`datetime.fromisoformat(ts.rstrip("Z")).timestamp()` stripped the Z
suffix, leaving a naive datetime; `.timestamp()` then interprets naive
datetimes as local time, so on any host with TZ != UTC the grace-window
math was off by the host's UTC offset — producing false-positive S-01
alerts. Python 3.11's `fromisoformat` parses Z natively into
`tzinfo=UTC`, so dropping the rstrip restores correctness on any host.
Surfaced by /review I1.
Diff-scoped audit of the canary harness (Phase 1 of Abilityai#411). Zero findings at the 8/10 daily-mode confidence gate — admin-only API surface, outbound-only Slack with explicit URL-leak mitigation, parameterized SQL with constants-only f-string composition, env-gated activation. Five candidates considered and dropped below the gate (documented in the report).
vybe
left a comment
There was a problem hiding this comment.
Both items addressed — NameError fixed and alert sink split into . CSO audit clean (0 findings). LGTM.
vybe
left a comment
There was a problem hiding this comment.
Both items addressed — NameError fixed and alert sink split into canary_alerts.py. CSO audit clean (0 findings). LGTM.
Adds three moderate-complexity invariants on top of Phase 2. Each
brings exactly one new piece of plumbing — first time the canary
takes a hard dep on a non-trivial source beyond SQLite + Redis basic
ops:
- S-03 — Slot TTL ≥ execution timeout. For every member of
`agent:slots:A`, the companion `agent:slot:A:{eid}` HASH must have
`TTL ≥ execution_timeout_seconds + 300s` (SLOT_TTL_BUFFER). Three
failure kinds surfaced explicitly: `missing` (-2; the #226 class),
`no_expiry` (-1), `below_floor` (positive TTL under floor). Critical.
Tier A. Per-slot `redis.ttl()` lookup, bounded by ZCARD per agent
(≤ max_parallel_tasks).
- B-02 — No queued without slots-full. If any agent has queued > 0,
then either `slot_count == max_parallel` (legit backpressure) OR a
drain tick fired in the last 60s (drain will pick it up). Critical.
Tier B. Requires `CapacityManager.run_maintenance()` to write a
unix-timestamp heartbeat to `canary:drain_tick_at` at the END of
each successful sweep — mid-sweep crash leaves cursor stale and
lets B-02 catch the breakage. One-line write in capacity_manager.py,
rest is canary-local.
- R-01 — No zombie Claude processes. For every running
`trinity.platform=agent` container,
`ps -eo stat,comm | grep '^Z.*claude' | wc -l` must be 0. Critical.
Tier A. Guards PR #407. New source type — docker exec via the
existing docker_service.docker_client. Per-container failures recorded
in `sources_unavailable` so a single unhealthy container doesn't
kill the cycle. Regex anchored at `^Z` rather than the catalog's
` Z` (leading-space) — procps-ng on the agent base image emits
STAT left-aligned without padding; verified live by spawning an
actual zombie via `os.fork()`+`prctl(PR_SET_NAME, "claude")`.
Snapshot extended:
- `AgentSnapshot.slot_ttls: Dict[str, int]` — per-slot metadata TTL,
drain sentinels skipped at collection time.
- `Snapshot.drain_tick_at: Optional[float]` — read from
`canary:drain_tick_at`, sentinel-`None` on cold cluster.
- `Snapshot.zombie_counts: Dict[str, int]` — per-agent zombie process
count via container.exec_run; missing entry = exec failed for that
container (recorded in `sources_unavailable`).
Tests: 67 → 84 passing. Added `fake_docker` fixture so the synthetic
container list is controllable; FakeRedis got a `ttl()` method with
the standard -2/-1/positive sentinel semantics.
Verification against local stack — for each invariant, provoke, run
`POST /api/canary/run-cycle`, observe red, revert, observe green:
- S-03: ZADD a slot + EXPIRE its metadata HASH to 30s while the floor
is 360s → critical violation `kind: below_floor`. Also covered the
`missing` kind by deleting the HASH entirely. Revert by EXPIRE 500.
- B-02: inserted 1 queued row, set `canary:drain_tick_at` to 600s
ago → critical violation, `free_slots: 2, drain_tick_age_seconds:
600`. Revert by writing a fresh timestamp.
- R-01: spawned a real zombie inside agent-cornelius-m via Python
fork + prctl PR_SET_NAME → critical violation
`zombie_count: 1`. Reaped by killing the parent → green.
All post-fix cycles returned `violations: 0, transitions: 0`. Final
all-10-invariants cycle on clean platform: 106ms cycle duration,
all green, `sources_unavailable: []`.
Refs: #411, #653 (Phase 1), #884 (this PR — Phase 2 also)
… B-02, R-01) (#884) * feat(#882): canary invariant harness Phase 2 (S-02, E-01, E-05, B-01) Adds four single-source SQL/Redis invariants to the canary harness (#411). All four follow the Phase 1 (#653) pattern — no new source types, no new infrastructure, registered into the same `INVARIANTS` dict the run-cycle endpoint and background loop already drive. - S-02 — No overbooking. `ZCARD(agent:slots:A)` (drain sentinels filtered) > `max_parallel_tasks`. Critical. Tier A. Catches `acquire_slot` bypass — distinct from S-01 because the violation can be self-consistent (Redis and SQL agree on N+1 vs cap of N). - E-01 — Terminal-state closure. No `status='running'` row older than `execution_timeout_seconds + 300s` (matches `SLOT_TTL_BUFFER` so the check fires *after* cleanup has had its window). Critical. Tier B. - E-05 — Dispatched rows have session. No running row older than 60s with `claude_session_id IS NULL`. Major. Tier B. Guards #106. - B-01 — Queue-status coherence. `db.get_queued_count` (the accessor BacklogService calls) agrees with the snapshot's independently- collected `len(queued_exec_ids)`. Critical. Tier A. Trivially-green today after the #428 consolidation; regression guard against a future cache layer or status-filter drift on the production accessor. Snapshot extended with per-execution `claude_session_id` (E-05) and per-agent `queued_count_via_service` (B-01). The session-id collector PRAGMA-introspects the column so the minimal unit-test DDLs don't have to mirror every production column. The service-count collector lazy-imports `database.db`, returning `None` on import failure so unit tests (which stub `db.connection` but not the full facade) skip B-01 silently rather than firing a false positive. Unit tests: 67 passing (was 51). Each new invariant has positive, negative, and edge-case tests. Verification against local stack — for each invariant, provoke, run `POST /api/canary/run-cycle`, observe red, revert, observe green. All four reproduce as designed: - S-02 — ZADD'd 3 fake slot ids when max_parallel=2 → critical violation with `overbooked_by: 1`. - E-01 — inserted `status='running'` row with `started_at` 2h ago against a 60s-timeout agent → critical violation, `age_seconds: 7436 > timeout+buffer=360s`. - E-05 — inserted `status='running'` row 3 min old with `claude_session_id` NULL → major violation, age=188s. - B-01 — temporarily patched `db.get_queued_count` to return `count - 1` → critical violation, "db.get_queued_count = 0 != |queued ids in snapshot| = 1". Each post-fix cycle returned `violations: 0, transitions: 0`. Refs: #411, #653, docs/testing/orchestration-invariant-catalog.md * feat(#882): canary harness Phase 3 (S-03, B-02, R-01) Adds three moderate-complexity invariants on top of Phase 2. Each brings exactly one new piece of plumbing — first time the canary takes a hard dep on a non-trivial source beyond SQLite + Redis basic ops: - S-03 — Slot TTL ≥ execution timeout. For every member of `agent:slots:A`, the companion `agent:slot:A:{eid}` HASH must have `TTL ≥ execution_timeout_seconds + 300s` (SLOT_TTL_BUFFER). Three failure kinds surfaced explicitly: `missing` (-2; the #226 class), `no_expiry` (-1), `below_floor` (positive TTL under floor). Critical. Tier A. Per-slot `redis.ttl()` lookup, bounded by ZCARD per agent (≤ max_parallel_tasks). - B-02 — No queued without slots-full. If any agent has queued > 0, then either `slot_count == max_parallel` (legit backpressure) OR a drain tick fired in the last 60s (drain will pick it up). Critical. Tier B. Requires `CapacityManager.run_maintenance()` to write a unix-timestamp heartbeat to `canary:drain_tick_at` at the END of each successful sweep — mid-sweep crash leaves cursor stale and lets B-02 catch the breakage. One-line write in capacity_manager.py, rest is canary-local. - R-01 — No zombie Claude processes. For every running `trinity.platform=agent` container, `ps -eo stat,comm | grep '^Z.*claude' | wc -l` must be 0. Critical. Tier A. Guards PR #407. New source type — docker exec via the existing docker_service.docker_client. Per-container failures recorded in `sources_unavailable` so a single unhealthy container doesn't kill the cycle. Regex anchored at `^Z` rather than the catalog's ` Z` (leading-space) — procps-ng on the agent base image emits STAT left-aligned without padding; verified live by spawning an actual zombie via `os.fork()`+`prctl(PR_SET_NAME, "claude")`. Snapshot extended: - `AgentSnapshot.slot_ttls: Dict[str, int]` — per-slot metadata TTL, drain sentinels skipped at collection time. - `Snapshot.drain_tick_at: Optional[float]` — read from `canary:drain_tick_at`, sentinel-`None` on cold cluster. - `Snapshot.zombie_counts: Dict[str, int]` — per-agent zombie process count via container.exec_run; missing entry = exec failed for that container (recorded in `sources_unavailable`). Tests: 67 → 84 passing. Added `fake_docker` fixture so the synthetic container list is controllable; FakeRedis got a `ttl()` method with the standard -2/-1/positive sentinel semantics. Verification against local stack — for each invariant, provoke, run `POST /api/canary/run-cycle`, observe red, revert, observe green: - S-03: ZADD a slot + EXPIRE its metadata HASH to 30s while the floor is 360s → critical violation `kind: below_floor`. Also covered the `missing` kind by deleting the HASH entirely. Revert by EXPIRE 500. - B-02: inserted 1 queued row, set `canary:drain_tick_at` to 600s ago → critical violation, `free_slots: 2, drain_tick_age_seconds: 600`. Revert by writing a fresh timestamp. - R-01: spawned a real zombie inside agent-cornelius-m via Python fork + prctl PR_SET_NAME → critical violation `zombie_count: 1`. Reaped by killing the parent → green. All post-fix cycles returned `violations: 0, transitions: 0`. Final all-10-invariants cycle on clean platform: 106ms cycle duration, all green, `sources_unavailable: []`. Refs: #411, #653 (Phase 1), #884 (this PR — Phase 2 also) * fix(canary): /review fixes — alert quality + B-02 boot-window false-positive Addresses three findings from the pre-landing /review pass: I1 — Alert quality for Phase 2 + 3 invariants. canary_alerts.py only had S-01/E-02/L-03 entries in `_INVARIANT_NAMES`, `_INVARIANT_RUNBOOKS`, `_render_message`, and `_render_forensic`. New ids fell through to the "S-02 fired N violation(s)" generic fallback with the id doubled in the header. Added 7 entries each: - Friendly name and one-line runbook hint per invariant - Per-id `_render_message` (e.g. "3 zombie claude process(es) across 1 agent(s): cornelius-m" for R-01) - Per-id `_render_forensic` rendering of the relevant observed_state fields, truncated to 5 violations with a "+N more" footer Verified by hand-building an R-01 ViolationReport and inspecting the Block Kit payload — header, body, forensic, runbook, and context all render the new shape. I2 — B-02 boot-window false-positive. Background canary loop is fine (30s startup vs 15s maintenance loop), but the on-demand `POST /api/canary/run-cycle` endpoint can hit in the first 15s when no heartbeat exists. With pre-existing queued rows and free slots, B-02 would fire with `drain_tick_age_seconds: null`. `CapacityManager.__init__` now seeds the heartbeat with a fresh timestamp on construction. The maintenance loop overwrites on every successful tick; init only needs a non-stale floor. Verified live: deleted the heartbeat key, bounced the backend, key is present immediately — no waiting for the maintenance tick. I3 — Stale docstring in `_collect_zombie_counts`. The docstring still described the catalog's ` Z.*claude` (leading-space) reasoning while the actual `cmd` line uses `^Z.*claude`. Updated to describe the anchor-at-line-start version and reference the live-zombie verification. Bonus tidy: moved `import time` from inside `run_maintenance` to the top-of-file imports. Tests: 84/84 still green. Full all-10-invariant cycle clean. Refs: /review pass on #884 * feat(canary-fleet): replace long with sleep-echo slow agent `canary-fleet-long` was a duplicate of burst — same template, same task duration, same model — only the cron differed (*/5 vs */2). It added no coverage burst didn't already provide for S-01, E-02, S-03, R-01. Replace it with a `slow` agent backed by a new `sleep-echo` local template that sleeps 75s per task. This gives Phase 2 invariants something to inspect: - E-05 (dispatched rows have session): needs >60s running rows; was trivially-green with 4s test-echo tasks - S-03 below_floor: needs a slot to exist at canary snapshot time Also locks burst's live config into the yaml so a redeploy doesn't revert prior live SQL fixes: - cron: * * * * * -> */2 * * * * (cheapest cadence that phase-slides against the 5-min canary cycle) - description / comments updated to reflect actual coverage scope Manifest deploy can't express `model`, `max_parallel_tasks`, or `execution_timeout_seconds` (system_service.create_schedules drops them, SystemAgentConfig has no slot for capacity). Documented the four required post-deploy API calls in the yaml header so the next operator doesn't trip on it. * chore(lint): regenerate sys.modules baseline to absorb dev state Pre-existing CI failure inherited from dev — `dev` has been failing this lint since #871 (commit 98574f3) merged on 2026-05-17. That PR added 6 sys.modules violations in tests/unit/test_slot_per_slot_ttl.py without regenerating the baseline; a separate cleanup retired 3 violations in tests/unit/test_cleanup_unreachable_orphan.py. Regenerated via `python tests/lint_sys_modules.py --regenerate-baseline` — the path the lint script itself directs you to when violations move below baseline AND new files exceed it. Net: 235 violations in 67 files (unchanged total). No code-quality regression in this PR's actual diff — none of the canary tests use bare sys.modules manipulation (they use monkeypatch.setitem throughout).
Summary
Implements Phase 1 of #411 — a continuous orchestration-invariant harness that catches the bug class behind PRs #378, #403, #129, #226: race conditions and cross-component state drift between Redis × SQLite × agent registries that unit tests miss.
Three components, each justified separately in the PR planning thread:
src/backend/canary/) — pure-function checks shared between the watcher and the on-demand admin endpoint. No LLM reasoning anywhere; same snapshot in, same violations out.services/canary_service.py) — 5-min loop modeled oncleanup_service.py. Disabled by default; enable on staging/dev withCANARY_ENABLED=1.config/canary-fleet.yaml) — synthetic load via the existing/api/systems/deploymechanism. Without traffic the harness reports trivially-green cycles.Phase 1 invariants
ZRANGE agent:slots:{name}(drain sentinels filtered) == SQL running rows for that agentcanary:e02:terminal_seen) instead of Vector log diff for simplicity.agent_sharing,agent_schedules, non-terminalschedule_executions,agent_skills,agent_tags,agent_shared_files,agent_public_links, pendingoperator_queue/access_requests, agent-scopedmcp_api_keys, activechat_sessions, plus orphanagent:slots:*Redis keysSeverity: critical for active-orchestration orphans (running executions, Redis slot keys); major for dangling permissions/sharing/tags.
Terminal-execution set used by E-02 + L-03 is sourced from
TaskExecutionStatus(success,failed,cancelled,skipped) — the same set PR #524's CAS state machine treats as write-once.Why a separate
snapshot.pyfrom the invariant checksStated in the module docstring; summarised here:
agent_ownershipset, execution rows). A sharedSnapshotinstance means per-check timing drift can't introduce spurious mismatches.(snapshot) → list[ViolationReport]; they never re-implement SELECTs or ZRANGEs.Snapshotdataclasses straight in (tests/test_canary_invariants.py) — no live Redis or SQLite required to exercise checking logic.The snapshot is not atomic across Redis and SQLite — those don't share transactions. The harness deliberately accepts sub-second inconsistencies because real bugs persist across a 5-minute cycle by definition; transient races self-resolve and aren't what the canary exists to catch.
Transition-detection rule
A green→red transition fires exactly one Slack alert.
Implementation uses a Redis cursor
canary:last_cycle_atso each cycle compares this snapshot's violation timestamps against the previous cycle's snapshot, not the latest persisted violation. The router endpoint surfaces exactly the transitions the service emitted (no recomputation), so the API response and the alert sink can never disagree.Alert sink (Slack incoming webhook)
The harness fires green→red alerts via a Slack incoming webhook — admin-side configuration, no Settings UI. The canary is staging/dev-only and the operator already has shell access.
CANARY_SLACK_WEBHOOK_URLenv var. Unset = silent sink (cycles still run, violations still persist, debug-line records the detected transition).🚨 Canary: <invariant_id>), summary (severity, violation count, snapshot time), forensic detail per violation, runbook hint, "last red Xm ago" context badge when reopening after a green window.slack_service.post_webhook; failures are logged and swallowed so a hung webhook can't break the cycle. The violation row is still persisted regardless of webhook outcome.notifications, and doesn't change any per-agent access filter. The earlier admin-bypass approach was reverted inde6786f3once Slack became the sink.API + persistence
canary_violationstable (migration Smart Model Routing #34) — append-mostly, JSONobserved_statecarries invariant-specific payload.GET /api/canary/violations+/violations/stats+/violations/{id}— admin-only read paths.POST /api/canary/run-cycle— admin-only on-demand trigger that delegates to the sameCanaryService.run_cycle()invoked by the background loop. Useful for smoke-tests, post-fix verification, and integration tests that need deterministic timing. Response surfacessources_unavailableso a partial snapshot (e.g. Redis down) is visible to the caller instead of silently degrading checks (a3b77e34).Tests
27 unit tests in
tests/test_canary_invariants.pycoveringCanaryOperationsCRUD/filters/stats, snapshot collector behaviour, each invariant's holds-and-fires cases, the L-03 critical/major severity split, and the runner registry. All green locally with a fake Redis + temp SQLite.Smoke-test recipes
These are the manual end-to-end demos the harness was built to support. Run after
CANARY_ENABLED=1and (optionally)CANARY_SLACK_WEBHOOK_URL=<your-test-webhook>is set; fleet deployed viacurl -X POST .../api/systems/deploywithconfig/canary-fleet.yaml.Option 1 — L-03 via SQL poke (no code changes)
Surgical, deterministic, no rebuild. Triggers
L-03by inserting one orphan row directly:Option 2 — L-03 via real-flow code regression
Most authentic for L-03 — exercises the actual delete cascade.
src/backend/db/agents.py:128and comment out the cascade DELETE:# cursor.execute("DELETE FROM agent_sharing WHERE agent_name = ?", (agent_name,))--reloadpicks it up.POST /api/agents→POST /api/agents/{name}/share→DELETE /api/agents/{name}. Theagent_sharingrow is left behind.POST /api/canary/run-cycle→ L-03 fires.Option 3 — S-01 via real-flow code regression
Triggers
S-01by breaking slot release inservices/slot_service.py.src/backend/services/slot_service.py:168and replace theZREMwith a no-op:--reloadpicks it up.canary-fleet-burst's 1-minute schedule). When the task completes,release_slot()is called but the ZREM is no-op'd — Redis ZSET keeps the execution_id while SQL marks the rowsuccess.POST /api/canary/run-cycle→ S-01 fires (severity=critical,in_redis_onlycontains the leaked execution_id).Option 4 — E-02 via DB simulation
E-02 is the only invariant where direct DB manipulation is the realistic path: PR #524 added CAS guards specifically blocking terminal→non-terminal flips at the code level, so a "comment out the guard + trigger a reversal" recipe doesn't work cleanly. The DB poke simulates exactly what the bug class does — a silent status reversal — which is what E-02 exists to catch.
success).canary:e02:terminal_seenwith the completed execution_id.severity=critical,prev=terminal curr=running).Cycle hygiene to expect (verified live for all four recipes)
CANARY_SLACK_WEBHOOK_URLis unset).violations: []. Alert count holds.Test plan
pytest tests/test_canary_invariants.py— 27/27)python3 -m py_compile)CANARY_ENABLED=0(default — no canary activity)CANARY_ENABLED=1, watcher runs on the 5-min loop, fleet deployable via/api/systems/deployCANARY_ENABLED=1andCANARY_SLACK_WEBHOOK_URLunset — silent sink, transitions still detected, debug log records themNotes from the live demo
Real bugs surfaced during the end-to-end walkthrough and are fixed in this PR:
CANARY_ENABLEDwas set in.envbut never propagated to the backend container — added explicit env passthrough indocker-compose.yml.('completed', 'failed', 'cancelled', 'timeout')— wrong on three of four values vs Trinity's actualTaskExecutionStatusenum (success,failed,cancelled,skipped). Would have silently broken E-02 on production data and generated false-positive L-03 hits. Now sourced from the canonical enum and reused for both invariants.de6786f3once the alert sink moved to Slack incoming webhook in21b833d4— no notifications-table writes, no access-filter changes anywhere.prev_statusfrom thecanary:e02:terminal_seenRedis hash without preserving the actual terminal status; alerts now show the real prior status (7cca4b8c).These argue for adding an integration-test layer in a follow-up — pytest-driven
test_systems.py-style tests against a live backend would catch this class of issue automatically. Discussed in the PR thread.Phase 2 (deferred)
S-02, S-03, E-01, E-05, E-06, B-01, B-02, G-01, R-01 (per
docs/testing/orchestration-invariant-catalog.md). Each lands as a new file undersrc/backend/canary/invariants/plus a registry entry. The watcher service and API surface stay unchanged.Closes the infrastructure portion of #411's acceptance criteria.