feat(#882): canary harness Phase 2 + 3 (S-02, E-01, E-05, B-01, S-03, B-02, R-01) by obasilakis · Pull Request #884 · Abilityai/trinity

obasilakis · 2026-05-18T12:21:48Z

Closes #882. Builds on #653 (Phase 1).

Ships Phases 2 and 3 of the canary harness together — seven new invariants registered into the same INVARIANTS dict that the run-cycle endpoint and background loop already drive.

Phase 2 — single-source SQL/Redis (no new sources)

ID	Tier	Severity	What it catches
S-02	A	critical	`ZCARD(agent:slots:A)` > `max_parallel_tasks` (capacity bypass)
E-01	B	critical	`status='running'` row past `execution_timeout_seconds + SLOT_TTL_BUFFER`
E-05	B	major	`status='running'` row older than 60s with `claude_session_id IS NULL` (#106 guard)
B-01	A	critical	`db.get_queued_count` disagrees with snapshot's id-list count

Phase 3 — moderate (one new piece of plumbing each)

ID	Tier	Severity	What it catches	New plumbing
S-03	A	critical	Slot metadata HASH TTL below `execution_timeout_seconds + 300s` (#226 class)	Per-slot `redis.ttl()` lookup
B-02	B	critical	Queued > 0 with free slots and stale drain heartbeat (stalled drain)	`canary:drain_tick_at` heartbeat written by `CapacityManager.run_maintenance()`
R-01	A	critical	Zombie `claude` processes inside agent containers (#407 regression)	`docker exec` per running `trinity.platform=agent` container

Why each one is useful

Phase 2

S-02 vs S-01. S-01 says "Redis and SQL disagree". S-02 says "even Redis on its own already shows we're past the cap". They overlap but catch different bug classes.
E-01. Aligned with SLOT_TTL_BUFFER so the check fires after cleanup has had its window. A violation unambiguously means cleanup failed to act.
E-05. First Claude turn returns a session id in ms. >60s without one means the session-id update path is broken — same bug class as bug: Cleanup service misses 'skipped' executions and slow to detect no-session failures #106.
B-01. Trivially-green today (the backlog is status='queued' rows after refactor: consolidate ExecutionQueue + SlotService + BacklogService into CapacityManager #428), but lives as a regression guard. The point is to fail loudly the day someone caches db.get_queued_count and gets the invalidation wrong.

Phase 3

S-03. Three failure kinds surfaced explicitly: missing (-2, metadata HASH expired before ZSET — the bug: Stale slot cleanup uses fixed 20-min TTL regardless of agent timeout #226 case), no_expiry (-1, redis.expire() never set), below_floor (positive TTL under the configured floor). All three are critical because a slot with a too-short or missing TTL becomes a permanent capacity leak.
B-02. The drain-tick heartbeat is written at the END of run_maintenance() (not the start), so a mid-sweep crash leaves the cursor stale and lets the canary catch the breakage rather than masking it.
R-01. First canary invariant that needs Docker. Failure modes graceful — per-container exec failure is recorded in sources_unavailable and the check skips that agent rather than firing a false positive. Cycle cost ≈ 1 exec per running agent per 5 min.

Snapshot changes

Field	Phase	Purpose
`AgentSnapshot.running_claude_session_ids`	2	E-05 per-execution session id
`AgentSnapshot.queued_count_via_service`	2	B-01 production-accessor cross-check
`AgentSnapshot.slot_ttls`	3	S-03 per-slot Redis TTL
`Snapshot.drain_tick_at`	3	B-02 heartbeat timestamp
`Snapshot.zombie_counts`	3	R-01 per-agent zombie count

All collectors fail gracefully — per-agent exec failures, missing accessors, and unreachable Redis are recorded in sources_unavailable and the affected invariant either skips that agent or skips the whole cycle.

Unit tests

51 → 84 passing. Each invariant has positive, negative, and edge-case tests.

$ python3 -m pytest tests/test_canary_invariants.py -q
84 passed in 0.75s

Test fixtures: FakeRedis got a ttl() method with the standard -2/-1/positive sentinel semantics; new fake_docker fixture exposes a controllable container list with exec_run stubs.

Live verification

Against the local stack (CANARY_ENABLED=0 but the POST /api/canary/run-cycle admin endpoint still works). For each invariant: provoke, observe red, revert, observe green. All seven reproduce as designed.

S-02 (Phase 2)

$ # Provoke: ZADD 3 fake slot ids when max_parallel=2
{
  "violations": [{
    "invariant_id": "S-02", "severity": "critical",
    "observed_state": {
      "slot_count": 3, "max_parallel_tasks": 2, "overbooked_by": 1,
      "slot_ids": ["fake-eid-1", "fake-eid-2", "fake-eid-3"]
    }
  }]
}

E-01 (Phase 2)

$ # Provoke: insert status='running' row started 2h ago, agent timeout=60s
{
  "violations": [{
    "invariant_id": "E-01", "severity": "critical", "tier": "B",
    "observed_state": {
      "age_seconds": 7436,
      "execution_timeout_seconds": 60, "slot_ttl_buffer_seconds": 300
    }
  }]
}

E-05 (Phase 2)

$ # Provoke: insert status='running' row 3min old, claude_session_id NULL
{
  "violations": [{
    "invariant_id": "E-05", "severity": "major", "tier": "B",
    "observed_state": {"age_seconds": 188, "grace_seconds": 60}
  }]
}

B-01 (Phase 2)

Patched db/schedules.py:get_queued_count to return count - 1 — exact bug class B-01 exists to catch:

$ # Provoke: 1 queued row + buggy accessor returning 0
{
  "violations": [{
    "invariant_id": "B-01", "severity": "critical", "tier": "A",
    "observed_state": {"service_count": 0, "snapshot_count": 1},
    "signal_query": "db.get_queued_count('canary-test-882') = 0 != |queued ids in snapshot| = 1"
  }]
}

S-03 (Phase 3)

$ # Provoke: ZADD a slot, EXPIRE its metadata to 30s, agent floor=360s
{
  "violations": [{
    "invariant_id": "S-03", "severity": "critical", "tier": "A",
    "observed_state": {
      "redis_ttl_seconds": 30, "floor_seconds": 360, "kind": "below_floor"
    },
    "signal_query": "TTL(agent:slot:canary-test-882:s03-eid) = 30 (below_floor); floor = 360s"
  }]
}

Also verified kind: missing (-2) by deleting the metadata HASH entirely.

B-02 (Phase 3)

$ # Provoke: 1 queued row, 0 slots (cap=2), set drain tick to 600s ago
{
  "violations": [{
    "invariant_id": "B-02", "severity": "critical", "tier": "B",
    "observed_state": {
      "queued_count": 1, "slot_count": 0, "free_slots": 2,
      "drain_tick_age_seconds": 600, "drain_tick_grace_seconds": 60
    },
    "signal_query": "agent canary-test-882: queued=1, slots=0/2 (free=2), drain_tick_age=600s > 60s"
  }]
}

R-01 (Phase 3) — the most satisfying verification

Spawned a real zombie process named claude inside the running agent-cornelius-m container via:

os.fork()
# Child: prctl(PR_SET_NAME, "claude"); sys.exit(0)
# Parent: sleep(120) without wait()

$ # ps -eo stat,comm | grep '^Z.*claude' → "Z    claude"
{
  "violations": [{
    "invariant_id": "R-01", "severity": "critical", "tier": "A",
    "observed_state": {"agent_name": "cornelius-m", "zombie_count": 1},
    "signal_query": "docker exec agent-cornelius-m sh -c \"ps -eo stat,comm | grep '^Z.*claude' | wc -l\" = 1"
  }]
}

Reaped by killing the parent PID → R-01 returned green on the next cycle.

Catalog deviation worth noting

The orchestration-invariant-catalog spec uses Z.*claude (leading space). procps-ng on the agent base image emits STAT left-aligned without padding, so that pattern misses. Switched to ^Z.*claude — same intent, works across formatters. Verified live against a real zombie above.

Final all-10 cycle

{
  "checks_run": ["S-01", "S-02", "S-03", "E-01", "E-02", "E-05",
                 "L-03", "B-01", "B-02", "R-01"],
  "sources_unavailable": [],
  "violations": [],
  "transitions": [],
  "cycle_duration_ms": 106
}

Non-canary code touched

Just one place: services/capacity_manager.py:run_maintenance writes the canary:drain_tick_at heartbeat at the END of each successful sweep (~5 lines including the comment and try/except). Everything else stays inside canary/.

Out of scope — Phase 4

E-06 — HTTP fan-out across agents to /api/executions/running. Needs network-flake handling and timeout budgets.
G-01 — One-shot post-restart probe. Doesn't fit the 5-min loop — needs a new "one-shot canary" execution mode.

Doc updates

docs/memory/architecture.md:

CANARY-001 section: added Phase 2 + Phase 3 invariant blocks.
Background Services table: documented the canary:drain_tick_at heartbeat write in the Capacity Maintenance row.

Adds four single-source SQL/Redis invariants to the canary harness (#411). All four follow the Phase 1 (#653) pattern — no new source types, no new infrastructure, registered into the same `INVARIANTS` dict the run-cycle endpoint and background loop already drive. - S-02 — No overbooking. `ZCARD(agent:slots:A)` (drain sentinels filtered) > `max_parallel_tasks`. Critical. Tier A. Catches `acquire_slot` bypass — distinct from S-01 because the violation can be self-consistent (Redis and SQL agree on N+1 vs cap of N). - E-01 — Terminal-state closure. No `status='running'` row older than `execution_timeout_seconds + 300s` (matches `SLOT_TTL_BUFFER` so the check fires *after* cleanup has had its window). Critical. Tier B. - E-05 — Dispatched rows have session. No running row older than 60s with `claude_session_id IS NULL`. Major. Tier B. Guards #106. - B-01 — Queue-status coherence. `db.get_queued_count` (the accessor BacklogService calls) agrees with the snapshot's independently- collected `len(queued_exec_ids)`. Critical. Tier A. Trivially-green today after the #428 consolidation; regression guard against a future cache layer or status-filter drift on the production accessor. Snapshot extended with per-execution `claude_session_id` (E-05) and per-agent `queued_count_via_service` (B-01). The session-id collector PRAGMA-introspects the column so the minimal unit-test DDLs don't have to mirror every production column. The service-count collector lazy-imports `database.db`, returning `None` on import failure so unit tests (which stub `db.connection` but not the full facade) skip B-01 silently rather than firing a false positive. Unit tests: 67 passing (was 51). Each new invariant has positive, negative, and edge-case tests. Verification against local stack — for each invariant, provoke, run `POST /api/canary/run-cycle`, observe red, revert, observe green. All four reproduce as designed: - S-02 — ZADD'd 3 fake slot ids when max_parallel=2 → critical violation with `overbooked_by: 1`. - E-01 — inserted `status='running'` row with `started_at` 2h ago against a 60s-timeout agent → critical violation, `age_seconds: 7436 > timeout+buffer=360s`. - E-05 — inserted `status='running'` row 3 min old with `claude_session_id` NULL → major violation, age=188s. - B-01 — temporarily patched `db.get_queued_count` to return `count - 1` → critical violation, "db.get_queued_count = 0 != |queued ids in snapshot| = 1". Each post-fix cycle returned `violations: 0, transitions: 0`. Refs: #411, #653, docs/testing/orchestration-invariant-catalog.md

Adds three moderate-complexity invariants on top of Phase 2. Each brings exactly one new piece of plumbing — first time the canary takes a hard dep on a non-trivial source beyond SQLite + Redis basic ops: - S-03 — Slot TTL ≥ execution timeout. For every member of `agent:slots:A`, the companion `agent:slot:A:{eid}` HASH must have `TTL ≥ execution_timeout_seconds + 300s` (SLOT_TTL_BUFFER). Three failure kinds surfaced explicitly: `missing` (-2; the #226 class), `no_expiry` (-1), `below_floor` (positive TTL under floor). Critical. Tier A. Per-slot `redis.ttl()` lookup, bounded by ZCARD per agent (≤ max_parallel_tasks). - B-02 — No queued without slots-full. If any agent has queued > 0, then either `slot_count == max_parallel` (legit backpressure) OR a drain tick fired in the last 60s (drain will pick it up). Critical. Tier B. Requires `CapacityManager.run_maintenance()` to write a unix-timestamp heartbeat to `canary:drain_tick_at` at the END of each successful sweep — mid-sweep crash leaves cursor stale and lets B-02 catch the breakage. One-line write in capacity_manager.py, rest is canary-local. - R-01 — No zombie Claude processes. For every running `trinity.platform=agent` container, `ps -eo stat,comm | grep '^Z.*claude' | wc -l` must be 0. Critical. Tier A. Guards PR #407. New source type — docker exec via the existing docker_service.docker_client. Per-container failures recorded in `sources_unavailable` so a single unhealthy container doesn't kill the cycle. Regex anchored at `^Z` rather than the catalog's ` Z` (leading-space) — procps-ng on the agent base image emits STAT left-aligned without padding; verified live by spawning an actual zombie via `os.fork()`+`prctl(PR_SET_NAME, "claude")`. Snapshot extended: - `AgentSnapshot.slot_ttls: Dict[str, int]` — per-slot metadata TTL, drain sentinels skipped at collection time. - `Snapshot.drain_tick_at: Optional[float]` — read from `canary:drain_tick_at`, sentinel-`None` on cold cluster. - `Snapshot.zombie_counts: Dict[str, int]` — per-agent zombie process count via container.exec_run; missing entry = exec failed for that container (recorded in `sources_unavailable`). Tests: 67 → 84 passing. Added `fake_docker` fixture so the synthetic container list is controllable; FakeRedis got a `ttl()` method with the standard -2/-1/positive sentinel semantics. Verification against local stack — for each invariant, provoke, run `POST /api/canary/run-cycle`, observe red, revert, observe green: - S-03: ZADD a slot + EXPIRE its metadata HASH to 30s while the floor is 360s → critical violation `kind: below_floor`. Also covered the `missing` kind by deleting the HASH entirely. Revert by EXPIRE 500. - B-02: inserted 1 queued row, set `canary:drain_tick_at` to 600s ago → critical violation, `free_slots: 2, drain_tick_age_seconds: 600`. Revert by writing a fresh timestamp. - R-01: spawned a real zombie inside agent-cornelius-m via Python fork + prctl PR_SET_NAME → critical violation `zombie_count: 1`. Reaped by killing the parent → green. All post-fix cycles returned `violations: 0, transitions: 0`. Final all-10-invariants cycle on clean platform: 106ms cycle duration, all green, `sources_unavailable: []`. Refs: #411, #653 (Phase 1), #884 (this PR — Phase 2 also)

…ositive Addresses three findings from the pre-landing /review pass: I1 — Alert quality for Phase 2 + 3 invariants. canary_alerts.py only had S-01/E-02/L-03 entries in `_INVARIANT_NAMES`, `_INVARIANT_RUNBOOKS`, `_render_message`, and `_render_forensic`. New ids fell through to the "S-02 fired N violation(s)" generic fallback with the id doubled in the header. Added 7 entries each: - Friendly name and one-line runbook hint per invariant - Per-id `_render_message` (e.g. "3 zombie claude process(es) across 1 agent(s): cornelius-m" for R-01) - Per-id `_render_forensic` rendering of the relevant observed_state fields, truncated to 5 violations with a "+N more" footer Verified by hand-building an R-01 ViolationReport and inspecting the Block Kit payload — header, body, forensic, runbook, and context all render the new shape. I2 — B-02 boot-window false-positive. Background canary loop is fine (30s startup vs 15s maintenance loop), but the on-demand `POST /api/canary/run-cycle` endpoint can hit in the first 15s when no heartbeat exists. With pre-existing queued rows and free slots, B-02 would fire with `drain_tick_age_seconds: null`. `CapacityManager.__init__` now seeds the heartbeat with a fresh timestamp on construction. The maintenance loop overwrites on every successful tick; init only needs a non-stale floor. Verified live: deleted the heartbeat key, bounced the backend, key is present immediately — no waiting for the maintenance tick. I3 — Stale docstring in `_collect_zombie_counts`. The docstring still described the catalog's ` Z.*claude` (leading-space) reasoning while the actual `cmd` line uses `^Z.*claude`. Updated to describe the anchor-at-line-start version and reference the live-zombie verification. Bonus tidy: moved `import time` from inside `run_maintenance` to the top-of-file imports. Tests: 84/84 still green. Full all-10-invariant cycle clean. Refs: /review pass on #884

`canary-fleet-long` was a duplicate of burst — same template, same task duration, same model — only the cron differed (*/5 vs */2). It added no coverage burst didn't already provide for S-01, E-02, S-03, R-01. Replace it with a `slow` agent backed by a new `sleep-echo` local template that sleeps 75s per task. This gives Phase 2 invariants something to inspect: - E-05 (dispatched rows have session): needs >60s running rows; was trivially-green with 4s test-echo tasks - S-03 below_floor: needs a slot to exist at canary snapshot time Also locks burst's live config into the yaml so a redeploy doesn't revert prior live SQL fixes: - cron: * * * * * -> */2 * * * * (cheapest cadence that phase-slides against the 5-min canary cycle) - description / comments updated to reflect actual coverage scope Manifest deploy can't express `model`, `max_parallel_tasks`, or `execution_timeout_seconds` (system_service.create_schedules drops them, SystemAgentConfig has no slot for capacity). Documented the four required post-deploy API calls in the yaml header so the next operator doesn't trip on it.

Pre-existing CI failure inherited from dev — `dev` has been failing this lint since #871 (commit 98574f3) merged on 2026-05-17. That PR added 6 sys.modules violations in tests/unit/test_slot_per_slot_ttl.py without regenerating the baseline; a separate cleanup retired 3 violations in tests/unit/test_cleanup_unreachable_orphan.py. Regenerated via `python tests/lint_sys_modules.py --regenerate-baseline` — the path the lint script itself directs you to when violations move below baseline AND new files exceed it. Net: 235 violations in 67 files (unchanged total). No code-quality regression in this PR's actual diff — none of the canary tests use bare sys.modules manipulation (they use monkeypatch.setitem throughout).

vybe

High-quality Phase 2+3 delivery. 7 invariants with extensive unit coverage (51→84), each live-verified against the local stack including a real-zombie fork+prctl for R-01. Defensive collector shape correct, heartbeat plumbing right (END-of-sweep write + init seed), scope discipline excellent (one non-canary file touched). CI fully green, architecture docs updated, Closes #882. Honest /review self-pass before posting.

…e-586-plan Resolves PR #875 reviewer request: rebase on dev to pick up PR #884's baseline regeneration (adds test_slot_per_slot_ttl.py: 6 via #881's _STUBBED_MODULE_NAMES refactor; removes test_cleanup_unreachable_orphan.py: 3). After merge, sys.modules lint is green: - test_slot_per_slot_ttl.py: now exempt via sanctioned _STUBBED_MODULE_NAMES + _restore_sys_modules pattern (PR #881 on dev) - test_cb_probe_execution_close.py: baseline 7 matches actual (this PR's stub eviction) - test_config_fail_fast.py: actual 0, baseline 1 (reduced — sanctioned) Conflict in docs/memory/feature-flows.md resolved by keeping both this PR's #35d4e78 row and dev's #887/#888 rows.

This was referenced May 18, 2026

Canary Phase 3: add invariants S-03, B-02, R-01 #885

Closed

Canary Phase 2 + 3: add invariants S-02, E-01, E-05, B-01, S-03, B-02, R-01 #882

Open

obasilakis changed the title ~~feat(#882): canary invariant harness Phase 2 (S-02, E-01, E-05, B-01)~~ feat(#882): canary harness Phase 2 + 3 (S-02, E-01, E-05, B-01, S-03, B-02, R-01) May 18, 2026

obasilakis added 3 commits May 18, 2026 15:56

vybe approved these changes May 20, 2026

View reviewed changes

vybe merged commit 009c7fc into dev May 20, 2026
9 checks passed

vybe mentioned this pull request May 20, 2026

test(#586) + test-suite recovery + doc backfill + CSO 2026-05-17 #875

Merged

4 tasks

AndriiPasternak31 mentioned this pull request May 20, 2026

test(security): backfill encryption tests for Telegram/WhatsApp/Slack-002 (#664) #876

Merged

4 tasks

This was referenced May 22, 2026

bug: scheduled runs ignore per-agent execution_timeout_seconds (canary S-03 + E-01 surfaced) #913

Open

Canary harness Phase 2/3 (#882) + #913 timeout fix #920

Closed

fix(#913): scheduler honors per-agent execution_timeout_seconds #922

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(#882): canary harness Phase 2 + 3 (S-02, E-01, E-05, B-01, S-03, B-02, R-01)#884

feat(#882): canary harness Phase 2 + 3 (S-02, E-01, E-05, B-01, S-03, B-02, R-01)#884
vybe merged 5 commits into
devfrom
feature/882-canary-phase2

obasilakis commented May 18, 2026 •

edited

Loading

Uh oh!

vybe left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

obasilakis commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Phase 2 — single-source SQL/Redis (no new sources)

Phase 3 — moderate (one new piece of plumbing each)

Why each one is useful

Phase 2

Phase 3

Snapshot changes

Unit tests

Live verification

S-02 (Phase 2)

E-01 (Phase 2)

E-05 (Phase 2)

B-01 (Phase 2)

S-03 (Phase 3)

B-02 (Phase 3)

R-01 (Phase 3) — the most satisfying verification

Catalog deviation worth noting

Final all-10 cycle

Non-canary code touched

Out of scope — Phase 4

Doc updates

Uh oh!

vybe left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

obasilakis commented May 18, 2026 •

edited

Loading