feat(#882): canary harness Phase 2 + 3 (S-02, E-01, E-05, B-01, S-03, B-02, R-01)#884
Merged
Conversation
Adds four single-source SQL/Redis invariants to the canary harness (#411). All four follow the Phase 1 (#653) pattern — no new source types, no new infrastructure, registered into the same `INVARIANTS` dict the run-cycle endpoint and background loop already drive. - S-02 — No overbooking. `ZCARD(agent:slots:A)` (drain sentinels filtered) > `max_parallel_tasks`. Critical. Tier A. Catches `acquire_slot` bypass — distinct from S-01 because the violation can be self-consistent (Redis and SQL agree on N+1 vs cap of N). - E-01 — Terminal-state closure. No `status='running'` row older than `execution_timeout_seconds + 300s` (matches `SLOT_TTL_BUFFER` so the check fires *after* cleanup has had its window). Critical. Tier B. - E-05 — Dispatched rows have session. No running row older than 60s with `claude_session_id IS NULL`. Major. Tier B. Guards #106. - B-01 — Queue-status coherence. `db.get_queued_count` (the accessor BacklogService calls) agrees with the snapshot's independently- collected `len(queued_exec_ids)`. Critical. Tier A. Trivially-green today after the #428 consolidation; regression guard against a future cache layer or status-filter drift on the production accessor. Snapshot extended with per-execution `claude_session_id` (E-05) and per-agent `queued_count_via_service` (B-01). The session-id collector PRAGMA-introspects the column so the minimal unit-test DDLs don't have to mirror every production column. The service-count collector lazy-imports `database.db`, returning `None` on import failure so unit tests (which stub `db.connection` but not the full facade) skip B-01 silently rather than firing a false positive. Unit tests: 67 passing (was 51). Each new invariant has positive, negative, and edge-case tests. Verification against local stack — for each invariant, provoke, run `POST /api/canary/run-cycle`, observe red, revert, observe green. All four reproduce as designed: - S-02 — ZADD'd 3 fake slot ids when max_parallel=2 → critical violation with `overbooked_by: 1`. - E-01 — inserted `status='running'` row with `started_at` 2h ago against a 60s-timeout agent → critical violation, `age_seconds: 7436 > timeout+buffer=360s`. - E-05 — inserted `status='running'` row 3 min old with `claude_session_id` NULL → major violation, age=188s. - B-01 — temporarily patched `db.get_queued_count` to return `count - 1` → critical violation, "db.get_queued_count = 0 != |queued ids in snapshot| = 1". Each post-fix cycle returned `violations: 0, transitions: 0`. Refs: #411, #653, docs/testing/orchestration-invariant-catalog.md
This was referenced May 18, 2026
Adds three moderate-complexity invariants on top of Phase 2. Each
brings exactly one new piece of plumbing — first time the canary
takes a hard dep on a non-trivial source beyond SQLite + Redis basic
ops:
- S-03 — Slot TTL ≥ execution timeout. For every member of
`agent:slots:A`, the companion `agent:slot:A:{eid}` HASH must have
`TTL ≥ execution_timeout_seconds + 300s` (SLOT_TTL_BUFFER). Three
failure kinds surfaced explicitly: `missing` (-2; the #226 class),
`no_expiry` (-1), `below_floor` (positive TTL under floor). Critical.
Tier A. Per-slot `redis.ttl()` lookup, bounded by ZCARD per agent
(≤ max_parallel_tasks).
- B-02 — No queued without slots-full. If any agent has queued > 0,
then either `slot_count == max_parallel` (legit backpressure) OR a
drain tick fired in the last 60s (drain will pick it up). Critical.
Tier B. Requires `CapacityManager.run_maintenance()` to write a
unix-timestamp heartbeat to `canary:drain_tick_at` at the END of
each successful sweep — mid-sweep crash leaves cursor stale and
lets B-02 catch the breakage. One-line write in capacity_manager.py,
rest is canary-local.
- R-01 — No zombie Claude processes. For every running
`trinity.platform=agent` container,
`ps -eo stat,comm | grep '^Z.*claude' | wc -l` must be 0. Critical.
Tier A. Guards PR #407. New source type — docker exec via the
existing docker_service.docker_client. Per-container failures recorded
in `sources_unavailable` so a single unhealthy container doesn't
kill the cycle. Regex anchored at `^Z` rather than the catalog's
` Z` (leading-space) — procps-ng on the agent base image emits
STAT left-aligned without padding; verified live by spawning an
actual zombie via `os.fork()`+`prctl(PR_SET_NAME, "claude")`.
Snapshot extended:
- `AgentSnapshot.slot_ttls: Dict[str, int]` — per-slot metadata TTL,
drain sentinels skipped at collection time.
- `Snapshot.drain_tick_at: Optional[float]` — read from
`canary:drain_tick_at`, sentinel-`None` on cold cluster.
- `Snapshot.zombie_counts: Dict[str, int]` — per-agent zombie process
count via container.exec_run; missing entry = exec failed for that
container (recorded in `sources_unavailable`).
Tests: 67 → 84 passing. Added `fake_docker` fixture so the synthetic
container list is controllable; FakeRedis got a `ttl()` method with
the standard -2/-1/positive sentinel semantics.
Verification against local stack — for each invariant, provoke, run
`POST /api/canary/run-cycle`, observe red, revert, observe green:
- S-03: ZADD a slot + EXPIRE its metadata HASH to 30s while the floor
is 360s → critical violation `kind: below_floor`. Also covered the
`missing` kind by deleting the HASH entirely. Revert by EXPIRE 500.
- B-02: inserted 1 queued row, set `canary:drain_tick_at` to 600s
ago → critical violation, `free_slots: 2, drain_tick_age_seconds:
600`. Revert by writing a fresh timestamp.
- R-01: spawned a real zombie inside agent-cornelius-m via Python
fork + prctl PR_SET_NAME → critical violation
`zombie_count: 1`. Reaped by killing the parent → green.
All post-fix cycles returned `violations: 0, transitions: 0`. Final
all-10-invariants cycle on clean platform: 106ms cycle duration,
all green, `sources_unavailable: []`.
Refs: #411, #653 (Phase 1), #884 (this PR — Phase 2 also)
…ositive
Addresses three findings from the pre-landing /review pass:
I1 — Alert quality for Phase 2 + 3 invariants. canary_alerts.py only
had S-01/E-02/L-03 entries in `_INVARIANT_NAMES`, `_INVARIANT_RUNBOOKS`,
`_render_message`, and `_render_forensic`. New ids fell through to the
"S-02 fired N violation(s)" generic fallback with the id doubled in
the header. Added 7 entries each:
- Friendly name and one-line runbook hint per invariant
- Per-id `_render_message` (e.g. "3 zombie claude process(es) across
1 agent(s): cornelius-m" for R-01)
- Per-id `_render_forensic` rendering of the relevant observed_state
fields, truncated to 5 violations with a "+N more" footer
Verified by hand-building an R-01 ViolationReport and inspecting the
Block Kit payload — header, body, forensic, runbook, and context all
render the new shape.
I2 — B-02 boot-window false-positive. Background canary loop is fine
(30s startup vs 15s maintenance loop), but the on-demand
`POST /api/canary/run-cycle` endpoint can hit in the first 15s when
no heartbeat exists. With pre-existing queued rows and free slots,
B-02 would fire with `drain_tick_age_seconds: null`.
`CapacityManager.__init__` now seeds the heartbeat with a fresh
timestamp on construction. The maintenance loop overwrites on every
successful tick; init only needs a non-stale floor. Verified live:
deleted the heartbeat key, bounced the backend, key is present
immediately — no waiting for the maintenance tick.
I3 — Stale docstring in `_collect_zombie_counts`. The docstring still
described the catalog's ` Z.*claude` (leading-space) reasoning while
the actual `cmd` line uses `^Z.*claude`. Updated to describe the
anchor-at-line-start version and reference the live-zombie
verification.
Bonus tidy: moved `import time` from inside `run_maintenance` to the
top-of-file imports.
Tests: 84/84 still green. Full all-10-invariant cycle clean.
Refs: /review pass on #884
`canary-fleet-long` was a duplicate of burst — same template, same task duration, same model — only the cron differed (*/5 vs */2). It added no coverage burst didn't already provide for S-01, E-02, S-03, R-01. Replace it with a `slow` agent backed by a new `sleep-echo` local template that sleeps 75s per task. This gives Phase 2 invariants something to inspect: - E-05 (dispatched rows have session): needs >60s running rows; was trivially-green with 4s test-echo tasks - S-03 below_floor: needs a slot to exist at canary snapshot time Also locks burst's live config into the yaml so a redeploy doesn't revert prior live SQL fixes: - cron: * * * * * -> */2 * * * * (cheapest cadence that phase-slides against the 5-min canary cycle) - description / comments updated to reflect actual coverage scope Manifest deploy can't express `model`, `max_parallel_tasks`, or `execution_timeout_seconds` (system_service.create_schedules drops them, SystemAgentConfig has no slot for capacity). Documented the four required post-deploy API calls in the yaml header so the next operator doesn't trip on it.
Pre-existing CI failure inherited from dev — `dev` has been failing this lint since #871 (commit 98574f3) merged on 2026-05-17. That PR added 6 sys.modules violations in tests/unit/test_slot_per_slot_ttl.py without regenerating the baseline; a separate cleanup retired 3 violations in tests/unit/test_cleanup_unreachable_orphan.py. Regenerated via `python tests/lint_sys_modules.py --regenerate-baseline` — the path the lint script itself directs you to when violations move below baseline AND new files exceed it. Net: 235 violations in 67 files (unchanged total). No code-quality regression in this PR's actual diff — none of the canary tests use bare sys.modules manipulation (they use monkeypatch.setitem throughout).
vybe
approved these changes
May 20, 2026
Contributor
vybe
left a comment
There was a problem hiding this comment.
High-quality Phase 2+3 delivery. 7 invariants with extensive unit coverage (51→84), each live-verified against the local stack including a real-zombie fork+prctl for R-01. Defensive collector shape correct, heartbeat plumbing right (END-of-sweep write + init seed), scope discipline excellent (one non-canary file touched). CI fully green, architecture docs updated, Closes #882. Honest /review self-pass before posting.
4 tasks
4 tasks
AndriiPasternak31
added a commit
that referenced
this pull request
May 20, 2026
…e-586-plan Resolves PR #875 reviewer request: rebase on dev to pick up PR #884's baseline regeneration (adds test_slot_per_slot_ttl.py: 6 via #881's _STUBBED_MODULE_NAMES refactor; removes test_cleanup_unreachable_orphan.py: 3). After merge, sys.modules lint is green: - test_slot_per_slot_ttl.py: now exempt via sanctioned _STUBBED_MODULE_NAMES + _restore_sys_modules pattern (PR #881 on dev) - test_cb_probe_execution_close.py: baseline 7 matches actual (this PR's stub eviction) - test_config_fail_fast.py: actual 0, baseline 1 (reduced — sanctioned) Conflict in docs/memory/feature-flows.md resolved by keeping both this PR's #35d4e78 row and dev's #887/#888 rows.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #882. Builds on #653 (Phase 1).
Ships Phases 2 and 3 of the canary harness together — seven new invariants registered into the same
INVARIANTSdict that the run-cycle endpoint and background loop already drive.Phase 2 — single-source SQL/Redis (no new sources)
ZCARD(agent:slots:A)>max_parallel_tasks(capacity bypass)status='running'row pastexecution_timeout_seconds + SLOT_TTL_BUFFERstatus='running'row older than 60s withclaude_session_id IS NULL(#106 guard)db.get_queued_countdisagrees with snapshot's id-list countPhase 3 — moderate (one new piece of plumbing each)
execution_timeout_seconds + 300s(#226 class)redis.ttl()lookupcanary:drain_tick_atheartbeat written byCapacityManager.run_maintenance()claudeprocesses inside agent containers (#407 regression)docker execper runningtrinity.platform=agentcontainerWhy each one is useful
Phase 2
SLOT_TTL_BUFFERso the check fires after cleanup has had its window. A violation unambiguously means cleanup failed to act.status='queued'rows after refactor: consolidate ExecutionQueue + SlotService + BacklogService into CapacityManager #428), but lives as a regression guard. The point is to fail loudly the day someone cachesdb.get_queued_countand gets the invalidation wrong.Phase 3
missing(-2, metadata HASH expired before ZSET — the bug: Stale slot cleanup uses fixed 20-min TTL regardless of agent timeout #226 case),no_expiry(-1,redis.expire()never set),below_floor(positive TTL under the configured floor). All three are critical because a slot with a too-short or missing TTL becomes a permanent capacity leak.run_maintenance()(not the start), so a mid-sweep crash leaves the cursor stale and lets the canary catch the breakage rather than masking it.sources_unavailableand the check skips that agent rather than firing a false positive. Cycle cost ≈ 1 exec per running agent per 5 min.Snapshot changes
AgentSnapshot.running_claude_session_idsAgentSnapshot.queued_count_via_serviceAgentSnapshot.slot_ttlsSnapshot.drain_tick_atSnapshot.zombie_countsAll collectors fail gracefully — per-agent exec failures, missing accessors, and unreachable Redis are recorded in
sources_unavailableand the affected invariant either skips that agent or skips the whole cycle.Unit tests
51 → 84 passing. Each invariant has positive, negative, and edge-case tests.
Test fixtures:
FakeRedisgot attl()method with the standard-2/-1/positive sentinel semantics; newfake_dockerfixture exposes a controllable container list withexec_runstubs.Live verification
Against the local stack (
CANARY_ENABLED=0but thePOST /api/canary/run-cycleadmin endpoint still works). For each invariant: provoke, observe red, revert, observe green. All seven reproduce as designed.S-02 (Phase 2)
E-01 (Phase 2)
E-05 (Phase 2)
B-01 (Phase 2)
Patched
db/schedules.py:get_queued_countto returncount - 1— exact bug class B-01 exists to catch:S-03 (Phase 3)
Also verified
kind: missing(-2) by deleting the metadata HASH entirely.B-02 (Phase 3)
R-01 (Phase 3) — the most satisfying verification
Spawned a real zombie process named
claudeinside the runningagent-cornelius-mcontainer via:Reaped by killing the parent PID → R-01 returned green on the next cycle.
Catalog deviation worth noting
The orchestration-invariant-catalog spec uses
Z.*claude(leading space). procps-ng on the agent base image emits STAT left-aligned without padding, so that pattern misses. Switched to^Z.*claude— same intent, works across formatters. Verified live against a real zombie above.Final all-10 cycle
{ "checks_run": ["S-01", "S-02", "S-03", "E-01", "E-02", "E-05", "L-03", "B-01", "B-02", "R-01"], "sources_unavailable": [], "violations": [], "transitions": [], "cycle_duration_ms": 106 }Non-canary code touched
Just one place:
services/capacity_manager.py:run_maintenancewrites thecanary:drain_tick_atheartbeat at the END of each successful sweep (~5 lines including the comment and try/except). Everything else stays insidecanary/.Out of scope — Phase 4
/api/executions/running. Needs network-flake handling and timeout budgets.Doc updates
docs/memory/architecture.md:canary:drain_tick_atheartbeat write in the Capacity Maintenance row.