Skip to content

feat(#882): canary harness Phase 2 + 3 (S-02, E-01, E-05, B-01, S-03, B-02, R-01)#884

Merged
vybe merged 5 commits into
devfrom
feature/882-canary-phase2
May 20, 2026
Merged

feat(#882): canary harness Phase 2 + 3 (S-02, E-01, E-05, B-01, S-03, B-02, R-01)#884
vybe merged 5 commits into
devfrom
feature/882-canary-phase2

Conversation

@obasilakis
Copy link
Copy Markdown
Contributor

@obasilakis obasilakis commented May 18, 2026

Closes #882. Builds on #653 (Phase 1).

Ships Phases 2 and 3 of the canary harness together — seven new invariants registered into the same INVARIANTS dict that the run-cycle endpoint and background loop already drive.

Phase 2 — single-source SQL/Redis (no new sources)

ID Tier Severity What it catches
S-02 A critical ZCARD(agent:slots:A) > max_parallel_tasks (capacity bypass)
E-01 B critical status='running' row past execution_timeout_seconds + SLOT_TTL_BUFFER
E-05 B major status='running' row older than 60s with claude_session_id IS NULL (#106 guard)
B-01 A critical db.get_queued_count disagrees with snapshot's id-list count

Phase 3 — moderate (one new piece of plumbing each)

ID Tier Severity What it catches New plumbing
S-03 A critical Slot metadata HASH TTL below execution_timeout_seconds + 300s (#226 class) Per-slot redis.ttl() lookup
B-02 B critical Queued > 0 with free slots and stale drain heartbeat (stalled drain) canary:drain_tick_at heartbeat written by CapacityManager.run_maintenance()
R-01 A critical Zombie claude processes inside agent containers (#407 regression) docker exec per running trinity.platform=agent container

Why each one is useful

Phase 2

Phase 3

  • S-03. Three failure kinds surfaced explicitly: missing (-2, metadata HASH expired before ZSET — the bug: Stale slot cleanup uses fixed 20-min TTL regardless of agent timeout #226 case), no_expiry (-1, redis.expire() never set), below_floor (positive TTL under the configured floor). All three are critical because a slot with a too-short or missing TTL becomes a permanent capacity leak.
  • B-02. The drain-tick heartbeat is written at the END of run_maintenance() (not the start), so a mid-sweep crash leaves the cursor stale and lets the canary catch the breakage rather than masking it.
  • R-01. First canary invariant that needs Docker. Failure modes graceful — per-container exec failure is recorded in sources_unavailable and the check skips that agent rather than firing a false positive. Cycle cost ≈ 1 exec per running agent per 5 min.

Snapshot changes

Field Phase Purpose
AgentSnapshot.running_claude_session_ids 2 E-05 per-execution session id
AgentSnapshot.queued_count_via_service 2 B-01 production-accessor cross-check
AgentSnapshot.slot_ttls 3 S-03 per-slot Redis TTL
Snapshot.drain_tick_at 3 B-02 heartbeat timestamp
Snapshot.zombie_counts 3 R-01 per-agent zombie count

All collectors fail gracefully — per-agent exec failures, missing accessors, and unreachable Redis are recorded in sources_unavailable and the affected invariant either skips that agent or skips the whole cycle.

Unit tests

51 → 84 passing. Each invariant has positive, negative, and edge-case tests.

$ python3 -m pytest tests/test_canary_invariants.py -q
84 passed in 0.75s

Test fixtures: FakeRedis got a ttl() method with the standard -2/-1/positive sentinel semantics; new fake_docker fixture exposes a controllable container list with exec_run stubs.

Live verification

Against the local stack (CANARY_ENABLED=0 but the POST /api/canary/run-cycle admin endpoint still works). For each invariant: provoke, observe red, revert, observe green. All seven reproduce as designed.

S-02 (Phase 2)

$ # Provoke: ZADD 3 fake slot ids when max_parallel=2
{
  "violations": [{
    "invariant_id": "S-02", "severity": "critical",
    "observed_state": {
      "slot_count": 3, "max_parallel_tasks": 2, "overbooked_by": 1,
      "slot_ids": ["fake-eid-1", "fake-eid-2", "fake-eid-3"]
    }
  }]
}

E-01 (Phase 2)

$ # Provoke: insert status='running' row started 2h ago, agent timeout=60s
{
  "violations": [{
    "invariant_id": "E-01", "severity": "critical", "tier": "B",
    "observed_state": {
      "age_seconds": 7436,
      "execution_timeout_seconds": 60, "slot_ttl_buffer_seconds": 300
    }
  }]
}

E-05 (Phase 2)

$ # Provoke: insert status='running' row 3min old, claude_session_id NULL
{
  "violations": [{
    "invariant_id": "E-05", "severity": "major", "tier": "B",
    "observed_state": {"age_seconds": 188, "grace_seconds": 60}
  }]
}

B-01 (Phase 2)

Patched db/schedules.py:get_queued_count to return count - 1 — exact bug class B-01 exists to catch:

$ # Provoke: 1 queued row + buggy accessor returning 0
{
  "violations": [{
    "invariant_id": "B-01", "severity": "critical", "tier": "A",
    "observed_state": {"service_count": 0, "snapshot_count": 1},
    "signal_query": "db.get_queued_count('canary-test-882') = 0 != |queued ids in snapshot| = 1"
  }]
}

S-03 (Phase 3)

$ # Provoke: ZADD a slot, EXPIRE its metadata to 30s, agent floor=360s
{
  "violations": [{
    "invariant_id": "S-03", "severity": "critical", "tier": "A",
    "observed_state": {
      "redis_ttl_seconds": 30, "floor_seconds": 360, "kind": "below_floor"
    },
    "signal_query": "TTL(agent:slot:canary-test-882:s03-eid) = 30 (below_floor); floor = 360s"
  }]
}

Also verified kind: missing (-2) by deleting the metadata HASH entirely.

B-02 (Phase 3)

$ # Provoke: 1 queued row, 0 slots (cap=2), set drain tick to 600s ago
{
  "violations": [{
    "invariant_id": "B-02", "severity": "critical", "tier": "B",
    "observed_state": {
      "queued_count": 1, "slot_count": 0, "free_slots": 2,
      "drain_tick_age_seconds": 600, "drain_tick_grace_seconds": 60
    },
    "signal_query": "agent canary-test-882: queued=1, slots=0/2 (free=2), drain_tick_age=600s > 60s"
  }]
}

R-01 (Phase 3) — the most satisfying verification

Spawned a real zombie process named claude inside the running agent-cornelius-m container via:

os.fork()
# Child: prctl(PR_SET_NAME, "claude"); sys.exit(0)
# Parent: sleep(120) without wait()
$ # ps -eo stat,comm | grep '^Z.*claude' → "Z    claude"
{
  "violations": [{
    "invariant_id": "R-01", "severity": "critical", "tier": "A",
    "observed_state": {"agent_name": "cornelius-m", "zombie_count": 1},
    "signal_query": "docker exec agent-cornelius-m sh -c \"ps -eo stat,comm | grep '^Z.*claude' | wc -l\" = 1"
  }]
}

Reaped by killing the parent PID → R-01 returned green on the next cycle.

Catalog deviation worth noting

The orchestration-invariant-catalog spec uses Z.*claude (leading space). procps-ng on the agent base image emits STAT left-aligned without padding, so that pattern misses. Switched to ^Z.*claude — same intent, works across formatters. Verified live against a real zombie above.

Final all-10 cycle

{
  "checks_run": ["S-01", "S-02", "S-03", "E-01", "E-02", "E-05",
                 "L-03", "B-01", "B-02", "R-01"],
  "sources_unavailable": [],
  "violations": [],
  "transitions": [],
  "cycle_duration_ms": 106
}

Non-canary code touched

Just one place: services/capacity_manager.py:run_maintenance writes the canary:drain_tick_at heartbeat at the END of each successful sweep (~5 lines including the comment and try/except). Everything else stays inside canary/.

Out of scope — Phase 4

  • E-06 — HTTP fan-out across agents to /api/executions/running. Needs network-flake handling and timeout budgets.
  • G-01 — One-shot post-restart probe. Doesn't fit the 5-min loop — needs a new "one-shot canary" execution mode.

Doc updates

docs/memory/architecture.md:

  • CANARY-001 section: added Phase 2 + Phase 3 invariant blocks.
  • Background Services table: documented the canary:drain_tick_at heartbeat write in the Capacity Maintenance row.

Adds four single-source SQL/Redis invariants to the canary harness
(#411). All four follow the Phase 1 (#653) pattern — no new source
types, no new infrastructure, registered into the same `INVARIANTS`
dict the run-cycle endpoint and background loop already drive.

- S-02 — No overbooking. `ZCARD(agent:slots:A)` (drain sentinels
  filtered) > `max_parallel_tasks`. Critical. Tier A. Catches
  `acquire_slot` bypass — distinct from S-01 because the violation can
  be self-consistent (Redis and SQL agree on N+1 vs cap of N).
- E-01 — Terminal-state closure. No `status='running'` row older than
  `execution_timeout_seconds + 300s` (matches `SLOT_TTL_BUFFER` so the
  check fires *after* cleanup has had its window). Critical. Tier B.
- E-05 — Dispatched rows have session. No running row older than 60s
  with `claude_session_id IS NULL`. Major. Tier B. Guards #106.
- B-01 — Queue-status coherence. `db.get_queued_count` (the accessor
  BacklogService calls) agrees with the snapshot's independently-
  collected `len(queued_exec_ids)`. Critical. Tier A. Trivially-green
  today after the #428 consolidation; regression guard against a
  future cache layer or status-filter drift on the production accessor.

Snapshot extended with per-execution `claude_session_id` (E-05) and
per-agent `queued_count_via_service` (B-01). The session-id collector
PRAGMA-introspects the column so the minimal unit-test DDLs don't
have to mirror every production column. The service-count collector
lazy-imports `database.db`, returning `None` on import failure so unit
tests (which stub `db.connection` but not the full facade) skip B-01
silently rather than firing a false positive.

Unit tests: 67 passing (was 51). Each new invariant has positive,
negative, and edge-case tests.

Verification against local stack — for each invariant, provoke, run
`POST /api/canary/run-cycle`, observe red, revert, observe green.
All four reproduce as designed:

- S-02 — ZADD'd 3 fake slot ids when max_parallel=2 → critical
  violation with `overbooked_by: 1`.
- E-01 — inserted `status='running'` row with `started_at` 2h ago
  against a 60s-timeout agent → critical violation,
  `age_seconds: 7436 > timeout+buffer=360s`.
- E-05 — inserted `status='running'` row 3 min old with
  `claude_session_id` NULL → major violation, age=188s.
- B-01 — temporarily patched `db.get_queued_count` to return
  `count - 1` → critical violation,
  "db.get_queued_count = 0 != |queued ids in snapshot| = 1".

Each post-fix cycle returned `violations: 0, transitions: 0`.

Refs: #411, #653, docs/testing/orchestration-invariant-catalog.md
Adds three moderate-complexity invariants on top of Phase 2. Each
brings exactly one new piece of plumbing — first time the canary
takes a hard dep on a non-trivial source beyond SQLite + Redis basic
ops:

- S-03 — Slot TTL ≥ execution timeout. For every member of
  `agent:slots:A`, the companion `agent:slot:A:{eid}` HASH must have
  `TTL ≥ execution_timeout_seconds + 300s` (SLOT_TTL_BUFFER). Three
  failure kinds surfaced explicitly: `missing` (-2; the #226 class),
  `no_expiry` (-1), `below_floor` (positive TTL under floor). Critical.
  Tier A. Per-slot `redis.ttl()` lookup, bounded by ZCARD per agent
  (≤ max_parallel_tasks).
- B-02 — No queued without slots-full. If any agent has queued > 0,
  then either `slot_count == max_parallel` (legit backpressure) OR a
  drain tick fired in the last 60s (drain will pick it up). Critical.
  Tier B. Requires `CapacityManager.run_maintenance()` to write a
  unix-timestamp heartbeat to `canary:drain_tick_at` at the END of
  each successful sweep — mid-sweep crash leaves cursor stale and
  lets B-02 catch the breakage. One-line write in capacity_manager.py,
  rest is canary-local.
- R-01 — No zombie Claude processes. For every running
  `trinity.platform=agent` container,
  `ps -eo stat,comm | grep '^Z.*claude' | wc -l` must be 0. Critical.
  Tier A. Guards PR #407. New source type — docker exec via the
  existing docker_service.docker_client. Per-container failures recorded
  in `sources_unavailable` so a single unhealthy container doesn't
  kill the cycle. Regex anchored at `^Z` rather than the catalog's
  ` Z` (leading-space) — procps-ng on the agent base image emits
  STAT left-aligned without padding; verified live by spawning an
  actual zombie via `os.fork()`+`prctl(PR_SET_NAME, "claude")`.

Snapshot extended:
- `AgentSnapshot.slot_ttls: Dict[str, int]` — per-slot metadata TTL,
  drain sentinels skipped at collection time.
- `Snapshot.drain_tick_at: Optional[float]` — read from
  `canary:drain_tick_at`, sentinel-`None` on cold cluster.
- `Snapshot.zombie_counts: Dict[str, int]` — per-agent zombie process
  count via container.exec_run; missing entry = exec failed for that
  container (recorded in `sources_unavailable`).

Tests: 67 → 84 passing. Added `fake_docker` fixture so the synthetic
container list is controllable; FakeRedis got a `ttl()` method with
the standard -2/-1/positive sentinel semantics.

Verification against local stack — for each invariant, provoke, run
`POST /api/canary/run-cycle`, observe red, revert, observe green:

- S-03: ZADD a slot + EXPIRE its metadata HASH to 30s while the floor
  is 360s → critical violation `kind: below_floor`. Also covered the
  `missing` kind by deleting the HASH entirely. Revert by EXPIRE 500.
- B-02: inserted 1 queued row, set `canary:drain_tick_at` to 600s
  ago → critical violation, `free_slots: 2, drain_tick_age_seconds:
  600`. Revert by writing a fresh timestamp.
- R-01: spawned a real zombie inside agent-cornelius-m via Python
  fork + prctl PR_SET_NAME → critical violation
  `zombie_count: 1`. Reaped by killing the parent → green.

All post-fix cycles returned `violations: 0, transitions: 0`. Final
all-10-invariants cycle on clean platform: 106ms cycle duration,
all green, `sources_unavailable: []`.

Refs: #411, #653 (Phase 1), #884 (this PR — Phase 2 also)
@obasilakis obasilakis changed the title feat(#882): canary invariant harness Phase 2 (S-02, E-01, E-05, B-01) feat(#882): canary harness Phase 2 + 3 (S-02, E-01, E-05, B-01, S-03, B-02, R-01) May 18, 2026
…ositive

Addresses three findings from the pre-landing /review pass:

I1 — Alert quality for Phase 2 + 3 invariants. canary_alerts.py only
had S-01/E-02/L-03 entries in `_INVARIANT_NAMES`, `_INVARIANT_RUNBOOKS`,
`_render_message`, and `_render_forensic`. New ids fell through to the
"S-02 fired N violation(s)" generic fallback with the id doubled in
the header. Added 7 entries each:
  - Friendly name and one-line runbook hint per invariant
  - Per-id `_render_message` (e.g. "3 zombie claude process(es) across
    1 agent(s): cornelius-m" for R-01)
  - Per-id `_render_forensic` rendering of the relevant observed_state
    fields, truncated to 5 violations with a "+N more" footer

Verified by hand-building an R-01 ViolationReport and inspecting the
Block Kit payload — header, body, forensic, runbook, and context all
render the new shape.

I2 — B-02 boot-window false-positive. Background canary loop is fine
(30s startup vs 15s maintenance loop), but the on-demand
`POST /api/canary/run-cycle` endpoint can hit in the first 15s when
no heartbeat exists. With pre-existing queued rows and free slots,
B-02 would fire with `drain_tick_age_seconds: null`.

`CapacityManager.__init__` now seeds the heartbeat with a fresh
timestamp on construction. The maintenance loop overwrites on every
successful tick; init only needs a non-stale floor. Verified live:
deleted the heartbeat key, bounced the backend, key is present
immediately — no waiting for the maintenance tick.

I3 — Stale docstring in `_collect_zombie_counts`. The docstring still
described the catalog's ` Z.*claude` (leading-space) reasoning while
the actual `cmd` line uses `^Z.*claude`. Updated to describe the
anchor-at-line-start version and reference the live-zombie
verification.

Bonus tidy: moved `import time` from inside `run_maintenance` to the
top-of-file imports.

Tests: 84/84 still green. Full all-10-invariant cycle clean.

Refs: /review pass on #884
`canary-fleet-long` was a duplicate of burst — same template, same task
duration, same model — only the cron differed (*/5 vs */2). It added no
coverage burst didn't already provide for S-01, E-02, S-03, R-01.

Replace it with a `slow` agent backed by a new `sleep-echo` local
template that sleeps 75s per task. This gives Phase 2 invariants
something to inspect:

- E-05 (dispatched rows have session): needs >60s running rows; was
  trivially-green with 4s test-echo tasks
- S-03 below_floor: needs a slot to exist at canary snapshot time

Also locks burst's live config into the yaml so a redeploy doesn't
revert prior live SQL fixes:
- cron: * * * * * -> */2 * * * *  (cheapest cadence that phase-slides
  against the 5-min canary cycle)
- description / comments updated to reflect actual coverage scope

Manifest deploy can't express `model`, `max_parallel_tasks`, or
`execution_timeout_seconds` (system_service.create_schedules drops them,
SystemAgentConfig has no slot for capacity). Documented the four
required post-deploy API calls in the yaml header so the next operator
doesn't trip on it.
Pre-existing CI failure inherited from dev — `dev` has been failing
this lint since #871 (commit 98574f3) merged on 2026-05-17. That PR
added 6 sys.modules violations in tests/unit/test_slot_per_slot_ttl.py
without regenerating the baseline; a separate cleanup retired 3
violations in tests/unit/test_cleanup_unreachable_orphan.py.

Regenerated via `python tests/lint_sys_modules.py --regenerate-baseline`
— the path the lint script itself directs you to when violations
move below baseline AND new files exceed it. Net: 235 violations in
67 files (unchanged total).

No code-quality regression in this PR's actual diff — none of the
canary tests use bare sys.modules manipulation (they use
monkeypatch.setitem throughout).
Copy link
Copy Markdown
Contributor

@vybe vybe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

High-quality Phase 2+3 delivery. 7 invariants with extensive unit coverage (51→84), each live-verified against the local stack including a real-zombie fork+prctl for R-01. Defensive collector shape correct, heartbeat plumbing right (END-of-sweep write + init seed), scope discipline excellent (one non-canary file touched). CI fully green, architecture docs updated, Closes #882. Honest /review self-pass before posting.

@vybe vybe merged commit 009c7fc into dev May 20, 2026
9 checks passed
AndriiPasternak31 added a commit that referenced this pull request May 20, 2026
…e-586-plan

Resolves PR #875 reviewer request: rebase on dev to pick up PR #884's
baseline regeneration (adds test_slot_per_slot_ttl.py: 6 via #881's
_STUBBED_MODULE_NAMES refactor; removes test_cleanup_unreachable_orphan.py: 3).

After merge, sys.modules lint is green:
- test_slot_per_slot_ttl.py: now exempt via sanctioned _STUBBED_MODULE_NAMES + _restore_sys_modules pattern (PR #881 on dev)
- test_cb_probe_execution_close.py: baseline 7 matches actual (this PR's stub eviction)
- test_config_fail_fast.py: actual 0, baseline 1 (reduced — sanctioned)

Conflict in docs/memory/feature-flows.md resolved by keeping both
this PR's #35d4e78 row and dev's #887/#888 rows.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants