* docs(architecture): add target architecture document and wire into dev workflow
Introduces docs/planning/TARGET_ARCHITECTURE.md — the optimal steady-state
design Trinity should converge toward (PostgreSQL, actor model coordination,
async-first agent communication, fleet observability, GuardAgent security).
Updates CLAUDE.md, DEVELOPMENT_WORKFLOW.md, and the groom/roadmap/sprint
playbooks to distinguish current architecture (what is built today) from
target architecture (where decisions should point), and to use target
architecture alignment as a ranking signal during backlog grooming and
issue prioritization.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(security): encrypt SLACK-001 bot tokens at rest (#453) (#667)
* fix(security): Encrypt Slack bot tokens at rest (#453)
The SLACK-001 public-link Slack integration (`db/slack.py`) was the last
holdout still storing bot tokens as plaintext in SQLite, violating
Architectural Invariant #12. Telegram (`telegram_bindings.bot_token_encrypted`),
WhatsApp (`whatsapp_bindings.auth_token_encrypted`), and SLACK-002
(`slack_workspaces.bot_token`) all already encrypt via `services.credential_encryption`
(AES-256-GCM, JSON envelope). This brings SLACK-001 in line with that pattern.
Additionally, `slack_workspaces.bot_token` was rolled out with lazy
encryption (encrypt-on-write + plaintext fallback at `slack_channels.py:47-49`)
which left two sources of plaintext on disk:
- Rows written before the encryption rollout
- Rows copied from `slack_link_connections` by `_migrate_slack_channel_agents`
A one-shot migration walks both tables on startup and re-encrypts any
plaintext `xoxb-*` rows. Idempotent at row level (skip JSON envelopes)
and at migration level (schema_migrations runner).
src/backend/db/slack.py
- Add `_get_encryption_service` / `_encrypt_token` / `_decrypt_token`
(exact copy of the pattern in `db/slack_channels.py`,
`db/telegram_channels.py`, `db/whatsapp_channels.py`)
- Encrypt at `create_slack_connection` (write site)
- Decrypt at `_row_to_connection` (read site, with `xoxb-*` plaintext
fallback so runtime works pre-migration on legacy rows)
- Caller-facing API surface unchanged — `slack_bot_token` field still
contains plaintext in the returned dict
src/backend/db/migrations.py
- New `_migrate_slack_bot_token_encryption` registered in MIGRATIONS list
- Walks BOTH `slack_link_connections.slack_bot_token` AND
`slack_workspaces.bot_token`, encrypts plaintext rows in place
- Hard-fail on missing CREDENTIAL_ENCRYPTION_KEY (matches the implicit
pattern of every other consumer of CredentialEncryptionService)
- Skips already-encrypted rows (signature: starts with `{` not `xoxb-`)
- Defensive: skips silently if a table doesn't exist
docs/memory/architecture.md
- Reword Invariant #12 to acknowledge channel/subscription tokens as a
documented exception (persisted but mandatorily encrypted), with the
full list of tables under that rule
- Add `slack_link_connections` DDL block; update `slack_workspaces`
block to clarify the column-name vs content-type distinction
tests/unit/test_slack_token_encryption.py (NEW, 14 tests)
- TestRoundTrip: write encrypts, read decrypts, raw DB value is JSON envelope
- TestPlaintextFallback: legacy `xoxb-*` row returns token + warning logged;
corrupt envelope returns None + error logged
- TestEncryptionHelpers: encrypt+decrypt isolation; encrypt raises ValueError
on missing key; decrypt returns None on missing key
- TestMigration: encrypts plaintext in both tables, skips encrypted, idempotent
on second run, hard-fails without key, no-op on missing/empty tables
Live verification on running backend:
- Migration ran on startup: 1 row in each table re-encrypted
(the real `ability.ai` workspace data)
- On-disk now `{"version": 1, "algorithm": "AES-256-GCM", ...}` envelopes
- `SlackOperations.get_slack_connection` returns the original `xoxb-...`
plaintext via decrypt — caller-facing API surface unchanged
Out of scope (filed separately):
- Encryption tests for slack_channels.py + telegram_channels.py +
whatsapp_channels.py (all shipped without dedicated test coverage):
tracked in #664
- Renaming `bot_token` → `bot_token_encrypted` columns: cosmetic, would
require a real schema migration; current naming works behind the
service-layer encapsulation
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* docs(req): annotate SLACK-001 bot token as encrypted at rest (#453)
SLACK-002 (line 809) noted "bot_token encrypted"; SLACK-001 (line 781)
didn't, even after #453 brought slack_link_connections.slack_bot_token
under the same AES-256-GCM regime. Mirror the annotation for parity.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Pavlo <pash@pashs-MacBook-Pro.local>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* docs(announcements): add Cornelius voice mode video announcement
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* docs(validation): add agent validation spec for issue #668
56 checks across 10 categories covering static file checks, YAML/JSON
schema validation, security scanning, and AI-evaluable logical checks
(skill coherence, CLAUDE.md quality, cross-file consistency). Serves as
the canonical check list for the compatibility validation API and MCP tool.
Closes-adjacent: #668
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(monitoring): degrade fleet-status to 'unknown' instead of 500 on NULL status row (#669) (#676)
`GET /api/monitoring/status` (the endpoint MCP `get_fleet_health` calls) was
returning 500 whenever any `agent_health_checks` row had `status = NULL`.
Root cause: the build loop did
AgentHealthSummary(status=check.get("status", "unknown"), ...)
`dict.get(key, default)` only returns the default when the key is *missing*.
A row with the key present but value `None` returned `None`, which Pydantic
v2 rejected because `AgentHealthSummary.status: str` is required.
The sibling per-agent endpoint dodged this by triggering a fresh health check
when no aggregate row existed, which is why `get_agent_health` worked while
`get_fleet_health` 500'd in the production fleet that triggered #669.
Fix:
- Extract `_build_agent_summary(name, check)` and `_coerce_status(raw)` so
NULL/missing/non-str status degrades to `"unknown"` consistently. Same
for NULL `error_message` (would explode `.split("; ")`).
- Wrap the aggregator in try/except returning a structured "unknown" payload
rather than 500 (issue ask #1). Future schema drift surfaces as data, not
as an outage.
- Reconcile `docs/user-docs/operations/monitoring.md` — the listed
`/api/monitoring/fleet-health` path doesn't exist; correct to
`/api/monitoring/status`.
Tests: 7 new unit tests in `tests/unit/test_fleet_status_resilience.py`
covering NULL status, missing status key, missing check row, NULL
error_message, non-string status, and sort-key tolerance.
Does NOT fix the underlying scheduler stoppage (8 of 9 agents going months
without a health-check refresh) — that root cause needs production logs and
is split out as a separate ticket.
Closes #669 (symptom)
Refs #675 (scheduler stoppage follow-up)
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(agent-runtime): surface stdout parse failures + orphan identity for #640 debugging (#662)
After diagnosing #640 against a running agent with an npx-launched stdio MCP
(@upstash/context7-mcp@latest), the issue body's "MCP child inherits fd > 2"
theory does not hold: Node.js spawn already isolates the MCP child to fd 0/1/2
on socketpairs. The remaining wire-corruption mechanism (some descendant
outside claude's pgid acquiring agent-server's pipe write-end via setsid+dup)
needs a live production trace to root-cause — single-session repro budget here
isn't enough to manifest the failure (issue says ≥100 turns / 18min).
What this commit does ship: make the next production failure usable by
surfacing the diagnostic data the existing #618/#639 mitigations already had
at hand but discarded.
read_stdout (chat + headless paths)
- JSONDecodeError no longer silently `pass`-swallowed.
- Track count + capture first sanitised, length-capped (300 char) sample.
- Surface in completion log line as parse_failures=N + WARNING with sample
text when N > 0. Lets operators distinguish wire corruption (#640) from
reader-leak-past-claude-exit (#520/#618) in production logs.
_classify_empty_result
- New parse_failure_count / parse_failure_sample kwargs (defaults preserve
legacy callers — chat path doesn't wire it; backward compat covered by
test_default_parse_failure_args_preserve_legacy_callers).
- Detail string now includes parse_failures + raw_messages type histogram
(top 6 types) + first malformed line. The histogram tells operators
whether the reader caught most of the stream or stalled near the start.
_kill_orphan_pipe_writers (#618)
- Was logging orphan count only.
- Now captures cmdline / ppid / pgid per orphan BEFORE SIGKILL (after
the kill /proc/{pid} is gone) and emits one INFO line per pid, capped
at 10 lines + count-only summary, so log volume stays bounded under a
runaway MCP fan-out.
- First pass at identifying which package consistently leaks. Issue
body's "npm setsid" hypothesis is testable now without re-instrumenting.
5 new tests (33 total, all green):
- test_parse_failure_count_surfaces_in_detail
- test_parse_failure_sample_appended_when_present
- test_parse_failure_sample_omitted_when_count_is_zero
- test_raw_messages_type_summary_in_detail
- test_default_parse_failure_args_preserve_legacy_callers
Does not close #640. The wire-interleaving root cause remains open — but
the diagnostic surface is now sufficient to identify it from a single
production failure rather than needing to re-instrument and wait for the
next occurrence.
Issue: #640
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(executions): block SUCCESS over CANCELLED so user cancel survives late agent reply (#671) (#681)
When an operator cancels a running execution mid-flight, two writers race
on the same `schedule_executions` row:
Writer A — terminate handler (`routers/chat.py:~1841`)
writes status = CANCELLED.
Writer B — TaskExecutionService success branch
(`services/task_execution_service.py:498`)
writes status = SUCCESS once the agent's HTTP reply lands. Claude Code
typically catches the cancel signal, emits a graceful final message,
and exits 0 — so the agent reports "completed successfully" and B
lands well after A.
Pre-fix CAS (RELIABILITY-005, db/schedules.py): SUCCESS writes were
unconditional ("agent's own completion result always wins"). When A landed
first and B landed second, B silently clobbered the CANCELLED status with
SUCCESS. Effects in production:
- schedule's `next_run_at` advanced as if the run had succeeded,
suppressing recovery on the next cron tick (silent skip)
- cost telemetry counted the partial run as billable success
- on-call had no signal that deliverables were incomplete
- for agents with side effects (Slack post, sheet rows, CRM), wrongly-
green status hid incomplete work from operators
Reporter saw two consecutive incidents on bdr-agent / `Daily Lead Outreach`
ending with status=success despite the operator cancelling and no Slack
ping / sheet rows / final deliverable being produced.
Fix: narrow the CAS carve-out so SUCCESS writes are blocked when the row
is already CANCELLED, but still win over RUNNING / QUEUED / PENDING_RETRY /
SKIPPED and over a phantom-stale FAILED (preserves the #378 invariant —
real completions still beat misfired Phase-3 cleanup).
- SUCCESS over RUNNING — wins (happy path)
- SUCCESS over phantom FAILED — wins (#378 invariant preserved)
- SUCCESS over CANCELLED — blocked (#671)
- FAILED/CANCELLED over any terminal — blocked (RELIABILITY-005, unchanged)
Tests: 5 unit tests in tests/unit/test_cancelled_not_overwritten.py
covering each transition above. The exact prod race repro
(`test_success_blocked_when_row_already_cancelled`) fails pre-fix, passes
post-fix. Live-verified against running stack:
cancel write ok: True
after cancel: ('cancelled',)
late-success write ok: False
after late-success: ('cancelled', None, None) # response/cost NOT recorded
Defense-in-depth (plumb cancel signal into the agent task-runner so its
reply carries `status=cancelled` and its log says "cancelled by user"
instead of "completed successfully") tracked separately as a follow-up.
Closes #671 (minimum CAS guard)
Refs #679 (defense-in-depth follow-up)
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(slack): multi-connection Socket Mode with envelope-ID dedup ring (#244) (#684)
Implements Slack's documented multi-connection Socket Mode pattern in
adapters/transports/slack_socket.py. Slack's edge fans out events across
active connections, so when one half-closes, siblings keep absorbing
traffic — eliminating the 430 ms reconnect gap absorbed by the watchdog.
Empirical basis: 7-day production ops report on ability-services showed
68 disconnect/reconnect cycles, 100% recovered by the watchdog (#278), all
ending in identical "Cannot write to closing transport" — Slack's edge
half-closing without a WebSocket close frame. Slack's own docs and support
team recommend running multiple concurrent connections (up to 10 per app)
as the architectural answer to this exact pattern.
Changes
- N concurrent SocketModeClient instances (default 2; range 1-10) via
SLACK_SOCKET_CONNECTION_COUNT env var, clamped fail-safe on parse error
- _ClientCtx dataclass per client (own watchdog task, own backoff counter)
- Envelope-ID dedup ring (OrderedDict cap 1024 + asyncio.Lock) defends
against possible cross-connection duplicate delivery; INFO log on hit so
we measure whether Slack ever actually dual-delivers
- Per-client log prefix [c=N] so ops can attribute disconnects per client
- is_connected returns "any client healthy" (permissive) + new
connected_count property exposes degraded mode
- stop() iterates all clients/watchdogs (cleanup correctness)
- Parallel start via asyncio.gather keeps boot at ~10s ceiling
- Env-var WARN no longer echoes raw value (prevents accidental token leak
if operator pastes app token into wrong env var)
Tests
- 56/56 passing (28 existing watchdog + 28 new multi-connection)
- New test_slack_multi_connection.py covers env-var bounds,
is_connected/connected_count semantics, dedup ring (skip + concurrent
+ FIFO eviction), per-client backoff isolation, partial startup,
N=1 backward compat, stop() cleanup
Deferred
- #683 — wrap connect_to_new_endpoint() in asyncio.wait_for to prevent
watchdog stall (pre-existing watchdog hole, blast-radius reduced by
this PR's per-client isolation)
Verified
- Single-worker uvicorn confirmed (docker-compose.yml line 82, no
--workers flag; line 230 comment confirms intent), so per-process
dedup ring is correct
- Architectural Invariants preserved (Channel Adapter ABC unchanged;
no new endpoints; no new persistent storage; no Invariant #12 regression)
- /review found 0 critical, 2 informational; C1 applied
- /cso --diff: 0 critical / 0 high / 0 medium / 0 low
Fixes #244
Co-authored-by: Pavlo <pash@pashs-MacBook-Pro.local>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* docs(user-docs): add deployment guides + screenshots
- Add 6 missing deploying/ spokes: local-development, single-server,
public-access, upgrading, backup-and-restore, monitoring
- Add 9 UI screenshots and wire into 10 existing feature docs
- Update deploying-trinity.md hub with spoke navigation table
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(agent-client): circuit breaker cooldown clock no longer resets on failures while open (#687) (#688)
record_failure() was resetting last_failure_time on every call, including
when the circuit was already open. Continuous probe failures (cleanup
re-verify, scheduler dispatches) kept the cooldown timer near zero,
making the half-open transition unreachable and leaving the circuit
permanently open until backend restart.
Fix: only update last_failure_time when state != "open", so the 30s
cooldown starts from when the circuit first opens and is not disturbed
by subsequent failures.
Fixes #687
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
* refactor(config): remove legacy unauth REDIS_URL fallbacks (#645) (#697)
`config.REDIS_URL` is the canonical Redis URL gate (#589 / PR #643) —
it raises at import if creds are missing. Three services bypassed
that gate by reading the env var directly with an unauthenticated
localhost fallback, both at the factory site and in the constructor
default:
redis_url = os.getenv("REDIS_URL", "redis://redis:6379")
def __init__(self, redis_url: str = "redis://redis:6379"):
In docker-compose these branches don't fire — REDIS_URL is always
populated. But test/CI paths and ad-hoc debug shells silently fall
back to unauth localhost, then hit NOAUTH at runtime instead of the
clear startup-time error #589 establishes.
Changes:
- slot_service / ssh_service: factories drop the os.getenv fallback;
constructors take Optional[str] = None and lazy-import
config.REDIS_URL when called with no arg.
- capacity_manager: same pattern, lazy-import in __init__.
- tests/unit/conftest.py: setdefault REDIS_URL with creds before any
backend import — unit tests don't share the parent conftest
(norecursedirs = ..) so the env wasn't being primed.
- tests/unit/test_redis_url_no_fallback.py: lint-style regression
test that greps src/backend/services/ for `os.getenv("REDIS_URL"`
and `"redis://redis:6379"` and fails if either resurfaces.
Verified:
- 30 unit tests pass in trinity-backend container
(test_redis_url_no_fallback + test_capacity_manager + test_config_fail_fast)
- 14 ssh_service tests + 9 redis-url-related tests pass on host venv
- Live backend bootstraps cleanly: SlotService / SshService /
CapacityManager all resolve REDIS_URL via config
- Lint test catches regression: stashing the slot_service fix flips
both assertions to FAIL with offending line:number
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(security): close TOCTOU race in webhook rate limiter (#644) (#696)
* fix(security): close TOCTOU race in webhook rate limiter (#644)
Pre-fix path issued a separate GET then INCR. N concurrent callers
could all observe count < WEBHOOK_RATE_LIMIT before any of them
incremented, slipping past the 429 and pushing the actual call rate
to limit + N.
Switched to INCR-then-compare (Redis INCR is atomic): increment
unconditionally, then 429 the caller whose post-increment count crosses
the threshold. Trade-off: blocked requests still tick the counter,
slightly extending cool-down for an over-limit token. Acceptable for
a rate-limiter — we only stop accepting work, we don't unwind.
Tests:
- tests/unit/test_webhook_rate_limit_toctou.py — pins INCR-first
semantics. The structural assertion (r.get() not called) reliably
catches a partial revert that re-adds the GET; the wide-window race
belongs in integration tests against real Redis.
- tests/integration/test_webhook_rate_limit.py — adds concurrent
burst test alongside the existing #589 sequential coverage.
Verified live in trinity-backend with real Redis:
- pre-fix: 15/20 succeeded under 20-thread burst (limit 10) — race
reproduced.
- post-fix: exactly 10/20 succeeded — limit holds.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(webhooks): unblock trigger endpoint — schedule model + audit signature (#647 follow-up)
While verifying #644 against a live stack, found two additional facade
gaps that #648 (the WEBHOOK-001 delegation fix) didn't catch — both
crash trigger_webhook before the rate-limiter even runs to completion:
1. `Schedule` pydantic model never carried `webhook_enabled` /
`webhook_token` fields. The DB columns exist, but the row mapper
discarded them, so `if not schedule.webhook_enabled:` raised
AttributeError on every trigger call.
2. `webhooks.py:trigger_webhook` called `platform_audit_service.log()`
with `actor_type="system"`. The service derives actor_type
internally from actor_user / actor_agent_name / mcp_scope and has no
such kwarg; every accepted webhook 500'd in the audit step.
Both are tiny:
- Add the two fields to `Schedule` (db_models.py).
- Pull them through `_row_to_schedule` (db/schedules.py).
- Drop the bogus actor_type kwarg, pass actor_ip instead — webhook
callers are unauthenticated so caller IP is the only attributable
signal.
With these, the integration test in tests/integration/test_webhook_rate_limit.py
now exercises the full HTTP path end-to-end. Live verification against
the running backend: 15-way concurrent burst → 10 × 202 + 5 × 429,
limit holds.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(api-keys): clipboard fallback + error feedback (#677) (#695)
Both copy actions on /api-keys (Copy Config, Copy key icon) called
navigator.clipboard.writeText() with no fallback and swallowed every
rejection in console.error, so users in modal-focus or non-secure
contexts saw nothing on the clipboard and no error.
- Add `utils/clipboard.js` with a textarea + execCommand fallback,
returning a boolean for caller-driven UX.
- Wire ApiKeys.vue's copyApiKey / copyMcpConfig through the helper.
Visual "Copied!" / green-check state only fires on confirmed
success; failure shows an alert telling the user to copy manually.
- Add e2e spec exercising both buttons with a granted
clipboard-read permission.
Verified in a real browser (admin login, /api-keys, create key,
click both buttons): clipboard contained the expected MCP JSON and
raw `trinity_mcp_*` key respectively; no console errors.
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(monitoring): correct get_accessible_agents call signature (#682) (#694)
`GET /api/monitoring/status` returned 500 TypeError for non-admin users
because routers/monitoring.py called the helper with the pre-refactor
two-arg signature `(email, agent_names)`. Every other call site was
updated to `(current_user)` — only this one was missed.
Adds a unit regression test that pins the canonical helper signature
and spy-verifies the router calls it with exactly one positional
User arg.
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* test(harness): repro scaffolding + negative results for #640 — 6 hypotheses falsified (#693)
* test(harness): repro scaffolding for #640 — controlled stdio MCP leaks
Adds tests/harness/640/ — three files implementing a deterministic repro
harness for the open root cause behind the reader-thread / wire-corruption
failure family (#640, manifesting in #678, #630, #618, #548, #586).
Background: PR #662's author tried 1.5h with `@upstash/context7-mcp` —
issue body says ≥100 turns / 18min are needed to manifest. Hunting for a
naturally-leaky package is unreliable. Instead this harness builds a
controlled experiment: a minimal stdio MCP server with switchable leak
variants, each testing one hypothesis from the issue body and #662's
empirical notes.
Files:
- noisy_mcp_server.py — stdlib-only stdio MCP server with --leak knob:
none : control / baseline
stderr-flood : MCP child stderr noise
setsid-child : grandchild that escapes pgid (#618 family)
and retains protocol-pipe write end
proc-fd-write : raw writes to /proc/self/fd/1 interleaved
with MCP frames
delayed-stdout : partial-line writes that race the reader at
line boundary
npm-wrapper : real-world npx-style boilerplate emitted to
stdout BEFORE protocol handshake — most likely
production culprit
- run_repro.py — driver that hits an agent's chat API for N turns and
measures null-cost / null-response rate (the observable symptom from
#678). Exits 1 if rate exceeds --null-cost-fail-rate (default 5%) so
the harness can also serve as a CI regression gate once a fix lands.
- README.md — runbook: agent setup, .mcp.json wiring, expected output,
caveats. Documents that parse_failures-counter assertions wait on
PR #662 merging.
Smoke-tested:
- Simulator's CLI parses --help.
- Sample initialize + tools/list session round-trips clean JSON
responses with --leak=none, sidecar log captures protocol activity.
- npm-wrapper variant emits the boilerplate BEFORE the JSON-RPC reply
as designed.
This commit is scaffolding only — actual variant characterization
against a running stack is a follow-up. Each variant takes ~10-15 min
of Sonnet wall-time at 50 turns, so running the full 6-variant matrix
is a budgeted exercise rather than something to do in one CI pass.
Refs #640
Refs #662 (parse_failures instrumentation prerequisite)
Refs #678 (production manifestation that motivated this work)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* test(harness): document negative results for #640 — 6 hypotheses falsified
Adds the results section to tests/harness/640/README.md after running 2
variants empirically (npm-wrapper, setsid-child) plus 4 hypotheses ruled
out via static inspection of claude-cli's cli.js and Linux kernel
behaviour. Also fixes a driver bug where cost was read from
top-level `cost_usd` instead of `metadata.cost_usd` (real cost lives
under metadata; the chat endpoint nests observability fields there).
The harness as designed cannot reproduce #640 because every leak path
it can exercise is on the MCP protocol pipe (claude-side), not on the
agent-server claude-stdout pipe (which is where the wire corruption
in #640 actually manifests). Claude+SDK isolate MCP child stdio
correctly; Linux refuses /proc/*/fd/* bypass with ENXIO; Claude has
no stray stdout writes in stream-json hot path.
Negative results preserved so future #640 hunts don't re-walk the
same paths. Real next step is landing PR #662 and getting prod-data-
driven evidence on which package actually leaks.
Refs #640
Refs #662 (diagnostic instrumentation prerequisite)
Refs #678 (production manifestation)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(voice): agent workspace page with canvas panel and Gemini panel tools (#699) (#703)
* feat(voice): agent workspace page with canvas panel and Gemini panel tools (#699)
Adds a full-page voice workspace at /agents/:name/workspace with a split
layout: orb + controls on the left, an agent-controlled canvas panel on
the right. Introduces four in-process panel tools (show_markdown,
update_panel, append_to_panel, clear_panel) that let Gemini write
structured content to the canvas during a voice conversation without
delegating to the agent container. Panel state is polled at 300ms via
a new GET /voice/{session_id}/panel endpoint. Workspace mode is gated
on a new voice_available feature flag (GEMINI_API_KEY + VOICE_ENABLED)
and surfaced via a BETA-badged button in AgentHeader.
Security: panel content is DOMPurify-sanitised before v-html rendering;
panel endpoint has session ownership checks; append_to_panel caps
accumulated content at 512 KB to bound per-session memory.
Tests: 7 new panel-tool unit tests (test_voice_tools.py); 5 new panel
endpoint ownership tests (test_voice_auth.py).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* docs(voice): add VOICE-008 to requirements + voice API table in architecture
Adds VOICE-008 (Voice Workspace / #699) requirements entry and Phase 4
roadmap entry. Adds voice API endpoint table to architecture.md (was
entirely absent) and updates feature-flags description to mention
voice_available.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
---------
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(voice): cross-worker session 403 + audit kwargs TypeError (#704 #705) (#706)
#704: Voice sessions written to Redis (dual-write with in-memory) so a
WebSocket worker can auth-check a session created on a different Uvicorn
process. VoiceService.create/get/remove_session are now async; Redis key
TTL = VOICE_MAX_DURATION + 60 (360s); Redis failure at /voice/start raises
loudly rather than silently producing an intermittently-failing session ID.
#705: on_tool_call callback passed legacy actor_type=/actor_id=/actor_email=
kwargs that don't exist on platform_audit_service.log(), causing a TypeError
that silently swallowed every voice tool-call audit record. Fixed to use
actor_user=types.SimpleNamespace(id=..., email=...) matching the _resolve_actor
contract; wrapped in asyncio.create_task so the audit write doesn't block.
Tests: +7 Redis fallback tests (TestRedisSessionFallback in test_voice_tools.py),
+1 audit attribution source-inspection test (TestVoiceAuditAttribution in
test_voice_auth.py). Catalog updated: 45 voice unit tests total.
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(voice): workspace panel flicker + Chart.js rendering in update_panel (#707) (#709)
- updated_at change-detection gate in fetchPanel() prevents 3x/sec Vue
re-renders and stops empty state from overwriting content when session ends
- in-flight guard (panelFetching flag) prevents overlapping 300ms requests
- panel content preserved on session end; reset on new session start
- replace v-html+sanitizedHtml computed with ref+renderHtmlPanel():
DOMPurify.sanitize(html, {ADD_TAGS:['script']}) + _execScripts() re-clones
script nodes as live DOM so Chart.js new Chart() calls execute correctly
- Chart.js 4.4.0 pre-loaded via injectChartJs() on mount (CDN, id-guarded)
- WORKSPACE_PANEL_INSTRUCTIONS updated: document Chart.js pre-loaded rule
Fixes #707
Co-authored-by: Claude <noreply@anthropic.com>
* docs(architecture): update stale api-keys refs after #302 settings refactor
ApiKeys.vue deleted; /api-keys now redirects to /settings?tab=mcp-keys.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* refactor(settings): tabbed layout with role-gated MCP Keys absorption (#302) (#700)
Splits the 2,600-line Settings page into 5 logical tabs (General, Access,
Integrations, MCP Keys, Agents) with URL ?tab= deep-linking and ROLE-001
role-gated visibility. Absorbs the standalone /api-keys page into the new
MCP Keys tab; preserves bookmarks via a permanent SPA redirect.
Built test-first via Canon TDD against a 12-behavior list documented at
docs/planning/302-settings-test-list.md. 14 Playwright tests pass; 13
match the test-list items 1:1 plus 1 regression test pinning the
non-admin admin-403-bounce fix surfaced during /review.
Behavior
- /settings ?tab=<id> deep links to any tab. Unknown ?tab= falls back to
the user's default tab. Browser back/forward navigates tab history.
- Tab visibility gates by role: admin sees all 5 tabs; non-admin sees
only MCP Keys (matches today's /api-keys page being non-admin).
- Default tab: General for admin, MCP Keys for non-admin.
- /api-keys redirects to /settings?tab=mcp-keys (static literal target,
no open-redirect surface).
- NavBar "Keys" link removed; Settings link now visible to all auth
users (was admin-only).
Implementation
- New components/settings/McpKeysTab.vue extracted from views/ApiKeys.vue
(deleted). Same auth posture preserved verbatim.
- New composables/useRole.js mirrors backend ROLE-001 hierarchy
(user < operator < creator < admin) for client-side UI gating.
- New authStore.role getter sourced from /api/users/me; new
fetchUserProfile() action populates role on login + session restore.
- 13 existing Settings sections wrapped with v-if matching their tab.
- watch(isAdmin, ..., { immediate: true }) guards admin-only data
fetches so non-admin users don't trigger 403 → router.push('/')
bounce — this was the bug surfaced by /review and fixed pre-merge.
Security
- Backend require_admin/require_role in routers/settings.py UNCHANGED.
UI hiding is convenience, not the security boundary. A non-admin who
edits localStorage.user.role = 'admin' sees all 5 tabs but every admin
endpoint still returns 403 — UI bypass yields zero capability gain.
- /api-keys redirect target is a hardcoded literal — no user input.
- /review: 0 critical (after C1 fix), 5 informational.
- /cso --diff: 0 critical / 0 high / 0 medium / 0 low.
Acceptance criteria status
- [x] Tabbed nav with ?tab= URL query param
- [x] 12+ sections organized into 5 logical tabs
- [x] MCP API Keys absorbed into Settings
- [~] "Each tab a separate Vue component" — only McpKeysTab.vue
extracted; the other 4 tabs remain inline v-if sections in
Settings.vue. Follow-up issue worth filing for full extraction
(state, methods, computed props need to move per-tab).
- [x] NavBar simplified (Keys link removed)
- [x] Non-admin users still access MCP key management
- [x] No functionality lost — covered by behavior 11 regression test
Test plan
- 14/14 settings-tabs.spec.js pass (13 list-driven + 1 regression)
- Existing smoke.spec.js updated (no longer asserts Keys link)
- 3 unrelated session-tab.spec.js failures pre-exist on dev (unaffected)
Out of scope (intentionally deferred)
- Full tab-as-component extraction (4 remaining)
- v-show vs v-if for modal-state preservation across tab switches
- Vitest unit-test infra (frontend has Playwright e2e only)
Fixes #302
Co-authored-by: Pavlo <pash@pashs-MacBook-Pro.local>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(voice): test regression + Chart.js bundling + model default (#723) (#726)
Fixes three independent regressions in the voice workspace:
1. **test_voice_auth.py collection failure** — `_stub_docker_service()` was
missing `docker_client` and four other attrs that `services/__init__.py`
imports; adding a template_service stub also prevents a cascade from
migration code that imports `services.credential_encryption`.
Set `SECRET_KEY` env var early so the real config and `test_voice_tools.py`
stub both sign/verify JWTs with the same key (avoids 4001 on ownership tests).
2. **test_voice_tools.py import error** — `services.gemini_voice` stub left in
`sys.modules` by `test_voice_auth.py` blocked the real import of
`GeminiVoiceService`. One-line eviction before the import fixes it.
3. **Chart.js CDN → bundle** — replace the dynamic CDN script injection with a
proper `import Chart from 'chart.js/auto'` + `window.Chart = Chart`
in `AgentWorkspace.vue`. The CDN approach was unreliable under load;
the bundled path is deterministic. `fetchPanel` null-guard updated so
panel content is never overwritten by an empty response after session end.
Also exposes `VOICE_MODEL` env var in docker-compose and updates the default
model identifier to `models/gemini-3.1-flash-live-preview`.
All 45 tests in test_voice_auth.py + test_voice_tools.py pass.
Closes #723
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(schedules): restore auth parity on GET webhook status endpoint (#724) (#727)
* docs(validate-pr): add infrastructure change check to align with DEVELOPMENT_WORKFLOW.md
Adds Step 4.7 to flag docker-compose/Dockerfile changes without justification,
matching the Red Flags checklist in docs/DEVELOPMENT_WORKFLOW.md.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(schedules): restore auth parity on GET webhook status endpoint (#724)
GET /{name}/schedules/{id}/webhook used AuthorizedAgent which calls
db.get_agent_owner() and 404s for agents not in the ownership table.
POST and DELETE already use name:str + can_user_access_agent; align GET
to match so all three webhook endpoints behave consistently.
Fixes #724
Co-Authored-By: Claude <noreply@anthropic.com>
---------
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(tests): fix test import drift — 0 collection errors, all 3 target files pass (#725) (#729)
* fix(tests): patch sys.modules stubs in monitoring router and skill service user agent tests
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(tests): add get_agent_default_resources stub to readiness probe test (#725)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(tests): recover real fastapi in monitoring router loader (#725)
test_inject_assigned_credentials.py permanently overwrites sys.modules['fastapi']
with a Mock at collection time via sys.modules.update(). When collected after that
file (alphabetically 'i' < 'm'), _load_monitoring_router() exec'd monitoring.py with
a mocked APIRouter, causing @router.get() to return a Mock instead of the original
async function. asyncio.run() then raised TypeError on the Mock.
Fix: briefly evict the polluted fastapi entry, re-import from disk to get the real
module, restore the Mock, then include the real fastapi in the patch.dict context
used during exec_module.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
---------
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(agent-server): cap drain executor thread at 90 s to fix #728 CPU spin (#730)
`safe_close_pipes` deadlocks on Python's BufferedReader internal lock when
a subscription token is expired (no claude child ever spawns, reader threads
stay alive). The deadlock wedges `asyncio.run(_drain_reader_threads(...))` in
the executor thread for up to 7200 s at 87–91% CPU.
Add `_drain_bounded()` in `claude_code.py`: runs the drain inside a daemon
thread with a `threading.Event` + 90 s `done.wait()` budget. Replaces all 4
`asyncio.run(_drain_reader_threads(...))` call sites. `subprocess_pgroup.py`
is untouched to minimise regression risk.
4 unit tests in `tests/unit/test_drain_bounded.py`.
Fixes #728
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
* test(settings): harden #302 e2e coverage + add manual test plan (#716)
Follow-up to PR #700 (issue #302). Adds 6 new Playwright tests and a
manual test runbook to lift coverage from "navigation works" to
"regression-safe + integration".
New e2e tests (in src/frontend/e2e/settings-tabs.spec.js):
F1. Every section appears under its expected tab — replaces the limited
behavior 11 (4 of 13 sections) with full 13-section regression. Uses
exact:true match to disambiguate "API Keys" from "MCP API Keys".
F2. Parametric deep links — 4 ?tab= IDs not already covered by behavior 2
(general, access, integrations, agents). Round out the 5-ID matrix.
F3. MCP Keys CRUD integration — drives create + revoke through the UI,
verifies state via API queries. Uses CLEANUP_PREFIX 'test-302-e2e-
cleanup' for the test key name. afterEach hook + DELETE response
assertion cleans up; manual recovery one-liner documented in the
file header in case the hook somehow misses (zero stray rows
observed in local runs).
F4. Admin login fetches /api/users/me — pins the new fetchUserProfile()
wiring from auth.js so a future refactor can't silently break role-
based UI gating.
F5. Re-click active tab does not push duplicate history — guards against
regressing the early-return guard in selectTab().
F6. Non-admin does not see MCP Server URL section in MCP Keys tab —
confirms the inner v-if="isAdmin" gate (independent of the tab-level
v-if) works.
Test infra changes
- Whole spec file marked test.describe.configure({ mode: 'serial' })
because the CRUD test creates real keys and parallel workers race
against the McpKeysTab list re-fetch + ensureDefaultKey side effects
in headless mode. Total runtime: ~12s serial vs ~5s parallel —
acceptable.
- New cleanupTestMcpKeys helper used by afterEach hook.
- New CLEANUP_PREFIX constant exported in the file header comment with
the docker-exec recovery one-liner.
Manual test plan (docs/testing/302-settings-tabbed-layout-manual-test-
plan.md): 7-section runbook covering everything the e2e suite covers
plus the things only humans can verify (visual UX, real non-admin
session via DevTools localStorage spoof, /api-keys bookmark redirect
demo). ~25 minutes for full pass.
Verified
- 23/23 settings-tabs.spec.js pass in 12.2s
- 0 stray test rows in mcp_api_keys after a full run
- No backend, docker, or CI changes — this is purely test-infrastructure
hardening for a frontend-only feature
Refs #302
Co-authored-by: Pavlo <pash@pashs-MacBook-Pro.local>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(slack): startup recovery supervisor for transient initial-connect failures (#708) (#719)
When ALL initial Socket Mode connect attempts fail at backend boot
(transient DNS slowness, edge throttle, etc. exceeding the 10s connect
ceiling), `start()` now spawns a recovery supervisor task that retries
in the background with the watchdog's exponential backoff (60→120→240→
300s cap) until at least one client connects, then graduates to the
per-client watchdog model.
Pre-fix behavior: silent permanent-offline state until manual restart.
The watchdog model assumed at-least-one-connection, leaving no path
out of "zero contexts" once initial gather failed.
Behavior matrix:
- All initial succeed → no supervisor, watchdogs run (no overhead change)
- Partial succeed → no supervisor, degraded mode unchanged
- All initial fail (transient) → supervisor retries, exits on recovery
- All initial fail (bad creds) → supervisor retries forever; backend
HTTP stays fully responsive; ERROR "STARTUP UNREACHABLE" log fires
after 3 consecutive failures for operator paging
- Token format invalid (no xapp- prefix) → existing early-return path
preserved, no supervisor (permanent error must not spin)
- stop() called mid-supervisor → supervisor cancelled cleanly, await
propagates CancelledError, no zombie task
main.py: stop nilling _slack_transport when initial connect fails so
the supervisor task isn't orphaned.
Test coverage: 10 new unit tests (TestStartupRecoverySupervisor) +
1 flipped existing (test_start_aborts_when_all_clients_fail →
test_start_spawns_supervisor_when_all_clients_fail). 65/65 pass.
End-to-end smoke verified against running backend with extra_hosts
DNS poisoning + bad-credentials scenarios; backend HTTP confirmed
responsive throughout.
Co-authored-by: Pavlo <pash@pashs-MacBook-Pro.local>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(config+sec): align TRINITY_PASSWORD across compose files + close changeme propagation paths (#692)
- docker-compose.yml: TRINITY_PASSWORD now reads ADMIN_PASSWORD directly
- docker-compose.prod.yml: switch to fail-loud ${ADMIN_PASSWORD:?...} on both backend and mcp-server
- .env.example: collapse duplicate FRONTEND_URL; add GOOGLE_API_KEY, LOG_*, TRINITY_DATA_PATH, HOST_TEMPLATES_PATH
- src/mcp-server/src/server.ts: drop || "changeme" fallback; throw on startup when MCP_REQUIRE_API_KEY=false and no usable credential
- scripts/deploy/gcp-deploy.sh: refuse to deploy if ADMIN_PASSWORD is unset or literally "changeme"
- deploy.config.example: drop "changeme" default
- docs: update mcp-orchestration flow, single-server deploy guide, TEST_REPORT historical note, feature-flows index
Default MCP_REQUIRE_API_KEY=true mode is unaffected.
Closes #692
Co-Authored-By: AndriiPasternak31 <andriipasternak31@gmail.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(tests): green the unit-test suite (closes #660) (#714)
Resolves all 49 failures + 2 collection errors in `uv run pytest tests/unit/`.
Final state: 754 passed, 1 skipped, 0 failed across 3 consecutive runs.
Production schema/migrations and backend code are untouched.
Failure groups fixed:
- A (14): missing backend deps in tests/requirements-test.txt
- B (4): test_telegram_webhook_backfill missing platform_audit_service stub
- C (2): test_fleet_sync_audit S7 partial UNIQUE index prevents seeding duplicate-binding state
- D (28): test_file_upload cascading pollution from test_backlog tmp_db + test_slack_watchdog sys.modules stub
- E (4+1): test_git_status_dual_ahead_behind inverted args + agent_server package shadow
- F (1): test_agent_server_auto_sync same package shadow as E
Also adds slackify-markdown>=0.2.0 floor for supply-chain consistency.
Co-Authored-By: Andrii Pasternak <andriipasternak31@gmail.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(schema): backfill schema.py to match migrations.py reality (#691) (#712)
Adds 11 tables, 14 columns, and ~22 indexes that existed only in
migrations.py to schema.py, restoring it as a faithful reference for
the database (Architectural Invariant #3).
Mechanical and additive. Migrations are append-only history. Existing
databases unaffected — every new statement uses IF NOT EXISTS.
Tables added: subscription_credentials, agent_notifications,
subscription_rate_limit_events, slack_workspaces, slack_channel_agents,
slack_active_threads, telegram_bindings, telegram_chat_links,
telegram_group_configs, whatsapp_bindings, whatsapp_chat_links.
Columns added: agent_ownership (full_capabilities, max_backlog_depth,
voice_system_prompt), schedule_executions (source_user_id,
source_user_email, source_agent_name, source_mcp_key_id,
source_mcp_key_name, claude_session_id, queued_at, backlog_metadata,
fan_out_id), agent_schedules (webhook_token, webhook_enabled).
Indexes added: subscription, notification, rate-limit, multi-agent
slack, telegram, whatsapp, plus partial indexes for execution backlog
(idx_executions_queued), retry (idx_executions_pending_retry),
fan-out (idx_executions_fan_out), webhook tokens
(idx_schedules_webhook_token), and proactive sharing
(idx_agent_sharing_proactive).
Verified: name-only and strict DDL parity scripts both pass with
zero missing and zero different entries. test_migrations.py 17/17.
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(metrics): add /metrics skill for engineering analytics
Ports the metrics skill from feature/704-705-voice-bugs to dev.
Provides velocity, cycle time, bug ratio, and backlog health reporting
via GitHub Issues + project board data. Includes the 2026-05-07 baseline
report generated during initial development.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* docs(validation-spec): add F-011/F-012/F-013 file structure checks for architecture, requirements, and changelog docs
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* docs(arch+validation): add data-exchange principle and composability checks
Add governing principle #7 to TARGET_ARCHITECTURE.md: data exchange over
conversation chains as the default multi-agent composition pattern.
Add Composability category (I-001–I-005) to agent-validation-spec.md:
checks that agents declare output contracts, produce structured file-based
outputs for downstream consumers, and enforce contracts via post-check hooks.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* chore(metrics): code-health baseline 2026-05-08 @ 1eb7b5e
* feat(code-health): add /code-health skill and weekly schedule on trinity agent
- Add code-health skill playbook (.claude/skills/code-health/)
- Run first code health baseline: top hotspot routers/chat.py (score 6105),
14 size violations, 6 stale TODOs, 0 circular imports
- Commit baseline to docs/metrics/code-health-baseline.json
- Document /code-health in DEVELOPMENT_WORKFLOW.md (checklist + commands table)
- Schedule weekly Monday 09:00 UTC autonomous run on trinity agent
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(git): use per-agent PAT fallback in github sync, helpers, and lifecycle (#735) (#739)
Three callsites were ignoring per-agent GitHub PATs and calling the
platform-only get_github_pat() directly, causing silent clobber of
per-agent PATs on container restart and preventing agents with per-agent
PATs from initializing git sync without a platform PAT also configured.
- routers/git.py initialize_github_sync: get_github_pat() → get_github_pat_for_agent(agent_name)
- services/agent_service/helpers.py check_github_pat_env_matches: platform PAT → agent-effective PAT (prevents spurious recreation)
- services/agent_service/lifecycle.py: platform PAT update → agent-effective PAT (prevents per-agent PAT clobber on restart)
19 unit tests added (static callsite checks + logic tests for fallback chain).
Fixes #735
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(schedules): prevent accidental re-enable via MCP update and expose retry config fields (#741) (#742)
Two silent bugs in MCP schedule update path:
1. Terse `enabled` description in `update_agent_schedule` led AI models to
include `enabled: true` when updating unrelated fields, re-enabling
schedules the user had intentionally disabled. Added explicit warning to
omit the field unless changing state is intended.
2. `ScheduleUpdateRequest` was missing `max_retries` and
`retry_delay_seconds`, so Pydantic's `exclude_unset=True` silently
dropped those fields before they could reach `db.update_schedule()`.
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(subprocess): replace os.stat() with os.readlink() in orphan pipe scan (#728) (#747)
os.stat() on /proc/pid/fd/N follows the symlink to the pipe inode and
acquires an inode lock at the kernel level. On a D-state process this
lock may be held indefinitely, causing _kill_orphan_pipe_writers to
silently exit its 10 s daemon-thread cap without ever finding the orphan
writer — leaving the reader thread permanently leaked.
os.readlink() reads the symlink target string "pipe:[inode]" from the
proc pseudo-filesystem's own metadata WITHOUT following the symlink and
WITHOUT acquiring any inode lock. Safe on D-state processes.
Changes:
- Replace os.stat() + fdinfo flags check with os.readlink() + "pipe:[N]" match
- Add our_pid = os.getpid() self-skip (handles our_pgid=None edge case)
- Remove fdinfo write-flag check (no longer needed once self is excluded)
Regression test: TestKillOrphanPipeWriters.test_kills_orphan_even_when_stat_raises_dstate_simulation
monkeypatches os.stat to always raise OSError, verifies orphan is still
found and killed. Would have FAILED against the old implementation.
Complements the _drain_bounded 90 s cap from PR #730.
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
* docs(adr): evaluate Claude Agent SDK migration (#409) (#743)
Recommends DEFER. SDK does not close #285/#407 by design and
introduces five observability regressions plus an unverified
subscription-token auth path. Proposes completing #122 (split) on
the current architecture instead, with a rescoped 6-module target
that matches today's 2137-line file.
Closes #409
* ci(unit-suite): add per-PR + nightly regression gates (#715) (#744)
* ci(unit-suite): add per-PR + nightly regression gates (#715)
Two GitHub Actions workflows that surface the dev-merge residual class
of regressions earlier than reviewer time:
- backend-unit-test.yml — per-PR gate. Runs the unit suite under three
pytest-randomly seeds against both the PR's base-branch tip and its
merge commit, then diffs the union of failing test IDs. Fails the
check on new failures or missing JUnit XML (fail-closed on infra).
- backend-unit-nightly.yml — cron 06:00 UTC sweep over open PRs
targeting dev. Three jobs: discover (RO) → test (RO, no creds, runs
untrusted PR code via pull/N/head) → comment (write, no checkout).
Posts sticky regression comments via github-script.
scripts/ci/diff-pytest-failures.py — JUnit XML diff utility. Tracks
<failure> and <error> kinds separately, fails closed on missing or
unparseable XML, ships an in-process --self-test mode (8 cases) that
the workflows run before the real diff.
pytest-randomly is installed in-workflow only — adding it to
tests/requirements-test.txt would silently randomize tests/run-core.sh
because pytest auto-discovers installed plugins.
Out of scope (per #715): fixing the 34 failures + 17 errors in the
existing baseline. Those belong to the #660 follow-up; this gate is
the entry point that surfaces them.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* ci(unit-suite): stash diff script before base-side checkout (#715)
The base-side matrix legs were failing fast because
scripts/ci/diff-pytest-failures.py only exists on the PR branch — when
we git switch --detach onto the target-branch tip for the base run, the
file disappears and the self-test step exits 2 ("No such file").
Fix: copy the script to ~/.ci-tools/ in the merge-commit checkout
before any side switch, then run the self-test against the stashed
copy. The pytest run itself doesn't need the script (it just produces
JUnit XML); only the diff aggregator job consumes it, and that job
checks out the PR branch fresh.
Caught immediately by the gate validating itself on PR #744 — exactly
the dev-merge-residual feedback loop the gate is meant to provide.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* ci(unit-suite): harden nightly comment loop against marker quoting and bad JSON (#715)
Two issues from the focused SQL/race/auth review:
1. Sticky-comment finder lacked an author filter. A user reply that
quoted the `<!-- nightly-unit-suite -->` marker would match
`existing.find(...)`, and the subsequent updateComment call would
403 (bots can only edit their own comments) — leaving the PR
without a fresh sticky comment. Now requires `c.user.type === 'Bot'`
or `github-actions[bot]` author.
2. One malformed status JSON in the matrix would JSON.parse-throw out
of the for-loop, killing comment posting for every subsequent PR
in the same nightly run. Each iteration now lives in its own
try/catch and logs a per-PR warning on failure.
Both caught before nightly ever fired in production. Per-PR gate is
unaffected (the bug was only in the nightly comment job).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* ci(unit-suite): use trusted dev-branch diff script for nightly tamper-evidence (#715)
Defense-in-depth from the /cso --diff audit. Previously the nightly's
self-test and regression-diff steps invoked
scripts/ci/diff-pytest-failures.py from the merged-in PR workspace,
meaning a malicious PR could ship a modified diff utility that always
exits 0 and produces a false "✅ Nightly clean" sticky comment.
Fix: stash the trusted copy from origin/dev before the merge, then
invoke ~/.ci-tools/diff-pytest-failures.py for both --self-test and
the real regression diff. The stash step happens at position 2 (right
after the dev checkout, before pull/N/head fetch + merge), so the
file copied is always the unmerged dev-branch version regardless of
PR contents.
The per-PR gate is intentionally not hardened the same way — that
gate is a self-check, and a PR that modifies its own diff script
shows up conspicuously in the diff for the human reviewer. The
nightly is the cross-PR signal that warrants tamper-evidence.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* ci(schema): schema.py vs migrations.py parity gate (#713) (#745)
* ci(schema): add schema.py vs migrations.py parity gate (#713)
Adds a pytest unit test (`tests/unit/test_schema_parity.py`) plus a
path-filtered GitHub Actions workflow (`.github/workflows/schema-parity.yml`)
that fails any PR whose `init_database()` boot-path produces tables,
columns, indexes, or triggers that aren't declared in `init_schema()`
alone.
The check builds two in-memory SQLite snapshots — `init_schema()` only,
and the full `migrations -> init_schema -> migrations` lifecycle from
`database.py:139-164` — and diffs them. The full lifecycle on an empty
DB is required so short-circuit migrations like `_migrate_audit_log_table`
(`migrations.py:1356-1364`, returns early when `audit_log` exists) can't
hide schema.py omissions of indexes/triggers.
Per-column TYPE comparison via PRAGMA table_info also catches drift
where schema.py and migrations disagree on a column's type — a class of
bug that name-only diffing would miss.
Out of scope, documented in the test docstring: reverse-direction drift
(schema.py edited without a migration), constraint-level drift
(NOT NULL / DEFAULT / FK / CHECK), trigger BODY drift. Future work.
Closes the prevention loop for #691: PR #712 backfilled 11 tables, 14
columns, and ~22 indexes that had drifted out of schema.py over time.
This gate stops the same drift from re-accumulating.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* chore(schema-parity): catalog new test + harden CI key placeholder (refs #713)
- .claude/agents/test-runner.md: add test_schema_parity.py under
System & Deployment, plus a Recent Test Additions (2026-05-09)
entry with the full design rationale (full init_database() lifecycle,
PRAGMA-based type comparison, triggers in snapshot, out-of-scope notes).
- .github/workflows/schema-parity.yml: replace the invalid-hex
CREDENTIAL_ENCRYPTION_KEY placeholder ("test-ci-only-...") with a
valid 64-char zero-hex string. The key is never accessed on an empty
in-memory DB, but if a future code path ever calls encrypt(), the
workflow now produces a real test failure rather than a confusing
"Invalid encryption key format" crypto error.
Refs #713
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* docs(schema-parity): inline maintainer note about required-status-check timing
Resolves /validate-pr suggestion: the warning about NOT adding this job
to branch-protection required-status-checks on day one was only in the
PR body. Move it inline as a header comment in the workflow YAML so it
stays visible to anyone reading the file directly (and survives long
after the PR is merged).
The risk is a future module move bypassing the path filter and
silently skipping the gate, then bricking unrelated PRs that don't
touch DB files. Leaving it optional + self-skipping is the safe default.
Refs #713
* ci(schema-parity): widen path filter to include src/backend/database.py
The parity test mirrors init_database()'s migrations->schema->migrations
lifecycle (database.py:139-164), but the path filter only watched
src/backend/db/**. A future re-ordering of init_database() itself would
slip past the gate. Add src/backend/database.py to both pull_request
and push paths so changes to the lifecycle trigger the check.
Refs #713
Co-Authored-By: Claude <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(circuit-breaker): Redis-coordinated state with backoff + dormant (#631) (#698)
Pre-fix the circuit breaker lived in a per-process dict in agent_client.py.
Two uvicorn workers each tracked their own state, each probed dead agents
every 30s, and `monitoring_service` (which uses raw httpx and never
consulted the circuit at all) wrote 4 health_check rows per cycle for
the same dead agent. With a multi-hour outage that produced 400+ failures,
hundreds of concurrent SQLite writes from two workers, and the lock
contention cascaded into WebSocket auth → UI unresponsive for everyone.
Fix shape:
1. State moves into Redis (`agent:circuit:{name}` hash) with atomic Lua
scripts for record_failure / record_success / allow_request. Single
source of truth across workers; "Circuit OPENED" log fires exactly
once per cluster transition because the script returns prior+new
state and only one worker observes the closed→open boundary.
2. Exponential backoff on the open state — 30s → 60s → 120s → 240s,
capped at 300s. After 10 consecutive open-state probes without
recovery (~40min of attempts), the circuit transitions to **dormant**
and stops probing entirely.
3. Probe-lock (`SET NX EX 10` on `:probe-lock` key) ensures only one
worker fires the half-open probe per cooldown window.
4. `monitoring_service.check_network_health` now feeds its result into
the circuit (record_success on 2xx-4xx, record_failure on connect
errors / timeouts / 5xx) — previously its failures were invisible to
the circuit. `perform_health_check` short-circuits to a synthetic
AgentHealthDetail when the circuit is dormant: no httpx calls, no DB
writes. This is the cure for the SQLite-contention root cause.
5. Operator hooks: `force_circuit_dormant(agent, reason)` and
`reset_circuit(agent)`. The autonomy toggle calls them
(disable→dormant, enable→reset) — disabling autonomy now stops
probing for that agent (AC#5). The manual
POST /api/monitoring/agents/{name}/check resets first so the probe
actually runs (the documented dormant-recovery path).
Fail-open: when Redis is unreachable the breaker degrades to "always
allow request" — matches the rate-limiter pattern in webhooks.py and
keeps a Redis blip from breaking agents that are otherwise healthy.
Tests:
- tests/integration/test_circuit_breaker.py — 18 tests against real
Redis: state machine, backoff curve, dormant entry/exit, probe-lock
cross-worker semantics, transition-logged-once, operator hooks,
scan correctness, fail-open.
- tests/unit/test_monitoring_dormant_skip.py — 5 tests verifying the
perform_health_check bail-fast path: when dormant, no DB writes; when
not dormant, the full 4-row write path runs as before.
- tests/unit/test_circuit_breaker.py — kept the connection-pool and
exception-hierarchy tests; the state-machine tests moved to
integration since they need real Redis.
Live verified end-to-end: killed agent-server inside container, drove
through closed → open → dormant via repeated perform_health_check calls,
confirmed 0.5ms bail-fast in dormant state with zero new DB rows; toggled
autonomy off→on and observed dormant→closed transitions.
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(#411): canary invariant harness Phase 1 (S-01, E-02, L-03) (#653)
* feat(#411): canary invariant harness Phase 1 (S-01, E-02, L-03)
Implements the canary harness from #411 that catches the bug class behind
PRs #378, #403, #129, #226 — race conditions and cross-component state
drift between Redis × SQLite × agent registries that unit tests miss.
Architecture (per design discussion in PR comments):
- Deterministic library (src/backend/canary/) — pure-function checks
shared between the watcher and the on-demand admin endpoint. No LLM
reasoning anywhere — same snapshot in, same violations out.
- Watcher service (services/canary_service.py) — 5-min APScheduler-style
loop modeled on cleanup_service.py. Disabled by default; enable on
staging/dev with CANARY_ENABLED=1.
- Fleet manifest (config/canary-fleet.yaml) — synthetic load via the
existing /api/systems/deploy mechanism. Without traffic the harness
reports trivially-green cycles.
Phase 1 invariants:
- S-01 (slot–row bijection) — Redis ZRANGE vs SQL running rows; drain
sentinels filtered. Critical severity.
- E-02 (no phantom reversal) — terminal executions stay terminal.
Phase 1 uses Redis-backed state comparison instead of Vector log diff
for simplicity. Critical severity.
- L-03 (delete cascades) — orphan-row scan across cross-cutting tables
(agent_sharing, agent_schedules, schedule_executions [non-terminal],
agent_skills, agent_tags, agent_shared_files, agent_public_links,
pending operator_queue / access_requests, agent-scoped mcp_api_keys,
active chat_sessions) plus orphan agent:slots:* Redis key scan.
Critical for active-orchestration orphans, major otherwise.
Persistence + API:
- canary_violations table (migration 34, schema in db/schema.py).
- GET /api/canary/violations + /violations/stats + /violations/{id} —
admin-only read paths over the persisted history.
- POST /api/canary/run-cycle — admin-only on-demand trigger; delegates
to the same CanaryService.run_cycle() the background loop calls.
Notifications:
- One green→red transition emits one notification via
db.create_notification(notification_type='alert', category='canary').
- Severity → priority: critical → urgent, major → high, minor → normal.
- Continuing-red cycles do not re-notify; the row in canary_violations
is the trend signal, the bell is the "now" signal.
Tests:
- 27 unit tests in tests/test_canary_invariants.py covering
CanaryOperations CRUD/filters/stats, snapshot collector behaviour,
each invariant's holds-and-fires cases, the L-03 critical/major
severity split, and the runner registry.
Smoke-test recipe (Option 1 — orphan SQL row, no code changes):
# 1. Set CANARY_ENABLED=1 and restart backend (creates canary_violations
# table via migration; starts the watcher).
# 2. Deploy the fleet:
curl -X POST -H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d "$(jq -Rs '{manifest: .}' < config/canary-fleet.yaml)" \
http://localhost:8000/api/systems/deploy
# 3. Ba…