SamPlvs · SamPlvs · Jun 1, 2026 · Jun 1, 2026 · Jun 1, 2026
@@ -1173,3 +1173,23 @@ The `--no-headlines` flag is preserved (not removed) for backwards compatibility
 - Category-allowlist-only (skip blocklist) — rejected; defence-in-depth wants generic-category AND blocklist-clear AND no-blocklist-refusal.
 
 **Outcome:** modified `src/zo/{orchestrator,experiments,cli,draft}.py`; new `src/zo/promote.py`; docs (`the-team`, `memory-and-continuity`, `overview`, `introduction`, `installation`, `demo.html`, `COMMANDS.md`, `README.md`); `.gitignore`; new `tests/unit/test_promote.py` (15) + additions to test_orchestrator/test_experiments/test_training_metrics/test_cli. **+32 tests (780 → 812 + 7 skipped), green on Python 3.11 AND 3.12, ruff `src/` clean, validate-docs 0 failures.** PR-041 added. **Deferred (audit #13/#15/#16/#17):** semantic reindex at session-end, agent failure-reporting protocol, `end_session` DECISION_LOG/PRIORS integration, `zo retrospective` CLI. Branch `claude/self-evolution`, stacked on #95.
+
+---
+
+## Decision: 2026-06-01T11:30:00Z
+**Type:** BUGFIX
+**Title:** Add startup grace + confirmation debounce to tmux session-liveness detection
+**Decision:** Rewrite `LifecycleWrapper._wait_tmux` to (1) ignore negative liveness readings for the first `_STARTUP_GRACE_POLLS=2` polls and (2) require `_DEAD_CONFIRM_POLLS=2` consecutive negatives (re-checked every `_DEAD_RECHECK_INTERVAL=2.0`s) before concluding the session ended.
+**Rationale:** `zo continue` in tmux mode launched the lead session correctly, then logged "Lead session completed, agent window closed" ~15ms later and killed the window — the first liveness poll fired before Claude's TUI claimed the pane (and while a one-time workspace-trust dialog could be up), and a single negative reading was treated as a real `/exit`. The launch path already polls carefully for *readiness*; the completion path must be equally skeptical about "it died."
+**Alternatives considered:** (1) Pre-trust `--add-dir` directories to suppress the trust dialog — fixes one trigger, not the class (any transient still tears down). (2) Time-based grace only — breaks under mocked time in tests; chose poll-count-based grace + debounce.
+**Outcome:** 53 wrapper tests pass (old complete-on-first-negative test updated + 2 regressions added), ruff clean, validate-docs 0 failures. PR-042 in PRIORS.md. Branch `claude/tmux-liveness-grace`, PR opened.
+
+---
+
+## Decision: 2026-06-01T12:30:00Z
+**Type:** BUGFIX
+**Title:** Persist bypass-permissions disclaimer acceptance to suppress Claude's startup consent dialog (root cause of the dying-session report)
+**Decision:** Add `ensure_bypass_disclaimer_accepted()` to `permissions_overlay.py` (sets `bypassPermissionsModeAccepted: true` in `~/.claude.json`, idempotent, key-preserving, refuses to overwrite corrupt JSON) and call it from `_launch_tmux` and `_launch_headless` whenever `bypass_permissions` is True.
+**Rationale:** `--bypass-permissions` makes Claude 2.1.159 show a one-time interactive consent dialog (default "No, exit"). The tmux launcher mistakes the dialog for the ready TUI (`_wait_for_tui_ready`), pastes the lead prompt + Enter, selects the default, and Claude quits on startup — the session dies before any work. The lead Claude's session transcript (killed mid-Bash-call) and live instrumentation of the real launch path confirmed the mechanism; PR-042's grace/debounce had only delayed the symptom (15ms → 22s). Passing the flag IS the user's consent, so persisting it is faithful to intent.
+**Alternatives considered:** (1) Screen-scrape the dialog and send "2"+Enter to accept — works but is fragile to menu-layout changes across Claude versions. (2) Rely on PR-042 grace/debounce alone — rejected: it never stops Claude from exiting, only delays detection. (3) Use `--dangerously-skip-permissions` CLI flag in tmux — historically exits immediately in interactive mode.
+**Outcome:** End-to-end verified (flag removed → wrapper re-set it → Claude alive 32s, past the old 22s death). 53 wrapper + 16 overlay tests pass, ruff clean, validate-docs 0 failures. PR-043 added. Stacked on PR-042 in PR #97.
@@ -1298,3 +1298,58 @@ For PR #92 the corrective sequence was: identify the actual CI gates by reading
 ### Verified Solution
 
 Seed (`_maybe_seed_priors`), load (`_prompt_memory` injects project priors), write (`_record_learning` on loop DEAD_END/PLATEAU via `EvolutionEngine.record_failure` + `append_prior`), promote (`src/zo/promote.py` fail-closed sanitizer + `zo learnings promote`). 4 `log_error(message=)` → `description=` fixes + a forced-failure-branch regression test. New `tests/unit/test_promote.py` (15 adversarial cases) + seed/load/write/bug-fix tests. +32 tests (780 → 812, both Python 3.11 & 3.12). **Cross-reference:** PR-009 (built ≠ wired), PR-005/PR-035/PR-040 (enforcement over aspiration), PR-024/PR-030 (confidentiality — the promotion sanitizer reuses the validate-docs client blocklist).
+
+---
+
+## PR-042: tmux Session-Liveness Detection Must Have a Startup Grace + Confirmation Debounce — A Single Instant Reading Must Not Tear Down a Healthy Session
+
+**Source:** Session 036 (2026-06-01), prod project continue run
+**Root cause category:** incomplete_rule (fragile liveness detection)
+
+**Failure:** `zo continue --repo . --bypass-permissions` launched the lead session in tmux ("TUI ready after 5s", "Launched lead session in tmux pane=…") and then logged "Lead session completed, agent window closed" **~15ms later**, closing immediately with no visible Claude window. An earlier same-day run with the same binary had run for 4+ hours, so it was not a version/config issue. Root cause: `LifecycleWrapper._wait_tmux` polled liveness with **zero grace period and zero debounce** — its first poll fired milliseconds after launch (before Claude's TUI had fully claimed the pane, and while a one-time workspace-trust dialog for the newly `--add-dir`'d delivery repo could still be up), and a *single* negative reading (`not pane_exists or not claude_running`) immediately concluded the session ended and killed the window. Any transient — startup race, trust dialog, momentary `tmux display-message` hiccup, brief foreground flip — looked identical to a real `/exit`.
+
+### Rules
+
+1. **Liveness/teardown decisions driven by polling an external async process need a startup grace window — the first poll often fires before the thing being watched has finished starting.** The launch path already polled carefully for *readiness* (`_wait_for_tui_ready` requires stable content) but the *completion* path trusted an instantaneous first reading. Asymmetric care is the bug: be at least as skeptical about "it died" as about "it's ready."
+   - **Why:** `_wait_tmux`'s first iteration runs ~15ms after `_launch_tmux` returns; `pane_current_command` and pane existence are not reliably settled that early, and a trust dialog can transiently alter what tmux reports.
+   - **How to apply:** ignore negative readings for the first `_STARTUP_GRACE_POLLS` polls; never conclude "exited" inside the grace window.
+
+2. **Never tear down on a single negative sample — require consecutive confirmations (debounce).** A momentary subprocess-query failure or a one-frame foreground flip is not an exit. Demand `_DEAD_CONFIRM_POLLS` consecutive negatives, and re-check on a short interval (`_DEAD_RECHECK_INTERVAL`) so a genuine exit is still detected promptly.
+   - **How to apply:** maintain a `consecutive_dead` counter that resets on any positive reading; only conclude when it reaches the threshold. Keep responsiveness by shortening the sleep while a suspected exit is being confirmed.
+
+3. **Treat a tmux/subprocess query failure as "unknown," not "dead."** The debounce already absorbs this, but it is the explicit reasoning: `tmux` returning non-zero / timing out means "I couldn't tell," which must not be coerced into a teardown.
+
+### Verified Solution
+
+`_wait_tmux` rewritten with `_STARTUP_GRACE_POLLS=2` (ignore negatives for ~first 20s at the 10s default interval), `_DEAD_CONFIRM_POLLS=2` (consecutive negatives required), and `_DEAD_RECHECK_INTERVAL=2.0` (fast re-poll while confirming). The old test that asserted complete-on-first-negative was updated to expect confirmed-exit-on-4th-poll; added `test_tmux_wait_ignores_negative_during_startup_grace` (regression for the 15ms teardown) and `test_tmux_wait_debounces_transient_negative`. 53 wrapper tests pass, ruff clean. **Cross-reference:** PR-040/PR-041 family is about wiring/enforcement; this one is about *robustness of the monitoring loop itself* — the watcher must not be more fragile than the thing it watches.
+
+---
+
+## PR-043: --bypass-permissions Triggers Claude's Startup Consent Dialog Which the tmux Launcher Dismisses as "No, exit" — Persist the Acceptance Flag
+
+**Source:** Session 036 (2026-06-01), prod project continue run (root cause of the PR-042 symptom)
+**Root cause category:** novel_case (CLI dialog interaction) + incomplete_rule (PR-042 treated the symptom)
+
+**Failure:** `zo continue --repo . --bypass-permissions` launched the lead session and it died seconds later with no usable Claude window — repeatedly. PR-042 (startup grace + debounce in `_wait_tmux`) only *delayed* the death from ~15ms to ~22s; it never addressed why Claude was leaving. The lead Claude's own session transcript (`~/.claude/projects/-home-sam-zero-operators/<id>.jsonl`) was the key evidence: Claude read the plan, ran a Bash tool call, then the transcript **stopped mid-work** — it was *killed*, not crashed. Instrumenting the real launch path (overlay applied, real prompt, both `--add-dir`s) showed `pane_current_command='bash'` at t~0s and a captured pane containing:
+```
+WARNING: Claude Code running in Bypass Permissions mode
+  ❯ 1. No, exit
+    2. Yes, I accept
+  Enter to confirm · Esc to cancel
+```
+**Mechanism:** `apply_bypass_overlay` writes `permissions.defaultMode: "bypassPermissions"`. On startup in that mode Claude Code 2.1.159 shows a one-time interactive consent dialog whose default-highlighted option is **"1. No, exit"**. `_wait_for_tui_ready` mistakes the dialog for the ready TUI (it is ~1231 chars and stable — the exact "1231 chars" seen in the field logs), so the launcher pastes the 24KB lead prompt and presses Enter → confirms the default → **Claude quits on startup** → pane falls back to `bash` → `_wait_tmux` correctly sees "not claude" and tears down. Intermittent because Claude remembers acceptance once given (`bypassPermissionsModeAccepted` in `~/.claude.json`); the field machine had never accepted it. The misleading "Session summary: I don't see a prior agent session… fresh start" was a *separate* red herring — `_generate_session_summary` runs `claude -p --model haiku` over the buffered comms events, and with the session killed before emitting any, Haiku summarized an empty buffer.
+
+### Rules
+
+1. **When you drive an interactive CLI by pasting into its TTY, any startup dialog you didn't anticipate will silently eat your input — and the default action is often the destructive one ("exit").** A "ready" heuristic based on "screen has stabilized with N chars" cannot tell a consent dialog from an input prompt. Suppress known dialogs at the source rather than relying on the paste landing in the right place.
+   - **Why:** `_wait_for_tui_ready` keys off pane content length/stability; a modal dialog satisfies it. The pasted prompt + Enter then drives whatever modal is focused.
+   - **How to apply:** for every mode/flag ZO passes to Claude, enumerate the startup dialogs it can trigger (trust, bypass-consent) and pre-satisfy them via config (`hasTrustDialogAccepted`, `bypassPermissionsModeAccepted` in `~/.claude.json`) before launch. Passing `--bypass-permissions` IS the user's consent, so persisting it is faithful to intent, not a security bypass.
+
+2. **A fix that makes a failure slower is not a fix — confirm the mechanism, don't just stop the bleeding.** PR-042's grace+debounce looked plausible and shipped green, but the session still died (just later). The mechanism was only found by reading Claude's *own* session transcript and instrumenting the real launch path (not a synthetic one) — synthetic repros missed it because they lacked the bypass overlay, so no dialog appeared.
+   - **How to apply:** when a fix doesn't resolve a user-reported failure, treat the new symptom timing as a clue and reproduce the *exact* production path (same flags, same overlay, same prompt). Read the subprocess's own logs/transcripts. PR-042 keeps its value as defense-in-depth, but the root cause was elsewhere.
+
+3. **Read-modify-write of a shared user config must preserve all keys and refuse to clobber a corrupt file.** `ensure_bypass_disclaimer_accepted` loads `~/.claude.json`, sets only the one flag, rewrites preserving everything, is idempotent, and bails (no write) if the JSON is unreadable rather than destroying a 700-startup user profile.
+
+### Verified Solution
+
+`ensure_bypass_disclaimer_accepted(config_path=None)` in `permissions_overlay.py` sets `bypassPermissionsModeAccepted: true` in `~/.claude.json` (idempotent, preserves all keys, refuses to overwrite corrupt JSON). Called from `_launch_tmux` (logs a checkpoint when newly set) and `_launch_headless` whenever `bypass_permissions` is True. **End-to-end verified:** with the flag removed to simulate a fresh machine, a full wrapper launch re-set it and Claude stayed alive 16/16 polls (32s, past the old 22s death). +4 overlay tests (69 overlay+wrapper tests pass), ruff clean. Ships in PR #97 alongside PR-042. **Cross-reference:** PR-001 (Claude CLI interactive-mode constraints — same family: the TUI has launch-time interaction quirks the wrapper must handle), PR-042 (the grace/debounce defense-in-depth this supersedes as root cause).