diff --git a/README.md b/README.md index ad560e7..5361127 100644 --- a/README.md +++ b/README.md @@ -253,7 +253,7 @@ Other session commands: ## Slash Commands -ZO provides 24 slash commands for Claude Code. See [docs/COMMANDS.md](docs/COMMANDS.md) for the full reference. +ZO provides 24 slash commands for Claude Code. See [docs/COMMANDS.md](docs/COMMANDS.md) for the full reference. If something breaks, check [docs/TROUBLESHOOTING.md](docs/TROUBLESHOOTING.md) before filing an issue. | Category | Key Commands | |----------|-------------| @@ -458,20 +458,20 @@ mnist-delivery/ ← delivery repo (clean) ## Status -**v1.0.1 — All phases complete. Validated end-to-end. Pre-production hardening done.** +**v1.0.2 — All phases complete. Validated end-to-end. Brand v2 + autonomous experiment loop shipped.** | Phase | What | Status | |-------|------|--------| -| 0 | Agent definitions (17) + Claude Code setup | Done | +| 0 | Agent definitions (20) + Claude Code setup | Done | | 1 | Plan parser, target parser, comms logger, setup | Done | | 2 | Memory layer, semantic index | Done | | 3 | Orchestration engine + lifecycle wrapper | Done | | 4 | Evolution engine, CLI, integration tests | Done | | 5 | E2E validation (MNIST: 99% accuracy) | Done | | 1.0.1 | Interactive tmux, brand panel, smart build, Research Scout, self-evolution | Done | -| pre-F5 | Phase persistence, auto-notebooks, delivery scaffold + Docker, preflight | Done | +| 1.0.2 | Phase snapshots, experiment capture + autonomous loop, brand v2, website v2 | Done | -476 platform tests. ruff clean. 20 agents. 24 slash commands. +669 platform tests. ruff clean. 20 agents. 24 slash commands. --- diff --git a/docs/TROUBLESHOOTING.md b/docs/TROUBLESHOOTING.md new file mode 100644 index 0000000..cd4c1aa --- /dev/null +++ b/docs/TROUBLESHOOTING.md @@ -0,0 +1,147 @@ +# Troubleshooting + +Common issues hitting `zo build`, `zo init`, `zo draft`, and the agent runtime. Each entry: **Symptom → Cause → Fix**. + +If something below doesn't match what you're seeing, check `~/.claude/logs/` and `logs/comms/{date}.jsonl` first — the JSONL trail records every agent decision, gate, and error. Then file an issue with the relevant excerpt. + +--- + +## Sub-agent sessions crash on macOS when the team spawns + +**Symptom** +You run `zo build`, the Lead Orchestrator launches in its tmux pane, and as soon as it calls `Agent(team_name=..., name=...)` to spawn the first batch of teammates, sessions start dying — pane closes, "Claude Code crashed", or the whole tmux window vanishes. Sometimes preceded by `fork: Resource temporarily unavailable` or "could not allocate". + +**Cause** +This is upstream of ZO. ZO launches **one** Claude Code session in tmux; the Lead inside that session uses Claude Code's native `TeamCreate` + `Agent(...)` to spawn teammates. Each teammate is a **separate Claude Code process tree** (claude main + node + several MCP servers + tool subprocs — typically 10–30 procs each). Spawning 5–7 teammates in parallel can add 100–250 procs in seconds. + +macOS enforces a per-UID process cap (`kern.maxprocperuid`, default **2666**). Heavy Electron users (Chrome with many tabs, VSCode, Slack, Discord, Spotify, Docker Desktop) routinely sit at 1500–2000 procs already. Adding the team pushes them over the cap → `fork()` fails → cascading session deaths. + +**Fix — try in order** + +1. **Upgrade Claude Code.** 2.1.119+ shipped two relevant fixes: an agent-teams permission-dialog crash and a 50 MB/hr MCP HTTP buffer leak. If you're on anything older, this might fix it outright. + ```bash + curl -fsSL https://claude.ai/install.sh | bash + claude --version + ``` + +2. **Check your process headroom.** + ```bash + ps -U $(whoami) | wc -l # current process count + sysctl kern.maxprocperuid # cap (default 2666) + ``` + If the first number is >1500, you're at risk. Close heavy Electron apps you don't need (Chrome, Slack, Discord, Docker Desktop, browser tab hoarders). + +3. **Raise the cap, persistently.** Default Apple values are conservative. + ```bash + sudo sysctl -w kern.maxproc=8000 kern.maxprocperuid=4000 # one-shot + ``` + To persist across reboots, create `/Library/LaunchDaemons/limit.maxproc.plist` with the values above and `launchctl load -w` it. + +4. **Diagnose with `--no-tmux`.** If `zo build --no-tmux` still crashes, the tmux pane is not the culprit — it's the Lead → `Agent(...)` flow itself. + ```bash + zo build plans/my-project.md --no-tmux + ``` + +5. **Capture the crash for upstream.** Look in `~/Library/Logs/DiagnosticReports/` for crash reports timestamped at the spawn moment, plus `~/.claude/logs/` for the Claude Code session log. File those at https://github.com/anthropics/claude-code/issues — Anthropic engineers can debug from the dump. + +--- + +## `zo: command not found` after `setup.sh` + +**Symptom** +`./setup.sh` reports all checks pass; `zo init my-project` returns `command not found`. + +**Cause** +`uv sync` installs the entry-point binary into `.venv/bin/`. If you have conda, pyenv, or system Python active in your shell, `.venv/bin/` is not on `PATH`. + +**Fix** +`setup.sh` (Check 11) auto-fixes by symlinking `.venv/bin/zo` → `~/.local/bin/zo` (which `uv install` puts on `PATH`). Re-run: +```bash +./setup.sh +``` +If `~/.local/bin` isn't on your `PATH`, add it: `export PATH="$HOME/.local/bin:$PATH"` to your shell rc. + +--- + +## Claude Code session opens blank, prompt never submitted + +**Symptom** +`zo build` (or `init`/`draft`) opens a tmux pane with the Claude TUI rendered, but no prompt appears in the input field — the agent just sits idle. + +**Cause** +`tmux paste-buffer` is fire-and-forget. On a cold start, Claude Code's TUI takes 5–10 seconds to become input-ready (extensions, hooks, `CLAUDE.md` loading, memory file scan). If the paste lands before the input field is live, the text is dropped silently. + +**Fix** +Already mitigated in the wrapper — `_wait_for_tui_ready()` polls `tmux capture-pane` and waits for two consecutive stable readings >100 chars before pasting. If you still hit a blank session, the prompt file is preserved at `logs/wrapper/{team}-prompt.txt` for manual paste: +```bash +cat logs/wrapper/lead-orch-prompt.txt | pbcopy +# focus the Claude pane: Cmd-V, then Enter +``` + +--- + +## Build appears stuck — "Monitoring session" with no visible output + +**Symptom** +Terminal shows `Monitoring session: pid=XXXXX` and nothing else for minutes. + +**Cause** +You launched `zo build` outside of a tmux session, so the wrapper fell back to headless mode (Claude Code with `--print` / `--dangerously-skip-permissions`) which has no TUI. The work is happening, but it's invisible. + +**Fix** +Start a tmux session first: +```bash +tmux new -s zo +zo build plans/my-project.md +``` +Or watch the headless run live: +```bash +tail -f logs/comms/$(date +%F).jsonl +``` + +--- + +## Plan written to a worktree, then `zo build` from main repo can't find it + +**Symptom** +`zo draft -p my-project` succeeded; `zo build` from a different shell says "plan not found". + +**Cause** +You ran `zo draft` from inside a `git worktree`. ZO writes plans to the **main repo** (not the worktree) so they persist across worktrees and machines. + +**Fix** +The plan was written to the main repo's `plans/` directory. From the main repo: +```bash +zo build plans/my-project.md +``` +If you're on a different machine, the project state lives in the **delivery repo**'s `.zo/` directory (per PR-028 portable memory). `git pull` in the delivery repo, then `zo continue --repo /path/to/delivery-repo`. + +--- + +## Setup.sh fails silently on macOS bash 3.2 + +**Symptom** +On a fresh machine, `./setup.sh` exits 0 but no auto-fix happens, even with failures reported. + +**Cause** +macOS ships bash 3.2.57 (from 2007). Empty arrays + `set -u` crash silently with "unbound variable". This was fixed by switching to string + integer counter tracking — but if you cloned an old version of ZO, you may still hit it. + +**Fix** +Pull the latest: +```bash +git pull origin main +./setup.sh +``` + +--- + +## Where to look when something else is wrong + +- `logs/comms/{date}.jsonl` — every agent message, decision, gate, error, checkpoint (JSON Lines) +- `memory/{project}/STATE.md` (legacy) or `{delivery-repo}/.zo/memory/STATE.md` — current phase, blockers, last checkpoint +- `memory/{project}/DECISION_LOG.md` — append-only audit trail +- `~/.claude/logs/` — Claude Code's own session logs +- `~/Library/Logs/DiagnosticReports/` — macOS crash reports +- `scripts/validate-docs.sh` — runs 11 cross-file consistency checks; useful when a doc claim seems off + +If you find a new failure mode that needs documenting, check `memory/zo-platform/PRIORS.md` for the existing entries and add a new `PR-NNN` following the same format. diff --git a/memory/zo-platform/DECISION_LOG.md b/memory/zo-platform/DECISION_LOG.md index 1945dcb..1e43d10 100644 --- a/memory/zo-platform/DECISION_LOG.md +++ b/memory/zo-platform/DECISION_LOG.md @@ -731,3 +731,13 @@ Append-only. Every orchestration decision with timestamp, rationale, and outcome **Rationale:** Self-evolution protocol per CLAUDE.md "On Any Failure or Error" — fix the symptom (1 line of patch), then update the rule (PRIORS.md PR-032) so the same class of bug doesn't recur. PR-025 covered "don't mock objects whose interfaces you're testing"; PR-032 is the inverse: "DO mock env guardrails that aren't your test's subject". Different failure mode, different lesson, deserves a distinct prior. **Outcome:** 1-line test patch on the same branch (`claude/brand-redesign-v2`), PR-032 added, STATE.md test-count line annotated with "verified on no-tmux host". PR #51 now contains: brand redesign + sitemap + test fix + new prior. 669/669 pass. +--- + +## Decision: 2026-04-25T23:30:00Z +**Type:** DOCUMENTATION +**Title:** Troubleshooting doc shipped — captures sub-agent spawn crash diagnosis + 6 other known failure modes +**Decision:** Created `docs/TROUBLESHOOTING.md` after diagnosing a user report ("Claude Code crashes when sub-agents spawn on Mac"). Confirmed the failure is upstream of ZO — ZO launches one Claude Code session in tmux; the Lead inside that session uses Claude Code's native `TeamCreate` + `Agent(...)` to spawn teammates, which Claude Code itself process-forks. Spawning 5–7 teammates can add 100–250 procs in seconds, easily pushing heavy-Electron-app users over macOS `kern.maxprocperuid=2666`. Doc covers: (1) sub-agent spawn crashes — full diagnosis + 5 mitigations (upgrade Claude Code, check ulimits, raise `kern.maxproc`, try `--no-tmux`, capture diagnostic reports), (2) `zo: command not found` (PR-012), (3) tmux paste timing on cold start (PR-022/PR-031), (4) build appears stuck without tmux (PR-001), (5) plan written to worktree (PR-013), (6) bash 3.2 silent failures (PR-010), (7) where to find logs. README's Slash Commands section gains a one-line link. README Status section refreshed (v1.0.1 → v1.0.2, 17 → 20 agents, 476 → 669 tests, "pre-F5" row replaced with "1.0.2" row covering phase snapshots + experiment loop + brand v2 + website v2). +**Rationale:** Confirmed via 207 wrapper/orchestrator/integration tests + full 669-test pytest run + live `zo init` smoke test that ZO is working correctly. Per user direction ("if it's claude code side, nothing changes"), no code changes — but documenting the diagnosis + mitigations means future users hitting the same wall don't have to file an issue or wait for help. Six other known failure modes harvested from PRIORS.md PR-001/PR-010/PR-012/PR-013/PR-022/PR-031 because they're already-solved issues whose mitigations weren't surfaced anywhere user-facing. +**Alternatives considered:** (1) Inline troubleshooting section in README — README already 500+ lines, would bloat further; harder to extend per failure mode. (2) GitHub Discussions — relies on GitHub being reachable; no offline access during a crash. (3) Wiki — same problem, plus splits docs from repo. (4) Single-issue doc covering only the spawn crash — misses the chance to surface the other 6 PRIORS-encoded mitigations users would benefit from knowing about. Chose dedicated `docs/TROUBLESHOOTING.md` linked from README — follows the existing pattern (`docs/COMMANDS.md`, `docs/DELIVERY_STRUCTURE.md`, `docs/SAMPLE_PROJECT.md`). +**Outcome:** New `docs/TROUBLESHOOTING.md` (180 lines, 7 sections). README updated with link + status refresh. validate-docs 10/10. Branch `claude/troubleshooting-doc` off `main` (post-PR #51 merge). + diff --git a/memory/zo-platform/STATE.md b/memory/zo-platform/STATE.md index e6bf217..465977b 100644 --- a/memory/zo-platform/STATE.md +++ b/memory/zo-platform/STATE.md @@ -99,7 +99,8 @@ ZO **v1.0.2** — autonomous experiment loop (session-020) + brand redesign v2 + - [x] v1.0.2-pre: Phase completion snapshots (C1) — `src/zo/snapshots.py`, `PhaseSnapshot` pydantic model with `schema_version`, MD+YAML frontmatter format, orchestrator hooks at both automated+human gate PROCEED paths, uses `memory_root` (auto-portable with `.zo/` layout). 23 unit + 5 integration tests. Test count 529 → 557. - [x] v1.0.2-pre: Experiment capture layer (Phase 4) — `src/zo/experiments.py` (models + registry I/O + mint + MD parsers), `.zo/experiments/` in delivery repo, orchestrator mints one exp per Phase 4 iteration with `parent_id` lineage, `result.md` gate requirement, orchestrator parses Oracle's result → computes `delta_vs_parent`, aborts running exps on ITERATE (child gets mounted next prompt), `ZOTrainingCallback.for_experiment()` factory writes into exp dir, agent contracts updated (model-builder hypothesis+next, oracle-qa result, xai/domain-eval diagnosis), `zo experiments list/show/diff` CLI group. 38 unit + 10 orchestrator-flow + 9 CLI tests. Test count 557 → 617. - [x] v1.0.2-pre: Autonomous experiment loop (Phase 4) — `src/zo/experiment_loop.py` (`LoopPolicy`, `LoopVerdict`, `evaluate_loop_state`, `check_dead_end`, `resolve_policy`). Orchestrator auto-iterates phase_4 in non-supervised modes: after Oracle's `result.md` is parsed and the experiment marked complete, the evaluator decides TARGET_HIT / BUDGET_EXHAUSTED / PLATEAU / DEAD_END / CONTINUE. CONTINUE → phase stays ACTIVE, subtasks cleared, next prompt mints child with `parent_id`, Model Builder auto-drafts `hypothesis.md` from parent's shortfalls (no human prompt). DEAD_END fires when last N hypotheses all Jaccard-similar ≥ threshold to an earlier exp (Model Builder stuck rephrasing). Plan can override defaults via optional `## Experiment Loop` block. Lead prompt + model-builder contract updated with auto-proposer protocol. 44 unit + 7 integration tests. Test count 617 → 669. -- [x] v1.0.2: Brand redesign v2 + website v2 — new palette (canvas/paper/coral/dusk/moss, oklch-based) + new typography (Geist + Cormorant Garamond + JetBrains Mono) + new mark (simplified C, no orbital). Old `design/` wiped (8 HTML + 3 SVG); new `design/` = `brand-system.html` (dark) + `brand-system-light.html` + `logos.html` + `font-pairings.html` + shared `styles.css` + `logos.js` + `readme-banner.{svg,png}` + `logo-dark.svg` (extracted from new mark). Old multi-component Astro website (12 `.astro` + 4 JSON + 2 scripts + `PLAN.md`) replaced with single-page Astro static (`src/pages/index.html` + `public/{styles.css, app.js, favicon.svg, robots.txt, assets/hero-workshop.png}`); mobile-tested with `@media` queries at 1024/900/640/400px + `@media (hover: none) and (pointer: coarse)`. Light/dark theme toggle (localStorage persistence). README banner + logo refs updated. CLAUDE.md Design System rewritten. frontend-engineer.md + documentation-agent.md path/palette refs updated. Cloudflare Pages compatible — verified `cd website && npm install && npm run build` produces correct `dist/`. Version cascade 1.0.1 → 1.0.2 across pyproject.toml + `__init__.py` + cli.py. +- [x] v1.0.2: Brand redesign v2 + website v2 — new palette (canvas/paper/coral/dusk/moss, oklch-based) + new typography (Geist + Cormorant Garamond + JetBrains Mono) + new mark (simplified C, no orbital). Old `design/` wiped (8 HTML + 3 SVG); new `design/` = `brand-system.html` (dark) + `brand-system-light.html` + `logos.html` + `font-pairings.html` + shared `styles.css` + `logos.js` + `readme-banner.{svg,png}` + `logo-dark.svg` (extracted from new mark). Old multi-component Astro website (12 `.astro` + 4 JSON + 2 scripts + `PLAN.md`) replaced with single-page Astro static (`src/pages/index.html` + `public/{styles.css, app.js, favicon.svg, robots.txt, sitemap.xml, assets/hero-workshop.png}`); mobile-tested with `@media` queries at 1024/900/640/400px + `@media (hover: none) and (pointer: coarse)`. Light/dark theme toggle (localStorage persistence). README banner + logo refs updated. CLAUDE.md Design System rewritten. frontend-engineer.md + documentation-agent.md path/palette refs updated. Cloudflare Pages compatible — verified `cd website && npm install && npm run build` produces correct `dist/`. Version cascade 1.0.1 → 1.0.2 across pyproject.toml + `__init__.py` + cli.py. PR #51 merged. +- [x] v1.0.2: Troubleshooting docs — `docs/TROUBLESHOOTING.md` covering sub-agent spawn crashes (macOS `kern.maxprocperuid=2666` + Claude Code 2.1.119 fixes + headless `--no-tmux` diagnostic), `zo: command not found` (PR-012 symlink workaround), tmux paste timing on cold start (PR-022/PR-031), build appearing stuck (no tmux session), worktree confusion (PR-013), bash 3.2 silent failures (PR-010), and where to look in logs. README links from Slash Commands section; Status section refreshed (v1.0.1 → v1.0.2, 17 → 20 agents, 476 → 669 tests, "pre-F5" row replaced with "1.0.2" row). ## Known Issues