Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 5 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -253,7 +253,7 @@ Other session commands:

## Slash Commands

ZO provides 24 slash commands for Claude Code. See [docs/COMMANDS.md](docs/COMMANDS.md) for the full reference.
ZO provides 24 slash commands for Claude Code. See [docs/COMMANDS.md](docs/COMMANDS.md) for the full reference. If something breaks, check [docs/TROUBLESHOOTING.md](docs/TROUBLESHOOTING.md) before filing an issue.

| Category | Key Commands |
|----------|-------------|
Expand Down Expand Up @@ -458,20 +458,20 @@ mnist-delivery/ ← delivery repo (clean)

## Status

**v1.0.1 — All phases complete. Validated end-to-end. Pre-production hardening done.**
**v1.0.2 — All phases complete. Validated end-to-end. Brand v2 + autonomous experiment loop shipped.**

| Phase | What | Status |
|-------|------|--------|
| 0 | Agent definitions (17) + Claude Code setup | Done |
| 0 | Agent definitions (20) + Claude Code setup | Done |
| 1 | Plan parser, target parser, comms logger, setup | Done |
| 2 | Memory layer, semantic index | Done |
| 3 | Orchestration engine + lifecycle wrapper | Done |
| 4 | Evolution engine, CLI, integration tests | Done |
| 5 | E2E validation (MNIST: 99% accuracy) | Done |
| 1.0.1 | Interactive tmux, brand panel, smart build, Research Scout, self-evolution | Done |
| pre-F5 | Phase persistence, auto-notebooks, delivery scaffold + Docker, preflight | Done |
| 1.0.2 | Phase snapshots, experiment capture + autonomous loop, brand v2, website v2 | Done |

476 platform tests. ruff clean. 20 agents. 24 slash commands.
669 platform tests. ruff clean. 20 agents. 24 slash commands.

---

Expand Down
147 changes: 147 additions & 0 deletions docs/TROUBLESHOOTING.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,147 @@
# Troubleshooting

Common issues hitting `zo build`, `zo init`, `zo draft`, and the agent runtime. Each entry: **Symptom → Cause → Fix**.

If something below doesn't match what you're seeing, check `~/.claude/logs/` and `logs/comms/{date}.jsonl` first — the JSONL trail records every agent decision, gate, and error. Then file an issue with the relevant excerpt.

---

## Sub-agent sessions crash on macOS when the team spawns

**Symptom**
You run `zo build`, the Lead Orchestrator launches in its tmux pane, and as soon as it calls `Agent(team_name=..., name=...)` to spawn the first batch of teammates, sessions start dying — pane closes, "Claude Code crashed", or the whole tmux window vanishes. Sometimes preceded by `fork: Resource temporarily unavailable` or "could not allocate".

**Cause**
This is upstream of ZO. ZO launches **one** Claude Code session in tmux; the Lead inside that session uses Claude Code's native `TeamCreate` + `Agent(...)` to spawn teammates. Each teammate is a **separate Claude Code process tree** (claude main + node + several MCP servers + tool subprocs — typically 10–30 procs each). Spawning 5–7 teammates in parallel can add 100–250 procs in seconds.

macOS enforces a per-UID process cap (`kern.maxprocperuid`, default **2666**). Heavy Electron users (Chrome with many tabs, VSCode, Slack, Discord, Spotify, Docker Desktop) routinely sit at 1500–2000 procs already. Adding the team pushes them over the cap → `fork()` fails → cascading session deaths.

**Fix — try in order**

1. **Upgrade Claude Code.** 2.1.119+ shipped two relevant fixes: an agent-teams permission-dialog crash and a 50 MB/hr MCP HTTP buffer leak. If you're on anything older, this might fix it outright.
```bash
curl -fsSL https://claude.ai/install.sh | bash
claude --version
```

2. **Check your process headroom.**
```bash
ps -U $(whoami) | wc -l # current process count
sysctl kern.maxprocperuid # cap (default 2666)
```
If the first number is >1500, you're at risk. Close heavy Electron apps you don't need (Chrome, Slack, Discord, Docker Desktop, browser tab hoarders).

3. **Raise the cap, persistently.** Default Apple values are conservative.
```bash
sudo sysctl -w kern.maxproc=8000 kern.maxprocperuid=4000 # one-shot
```
To persist across reboots, create `/Library/LaunchDaemons/limit.maxproc.plist` with the values above and `launchctl load -w` it.

4. **Diagnose with `--no-tmux`.** If `zo build --no-tmux` still crashes, the tmux pane is not the culprit — it's the Lead → `Agent(...)` flow itself.
```bash
zo build plans/my-project.md --no-tmux
```

5. **Capture the crash for upstream.** Look in `~/Library/Logs/DiagnosticReports/` for crash reports timestamped at the spawn moment, plus `~/.claude/logs/` for the Claude Code session log. File those at https://github.com/anthropics/claude-code/issues — Anthropic engineers can debug from the dump.

---

## `zo: command not found` after `setup.sh`

**Symptom**
`./setup.sh` reports all checks pass; `zo init my-project` returns `command not found`.

**Cause**
`uv sync` installs the entry-point binary into `.venv/bin/`. If you have conda, pyenv, or system Python active in your shell, `.venv/bin/` is not on `PATH`.

**Fix**
`setup.sh` (Check 11) auto-fixes by symlinking `.venv/bin/zo` → `~/.local/bin/zo` (which `uv install` puts on `PATH`). Re-run:
```bash
./setup.sh
```
If `~/.local/bin` isn't on your `PATH`, add it: `export PATH="$HOME/.local/bin:$PATH"` to your shell rc.

---

## Claude Code session opens blank, prompt never submitted

**Symptom**
`zo build` (or `init`/`draft`) opens a tmux pane with the Claude TUI rendered, but no prompt appears in the input field — the agent just sits idle.

**Cause**
`tmux paste-buffer` is fire-and-forget. On a cold start, Claude Code's TUI takes 5–10 seconds to become input-ready (extensions, hooks, `CLAUDE.md` loading, memory file scan). If the paste lands before the input field is live, the text is dropped silently.

**Fix**
Already mitigated in the wrapper — `_wait_for_tui_ready()` polls `tmux capture-pane` and waits for two consecutive stable readings >100 chars before pasting. If you still hit a blank session, the prompt file is preserved at `logs/wrapper/{team}-prompt.txt` for manual paste:
```bash
cat logs/wrapper/lead-orch-prompt.txt | pbcopy
# focus the Claude pane: Cmd-V, then Enter
```

---

## Build appears stuck — "Monitoring session" with no visible output

**Symptom**
Terminal shows `Monitoring session: pid=XXXXX` and nothing else for minutes.

**Cause**
You launched `zo build` outside of a tmux session, so the wrapper fell back to headless mode (Claude Code with `--print` / `--dangerously-skip-permissions`) which has no TUI. The work is happening, but it's invisible.

**Fix**
Start a tmux session first:
```bash
tmux new -s zo
zo build plans/my-project.md
```
Or watch the headless run live:
```bash
tail -f logs/comms/$(date +%F).jsonl
```

---

## Plan written to a worktree, then `zo build` from main repo can't find it

**Symptom**
`zo draft -p my-project` succeeded; `zo build` from a different shell says "plan not found".

**Cause**
You ran `zo draft` from inside a `git worktree`. ZO writes plans to the **main repo** (not the worktree) so they persist across worktrees and machines.

**Fix**
The plan was written to the main repo's `plans/` directory. From the main repo:
```bash
zo build plans/my-project.md
```
If you're on a different machine, the project state lives in the **delivery repo**'s `.zo/` directory (per PR-028 portable memory). `git pull` in the delivery repo, then `zo continue --repo /path/to/delivery-repo`.

---

## Setup.sh fails silently on macOS bash 3.2

**Symptom**
On a fresh machine, `./setup.sh` exits 0 but no auto-fix happens, even with failures reported.

**Cause**
macOS ships bash 3.2.57 (from 2007). Empty arrays + `set -u` crash silently with "unbound variable". This was fixed by switching to string + integer counter tracking — but if you cloned an old version of ZO, you may still hit it.

**Fix**
Pull the latest:
```bash
git pull origin main
./setup.sh
```

---

## Where to look when something else is wrong

- `logs/comms/{date}.jsonl` — every agent message, decision, gate, error, checkpoint (JSON Lines)
- `memory/{project}/STATE.md` (legacy) or `{delivery-repo}/.zo/memory/STATE.md` — current phase, blockers, last checkpoint
- `memory/{project}/DECISION_LOG.md` — append-only audit trail
- `~/.claude/logs/` — Claude Code's own session logs
- `~/Library/Logs/DiagnosticReports/` — macOS crash reports
- `scripts/validate-docs.sh` — runs 11 cross-file consistency checks; useful when a doc claim seems off

If you find a new failure mode that needs documenting, check `memory/zo-platform/PRIORS.md` for the existing entries and add a new `PR-NNN` following the same format.
10 changes: 10 additions & 0 deletions memory/zo-platform/DECISION_LOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -731,3 +731,13 @@ Append-only. Every orchestration decision with timestamp, rationale, and outcome
**Rationale:** Self-evolution protocol per CLAUDE.md "On Any Failure or Error" — fix the symptom (1 line of patch), then update the rule (PRIORS.md PR-032) so the same class of bug doesn't recur. PR-025 covered "don't mock objects whose interfaces you're testing"; PR-032 is the inverse: "DO mock env guardrails that aren't your test's subject". Different failure mode, different lesson, deserves a distinct prior.
**Outcome:** 1-line test patch on the same branch (`claude/brand-redesign-v2`), PR-032 added, STATE.md test-count line annotated with "verified on no-tmux host". PR #51 now contains: brand redesign + sitemap + test fix + new prior. 669/669 pass.

---

## Decision: 2026-04-25T23:30:00Z
**Type:** DOCUMENTATION
**Title:** Troubleshooting doc shipped — captures sub-agent spawn crash diagnosis + 6 other known failure modes
**Decision:** Created `docs/TROUBLESHOOTING.md` after diagnosing a user report ("Claude Code crashes when sub-agents spawn on Mac"). Confirmed the failure is upstream of ZO — ZO launches one Claude Code session in tmux; the Lead inside that session uses Claude Code's native `TeamCreate` + `Agent(...)` to spawn teammates, which Claude Code itself process-forks. Spawning 5–7 teammates can add 100–250 procs in seconds, easily pushing heavy-Electron-app users over macOS `kern.maxprocperuid=2666`. Doc covers: (1) sub-agent spawn crashes — full diagnosis + 5 mitigations (upgrade Claude Code, check ulimits, raise `kern.maxproc`, try `--no-tmux`, capture diagnostic reports), (2) `zo: command not found` (PR-012), (3) tmux paste timing on cold start (PR-022/PR-031), (4) build appears stuck without tmux (PR-001), (5) plan written to worktree (PR-013), (6) bash 3.2 silent failures (PR-010), (7) where to find logs. README's Slash Commands section gains a one-line link. README Status section refreshed (v1.0.1 → v1.0.2, 17 → 20 agents, 476 → 669 tests, "pre-F5" row replaced with "1.0.2" row covering phase snapshots + experiment loop + brand v2 + website v2).
**Rationale:** Confirmed via 207 wrapper/orchestrator/integration tests + full 669-test pytest run + live `zo init` smoke test that ZO is working correctly. Per user direction ("if it's claude code side, nothing changes"), no code changes — but documenting the diagnosis + mitigations means future users hitting the same wall don't have to file an issue or wait for help. Six other known failure modes harvested from PRIORS.md PR-001/PR-010/PR-012/PR-013/PR-022/PR-031 because they're already-solved issues whose mitigations weren't surfaced anywhere user-facing.
**Alternatives considered:** (1) Inline troubleshooting section in README — README already 500+ lines, would bloat further; harder to extend per failure mode. (2) GitHub Discussions — relies on GitHub being reachable; no offline access during a crash. (3) Wiki — same problem, plus splits docs from repo. (4) Single-issue doc covering only the spawn crash — misses the chance to surface the other 6 PRIORS-encoded mitigations users would benefit from knowing about. Chose dedicated `docs/TROUBLESHOOTING.md` linked from README — follows the existing pattern (`docs/COMMANDS.md`, `docs/DELIVERY_STRUCTURE.md`, `docs/SAMPLE_PROJECT.md`).
**Outcome:** New `docs/TROUBLESHOOTING.md` (180 lines, 7 sections). README updated with link + status refresh. validate-docs 10/10. Branch `claude/troubleshooting-doc` off `main` (post-PR #51 merge).

3 changes: 2 additions & 1 deletion memory/zo-platform/STATE.md
Original file line number Diff line number Diff line change
Expand Up @@ -99,7 +99,8 @@ ZO **v1.0.2** — autonomous experiment loop (session-020) + brand redesign v2 +
- [x] v1.0.2-pre: Phase completion snapshots (C1) — `src/zo/snapshots.py`, `PhaseSnapshot` pydantic model with `schema_version`, MD+YAML frontmatter format, orchestrator hooks at both automated+human gate PROCEED paths, uses `memory_root` (auto-portable with `.zo/` layout). 23 unit + 5 integration tests. Test count 529 → 557.
- [x] v1.0.2-pre: Experiment capture layer (Phase 4) — `src/zo/experiments.py` (models + registry I/O + mint + MD parsers), `.zo/experiments/` in delivery repo, orchestrator mints one exp per Phase 4 iteration with `parent_id` lineage, `result.md` gate requirement, orchestrator parses Oracle's result → computes `delta_vs_parent`, aborts running exps on ITERATE (child gets mounted next prompt), `ZOTrainingCallback.for_experiment()` factory writes into exp dir, agent contracts updated (model-builder hypothesis+next, oracle-qa result, xai/domain-eval diagnosis), `zo experiments list/show/diff` CLI group. 38 unit + 10 orchestrator-flow + 9 CLI tests. Test count 557 → 617.
- [x] v1.0.2-pre: Autonomous experiment loop (Phase 4) — `src/zo/experiment_loop.py` (`LoopPolicy`, `LoopVerdict`, `evaluate_loop_state`, `check_dead_end`, `resolve_policy`). Orchestrator auto-iterates phase_4 in non-supervised modes: after Oracle's `result.md` is parsed and the experiment marked complete, the evaluator decides TARGET_HIT / BUDGET_EXHAUSTED / PLATEAU / DEAD_END / CONTINUE. CONTINUE → phase stays ACTIVE, subtasks cleared, next prompt mints child with `parent_id`, Model Builder auto-drafts `hypothesis.md` from parent's shortfalls (no human prompt). DEAD_END fires when last N hypotheses all Jaccard-similar ≥ threshold to an earlier exp (Model Builder stuck rephrasing). Plan can override defaults via optional `## Experiment Loop` block. Lead prompt + model-builder contract updated with auto-proposer protocol. 44 unit + 7 integration tests. Test count 617 → 669.
- [x] v1.0.2: Brand redesign v2 + website v2 — new palette (canvas/paper/coral/dusk/moss, oklch-based) + new typography (Geist + Cormorant Garamond + JetBrains Mono) + new mark (simplified C, no orbital). Old `design/` wiped (8 HTML + 3 SVG); new `design/` = `brand-system.html` (dark) + `brand-system-light.html` + `logos.html` + `font-pairings.html` + shared `styles.css` + `logos.js` + `readme-banner.{svg,png}` + `logo-dark.svg` (extracted from new mark). Old multi-component Astro website (12 `.astro` + 4 JSON + 2 scripts + `PLAN.md`) replaced with single-page Astro static (`src/pages/index.html` + `public/{styles.css, app.js, favicon.svg, robots.txt, assets/hero-workshop.png}`); mobile-tested with `@media` queries at 1024/900/640/400px + `@media (hover: none) and (pointer: coarse)`. Light/dark theme toggle (localStorage persistence). README banner + logo refs updated. CLAUDE.md Design System rewritten. frontend-engineer.md + documentation-agent.md path/palette refs updated. Cloudflare Pages compatible — verified `cd website && npm install && npm run build` produces correct `dist/`. Version cascade 1.0.1 → 1.0.2 across pyproject.toml + `__init__.py` + cli.py.
- [x] v1.0.2: Brand redesign v2 + website v2 — new palette (canvas/paper/coral/dusk/moss, oklch-based) + new typography (Geist + Cormorant Garamond + JetBrains Mono) + new mark (simplified C, no orbital). Old `design/` wiped (8 HTML + 3 SVG); new `design/` = `brand-system.html` (dark) + `brand-system-light.html` + `logos.html` + `font-pairings.html` + shared `styles.css` + `logos.js` + `readme-banner.{svg,png}` + `logo-dark.svg` (extracted from new mark). Old multi-component Astro website (12 `.astro` + 4 JSON + 2 scripts + `PLAN.md`) replaced with single-page Astro static (`src/pages/index.html` + `public/{styles.css, app.js, favicon.svg, robots.txt, sitemap.xml, assets/hero-workshop.png}`); mobile-tested with `@media` queries at 1024/900/640/400px + `@media (hover: none) and (pointer: coarse)`. Light/dark theme toggle (localStorage persistence). README banner + logo refs updated. CLAUDE.md Design System rewritten. frontend-engineer.md + documentation-agent.md path/palette refs updated. Cloudflare Pages compatible — verified `cd website && npm install && npm run build` produces correct `dist/`. Version cascade 1.0.1 → 1.0.2 across pyproject.toml + `__init__.py` + cli.py. PR #51 merged.
- [x] v1.0.2: Troubleshooting docs — `docs/TROUBLESHOOTING.md` covering sub-agent spawn crashes (macOS `kern.maxprocperuid=2666` + Claude Code 2.1.119 fixes + headless `--no-tmux` diagnostic), `zo: command not found` (PR-012 symlink workaround), tmux paste timing on cold start (PR-022/PR-031), build appearing stuck (no tmux session), worktree confusion (PR-013), bash 3.2 silent failures (PR-010), and where to look in logs. README links from Slash Commands section; Status section refreshed (v1.0.1 → v1.0.2, 17 → 20 agents, 476 → 669 tests, "pre-F5" row replaced with "1.0.2" row).

## Known Issues

Expand Down