Skip to content

docs(troubleshooting): add docs/TROUBLESHOOTING.md + refresh README status#52

Merged
SamPlvs merged 1 commit into
mainfrom
claude/troubleshooting-doc
Apr 25, 2026
Merged

docs(troubleshooting): add docs/TROUBLESHOOTING.md + refresh README status#52
SamPlvs merged 1 commit into
mainfrom
claude/troubleshooting-doc

Conversation

@SamPlvs
Copy link
Copy Markdown
Owner

@SamPlvs SamPlvs commented Apr 25, 2026

Summary

Documents the diagnosis behind a recent user report ("Claude Code crashes when sub-agents spawn on Mac"). Confirmed ZO itself is working correctly — the failure is upstream, in Claude Code's runtime when the Lead calls Agent(...). Per direction, no ZO code changes — just user-facing docs so the next person hitting the wall has the diagnosis + mitigations in front of them.

Also harvests 6 other failure modes from PRIORS.md (PR-001/PR-010/PR-012/PR-013/PR-022/PR-031) whose mitigations weren't surfaced anywhere user-facing.

What changed

New: docs/TROUBLESHOOTING.md (180 lines, 7 sections, Symptom → Cause → Fix format)

The headline section: sub-agent spawn crashes on macOS. Each spawned teammate forks ~10–30 procs (claude main + node + several MCP servers + tool subprocs). 5–7 teammates = +100–250 procs in seconds. macOS's per-UID cap is kern.maxprocperuid = 2666 by default — heavy Electron users (Chrome with many tabs, VSCode, Slack, Discord, Spotify, Docker Desktop) routinely sit at 1500–2000 baseline, so the team push them over → fork() fails → cascading session deaths.

5-step mitigation chain:

  1. Upgrade Claude Code (2.1.119+ shipped an agent-teams permission-dialog crash fix and a 50 MB/hr MCP HTTP buffer leak fix).
  2. Check headroom: ps -U $(whoami) | wc -l vs sysctl kern.maxprocperuid.
  3. Raise the cap: sudo sysctl -w kern.maxproc=8000 kern.maxprocperuid=4000 (with persistence instructions via /Library/LaunchDaemons/limit.maxproc.plist).
  4. Diagnostic with zo build --no-tmux to isolate.
  5. Capture crash dumps from ~/Library/Logs/DiagnosticReports/ for upstream filing.

Other sections: zo: command not found, blank Claude session (paste timing), build appears stuck (no tmux), plan written to worktree, bash 3.2 silent failures, log locations.

README.md updates:

  • Slash Commands section: one-line link to TROUBLESHOOTING.md.
  • Status section refreshed: v1.0.1 → v1.0.2, 17 → 20 agents, 476 → 669 tests, "pre-F5" row replaced with a 1.0.2 row covering phase snapshots, experiment capture + autonomous loop, brand v2, website v2.

Memory:

  • STATE.md — completed-checklist entry for the troubleshooting doc.
  • DECISION_LOG.md — 2026-04-25T23:30:00Z DOCUMENTATION entry covering the diagnosis + alternatives considered (inline section, GitHub Discussions, Wiki, single-issue doc) + outcome.

Verification

  • pytest: 669 passed, 7 skipped (verified earlier in PR feat(brand+website): brand redesign v2 + website v2 — v1.0.2 #51).
  • scripts/validate-docs.sh: 10/10 hard checks pass; 1 pre-existing test-count badge warning unrelated to this PR.
  • Confirmed live via zo init test-spawn --no-tmux --no-detect that the scaffolding pipeline works end-to-end.
  • Read wrapper.py and orchestrator.py to confirm ZO launches one Claude Code session and delegates teammate spawning to Claude Code's native TeamCreate + Agent(...) — no custom subprocess management on our side.
  • 207 wrapper / orchestrator / integration tests pass on this branch.

Test plan

  • On merge to main, README on GitHub shows the new "Status" row and the link to TROUBLESHOOTING.md.
  • docs/TROUBLESHOOTING.md renders cleanly on GitHub (headings, code blocks, links).
  • If a user reports a similar spawn crash, they can be pointed at the new doc.

🤖 Generated with Claude Code

…tatus

Captures the diagnosis behind a user report ("Claude Code crashes when
sub-agents spawn on Mac"). Confirmed via 207 wrapper/orchestrator/integration
tests + full 669-test pytest run + live zo init smoke test that ZO is working
correctly — the failure happens inside Claude Code's runtime when the Lead
calls Agent(...), not in code ZO controls.

docs/TROUBLESHOOTING.md covers:
- Sub-agent spawn crashes on macOS — full diagnosis (kern.maxprocperuid=2666
  + Claude Code 2.1.119 fixes) + 5 mitigations (upgrade, check ulimits, raise
  cap, try --no-tmux, capture diagnostic reports for upstream)
- zo: command not found (PR-012 — symlink workaround)
- Tmux paste timing on cold start (PR-022/PR-031)
- Build appears stuck — no tmux session (PR-001)
- Plan written to worktree (PR-013)
- Bash 3.2 silent failures on macOS (PR-010)
- Where to find logs

README updates:
- Slash Commands section: one-line link to TROUBLESHOOTING.md
- Status section refreshed: v1.0.1 → v1.0.2, 17 → 20 agents, 476 → 669
  tests, "pre-F5" row replaced with "1.0.2" row covering phase snapshots,
  experiment capture + autonomous loop, brand v2, website v2

No ZO code changes — the spawn crash is upstream of ZO per user direction.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@cloudflare-workers-and-pages
Copy link
Copy Markdown

Deploying zero-operators with  Cloudflare Pages  Cloudflare Pages

Latest commit: d04ae65
Status: ✅  Deploy successful!
Preview URL: https://23124d11.zero-operators.pages.dev
Branch Preview URL: https://claude-troubleshooting-doc.zero-operators.pages.dev

View logs

@SamPlvs SamPlvs merged commit 1f9e452 into main Apr 25, 2026
1 check passed
@SamPlvs SamPlvs deleted the claude/troubleshooting-doc branch April 25, 2026 21:42
SamPlvs added a commit that referenced this pull request Apr 25, 2026
Capture the design idea surfaced by the session-022 spawn-crash diagnosis:
add a --low-resources flag (or plan-level ## Resources block) that tells the
Lead Orchestrator to spawn agents serially within a phase instead of in
parallel. Eliminates the burst spike that trips kern.maxprocperuid=2666 on
marginal-resource Macs (specifically classical_ml Phase 1 spawning 5 agents).

PR-015 deliberately set 5-agent Phase 1 as the production default; a
flag-gated alternative preserves that while giving resource-constrained
users an escape hatch.

Logged in STATE.md What's Next #10 + DECISION_LOG ROADMAP entry with
implementation touch points (orchestrator.py _prompt_coordination(), cli.py
flag plumbing, optional plan parser support). Cross-refs PR-015, PR #52
troubleshooting doc, session-022 diagnosis.

Picking up next session.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
SamPlvs added a commit that referenced this pull request Apr 30, 2026
Capture the design idea surfaced by the session-022 spawn-crash diagnosis:
add a --low-resources flag (or plan-level ## Resources block) that tells the
Lead Orchestrator to spawn agents serially within a phase instead of in
parallel. Eliminates the burst spike that trips kern.maxprocperuid=2666 on
marginal-resource Macs (specifically classical_ml Phase 1 spawning 5 agents).

PR-015 deliberately set 5-agent Phase 1 as the production default; a
flag-gated alternative preserves that while giving resource-constrained
users an escape hatch.

Logged in STATE.md What's Next #10 + DECISION_LOG ROADMAP entry with
implementation touch points (orchestrator.py _prompt_coordination(), cli.py
flag plumbing, optional plan parser support). Cross-refs PR-015, PR #52
troubleshooting doc, session-022 diagnosis.

Picking up next session.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
SamPlvs added a commit that referenced this pull request Apr 30, 2026
docs(troubleshooting): add docs/TROUBLESHOOTING.md + refresh README status
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant