docs(troubleshooting): add docs/TROUBLESHOOTING.md + refresh README status by SamPlvs · Pull Request #52 · SamPlvs/zero-operators

SamPlvs · 2026-04-25T21:41:17Z

Summary

Documents the diagnosis behind a recent user report ("Claude Code crashes when sub-agents spawn on Mac"). Confirmed ZO itself is working correctly — the failure is upstream, in Claude Code's runtime when the Lead calls Agent(...). Per direction, no ZO code changes — just user-facing docs so the next person hitting the wall has the diagnosis + mitigations in front of them.

Also harvests 6 other failure modes from PRIORS.md (PR-001/PR-010/PR-012/PR-013/PR-022/PR-031) whose mitigations weren't surfaced anywhere user-facing.

What changed

New: docs/TROUBLESHOOTING.md (180 lines, 7 sections, Symptom → Cause → Fix format)

The headline section: sub-agent spawn crashes on macOS. Each spawned teammate forks ~10–30 procs (claude main + node + several MCP servers + tool subprocs). 5–7 teammates = +100–250 procs in seconds. macOS's per-UID cap is kern.maxprocperuid = 2666 by default — heavy Electron users (Chrome with many tabs, VSCode, Slack, Discord, Spotify, Docker Desktop) routinely sit at 1500–2000 baseline, so the team push them over → fork() fails → cascading session deaths.

5-step mitigation chain:

Upgrade Claude Code (2.1.119+ shipped an agent-teams permission-dialog crash fix and a 50 MB/hr MCP HTTP buffer leak fix).
Check headroom: ps -U $(whoami) | wc -l vs sysctl kern.maxprocperuid.
Raise the cap: sudo sysctl -w kern.maxproc=8000 kern.maxprocperuid=4000 (with persistence instructions via /Library/LaunchDaemons/limit.maxproc.plist).
Diagnostic with zo build --no-tmux to isolate.
Capture crash dumps from ~/Library/Logs/DiagnosticReports/ for upstream filing.

Other sections: zo: command not found, blank Claude session (paste timing), build appears stuck (no tmux), plan written to worktree, bash 3.2 silent failures, log locations.

README.md updates:

Slash Commands section: one-line link to TROUBLESHOOTING.md.
Status section refreshed: v1.0.1 → v1.0.2, 17 → 20 agents, 476 → 669 tests, "pre-F5" row replaced with a 1.0.2 row covering phase snapshots, experiment capture + autonomous loop, brand v2, website v2.

Memory:

STATE.md — completed-checklist entry for the troubleshooting doc.
DECISION_LOG.md — 2026-04-25T23:30:00Z DOCUMENTATION entry covering the diagnosis + alternatives considered (inline section, GitHub Discussions, Wiki, single-issue doc) + outcome.

Verification

pytest: 669 passed, 7 skipped (verified earlier in PR feat(brand+website): brand redesign v2 + website v2 — v1.0.2 #51).
scripts/validate-docs.sh: 10/10 hard checks pass; 1 pre-existing test-count badge warning unrelated to this PR.
Confirmed live via zo init test-spawn --no-tmux --no-detect that the scaffolding pipeline works end-to-end.
Read wrapper.py and orchestrator.py to confirm ZO launches one Claude Code session and delegates teammate spawning to Claude Code's native TeamCreate + Agent(...) — no custom subprocess management on our side.
207 wrapper / orchestrator / integration tests pass on this branch.

Test plan

On merge to main, README on GitHub shows the new "Status" row and the link to TROUBLESHOOTING.md.
docs/TROUBLESHOOTING.md renders cleanly on GitHub (headings, code blocks, links).
If a user reports a similar spawn crash, they can be pointed at the new doc.

🤖 Generated with Claude Code

…tatus Captures the diagnosis behind a user report ("Claude Code crashes when sub-agents spawn on Mac"). Confirmed via 207 wrapper/orchestrator/integration tests + full 669-test pytest run + live zo init smoke test that ZO is working correctly — the failure happens inside Claude Code's runtime when the Lead calls Agent(...), not in code ZO controls. docs/TROUBLESHOOTING.md covers: - Sub-agent spawn crashes on macOS — full diagnosis (kern.maxprocperuid=2666 + Claude Code 2.1.119 fixes) + 5 mitigations (upgrade, check ulimits, raise cap, try --no-tmux, capture diagnostic reports for upstream) - zo: command not found (PR-012 — symlink workaround) - Tmux paste timing on cold start (PR-022/PR-031) - Build appears stuck — no tmux session (PR-001) - Plan written to worktree (PR-013) - Bash 3.2 silent failures on macOS (PR-010) - Where to find logs README updates: - Slash Commands section: one-line link to TROUBLESHOOTING.md - Status section refreshed: v1.0.1 → v1.0.2, 17 → 20 agents, 476 → 669 tests, "pre-F5" row replaced with "1.0.2" row covering phase snapshots, experiment capture + autonomous loop, brand v2, website v2 No ZO code changes — the spawn crash is upstream of ZO per user direction. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

cloudflare-workers-and-pages · 2026-04-25T21:41:40Z

Deploying zero-operators with Cloudflare Pages

Latest commit:	`d04ae65`
Status:	✅ Deploy successful!
Preview URL:	https://23124d11.zero-operators.pages.dev
Branch Preview URL:	https://claude-troubleshooting-doc.zero-operators.pages.dev

View logs

Capture the design idea surfaced by the session-022 spawn-crash diagnosis: add a --low-resources flag (or plan-level ## Resources block) that tells the Lead Orchestrator to spawn agents serially within a phase instead of in parallel. Eliminates the burst spike that trips kern.maxprocperuid=2666 on marginal-resource Macs (specifically classical_ml Phase 1 spawning 5 agents). PR-015 deliberately set 5-agent Phase 1 as the production default; a flag-gated alternative preserves that while giving resource-constrained users an escape hatch. Logged in STATE.md What's Next #10 + DECISION_LOG ROADMAP entry with implementation touch points (orchestrator.py _prompt_coordination(), cli.py flag plumbing, optional plan parser support). Cross-refs PR-015, PR #52 troubleshooting doc, session-022 diagnosis. Picking up next session. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

docs(troubleshooting): add docs/TROUBLESHOOTING.md + refresh README status

SamPlvs merged commit 1f9e452 into main Apr 25, 2026
1 check passed

SamPlvs deleted the claude/troubleshooting-doc branch April 25, 2026 21:42

SamPlvs added a commit that referenced this pull request Apr 30, 2026

Merge pull request #52 from SamPlvs/claude/troubleshooting-doc

77c7724

docs(troubleshooting): add docs/TROUBLESHOOTING.md + refresh README status

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs(troubleshooting): add docs/TROUBLESHOOTING.md + refresh README status#52

docs(troubleshooting): add docs/TROUBLESHOOTING.md + refresh README status#52
SamPlvs merged 1 commit into
mainfrom
claude/troubleshooting-doc

SamPlvs commented Apr 25, 2026

Uh oh!

cloudflare-workers-and-pages Bot commented Apr 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

SamPlvs commented Apr 25, 2026

Summary

What changed

Verification

Test plan

Uh oh!

cloudflare-workers-and-pages Bot commented Apr 25, 2026

Deploying zero-operators with Cloudflare Pages

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant