docs(troubleshooting): add docs/TROUBLESHOOTING.md + refresh README status#52
Merged
Conversation
…tatus
Captures the diagnosis behind a user report ("Claude Code crashes when
sub-agents spawn on Mac"). Confirmed via 207 wrapper/orchestrator/integration
tests + full 669-test pytest run + live zo init smoke test that ZO is working
correctly — the failure happens inside Claude Code's runtime when the Lead
calls Agent(...), not in code ZO controls.
docs/TROUBLESHOOTING.md covers:
- Sub-agent spawn crashes on macOS — full diagnosis (kern.maxprocperuid=2666
+ Claude Code 2.1.119 fixes) + 5 mitigations (upgrade, check ulimits, raise
cap, try --no-tmux, capture diagnostic reports for upstream)
- zo: command not found (PR-012 — symlink workaround)
- Tmux paste timing on cold start (PR-022/PR-031)
- Build appears stuck — no tmux session (PR-001)
- Plan written to worktree (PR-013)
- Bash 3.2 silent failures on macOS (PR-010)
- Where to find logs
README updates:
- Slash Commands section: one-line link to TROUBLESHOOTING.md
- Status section refreshed: v1.0.1 → v1.0.2, 17 → 20 agents, 476 → 669
tests, "pre-F5" row replaced with "1.0.2" row covering phase snapshots,
experiment capture + autonomous loop, brand v2, website v2
No ZO code changes — the spawn crash is upstream of ZO per user direction.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Deploying zero-operators with
|
| Latest commit: |
d04ae65
|
| Status: | ✅ Deploy successful! |
| Preview URL: | https://23124d11.zero-operators.pages.dev |
| Branch Preview URL: | https://claude-troubleshooting-doc.zero-operators.pages.dev |
SamPlvs
added a commit
that referenced
this pull request
Apr 25, 2026
Capture the design idea surfaced by the session-022 spawn-crash diagnosis: add a --low-resources flag (or plan-level ## Resources block) that tells the Lead Orchestrator to spawn agents serially within a phase instead of in parallel. Eliminates the burst spike that trips kern.maxprocperuid=2666 on marginal-resource Macs (specifically classical_ml Phase 1 spawning 5 agents). PR-015 deliberately set 5-agent Phase 1 as the production default; a flag-gated alternative preserves that while giving resource-constrained users an escape hatch. Logged in STATE.md What's Next #10 + DECISION_LOG ROADMAP entry with implementation touch points (orchestrator.py _prompt_coordination(), cli.py flag plumbing, optional plan parser support). Cross-refs PR-015, PR #52 troubleshooting doc, session-022 diagnosis. Picking up next session. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
SamPlvs
added a commit
that referenced
this pull request
Apr 30, 2026
Capture the design idea surfaced by the session-022 spawn-crash diagnosis: add a --low-resources flag (or plan-level ## Resources block) that tells the Lead Orchestrator to spawn agents serially within a phase instead of in parallel. Eliminates the burst spike that trips kern.maxprocperuid=2666 on marginal-resource Macs (specifically classical_ml Phase 1 spawning 5 agents). PR-015 deliberately set 5-agent Phase 1 as the production default; a flag-gated alternative preserves that while giving resource-constrained users an escape hatch. Logged in STATE.md What's Next #10 + DECISION_LOG ROADMAP entry with implementation touch points (orchestrator.py _prompt_coordination(), cli.py flag plumbing, optional plan parser support). Cross-refs PR-015, PR #52 troubleshooting doc, session-022 diagnosis. Picking up next session. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
SamPlvs
added a commit
that referenced
this pull request
Apr 30, 2026
docs(troubleshooting): add docs/TROUBLESHOOTING.md + refresh README status
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Documents the diagnosis behind a recent user report ("Claude Code crashes when sub-agents spawn on Mac"). Confirmed ZO itself is working correctly — the failure is upstream, in Claude Code's runtime when the Lead calls
Agent(...). Per direction, no ZO code changes — just user-facing docs so the next person hitting the wall has the diagnosis + mitigations in front of them.Also harvests 6 other failure modes from
PRIORS.md(PR-001/PR-010/PR-012/PR-013/PR-022/PR-031) whose mitigations weren't surfaced anywhere user-facing.What changed
New:
docs/TROUBLESHOOTING.md(180 lines, 7 sections, Symptom → Cause → Fix format)The headline section: sub-agent spawn crashes on macOS. Each spawned teammate forks ~10–30 procs (claude main + node + several MCP servers + tool subprocs). 5–7 teammates = +100–250 procs in seconds. macOS's per-UID cap is
kern.maxprocperuid = 2666by default — heavy Electron users (Chrome with many tabs, VSCode, Slack, Discord, Spotify, Docker Desktop) routinely sit at 1500–2000 baseline, so the team push them over →fork()fails → cascading session deaths.5-step mitigation chain:
ps -U $(whoami) | wc -lvssysctl kern.maxprocperuid.sudo sysctl -w kern.maxproc=8000 kern.maxprocperuid=4000(with persistence instructions via/Library/LaunchDaemons/limit.maxproc.plist).zo build --no-tmuxto isolate.~/Library/Logs/DiagnosticReports/for upstream filing.Other sections:
zo: command not found, blank Claude session (paste timing), build appears stuck (no tmux), plan written to worktree, bash 3.2 silent failures, log locations.README.mdupdates:Memory:
STATE.md— completed-checklist entry for the troubleshooting doc.DECISION_LOG.md— 2026-04-25T23:30:00Z DOCUMENTATION entry covering the diagnosis + alternatives considered (inline section, GitHub Discussions, Wiki, single-issue doc) + outcome.Verification
pytest: 669 passed, 7 skipped (verified earlier in PR feat(brand+website): brand redesign v2 + website v2 — v1.0.2 #51).scripts/validate-docs.sh: 10/10 hard checks pass; 1 pre-existing test-count badge warning unrelated to this PR.zo init test-spawn --no-tmux --no-detectthat the scaffolding pipeline works end-to-end.wrapper.pyandorchestrator.pyto confirm ZO launches one Claude Code session and delegates teammate spawning to Claude Code's nativeTeamCreate+Agent(...)— no custom subprocess management on our side.Test plan
main, README on GitHub shows the new "Status" row and the link to TROUBLESHOOTING.md.docs/TROUBLESHOOTING.mdrenders cleanly on GitHub (headings, code blocks, links).🤖 Generated with Claude Code