[judy] fix(forge): make restart/update wait for port free before re-bind#42
Merged
Conversation
Root cause of intermittent 'forge update leaves service dead': forge update → cmd_restart → cmd_stop (launchctl bootout) → cmd_start (launchctl bootstrap). cmd_stop returned as soon as bootout returned, but the Python server, after catching SIGTERM, still owned :7878 for a few hundred ms while draining in-flight workers (the 'bye 🦞' log line). cmd_start then bootstrapped a new agent that tried to bind :7878, got EADDRINUSE, and exited. plist KeepAlive on Crashed only retries on signal-death, not on a clean non-zero exit, so the agent stayed dead until 'forge restart' was run by hand. Fix: - cmd_stop blocks up to 5s for the port to actually free; if not, it hard-kills the orphan pid and waits another 2s. - cmd_start refuses to bootstrap while :PORT is still busy (same wait + kill fallback). - cmd_start blocks up to 8s for the new process to bind :PORT before printing status, so 'forge update' no longer shows the misleading 'pid=N / Port: not listening' snapshot of a half-booted agent. The fixed 'sleep 0.5' / 'sleep 1' in restart/start are replaced by explicit port-state polling at 100ms granularity — typical restart is now <300ms but recovers cleanly when drain takes longer.
SymbolStar
added a commit
that referenced
this pull request
May 28, 2026
Phase 1 of the 'bot review → eventually auto-approve' plan from
today's chat. This PR only adds the comment-only stage so we get a
feel for false-positive rate before granting any 'gh pr review
--approve' power.
Posts (or updates) ONE sticky review comment per PR containing:
- diff stats + head sha
- red-line checks:
* A-7.5: scream if 'forbidden' is added to forge_xiaof.py /
web/xiaof.js (contract allows only unauth | rate_limited |
upstream_failed | internal)
* xiaof SSE event-order: warn when stream_to_sse_frames is touched
* forge restart race regression: warn on reintroduced 'sleep 0.5/1'
in bin/forge, hard-fail on removed wait_port_free/listening (the
fix from PR #42 that just prevented prod from re-dying on update)
- 'needs human review' list for paths never eligible for future
auto-approve: server.py, migrations/, bin/forge, .github/workflows/,
*.plist, forge_xiaof.py
Uses the default GITHUB_TOKEN with pull-requests:write only.
Concurrency-cancels stale runs per PR so synchronizes don't pile up.
Never sets exit code on findings — the comment is the deliverable.
Co-authored-by: Judy <judy@symbolstar.dev>
This was referenced May 28, 2026
This was referenced Jun 4, 2026
This was referenced Jun 11, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
OpenForge service died after
forge updatetoday and we had to manuallyforge restartto bring it back. Reproduced the timing:forge update→cmd_restart→cmd_stop(launchctl bootout) →cmd_start(launchctl bootstrap).cmd_stopreturned as soon asbootoutreturned, but the Python server — after catching SIGTERM — still owned :7878 for a few hundred ms while draining workers (the📛 received signal 15; draining…/bye 🦞log lines).cmd_startthen bootstrapped a new agent that tried to bind :7878, got EADDRINUSE, and exited.KeepAliveis configured forCrashed: trueonly — a clean non-zero exit doesn't qualify, so launchd left the agent dead.Status printed
pid=N / Port: 7878 not listeningand looked half-OK, which is exactly the false-positive that made me move on too fast.Fix
cmd_stopnow blocks up to 5s for the port to actually free; if not, it hard-kills the orphan pid and waits another 2s.cmd_startrefuses to bootstrap while :PORT is still busy (same wait + kill fallback).cmd_startblocks up to 8s for the new process to actually bind :PORT before printing status, soforge updateno longer shows the misleadingpid=N / Port: not listeningsnapshot of a half-booted agent.The hopeful fixed
sleep 0.5/sleep 1in restart/start are replaced by explicit port-state polling at 100ms granularity. Typical restart is now <300ms but recovers cleanly when SIGTERM drain takes longer.Test
Locally:
forge restart(× 5) all came backPort: 7878 listening+HTTP 200immediately, no manual fallback needed.forge updateon a no-op pull also clean.Not touching the plist —
KeepAlive { Crashed: true, SuccessfulExit: false }stays as-is. The race lived in the bash wrapper, not in launchd's policy.