[judy] fix(forge): make restart/update wait for port free before re-bind by SymbolStar · Pull Request #42 · SymbolStar/OpenForge

SymbolStar · 2026-05-28T06:07:43Z

Why

OpenForge service died after forge update today and we had to manually forge restart to bring it back. Reproduced the timing:

forge update → cmd_restart → cmd_stop (launchctl bootout) → cmd_start (launchctl bootstrap).
cmd_stop returned as soon as bootout returned, but the Python server — after catching SIGTERM — still owned :7878 for a few hundred ms while draining workers (the 📛 received signal 15; draining… / bye 🦞 log lines).
cmd_start then bootstrapped a new agent that tried to bind :7878, got EADDRINUSE, and exited.
plist KeepAlive is configured for Crashed: true only — a clean non-zero exit doesn't qualify, so launchd left the agent dead.

Status printed pid=N / Port: 7878 not listening and looked half-OK, which is exactly the false-positive that made me move on too fast.

Fix

cmd_stop now blocks up to 5s for the port to actually free; if not, it hard-kills the orphan pid and waits another 2s.
cmd_start refuses to bootstrap while :PORT is still busy (same wait + kill fallback).
cmd_start blocks up to 8s for the new process to actually bind :PORT before printing status, so forge update no longer shows the misleading pid=N / Port: not listening snapshot of a half-booted agent.

The hopeful fixed sleep 0.5 / sleep 1 in restart/start are replaced by explicit port-state polling at 100ms granularity. Typical restart is now <300ms but recovers cleanly when SIGTERM drain takes longer.

Test

Locally: forge restart (× 5) all came back Port: 7878 listening + HTTP 200 immediately, no manual fallback needed. forge update on a no-op pull also clean.

Not touching the plist — KeepAlive { Crashed: true, SuccessfulExit: false } stays as-is. The race lived in the bash wrapper, not in launchd's policy.

Root cause of intermittent 'forge update leaves service dead': forge update → cmd_restart → cmd_stop (launchctl bootout) → cmd_start (launchctl bootstrap). cmd_stop returned as soon as bootout returned, but the Python server, after catching SIGTERM, still owned :7878 for a few hundred ms while draining in-flight workers (the 'bye 🦞' log line). cmd_start then bootstrapped a new agent that tried to bind :7878, got EADDRINUSE, and exited. plist KeepAlive on Crashed only retries on signal-death, not on a clean non-zero exit, so the agent stayed dead until 'forge restart' was run by hand. Fix: - cmd_stop blocks up to 5s for the port to actually free; if not, it hard-kills the orphan pid and waits another 2s. - cmd_start refuses to bootstrap while :PORT is still busy (same wait + kill fallback). - cmd_start blocks up to 8s for the new process to bind :PORT before printing status, so 'forge update' no longer shows the misleading 'pid=N / Port: not listening' snapshot of a half-booted agent. The fixed 'sleep 0.5' / 'sleep 1' in restart/start are replaced by explicit port-state polling at 100ms granularity — typical restart is now <300ms but recovers cleanly when drain takes longer.

Phase 1 of the 'bot review → eventually auto-approve' plan from today's chat. This PR only adds the comment-only stage so we get a feel for false-positive rate before granting any 'gh pr review --approve' power. Posts (or updates) ONE sticky review comment per PR containing: - diff stats + head sha - red-line checks: * A-7.5: scream if 'forbidden' is added to forge_xiaof.py / web/xiaof.js (contract allows only unauth | rate_limited | upstream_failed | internal) * xiaof SSE event-order: warn when stream_to_sse_frames is touched * forge restart race regression: warn on reintroduced 'sleep 0.5/1' in bin/forge, hard-fail on removed wait_port_free/listening (the fix from PR #42 that just prevented prod from re-dying on update) - 'needs human review' list for paths never eligible for future auto-approve: server.py, migrations/, bin/forge, .github/workflows/, *.plist, forge_xiaof.py Uses the default GITHUB_TOKEN with pull-requests:write only. Concurrency-cancels stale runs per PR so synchronizes don't pile up. Never sets exit code on findings — the comment is the deliverable. Co-authored-by: Judy <judy@symbolstar.dev>

SymbolStar mentioned this pull request May 28, 2026

[judy] ci: add bot-review workflow (comment-only, phase 1) #44

Merged

SymbolStar merged commit df16f26 into main May 28, 2026
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[judy] fix(forge): make restart/update wait for port free before re-bind#42

[judy] fix(forge): make restart/update wait for port free before re-bind#42
SymbolStar merged 1 commit into
mainfrom
judy/forge-restart-race

SymbolStar commented May 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

SymbolStar commented May 28, 2026

Why

Fix

Test

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant