Skip to content

[judy] fix(forge): make restart/update wait for port free before re-bind#42

Merged
SymbolStar merged 1 commit into
mainfrom
judy/forge-restart-race
May 28, 2026
Merged

[judy] fix(forge): make restart/update wait for port free before re-bind#42
SymbolStar merged 1 commit into
mainfrom
judy/forge-restart-race

Conversation

@SymbolStar

Copy link
Copy Markdown
Owner

Why

OpenForge service died after forge update today and we had to manually forge restart to bring it back. Reproduced the timing:

  1. forge updatecmd_restartcmd_stop (launchctl bootout) → cmd_start (launchctl bootstrap).
  2. cmd_stop returned as soon as bootout returned, but the Python server — after catching SIGTERM — still owned :7878 for a few hundred ms while draining workers (the 📛 received signal 15; draining… / bye 🦞 log lines).
  3. cmd_start then bootstrapped a new agent that tried to bind :7878, got EADDRINUSE, and exited.
  4. plist KeepAlive is configured for Crashed: true only — a clean non-zero exit doesn't qualify, so launchd left the agent dead.

Status printed pid=N / Port: 7878 not listening and looked half-OK, which is exactly the false-positive that made me move on too fast.

Fix

  • cmd_stop now blocks up to 5s for the port to actually free; if not, it hard-kills the orphan pid and waits another 2s.
  • cmd_start refuses to bootstrap while :PORT is still busy (same wait + kill fallback).
  • cmd_start blocks up to 8s for the new process to actually bind :PORT before printing status, so forge update no longer shows the misleading pid=N / Port: not listening snapshot of a half-booted agent.

The hopeful fixed sleep 0.5 / sleep 1 in restart/start are replaced by explicit port-state polling at 100ms granularity. Typical restart is now <300ms but recovers cleanly when SIGTERM drain takes longer.

Test

Locally: forge restart (× 5) all came back Port: 7878 listening + HTTP 200 immediately, no manual fallback needed. forge update on a no-op pull also clean.

Not touching the plist — KeepAlive { Crashed: true, SuccessfulExit: false } stays as-is. The race lived in the bash wrapper, not in launchd's policy.

Root cause of intermittent 'forge update leaves service dead':

  forge update → cmd_restart → cmd_stop (launchctl bootout) → cmd_start
  (launchctl bootstrap). cmd_stop returned as soon as bootout returned,
  but the Python server, after catching SIGTERM, still owned :7878 for
  a few hundred ms while draining in-flight workers (the 'bye 🦞' log
  line). cmd_start then bootstrapped a new agent that tried to bind
  :7878, got EADDRINUSE, and exited. plist KeepAlive on Crashed only
  retries on signal-death, not on a clean non-zero exit, so the agent
  stayed dead until 'forge restart' was run by hand.

Fix:

- cmd_stop blocks up to 5s for the port to actually free; if not, it
  hard-kills the orphan pid and waits another 2s.
- cmd_start refuses to bootstrap while :PORT is still busy (same wait
  + kill fallback).
- cmd_start blocks up to 8s for the new process to bind :PORT before
  printing status, so 'forge update' no longer shows the misleading
  'pid=N / Port: not listening' snapshot of a half-booted agent.

The fixed 'sleep 0.5' / 'sleep 1' in restart/start are replaced by
explicit port-state polling at 100ms granularity — typical restart
is now <300ms but recovers cleanly when drain takes longer.
SymbolStar added a commit that referenced this pull request May 28, 2026
Phase 1 of the 'bot review → eventually auto-approve' plan from
today's chat. This PR only adds the comment-only stage so we get a
feel for false-positive rate before granting any 'gh pr review
--approve' power.

Posts (or updates) ONE sticky review comment per PR containing:

- diff stats + head sha
- red-line checks:
  * A-7.5: scream if 'forbidden' is added to forge_xiaof.py /
    web/xiaof.js (contract allows only unauth | rate_limited |
    upstream_failed | internal)
  * xiaof SSE event-order: warn when stream_to_sse_frames is touched
  * forge restart race regression: warn on reintroduced 'sleep 0.5/1'
    in bin/forge, hard-fail on removed wait_port_free/listening (the
    fix from PR #42 that just prevented prod from re-dying on update)
- 'needs human review' list for paths never eligible for future
  auto-approve: server.py, migrations/, bin/forge, .github/workflows/,
  *.plist, forge_xiaof.py

Uses the default GITHUB_TOKEN with pull-requests:write only.
Concurrency-cancels stale runs per PR so synchronizes don't pile up.
Never sets exit code on findings — the comment is the deliverable.

Co-authored-by: Judy <judy@symbolstar.dev>
@SymbolStar SymbolStar merged commit df16f26 into main May 28, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant