apps/cli: surface child-process crashes instead of silent 2-min timeout#3159
apps/cli: surface child-process crashes instead of silent 2-min timeout#3159youknowriad merged 2 commits intotrunkfrom
Conversation
When the WordPress server child process crashes on startup (e.g. due to a module import mismatch), `waitForReadyMessage` would block for the full 2-minute inactivity timeout and then throw a generic "Timeout waiting for ready message" error, burying the real cause. Listen for the child's exit event and for the pre-listener race, and reject with the child's stderr log path plus its tail so users see the actual error immediately. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
📊 Performance Test ResultsComparing 804f42f vs trunk app-size
site-editor
site-startup
Results are median values from multiple test runs. Legend: 🟢 Improvement (faster) | 🔴 Regression (slower) | ⚪ No change (<50ms diff) |
|
I'm curious what you think about this @fredrikekelund Do you think it's useful? I think the issue is probably more prominent in our dev environments but I wonder if it can happen on prod too. |
fredrikekelund
left a comment
There was a problem hiding this comment.
I haven't tested, but the fundamental logic LGTM 👍
The only part that sticks out is the logs parsing, where I shared a suggestion for how we could make it more robust. Up to you if you'd rather do that here or in a future PR, @youknowriad
| exitHandler = ( event ) => { | ||
| if ( event.process.pm_id === pmId && event.event === 'exit' ) { | ||
| reject( buildChildExitedError( processName ) ); | ||
| } | ||
| }; |
| void ( async () => { | ||
| const running = await isProcessRunning( processName ); | ||
| if ( ! running ) { | ||
| reject( buildChildExitedError( processName ) ); | ||
| } | ||
| } )(); |
There was a problem hiding this comment.
| void ( async () => { | |
| const running = await isProcessRunning( processName ); | |
| if ( ! running ) { | |
| reject( buildChildExitedError( processName ) ); | |
| } | |
| } )(); | |
| isProcessRunning( processName ) | |
| .then( ( running ) => { | |
| if ( ! running ) { | |
| reject( buildChildExitedError( processName ) ); | |
| } | |
| } ) | |
| .catch( reject ); |
Part syntax nitpicking, part being more explicit about what happens if we catch an exception here.
| function readChildStderrTail( processName: string ): string { | ||
| const errorLogPath = path.join( PROCESS_MANAGER_LOGS_DIR, `${ processName }-error.log` ); | ||
| try { | ||
| const { size } = fs.statSync( errorLogPath ); | ||
| if ( size === 0 ) { | ||
| return ''; | ||
| } | ||
| const readBytes = Math.min( size, CHILD_STDERR_TAIL_BYTES ); | ||
| const buffer = Buffer.alloc( readBytes ); | ||
| const fd = fs.openSync( errorLogPath, 'r' ); | ||
| try { | ||
| fs.readSync( fd, buffer, 0, readBytes, size - readBytes ); | ||
| } finally { | ||
| fs.closeSync( fd ); | ||
| } | ||
| return buffer.toString( 'utf8' ).trimEnd(); | ||
| } catch { | ||
| return ''; | ||
| } | ||
| } |
There was a problem hiding this comment.
This can be helpful, or it can be misleading (because the logs include contents from prior invocations)…
The best thing would probably be for the process manager daemon to forward the stderr output for the current invocation. That's not trivial, but we could probably achieve it by having the process manager daemon keep a rolling buffer of each child's stderr stream and then send the buffer's contents along with the exit event in ProcessManagerDaemon::handleProcessExit.
Not something you have to do in this PR, but an agent can probably do an OK job of it.
…vents Addresses review feedback on #3159: - Daemon keeps a bounded in-memory rolling buffer of each child's stderr (capped at 100 lines / 16 KB) so we know what the *current* invocation wrote, not whatever happens to be sitting in the rotated log file. - `exit` events on the daemon event bus now carry `stderrTail` with that buffer's contents. - Manager subscribes to `process-event`/`process-message` *before* starting the child process, buffering events until it knows the pmId. This removes the previous race guard and the file-tail reading entirely. Tests cover: - stderr tail flowing through the exit event payload, - exit events that fire before startProcess resolves. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Thanks for the review! Addressed both suggestions in 804f42f:
Two new tests: stderr tail flowing end-to-end, and exit events firing while |
Related issues
How AI was used in this PR
Claude investigated a reproducible site-start failure by inspecting Studio logs, the daemon's pm2 log directory, and the child-process IPC flow. It identified the root cause (stale worktree
node_modules+wordpress-server-child.mjsbundled against an incompatible@php-wasm/cli-util) and drafted this fix to make the failure actionable instead of silent. Reviewer focus: the newprocess-eventsubscription and the pre-listener race guard inwaitForReadyMessage.Proposed Changes
waitForReadyMessagenow listens forprocess-eventon the daemon bus and rejects with a helpful error when the child exits before sendingready.SyntaxError) instead of a generic timeout.isProcessRunning(processName). If the child has already exited betweenstartProcessresolving and our listeners attaching, we surface the error right away rather than waiting for the timeout.Testing Instructions
Automated:
npm test -- apps/cli/lib/tests/wordpress-server-manager.test.ts— 8 tests, all green.Manual repro (simulates a crashing child):
npm run cli:build```
printf 'throw new SyntaxError("forced crash for testing");\n%s' \
"$(cat apps/cli/dist/cli/wordpress-server-child.mjs)" \
node apps/cli/dist/cli/main.mjs site start <SITE_ID>Before: hangs ~2 minutes, then
Failed to start WordPress server: Timeout waiting for ready message…After: fails immediately with
WordPress server child process exited before becoming ready. See …/studio-site-<id>-error.logplus theSyntaxError: forced crash for testingtail.Restore with
npm run cli:build.Pre-merge Checklist