Skip to content

apps/cli: surface child-process crashes instead of silent 2-min timeout#3159

Merged
youknowriad merged 2 commits intotrunkfrom
claude/silly-blackwell-a94699
Apr 22, 2026
Merged

apps/cli: surface child-process crashes instead of silent 2-min timeout#3159
youknowriad merged 2 commits intotrunkfrom
claude/silly-blackwell-a94699

Conversation

@youknowriad
Copy link
Copy Markdown
Contributor

Related issues

  • Related to debugging local site start failures that surface as "Failed to start WordPress server: Timeout waiting for ready message from WordPress server child" after a 2-minute silent hang.

How AI was used in this PR

Claude investigated a reproducible site-start failure by inspecting Studio logs, the daemon's pm2 log directory, and the child-process IPC flow. It identified the root cause (stale worktree node_modules + wordpress-server-child.mjs bundled against an incompatible @php-wasm/cli-util) and drafted this fix to make the failure actionable instead of silent. Reviewer focus: the new process-event subscription and the pre-listener race guard in waitForReadyMessage.

Proposed Changes

  • waitForReadyMessage now listens for process-event on the daemon bus and rejects with a helpful error when the child exits before sending ready.
  • The error message includes the path to the child's stderr log plus the tail (up to 4 KB) of its contents — so users see the real stack trace (e.g. a SyntaxError) instead of a generic timeout.
  • Added a race guard: immediately after attaching listeners we check isProcessRunning(processName). If the child has already exited between startProcess resolving and our listeners attaching, we surface the error right away rather than waiting for the timeout.
  • Unit tests cover both the exit-event path and the already-exited race path.

Testing Instructions

Automated:

  • npm test -- apps/cli/lib/tests/wordpress-server-manager.test.ts — 8 tests, all green.

Manual repro (simulates a crashing child):

  1. npm run cli:build
  2. Inject a fatal error at the top of the built child:
    ```
    printf 'throw new SyntaxError("forced crash for testing");\n%s' \
    "$(cat apps/cli/dist/cli/wordpress-server-child.mjs)" \

    apps/cli/dist/cli/wordpress-server-child.mjs.tmp \
    && mv apps/cli/dist/cli/wordpress-server-child.mjs.tmp \
    apps/cli/dist/cli/wordpress-server-child.mjs
    ```

  3. node apps/cli/dist/cli/main.mjs site start <SITE_ID>

Before: hangs ~2 minutes, then Failed to start WordPress server: Timeout waiting for ready message…
After: fails immediately with WordPress server child process exited before becoming ready. See …/studio-site-<id>-error.log plus the SyntaxError: forced crash for testing tail.

Restore with npm run cli:build.

Pre-merge Checklist

  • Have you checked for TypeScript, React or other console errors?

When the WordPress server child process crashes on startup (e.g. due to a
module import mismatch), `waitForReadyMessage` would block for the full
2-minute inactivity timeout and then throw a generic "Timeout waiting for
ready message" error, burying the real cause.

Listen for the child's exit event and for the pre-listener race, and reject
with the child's stderr log path plus its tail so users see the actual
error immediately.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@wpmobilebot
Copy link
Copy Markdown
Collaborator

wpmobilebot commented Apr 20, 2026

📊 Performance Test Results

Comparing 804f42f vs trunk

app-size

Metric trunk 804f42f Diff Change
App Size (Mac) 1491.74 MB 1441.33 MB 50.41 MB 🟢 -3.4%

site-editor

Metric trunk 804f42f Diff Change
load 1629 ms 1587 ms 42 ms ⚪ 0.0%

site-startup

Metric trunk 804f42f Diff Change
siteCreation 8106 ms 8140 ms +34 ms ⚪ 0.0%
siteStartup 4962 ms 4956 ms 6 ms ⚪ 0.0%

Results are median values from multiple test runs.

Legend: 🟢 Improvement (faster) | 🔴 Regression (slower) | ⚪ No change (<50ms diff)

@youknowriad
Copy link
Copy Markdown
Contributor Author

I'm curious what you think about this @fredrikekelund Do you think it's useful? I think the issue is probably more prominent in our dev environments but I wonder if it can happen on prod too.

Copy link
Copy Markdown
Contributor

@fredrikekelund fredrikekelund left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't tested, but the fundamental logic LGTM 👍

The only part that sticks out is the logs parsing, where I shared a suggestion for how we could make it more robust. Up to you if you'd rather do that here or in a future PR, @youknowriad

Comment on lines +187 to +191
exitHandler = ( event ) => {
if ( event.process.pm_id === pmId && event.event === 'exit' ) {
reject( buildChildExitedError( processName ) );
}
};
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice 👍

Comment on lines +203 to +208
void ( async () => {
const running = await isProcessRunning( processName );
if ( ! running ) {
reject( buildChildExitedError( processName ) );
}
} )();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
void ( async () => {
const running = await isProcessRunning( processName );
if ( ! running ) {
reject( buildChildExitedError( processName ) );
}
} )();
isProcessRunning( processName )
.then( ( running ) => {
if ( ! running ) {
reject( buildChildExitedError( processName ) );
}
} )
.catch( reject );

Part syntax nitpicking, part being more explicit about what happens if we catch an exception here.

Comment on lines +139 to +158
function readChildStderrTail( processName: string ): string {
const errorLogPath = path.join( PROCESS_MANAGER_LOGS_DIR, `${ processName }-error.log` );
try {
const { size } = fs.statSync( errorLogPath );
if ( size === 0 ) {
return '';
}
const readBytes = Math.min( size, CHILD_STDERR_TAIL_BYTES );
const buffer = Buffer.alloc( readBytes );
const fd = fs.openSync( errorLogPath, 'r' );
try {
fs.readSync( fd, buffer, 0, readBytes, size - readBytes );
} finally {
fs.closeSync( fd );
}
return buffer.toString( 'utf8' ).trimEnd();
} catch {
return '';
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be helpful, or it can be misleading (because the logs include contents from prior invocations)…

The best thing would probably be for the process manager daemon to forward the stderr output for the current invocation. That's not trivial, but we could probably achieve it by having the process manager daemon keep a rolling buffer of each child's stderr stream and then send the buffer's contents along with the exit event in ProcessManagerDaemon::handleProcessExit.

Not something you have to do in this PR, but an agent can probably do an OK job of it.

…vents

Addresses review feedback on #3159:

- Daemon keeps a bounded in-memory rolling buffer of each child's stderr
  (capped at 100 lines / 16 KB) so we know what the *current* invocation
  wrote, not whatever happens to be sitting in the rotated log file.
- `exit` events on the daemon event bus now carry `stderrTail` with that
  buffer's contents.
- Manager subscribes to `process-event`/`process-message` *before*
  starting the child process, buffering events until it knows the pmId.
  This removes the previous race guard and the file-tail reading
  entirely.

Tests cover:
- stderr tail flowing through the exit event payload,
- exit events that fire before startProcess resolves.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@youknowriad
Copy link
Copy Markdown
Contributor Author

Thanks for the review! Addressed both suggestions in 804f42f:

  • The daemon now keeps a bounded rolling buffer (100 lines / 16 KB) of each child's stderr and attaches it as stderrTail on the exit event payload — no more leaning on rotated log files that include prior invocations.
  • The race guard / IIFE is gone entirely. subscribeForReadyOrExit attaches listeners to the daemon bus before startProcess runs and buffers events by process name until waitFor(pmId) is called, so the pre-listener exit case is handled by the same code path as a mid-wait exit. Made the implementation simpler overall (-10 net LOC).

Two new tests: stderr tail flowing end-to-end, and exit events firing while startProcess is still in flight.

@youknowriad youknowriad merged commit d55cb81 into trunk Apr 22, 2026
10 checks passed
@youknowriad youknowriad deleted the claude/silly-blackwell-a94699 branch April 22, 2026 16:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants