fix(selfhost): fast-fail codex on a stdin-read hang before the full timeout#3443
Conversation
codex exec occasionally hangs having printed only its "Reading prompt from stdin..." startup banner, with no further stdout/stderr bytes ever arriving, until the full CODEX_AI_TIMEOUT_MS (up to 600s at max effort) elapses and it is SIGKILLed. Waiting out the full timeout to detect a dead subprocess stalls the codex -> claude-code fallback chain for up to 10 minutes per attempt. Add a separate, much shorter "first output" deadline to defaultSpawn: if neither stdout nor stderr has produced a single byte within firstOutputTimeoutMs, kill the process early and resolve with a distinguishable stalledNoOutput flag. createCodexAi surfaces this as codex_stalled_no_output, kept separate from codex_timeout so the two failure modes are independently observable in logs/Sentry. The full timeoutMs remains the unchanged outer safety net for output that starts flowing but stalls later. The option is generic on SpawnFn but only codex wires it up, via the new CODEX_AI_FIRST_OUTPUT_TIMEOUT_MS env var (default 30s, independent of CODEX_AI_EFFORT since a slower completion does not imply a slower first byte). Claude Code's spawn path is unaffected.
|
Superagent didn't find any vulnerabilities or security issues in this PR. |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #3443 +/- ##
=======================================
Coverage 92.99% 93.00%
=======================================
Files 293 293
Lines 30960 30975 +15
Branches 11290 11297 +7
=======================================
+ Hits 28790 28807 +17
+ Misses 1514 1512 -2
Partials 656 656
🚀 New features to boost your workflow:
|
|
Warning 🟨🟨🟨🟨🟨🟨🟨🟨🟨🟨🟨🟨 ⏸️ Gittensory review result - manual review recommendedReview updated: 2026-07-05 07:39:18 UTC
⏸️ Suggested Action - Manual Review
Review summary Nits — 6 non-blocking
Review context
Contributor next steps
Signal definitions
🟩 Safe / merged · 🟦 Advisory · 🟨 Held for review · 🟥 Blocked / closed 💰 Earn for open-source contributions like this. Gittensor lets GitHub contributors earn for the work they already do — register to start earning →. Checked by Gittensory, a quiet PR intelligence layer for OSS maintainers.
|
… stderr The fast-fail timer was cleared by data on either stream, but codex's own "Reading prompt from stdin..." startup banner is unconditional stderr output on every invocation -- it would satisfy the deadline immediately and never catch the exact hang (stderr banner, then silence forever) this fix was written for. Real JSONL progress from `codex --json` always lands on stdout, so only stdout now counts as liveness. Also regenerates the stale selfhost env-reference doc for the new CODEX_AI_FIRST_OUTPUT_TIMEOUT_MS var.
Deploying with
|
| Status | Name | Latest Commit | Preview URL | Updated (UTC) |
|---|---|---|---|---|
| ✅ Deployment successful! View logs |
gittensory-ui | d32e831 | Commit Preview URL Branch Preview URL |
Jul 05 2026, 07:35 AM |
Summary
codex_timeout: Reading prompt from stdin.... Thecodex execsubprocess hangs having printed only its own startup banner to stderr and never emitting a single byte of JSONL, before being SIGKILLed by the full configured timeout (CODEX_AI_TIMEOUT_MS, up to 600,000ms / 10 minutes at max effort). This deployment'sAI_PROVIDERwas also just changed tocodex,claude-code(codex primary, claude-code automatic failover, single-reviewer mode — not dual review) as a separate config change, but a hang that takes the full 10 minutes to detect stalls that fallback for just as long per attempt, which isn't the "seamless" failover intended.defaultSpawn(src/selfhost/ai.ts) previously raced only against the one fulltimeoutMs. Added a second, much shorter, independent deadline (resolveCodexFirstOutputTimeoutMs, default 30s, env-configurable viaCODEX_AI_FIRST_OUTPUT_TIMEOUT_MS, clamped[1_000, 120_000]ms): if neither stdout nor stderr has produced a single byte by that point, the process is killed immediately with a distinctcodex_stalled_no_outputerror — never reusingcodex_timeout, so the two failure modes stay separately countable in Sentry/logs. If output does start flowing, this timer is cleared on the first byte and never fires again; only the original fulltimeoutMsgoverns from then on, so a call that's genuinely just slow to complete (not hung) is completely unaffected.SpawnFnoption (firstOutputTimeoutMs), but it's wired up only fromcreateCodexAi—createClaudeCodeAi's spawn call is untouched (confirmed no comparable prod-observed hang for Claude Code), so its path is byte-identical to before. A safety clamp (firstOutputTimeoutMs = min(resolveCodexFirstOutputTimeoutMs(env), timeoutMs - 1)) guarantees a misconfigured lowCODEX_AI_TIMEOUT_MScan never make the fast-fail deadline reach or exceed the outer safety net.Scope
type(scope): short summaryConventional Commit format, for examplefix(api): restore profile access checks.CONTRIBUTING.mdand does not reintroduce GitHub Pages, VitePress,site/, orCNAME.Validation
git diff --checknpm run typecheck(clean)npx vitest run test/unit/selfhost-ai.test.ts— 120/120 passingnpm run test:workers/npm run build:mcp/npm run test:mcp-pack/npm run ui:openapi:check/npm run ui:build— not run individually this PR; no worker/MCP/OpenAPI/UI surface touched.codex_stalled_no_outputmessage and correctfirstOutputTimeoutMs < timeoutMsclamping; three real-subprocess scenarios (fake CLI scripts onPATH, not mocked timers, mirroring this file's existingdefaultSpawn-driving pattern): a process producing zero output is killed at the fast deadline not the full one; a process that outputs quickly and completes normally is byte-identical to today; a process that outputs within the fast window but completes slowly afterward is governed only by the full timeout, never prematurely killed. Also covers thechild.on("error")handler's new conditional-timer-clearing branch (both present/absent cases).Safety
UI Evidencesection below. — N/A, no visible UI change.CODEX_AI_FIRST_OUTPUT_TIMEOUT_MSis a new optional env var with a sane default, not a behavior change an operator must opt into.Notes
safeCodeSpanTypeError, and Sentry release-validation strict-mode fixes.