Skip to content

fix(selfhost): fast-fail codex on a stdin-read hang before the full timeout#3443

Merged
JSONbored merged 2 commits into
mainfrom
fix/codex-first-output-deadline
Jul 5, 2026
Merged

fix(selfhost): fast-fail codex on a stdin-read hang before the full timeout#3443
JSONbored merged 2 commits into
mainfrom
fix/codex-first-output-deadline

Conversation

@JSONbored

Copy link
Copy Markdown
Owner

Summary

  • Found via Sentry (issues GITTENSORY-K and GITTENSORY-M, ~3275 combined occurrences over 6 days, both regressed — fixed once before, came back): codex_timeout: Reading prompt from stdin.... The codex exec subprocess hangs having printed only its own startup banner to stderr and never emitting a single byte of JSONL, before being SIGKILLed by the full configured timeout (CODEX_AI_TIMEOUT_MS, up to 600,000ms / 10 minutes at max effort). This deployment's AI_PROVIDER was also just changed to codex,claude-code (codex primary, claude-code automatic failover, single-reviewer mode — not dual review) as a separate config change, but a hang that takes the full 10 minutes to detect stalls that fallback for just as long per attempt, which isn't the "seamless" failover intended.
  • defaultSpawn (src/selfhost/ai.ts) previously raced only against the one full timeoutMs. Added a second, much shorter, independent deadline (resolveCodexFirstOutputTimeoutMs, default 30s, env-configurable via CODEX_AI_FIRST_OUTPUT_TIMEOUT_MS, clamped [1_000, 120_000]ms): if neither stdout nor stderr has produced a single byte by that point, the process is killed immediately with a distinct codex_stalled_no_output error — never reusing codex_timeout, so the two failure modes stay separately countable in Sentry/logs. If output does start flowing, this timer is cleared on the first byte and never fires again; only the original full timeoutMs governs from then on, so a call that's genuinely just slow to complete (not hung) is completely unaffected.
  • The mechanism is a generic, optional SpawnFn option (firstOutputTimeoutMs), but it's wired up only from createCodexAicreateClaudeCodeAi's spawn call is untouched (confirmed no comparable prod-observed hang for Claude Code), so its path is byte-identical to before. A safety clamp (firstOutputTimeoutMs = min(resolveCodexFirstOutputTimeoutMs(env), timeoutMs - 1)) guarantees a misconfigured low CODEX_AI_TIMEOUT_MS can never make the fast-fail deadline reach or exceed the outer safety net.
  • No issue filed — found via direct Sentry investigation, not a report.

Scope

  • The PR title follows type(scope): short summary Conventional Commit format, for example fix(api): restore profile access checks.
  • This PR is focused and does not mix unrelated backend, UI, MCP, docs, dependency, and deploy changes.
  • This follows CONTRIBUTING.md and does not reintroduce GitHub Pages, VitePress, site/, or CNAME.
  • I linked an issue, or this is small enough that the summary explains why an issue is not needed.

Validation

  • git diff --check
  • npm run typecheck (clean)
  • npx vitest run test/unit/selfhost-ai.test.ts — 120/120 passing
  • npm run test:workers / npm run build:mcp / npm run test:mcp-pack / npm run ui:openapi:check / npm run ui:build — not run individually this PR; no worker/MCP/OpenAPI/UI surface touched.
  • New or changed behavior has unit/integration tests for new branches, fallback paths, and sanitizer boundaries — resolver unit tests (default/effort-independence/clamp bounds/NaN/zero); a stub-spawn test asserting the distinct codex_stalled_no_output message and correct firstOutputTimeoutMs < timeoutMs clamping; three real-subprocess scenarios (fake CLI scripts on PATH, not mocked timers, mirroring this file's existing defaultSpawn-driving pattern): a process producing zero output is killed at the fast deadline not the full one; a process that outputs quickly and completes normally is byte-identical to today; a process that outputs within the fast window but completes slowly afterward is governed only by the full timeout, never prematurely killed. Also covers the child.on("error") handler's new conditional-timer-clearing branch (both present/absent cases).

Safety

  • No secrets, wallet details, hotkeys, coldkeys, user PATs, private keys, raw trust scores, private rankings, or private maintainer evidence are exposed.
  • Public GitHub text stays sanitized, low-noise, and does not imply compensation guarantees or optimization tactics.
  • Auth, cookie, CORS, GitHub App, Cloudflare, or session changes include negative-path tests. — N/A.
  • API/OpenAPI/MCP behavior is updated and tested where needed. — N/A, no API/OpenAPI/MCP surface changed.
  • UI changes use live API data or real empty/error/loading states, not production mock/demo fallbacks. — N/A, no UI change.
  • Visible UI changes include a UI Evidence section below. — N/A, no visible UI change.
  • Public docs/changelogs are updated where needed. — N/A, internal reliability fix; CODEX_AI_FIRST_OUTPUT_TIMEOUT_MS is a new optional env var with a sane default, not a behavior change an operator must opt into.

Notes

  • One of five fixes from a live stack-health pass (Sentry + Loki audit on the self-hosted deployment) — this is the highest-impact one (dominant error volume). See the sibling PRs for the RAG chunk-cap indexing priority, PR-publish silent-drop retry, REES safeCodeSpan TypeError, and Sentry release-validation strict-mode fixes.

codex exec occasionally hangs having printed only its "Reading prompt
from stdin..." startup banner, with no further stdout/stderr bytes
ever arriving, until the full CODEX_AI_TIMEOUT_MS (up to 600s at max
effort) elapses and it is SIGKILLed. Waiting out the full timeout to
detect a dead subprocess stalls the codex -> claude-code fallback
chain for up to 10 minutes per attempt.

Add a separate, much shorter "first output" deadline to defaultSpawn:
if neither stdout nor stderr has produced a single byte within
firstOutputTimeoutMs, kill the process early and resolve with a
distinguishable stalledNoOutput flag. createCodexAi surfaces this as
codex_stalled_no_output, kept separate from codex_timeout so the two
failure modes are independently observable in logs/Sentry. The full
timeoutMs remains the unchanged outer safety net for output that
starts flowing but stalls later.

The option is generic on SpawnFn but only codex wires it up, via the
new CODEX_AI_FIRST_OUTPUT_TIMEOUT_MS env var (default 30s, independent
of CODEX_AI_EFFORT since a slower completion does not imply a slower
first byte). Claude Code's spawn path is unaffected.
@superagent-security

Copy link
Copy Markdown

Superagent didn't find any vulnerabilities or security issues in this PR.

@codecov

codecov Bot commented Jul 5, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 93.00%. Comparing base (89854d7) to head (d32e831).
⚠️ Report is 14 commits behind head on main.
✅ All tests successful. No failed tests found.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #3443   +/-   ##
=======================================
  Coverage   92.99%   93.00%           
=======================================
  Files         293      293           
  Lines       30960    30975   +15     
  Branches    11290    11297    +7     
=======================================
+ Hits        28790    28807   +17     
+ Misses       1514     1512    -2     
  Partials      656      656           
Files with missing lines Coverage Δ
src/selfhost/ai.ts 98.59% <100.00%> (+0.53%) ⬆️
🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@gittensory-orb gittensory-orb Bot added the gittensor:bug Gittensor-scored bug fix — scores a 0.5x multiplier. label Jul 5, 2026
@gittensory-orb

gittensory-orb Bot commented Jul 5, 2026

Copy link
Copy Markdown

Warning

🟨🟨🟨🟨🟨🟨🟨🟨🟨🟨🟨🟨

⏸️ Gittensory review result - manual review recommended

Review updated: 2026-07-05 07:39:18 UTC

3 files · 1 AI reviewer · no blockers · readiness 93/100 · CI green · clean

⏸️ Suggested Action - Manual Review

  • Touches a guarded path — held for manual review: This PR changes guardrail-protected path(s): src/selfhost/ai.ts (matched src/selfhost/**).

Review summary
The change adds a Codex-only first-stdout deadline, plumbs it through the self-host spawn path, and documents the new env var in the generated UI reference. The stderr-only hang scenario is covered directly, and the distinct `codex_stalled_no_output` error preserves observability separate from the existing full timeout. I do not see a reachable correctness defect in the provided diff, but the implementation is comment-heavy and would benefit from extracting the timeout constants so the policy is easier to audit.

Nits — 6 non-blocking
  • nit: src/selfhost/ai.ts:151 repeats the default, floor, and ceiling values as raw numeric literals (`30_000`, `1_000`, `120_000`); define named constants near the existing CLI timeout policy so future changes do not drift between code, comments, and tests.
  • nit: src/selfhost/ai.ts:151 and src/selfhost/ai.ts:561 carry several long incident-history comments that duplicate each other; keep the operational rationale once and make the implementation comments describe only the local invariant being enforced.
  • nit: test/unit/selfhost-ai.test.ts:1038 adds several real-subprocess tests with repeated fake-cli setup; extract a small helper for creating a temporary `codex` binary so the individual regression cases stay focused on their timing behavior.
  • src/selfhost/ai.ts:151: introduce constants such as `CODEX_FIRST_OUTPUT_TIMEOUT_DEFAULT_MS`, `CODEX_FIRST_OUTPUT_TIMEOUT_MIN_MS`, and `CODEX_FIRST_OUTPUT_TIMEOUT_MAX_MS`, then use them in `resolveCodexFirstOutputTimeoutMs` and the tests.
  • src/selfhost/ai.ts:561: shorten the timer comment to the invariant: stderr banners do not clear this deadline; the first stdout byte does; close/error clears armed timers.
  • Touches a guarded path — held for manual review — A maintainer must review and merge this change.
Signal Result Evidence
Code review ✅ No blockers 1 reviewer
Linked issue ⚠️ Missing No linked issue or no-issue rationale found.
Related work ✅ No active overlap found No same-issue or scoped active PR overlap found.
Change scope ✅ 20/20 Low review scope from cached public metadata (no linked issue context).
Validation posture ✅ 25/25 PR body includes validation/test evidence.
Contributor workload ✅ 10/10 Author activity: 56 registered-repo PR(s), 46 merged, 416 issue(s).
Contributor context ✅ Confirmed Gittensor contributor JSONbored; Gittensor profile; 56 PR(s), 416 issue(s).
Gate result ⚠️ Not blocking Advisory; not blocking this PR.
Review context
  • Author: JSONbored
  • Role context: owner (maintainer lane)
  • Public audience mode: oss maintainer
  • Lane context: Repository registration is not available in the local Gittensory cache.
  • Public profile languages: not available
  • Official Gittensor activity: 56 PR(s), 416 issue(s).
  • PR-specific overlap: none found.
Contributor next steps
  • Treat this as maintainer-lane context rather than normal contributor-lane activity.
  • Explain no-issue PR.
  • No action.
  • Link the issue being solved, or explicitly explain why this is a no-issue PR.
Signal definitions
  • Related work = same linked issue, overlapping active PRs, or title/path similarity.
  • Change scope = cached public metadata such as size labels, draft state, and review-burden hints.
  • Validation posture = whether the PR provides enough public validation/test evidence for maintainer review.
  • Contributor workload = public contributor activity and cleanup pressure, not a repo-wide quality failure.
  • Contributor context = public GitHub/Gittensor identity context; non-Gittensor status is not a blocker.

🟩 Safe / merged · 🟦 Advisory · 🟨 Held for review · 🟥 Blocked / closed


💰 Earn for open-source contributions like this. Gittensor lets GitHub contributors earn for the work they already do — register to start earning →.

Checked by Gittensory, a quiet PR intelligence layer for OSS maintainers.

  • Re-run Gittensory review

@gittensory-orb gittensory-orb Bot added the manual-review Gittensor contributor context label Jul 5, 2026
… stderr

The fast-fail timer was cleared by data on either stream, but codex's
own "Reading prompt from stdin..." startup banner is unconditional
stderr output on every invocation -- it would satisfy the deadline
immediately and never catch the exact hang (stderr banner, then
silence forever) this fix was written for. Real JSONL progress from
`codex --json` always lands on stdout, so only stdout now counts as
liveness. Also regenerates the stale selfhost env-reference doc for
the new CODEX_AI_FIRST_OUTPUT_TIMEOUT_MS var.
@cloudflare-workers-and-pages

cloudflare-workers-and-pages Bot commented Jul 5, 2026

Copy link
Copy Markdown

Deploying with  Cloudflare Workers  Cloudflare Workers

The latest updates on your project. Learn more about integrating Git with Workers.

Status Name Latest Commit Preview URL Updated (UTC)
✅ Deployment successful!
View logs
gittensory-ui d32e831 Commit Preview URL

Branch Preview URL
Jul 05 2026, 07:35 AM

@JSONbored JSONbored merged commit 4de33af into main Jul 5, 2026
13 checks passed
@JSONbored JSONbored deleted the fix/codex-first-output-deadline branch July 5, 2026 07:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

gittensor:bug Gittensor-scored bug fix — scores a 0.5x multiplier. manual-review Gittensor contributor context

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant