fix: prevent context blowup by preserving previous_response_id on bridge recovery - IMPORTANT FIX FOR CONTEXT BLOWING UP by balakumardev · Pull Request #397 · Soju06/codex-lb

balakumardev · 2026-04-13T12:38:34Z

Problem

When an upstream websocket session is lost (session.closed) or a request fails mid-flight, the HTTP bridge recovery paths introduced in v1.12.0 return 400 previous_response_not_found errors. This tells the Codex CLI that the server-side conversation state is gone, causing it to drop previous_response_id and resend the entire conversation history as input.

The result is catastrophic for context window usage:

Metric	Before (v1.11.x)	After (v1.12.0)
Per-turn growth	~2-3K tokens	~70K tokens
Turns to fill 876K window	200+	17
Compactions per session	0	2+

Root Cause

Three recovery paths in the bridge return 400 previous_response_not_found instead of 502 upstream_unavailable:

Mid-request websocket failure - when the upstream websocket dies during send() and previous_response_id is present, the error handler returns 400 previous_response_not_found instead of 502 upstream_unavailable.
_retry_http_bridge_request_on_fresh_upstream - when called with send_request=True and previous_response_id is present, immediately marks the request as previous_response_not_found without attempting recovery.
_retry_http_bridge_precreated_request - same pattern.

When the CLI receives 400 previous_response_not_found, it interprets this as "the server doesn't have this conversation anymore" and resends the full conversation history without previous_response_id, inflating per-turn context by ~20x.

Fix

When session.closed and previous_response_id is present, raise 502 immediately (matching pre-v1.12.0 behaviour) so the client retries with previous_response_id intact.
In error handler, always raise 502 instead of 400 previous_response_not_found.
In retry helpers, return False without marking previous_response_not_found so callers raise a retriable 502.

Test Results

7 regression tests added in test_bridge_context_blowup.py. Ran against both broken and fixed code:

Test	Broken (pre-fix)	Fixed
test_mid_request_failure_with_previous_response_id_raises_502	FAIL (returns 400)	PASS (returns 502)
test_retry_with_previous_response_id_returns_false_without_marking_error	FAIL (sets previous_response_not_found)	PASS
test_closed_session_with_previous_response_id_raises_502	PASS	PASS
test_closed_session_without_previous_response_id_attempts_recovery	PASS	PASS
test_healthy_growth_rate_stays_within_budget	PASS	PASS
test_broken_growth_rate_fills_window_fast	PASS	PASS
test_error_code_determines_growth_rate	PASS	PASS

All 44 existing test_proxy_http_bridge.py tests pass with zero regressions.

Live Verification

Tested on a single-instance deployment with gpt-5.4 (876K effective context window):

v1.12.0 (broken): 4 user messages -> 853K tokens in 17 turns -> compaction triggered, avg 70K tok/turn
v1.11.x (reverted): 4 user messages -> ~80K tokens in 17 turns -> no compaction, avg 518 tok/turn

…dge recovery When an upstream websocket session is lost (session.closed) or a request fails mid-flight, the HTTP bridge recovery paths were returning 400 previous_response_not_found errors. This tells the Codex CLI that the server-side conversation state is gone, causing it to drop previous_response_id and resend the entire conversation history as input. The result is catastrophic for context window usage: per-turn input growth jumps from ~2-3K tokens (with previous_response_id) to ~70K tokens (full conversation replay), filling the 876K context window in just 17 turns instead of 200+. Root cause analysis: - _submit_http_bridge_request: when session.closed is True, the new code attempts to reconnect the upstream websocket. But the reconnected session is a fresh server-side session that doesn't recognise the old previous_response_id. When the upstream returns 'not found', the bridge marks the request as previous_response_not_found (400) instead of upstream_unavailable (502). - _retry_http_bridge_request_on_fresh_upstream: when called with send_request=True and previous_response_id is present, immediately marks the request as previous_response_not_found without even trying. - _retry_http_bridge_precreated_request: same pattern. Fix: - When session.closed and previous_response_id is present, raise 502 immediately (matching pre-v1.12.0 behaviour) so the client retries with previous_response_id intact on a new session. - In error handler, always raise 502 instead of 400 previous_response_not_found so the client treats the failure as transient and retries properly. - In retry helpers, return False without marking previous_response_not_found so callers can raise a retriable 502. Before: 4 user messages -> 853K tokens in 17 turns -> compaction After: 4 user messages -> ~80K tokens in 17 turns -> no compaction

Proves the bug exists on broken code and is fixed: BROKEN (pre-fix): - test_mid_request_failure_with_previous_response_id_raises_502: FAIL (returns 400 previous_response_not_found instead of 502) - test_retry_with_previous_response_id_returns_false_without_marking_error: FAIL (sets error_code_override to previous_response_not_found) FIXED: - All 7 tests PASS - 44 existing bridge tests still PASS (zero regressions) Test coverage: - Closed session + previous_response_id -> 502 (not 400) - Closed session without previous_response_id -> recovery attempted - Mid-request websocket failure + previous_response_id -> 502 (not 400) - Retry helper preserves previous_response_id (no error override) - Context growth modelling: healthy (2.3K/turn) vs broken (70K/turn) - Error code semantics: 502 vs 400 determines CLI behavior

- P1: Allow closed-session reconnect (send_request=False) before raising 502, so transient between-turn disconnects can still recover - P2: In _retry_http_bridge_request_on_fresh_upstream, reconnect but skip send when previous_response_id is present (effective_send flag) - Update integration tests: 400 previous_response_not_found -> 502 upstream_unavailable/stream_incomplete/upstream_request_timeout

Soju06 · 2026-04-14T02:11:15Z

Thank you for this important fix @balakumardev! 🎉

The context blowup issue is a critical regression — great catch with the session log analysis showing 70K vs 2.3K tokens/turn.

I've rebased your changes on main and applied Codex adversarial review fixes (preserving the reconnect recovery path for between-turn disconnects). The reviewed version is in #401.

Closing this in favor of #401 which includes your original fix + review improvements.

…vious_response_id - P1: Reconnect-only retry returned True leaving request stranded; now returns False so caller raises retriable 502 immediately - P3: Fix test mock send -> send_text to match production code path

…signal Send failure on owner-side bridge now raises bridge_owner_unreachable instead of generic upstream_unavailable, so non-owner replicas trigger local rebind via _http_bridge_should_attempt_local_previous_response_recovery.

Soju06 · 2026-04-14T03:02:38Z

Thank you for catching this critical context blowup regression @balakumardev! 🎉

The session log analysis (2.3K vs 70K tokens/turn) was incredibly helpful in diagnosing the root cause. The Codex adversarial review identified some edge cases around multi-replica recovery and reconnect paths which were resolved together.

@all-contributors please add @balakumardev for code and test

allcontributors · 2026-04-14T03:02:49Z

@Soju06

I've put up a pull request to add @balakumardev! 🎉

Brings upstream Soju06/codex-lb v1.15.0 into fork: - feat(proxy): GPT-5.5 + GPT-5.5 Pro model support (Soju06#477) - fix(proxy): prevent admission semaphore leak + raise concurrency (Soju06#466) - fix(proxy): inject session-level previous_response_id (Soju06#456) - fix(proxy): harden continuity recovery + WS replay + bridge lifecycle (Soju06#415) - fix: replace reject-fast admission with wait-then-reject (Soju06#413) - fix(auth): accept API keys on /api/codex/usage (Soju06#417) - fix(auth): allow explicit unauthenticated proxy client CIDRs (Soju06#399) - fix(proxy): websocket connect-phase failover (Soju06#396) - fix(proxy): preserve previous_response_id on bridge recovery (Soju06#397) - fix(proxy): budget-safe routing + image-gen compat (Soju06#421) - feat(dashboard): show account plan in request logs table (Soju06#425) - chore: archive 36 OpenSpec changes + type cleanup (Soju06#400) Conflict resolutions per deploy plan: - app/modules/proxy/service.py: accept upstream websocket connect helpers verbatim; defer fork endpoint concurrency limits from d66db88 pending follow-up rebase on wait-then-reject admission base - app/modules/proxy/api.py: drop fork _reject_or_acquire_http_proxy_endpoint_concurrency path in codex_usage handler; accept upstream api_key dep - app/core/metrics/prometheus.py: keep both fork + upstream metric counters - tests: accept upstream new tests; keep fork assertions where additive - frontend recent-requests-table.tsx: preserve fork column gating, insert upstream Plan column - openspec specs: preserve fork database-migrations requirements; merge upstream database-backends additions - uv.lock: regenerated via uv lock (v1.13.1 -> v1.15.0) Fork features temporarily off in this merge commit (follow-up PR required): - proxy endpoint concurrency limits (d66db88)

balakumardev changed the title ~~fix: prevent context blowup by preserving previous_response_id on bridge recovery~~ fix: prevent context blowup by preserving previous_response_id on bridge recovery - IMPORTANT FIX FOR CONTEXT BLOWING UP Apr 13, 2026

balakumardev and others added 3 commits April 14, 2026 10:51

Soju06 mentioned this pull request Apr 14, 2026

fix(proxy): preserve previous_response_id on bridge recovery to prevent context blowup #401

Closed

Soju06 closed this Apr 14, 2026

Soju06 reopened this Apr 14, 2026

Soju06 force-pushed the fix/http-bridge-context-blowup branch from f562ab9 to 9a9004a Compare April 14, 2026 02:14

Soju06 added 4 commits April 14, 2026 11:14

style: fix unused imports in test_bridge_context_blowup

d2aac0b

fix: use openai_error() for typed test payloads (ty check)

50b256e

Soju06 merged commit 85802e6 into Soju06:main Apr 14, 2026
18 checks passed

allcontributors Bot mentioned this pull request Apr 14, 2026

docs: add balakumardev as a contributor for code, and test #404

Closed

Soju06 mentioned this pull request Apr 14, 2026

chore(main): release 1.13.0 #380

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: prevent context blowup by preserving previous_response_id on bridge recovery - IMPORTANT FIX FOR CONTEXT BLOWING UP#397

fix: prevent context blowup by preserving previous_response_id on bridge recovery - IMPORTANT FIX FOR CONTEXT BLOWING UP#397
Soju06 merged 7 commits intoSoju06:mainfrom
balakumardev:fix/http-bridge-context-blowup

balakumardev commented Apr 13, 2026 •

edited

Loading

Uh oh!

Soju06 commented Apr 14, 2026

Uh oh!

Uh oh!

Soju06 commented Apr 14, 2026

Uh oh!

allcontributors Bot commented Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

balakumardev commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Root Cause

Fix

Test Results

Live Verification

Uh oh!

Soju06 commented Apr 14, 2026

Uh oh!

Uh oh!

Soju06 commented Apr 14, 2026

Uh oh!

allcontributors Bot commented Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

balakumardev commented Apr 13, 2026 •

edited

Loading