fix: prevent context blowup by preserving previous_response_id on bridge recovery - IMPORTANT FIX FOR CONTEXT BLOWING UP#397
Conversation
…dge recovery When an upstream websocket session is lost (session.closed) or a request fails mid-flight, the HTTP bridge recovery paths were returning 400 previous_response_not_found errors. This tells the Codex CLI that the server-side conversation state is gone, causing it to drop previous_response_id and resend the entire conversation history as input. The result is catastrophic for context window usage: per-turn input growth jumps from ~2-3K tokens (with previous_response_id) to ~70K tokens (full conversation replay), filling the 876K context window in just 17 turns instead of 200+. Root cause analysis: - _submit_http_bridge_request: when session.closed is True, the new code attempts to reconnect the upstream websocket. But the reconnected session is a fresh server-side session that doesn't recognise the old previous_response_id. When the upstream returns 'not found', the bridge marks the request as previous_response_not_found (400) instead of upstream_unavailable (502). - _retry_http_bridge_request_on_fresh_upstream: when called with send_request=True and previous_response_id is present, immediately marks the request as previous_response_not_found without even trying. - _retry_http_bridge_precreated_request: same pattern. Fix: - When session.closed and previous_response_id is present, raise 502 immediately (matching pre-v1.12.0 behaviour) so the client retries with previous_response_id intact on a new session. - In error handler, always raise 502 instead of 400 previous_response_not_found so the client treats the failure as transient and retries properly. - In retry helpers, return False without marking previous_response_not_found so callers can raise a retriable 502. Before: 4 user messages -> 853K tokens in 17 turns -> compaction After: 4 user messages -> ~80K tokens in 17 turns -> no compaction
Proves the bug exists on broken code and is fixed:
BROKEN (pre-fix):
- test_mid_request_failure_with_previous_response_id_raises_502: FAIL
(returns 400 previous_response_not_found instead of 502)
- test_retry_with_previous_response_id_returns_false_without_marking_error: FAIL
(sets error_code_override to previous_response_not_found)
FIXED:
- All 7 tests PASS
- 44 existing bridge tests still PASS (zero regressions)
Test coverage:
- Closed session + previous_response_id -> 502 (not 400)
- Closed session without previous_response_id -> recovery attempted
- Mid-request websocket failure + previous_response_id -> 502 (not 400)
- Retry helper preserves previous_response_id (no error override)
- Context growth modelling: healthy (2.3K/turn) vs broken (70K/turn)
- Error code semantics: 502 vs 400 determines CLI behavior
- P1: Allow closed-session reconnect (send_request=False) before raising 502, so transient between-turn disconnects can still recover - P2: In _retry_http_bridge_request_on_fresh_upstream, reconnect but skip send when previous_response_id is present (effective_send flag) - Update integration tests: 400 previous_response_not_found -> 502 upstream_unavailable/stream_incomplete/upstream_request_timeout
|
Thank you for this important fix @balakumardev! 🎉 The context blowup issue is a critical regression — great catch with the session log analysis showing 70K vs 2.3K tokens/turn. I've rebased your changes on main and applied Codex adversarial review fixes (preserving the reconnect recovery path for between-turn disconnects). The reviewed version is in #401. Closing this in favor of #401 which includes your original fix + review improvements. |
f562ab9 to
9a9004a
Compare
…vious_response_id - P1: Reconnect-only retry returned True leaving request stranded; now returns False so caller raises retriable 502 immediately - P3: Fix test mock send -> send_text to match production code path
…signal Send failure on owner-side bridge now raises bridge_owner_unreachable instead of generic upstream_unavailable, so non-owner replicas trigger local rebind via _http_bridge_should_attempt_local_previous_response_recovery.
|
Thank you for catching this critical context blowup regression @balakumardev! 🎉 The session log analysis (2.3K vs 70K tokens/turn) was incredibly helpful in diagnosing the root cause. The Codex adversarial review identified some edge cases around multi-replica recovery and reconnect paths which were resolved together. @all-contributors please add @balakumardev for code and test |
|
I've put up a pull request to add @balakumardev! 🎉 |
Brings upstream Soju06/codex-lb v1.15.0 into fork: - feat(proxy): GPT-5.5 + GPT-5.5 Pro model support (Soju06#477) - fix(proxy): prevent admission semaphore leak + raise concurrency (Soju06#466) - fix(proxy): inject session-level previous_response_id (Soju06#456) - fix(proxy): harden continuity recovery + WS replay + bridge lifecycle (Soju06#415) - fix: replace reject-fast admission with wait-then-reject (Soju06#413) - fix(auth): accept API keys on /api/codex/usage (Soju06#417) - fix(auth): allow explicit unauthenticated proxy client CIDRs (Soju06#399) - fix(proxy): websocket connect-phase failover (Soju06#396) - fix(proxy): preserve previous_response_id on bridge recovery (Soju06#397) - fix(proxy): budget-safe routing + image-gen compat (Soju06#421) - feat(dashboard): show account plan in request logs table (Soju06#425) - chore: archive 36 OpenSpec changes + type cleanup (Soju06#400) Conflict resolutions per deploy plan: - app/modules/proxy/service.py: accept upstream websocket connect helpers verbatim; defer fork endpoint concurrency limits from d66db88 pending follow-up rebase on wait-then-reject admission base - app/modules/proxy/api.py: drop fork _reject_or_acquire_http_proxy_endpoint_concurrency path in codex_usage handler; accept upstream api_key dep - app/core/metrics/prometheus.py: keep both fork + upstream metric counters - tests: accept upstream new tests; keep fork assertions where additive - frontend recent-requests-table.tsx: preserve fork column gating, insert upstream Plan column - openspec specs: preserve fork database-migrations requirements; merge upstream database-backends additions - uv.lock: regenerated via uv lock (v1.13.1 -> v1.15.0) Fork features temporarily off in this merge commit (follow-up PR required): - proxy endpoint concurrency limits (d66db88)
Problem
When an upstream websocket session is lost (session.closed) or a request fails mid-flight, the HTTP bridge recovery paths introduced in v1.12.0 return 400 previous_response_not_found errors. This tells the Codex CLI that the server-side conversation state is gone, causing it to drop previous_response_id and resend the entire conversation history as input.
The result is catastrophic for context window usage:
Root Cause
Three recovery paths in the bridge return 400 previous_response_not_found instead of 502 upstream_unavailable:
Mid-request websocket failure - when the upstream websocket dies during send() and previous_response_id is present, the error handler returns 400 previous_response_not_found instead of 502 upstream_unavailable.
_retry_http_bridge_request_on_fresh_upstream - when called with send_request=True and previous_response_id is present, immediately marks the request as previous_response_not_found without attempting recovery.
_retry_http_bridge_precreated_request - same pattern.
When the CLI receives 400 previous_response_not_found, it interprets this as "the server doesn't have this conversation anymore" and resends the full conversation history without previous_response_id, inflating per-turn context by ~20x.
Fix
Test Results
7 regression tests added in test_bridge_context_blowup.py. Ran against both broken and fixed code:
All 44 existing test_proxy_http_bridge.py tests pass with zero regressions.
Live Verification
Tested on a single-instance deployment with gpt-5.4 (876K effective context window):