fix(agent): stop a 6% gateway blip from killing the whole session (3.23.1)#70
Merged
Merged
Conversation
…23.1) Audited 2026-05-28 from telemetry: 28/468 paid-model calls (~6%) return a PaymentRejected from the Solana gateway intermittently — identical prompts succeed 5 s apart. Three client-side defects amplified that blip into 'totally unusable, restart doesn't help': 1. error-classifier: 'payment_rejected' was non-transient with maxRetries=0. A single blip surfaced as a hard error. Fixed: mark transient with maxRetries=3. Each retry re-signs with a fresh nonce (llm.ts), so it's not a replay; deterministic failures (clock skew, wrong chain) still exhaust the budget quickly and fall through. 2. loop.ts: 'payment_rejected' was treated identically to 'payment' (insufficient funds) — added to paymentFailedModels for the whole session. One blip permanently demoted the user to free models. Fixed: split the two. 'payment' stays session-permanent (wallet won't refill mid-session). 'payment_rejected' only falls back FOR THIS TURN; next turn resets to baseModel and tries the paid model again. 3. start.ts: disconnectMcpServers() was fire-and-forget and there was no explicit process.exit(). Lingering keep-alive sockets (panel HTTP server, gateway clients, MCP children, FRANKLIN_EXTRACT_ON_EXIT) pinned the event loop. User saw 'Goodbye.' but `ps` still showed the process; a follow-up `franklin` raced with the zombie. Fixed: bounded MCP shutdown race (2 s cap) followed by explicit process.exit() in both Ink and basic UIs. #1 + #2 + #3 together turn a 6% transient into 'session ruined and restart doesn't help'. Removing #1 and #2 caps the blast radius at one turn even when the gateway hiccups. Gateway-side root cause (Solana nonce-cache race + missing RETRYABLE entries) is tracked separately in BlockRunAI/blockrun-sol.
…e model The payment_rejected per-turn fallback switched to a free model without resetting recoveryAttempts. By that point the transient path above has exhausted this turn's maxRetries:3 budget, so the free fallback model inherited recoveryAttempts==3 and got zero retries — a single transient blip on the fallback model failed the whole turn, the exact outcome this PR set out to prevent. Reset the counter on switch, mirroring the rate_limit fallback's 'new model gets its own retry budget' behavior.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
TL;DR
Audited 2026-05-28 from telemetry: ~6% of paid-model calls (28/468) return a
PaymentRejectedfrom the Solana gateway intermittently — identical prompts succeed 5 seconds apart. Three client-side defects amplified that blip into "totally unusable, restart doesn't help":error-classifier.tspayment_rejectedwasisTransient: false, maxRetries: 0— a single blip surfaced as a hard errorisTransient: true, maxRetries: 3loop.tspayment_rejectedwas treated identically topayment— added topaymentFailedModelsfor the whole session, permanently demoting user to free models on one blippaymentstays session-permanent (wallet won't refill mid-session);payment_rejectedonly falls back for this turn, next turn resets tobaseModelstart.tsdisconnectMcpServers()was fire-and-forget + no explicitprocess.exit()— keep-alive sockets + MCP children pinned the event loop, user saw "Goodbye." butpsstill showed the processprocess.exit(process.exitCode ?? 0)in both Ink and basic UIsCombined, #1 + #2 + #3 turn a transient 6% gateway hiccup into "session ruined and restart doesn't help." Removing #1 and #2 caps the blast radius at one turn even when the gateway hiccups; #3 makes restart actually work.
Trade-offs (consciously accepted)
maxRetries: 0 → 3— users with genuinely broken wallets (wrong chain, expired keys) wait ~7 s (1 s + 2 s + 4 s backoff) before seeing the error, vs instant before. Samesuggestiontext still surfaces. Net win because real misconfigurations are ~100× rarer than burst blips.payment_rejectedper-turn fallback — if the gateway has a prolonged (minutes-level) outage rather than a 5 s blip, every turn now burns 3 sign retries before falling back, instead of being session-blacklisted once. Tracked as a follow-up: add a circuit breaker ("3payment_rejectedin 60 s → escalate to session-level"). For now, prefer the simpler change because 99% of cases are sub-5-s blips.process.exit()— any background async write (telemetry, learning extractor) that hasn't flushed gets cut.flushStats()is sync and runs first, so the user's session data is safe. Worst case loses a few KB of telemetry. Much better than the alternative (zombie process).RETRYABLE_ERRORSentries liketransaction_simulation_failed) is the real fix and is tracked in BlockRunAI/blockrun-sol. This PR is the client-side blast-radius cap.Test plan
npm run build— passestest/local.mjsasserts new classifier shape (isTransient: true, maxRetries: 3); passesfranklin --chain solana, set/model anthropic/claude-sonnet-4.6, send 20 quick prompts. Confirm at least onePaymentRejectedevent triggers per-turn fallback (not session-permanent). After fallback, send another prompt → should attemptclaude-sonnet-4.6again, not stay on the free model./exit→ps -ef | grep franklin | grep -v grepshould show no surviving process after ~2 s.