Skip to content

fix(agent): stop a 6% gateway blip from killing the whole session (3.23.1)#70

Merged
1bcMax merged 2 commits into
mainfrom
fix/payment-reject-blacklist-and-exit-hang
May 31, 2026
Merged

fix(agent): stop a 6% gateway blip from killing the whole session (3.23.1)#70
1bcMax merged 2 commits into
mainfrom
fix/payment-reject-blacklist-and-exit-hang

Conversation

@KillerQueen-Z
Copy link
Copy Markdown
Collaborator

TL;DR

Audited 2026-05-28 from telemetry: ~6% of paid-model calls (28/468) return a PaymentRejected from the Solana gateway intermittently — identical prompts succeed 5 seconds apart. Three client-side defects amplified that blip into "totally unusable, restart doesn't help":

# File Defect Fix
1 error-classifier.ts payment_rejected was isTransient: false, maxRetries: 0 — a single blip surfaced as a hard error isTransient: true, maxRetries: 3
2 loop.ts payment_rejected was treated identically to payment — added to paymentFailedModels for the whole session, permanently demoting user to free models on one blip Split: payment stays session-permanent (wallet won't refill mid-session); payment_rejected only falls back for this turn, next turn resets to baseModel
3 start.ts disconnectMcpServers() was fire-and-forget + no explicit process.exit() — keep-alive sockets + MCP children pinned the event loop, user saw "Goodbye." but ps still showed the process Bounded 2 s MCP shutdown race + explicit process.exit(process.exitCode ?? 0) in both Ink and basic UIs

Combined, #1 + #2 + #3 turn a transient 6% gateway hiccup into "session ruined and restart doesn't help." Removing #1 and #2 caps the blast radius at one turn even when the gateway hiccups; #3 makes restart actually work.

Trade-offs (consciously accepted)

  • maxRetries: 0 → 3 — users with genuinely broken wallets (wrong chain, expired keys) wait ~7 s (1 s + 2 s + 4 s backoff) before seeing the error, vs instant before. Same suggestion text still surfaces. Net win because real misconfigurations are ~100× rarer than burst blips.
  • payment_rejected per-turn fallback — if the gateway has a prolonged (minutes-level) outage rather than a 5 s blip, every turn now burns 3 sign retries before falling back, instead of being session-blacklisted once. Tracked as a follow-up: add a circuit breaker ("3 payment_rejected in 60 s → escalate to session-level"). For now, prefer the simpler change because 99% of cases are sub-5-s blips.
  • process.exit() — any background async write (telemetry, learning extractor) that hasn't flushed gets cut. flushStats() is sync and runs first, so the user's session data is safe. Worst case loses a few KB of telemetry. Much better than the alternative (zombie process).
  • Gateway-side root cause (Solana nonce-cache race + missing RETRYABLE_ERRORS entries like transaction_simulation_failed) is the real fix and is tracked in BlockRunAI/blockrun-sol. This PR is the client-side blast-radius cap.

Test plan

  • npm run build — passes
  • Updated test in test/local.mjs asserts new classifier shape (isTransient: true, maxRetries: 3); passes
  • Manual: franklin --chain solana, set /model anthropic/claude-sonnet-4.6, send 20 quick prompts. Confirm at least one PaymentRejected event triggers per-turn fallback (not session-permanent). After fallback, send another prompt → should attempt claude-sonnet-4.6 again, not stay on the free model.
  • Manual: /exitps -ef | grep franklin | grep -v grep should show no surviving process after ~2 s.

KillerQueen-Z and others added 2 commits May 30, 2026 17:41
…23.1)

Audited 2026-05-28 from telemetry: 28/468 paid-model calls (~6%) return
a PaymentRejected from the Solana gateway intermittently — identical
prompts succeed 5 s apart. Three client-side defects amplified that
blip into 'totally unusable, restart doesn't help':

1. error-classifier: 'payment_rejected' was non-transient with
   maxRetries=0. A single blip surfaced as a hard error. Fixed: mark
   transient with maxRetries=3. Each retry re-signs with a fresh nonce
   (llm.ts), so it's not a replay; deterministic failures (clock skew,
   wrong chain) still exhaust the budget quickly and fall through.

2. loop.ts: 'payment_rejected' was treated identically to 'payment'
   (insufficient funds) — added to paymentFailedModels for the whole
   session. One blip permanently demoted the user to free models.
   Fixed: split the two. 'payment' stays session-permanent (wallet
   won't refill mid-session). 'payment_rejected' only falls back FOR
   THIS TURN; next turn resets to baseModel and tries the paid model
   again.

3. start.ts: disconnectMcpServers() was fire-and-forget and there was
   no explicit process.exit(). Lingering keep-alive sockets (panel HTTP
   server, gateway clients, MCP children, FRANKLIN_EXTRACT_ON_EXIT)
   pinned the event loop. User saw 'Goodbye.' but `ps` still showed
   the process; a follow-up `franklin` raced with the zombie. Fixed:
   bounded MCP shutdown race (2 s cap) followed by explicit
   process.exit() in both Ink and basic UIs.

#1 + #2 + #3 together turn a 6% transient into 'session ruined and
restart doesn't help'. Removing #1 and #2 caps the blast radius at one
turn even when the gateway hiccups. Gateway-side root cause (Solana
nonce-cache race + missing RETRYABLE entries) is tracked separately
in BlockRunAI/blockrun-sol.
…e model

The payment_rejected per-turn fallback switched to a free model without
resetting recoveryAttempts. By that point the transient path above has
exhausted this turn's maxRetries:3 budget, so the free fallback model
inherited recoveryAttempts==3 and got zero retries — a single transient
blip on the fallback model failed the whole turn, the exact outcome this
PR set out to prevent. Reset the counter on switch, mirroring the
rate_limit fallback's 'new model gets its own retry budget' behavior.
@1bcMax 1bcMax merged commit 441f1c7 into main May 31, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant