Skip to content

fix(retry): treat transaction_simulation_failed as permanent — don't retry slow-model settlement failures #6

@KillerQueen-Z

Description

@KillerQueen-Z

Problem

When the BlockRun Solana gateway returns transaction_simulation_failed for a slow image model (notably openai/gpt-image-2 on complex prompts), the SDK retries the entire request 2–3 times. Each retry re-signs, re-generates (90+s on slow models), and fails again for the same root cause — the user ends up waiting ~5 minutes before finally seeing the failure.

User-reported example (LiteLLM Proxy → blockrun-litellm sidecar → blockrun-llm → sol.blockrun.ai):

status: 402 Payment Required
wall-clock: 5m 46.8s
details: 'transaction_simulation_failed'

The 5+ minutes is the SDK's automatic retry loop, not a single attempt.

Root cause analysis

transaction_simulation_failed for slow image models on Solana is deterministic and permanent within the lifetime of one signed authorization — it indicates the signed payment authorization expired (Solana blockhash ~60-90s) before the gateway could settle. Re-trying re-signs but re-generates the same slow image, so every attempt hits the same wall.

The BlockRun gateway already classifies this error as a client-side payment issue (see blockrun-sol src/lib/x402-solana.ts:370-378 PERMANENT_ERRORS and chat/completions/route.ts:655-657 isClientPaymentIssue). The SDK retry logic, however, treats it as transient.

Proposed fix

Add transaction_simulation_failed (and a couple of close cousins) to the SDK's permanent-error patterns, so the first failure surfaces immediately without retries.

Wherever the SDK classifies retryable vs permanent payment errors (e.g. inside SolanaLLMClient.image / payment helper), include:

PERMANENT_PAYMENT_PATTERNS = (
    "insufficient",
    "invalid signature",
    "invalid_payload",
    "expired",
    "authorization is used",
    "transaction_simulation_failed",   # ← add
    "blockhash not found",             # ← add (same class: re-signing without fixing root cause is futile within seconds)
    "block height exceeded",           # ← add
)

(Adjust to whatever the actual classification mechanism is — patterns above are the BlockRun-sol gateway's convention.)

User-visible impact

  • Today: user waits 5+ min, sees Payment rejected by server. detail={...transaction_simulation_failed...}, has no idea whether to retry or it'll loop forever.
  • After fix: user waits 30–60s (one attempt), sees the same error immediately, can either (a) switch chain to Base, (b) switch model to openai/gpt-image-1, or (c) wait for upstream gateway fix.

Removes the worst part of the bad UX (waiting) without claiming to fix the underlying gateway-side issue (separately tracked).

Acceptance criteria

  • transaction_simulation_failed returned by the BlockRun gateway raises PaymentError after a single attempt, not after multiple retries
  • Wall-clock time for a deterministic Solana settlement failure drops from ~5 min to ~30–60s
  • Existing retry behavior preserved for genuinely transient errors (network/RPC timeouts, facilitator 5xx, etc.)

Related

  • Upstream gateway fix: BlockRunAI/blockrun-sol — pre-settle for slow image models on Solana. This SDK change is a complementary UX fix; it does not depend on the gateway change.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions