Conversation
`Agent.call`'s async-then-sync-fallback was retrying any non-401/403 exception from the async path. That includes the case where the call reached the reasoner, the reasoner ran, and the reasoner explicitly returned a failed status — at which point retrying via the sync path just runs the same reasoner again with the same input, burns the same budget, and produces the same deterministic failure. Observed in production: github-buddy → pr-af.review fails with `pr_af.orchestrator.BudgetExhaustedError`; the SDK silently retries via sync 1 ms later, pr-af runs intake + anatomy from scratch, hits budget again, double-billed. Same pattern would apply to any deterministic 5xx surface (validation errors, malformed input, exhausted quotas) — one logical failure, two charges. ## Fix Add a new `ExecutionFailedError(AgentFieldClientError)` exception to distinguish "the work ran and failed" from "the call never reached the reasoner": - `async_execution_manager.wait_for_result` now raises `ExecutionFailedError` instead of plain `AgentFieldClientError` when the polled execution status is `FAILED`. (Backward-compatible: `ExecutionFailedError` inherits from `AgentFieldClientError`, so callers catching the parent still see it.) - `Agent.call`'s exception handler skips the sync fallback when the async exception is `ExecutionFailedError` or `ExecutionTimeoutError` — both mean the work has already used its budget on the agent side. Plain `AgentFieldClientError` (transport / submission / network) continues to fall back via sync, preserving the recovery path that fallback_to_sync was designed for. The fix is asymmetric on purpose: retry remains on for transient transport failures (the legitimate use case), but is now off for post-execution failures (the wasteful case). No new config knob — this is the right default and an opt-in env override would invite future regressions. ## Test plan `tests/test_agent_call.py` adds three pinning tests: - `test_call_skips_sync_fallback_on_execution_failed_error` - `test_call_skips_sync_fallback_on_execution_timeout_error` - `test_call_still_falls_back_on_transport_errors` (regression guard for the recovery path) All 65 tests pass across the affected files (`test_agent_call.py`, `test_agent_core.py`, `test_async_execution_manager_*`, `test_client_*`). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Contributor
Performance
✓ No regressions detected |
Contributor
📊 Coverage gateThresholds from
✅ Gate passedNo surface regressed past the allowed threshold and the aggregate stayed above the floor. |
Contributor
📐 Patch coverage gateThreshold: 80% on lines this PR touches vs
✅ Patch gate passedEvery surface whose lines were touched by this PR has patch coverage at or above the threshold. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Agent.call's async-then-sync-fallback was retrying any non-401/403 exception from the async path. That includes the case where the call reached the reasoner, the work ran, and the reasoner explicitly returned a failed status — at which point retrying via the sync path just runs the same reasoner again with the same input, burns the same budget, and produces the same deterministic failure.Observed in production: github-buddy →
pr-af.reviewfails withpr_af.orchestrator.BudgetExhaustedError; the SDK silently retries via sync 1 ms later (look for "Falling back to sync execution for target" in the calling agent's logs); pr-af runs intake + anatomy from scratch, hits the budget again, double-billed. The same pattern applies to any deterministic 5xx surface (validation errors, malformed input, exhausted quotas) — one logical failure, two charges.Fix
A new
ExecutionFailedError(AgentFieldClientError)exception distinguishes "the work ran and failed" from "the call never reached the reasoner":async_execution_manager.wait_for_resultnow raisesExecutionFailedErrorinstead of plainAgentFieldClientErrorwhen the polled execution status isFAILED. Backward-compatible:ExecutionFailedErrorinherits fromAgentFieldClientError, so callers catching the parent still see it; new callers can catch the subclass directly to distinguish post-execution errors from transport errors.Agent.call's exception handler skips the sync fallback when the async exception isExecutionFailedErrororExecutionTimeoutError— both mean the work has already used (or exceeded) its budget on the agent side. PlainAgentFieldClientError(transport / submission / network) continues to fall back via sync, preserving the recovery pathfallback_to_syncwas designed for.The behaviour is asymmetric on purpose: retry remains on for transient transport failures (the legitimate use case), but is now off for post-execution failures (the wasteful case). No new config knob — this is the right default, and an opt-in env override would invite future regressions.
Test plan
tests/test_agent_call.pyadds three pinning tests:test_call_skips_sync_fallback_on_execution_failed_error— verifiesExecutionFailedErrordoes not trigger the sync fallback.test_call_skips_sync_fallback_on_execution_timeout_error— same forExecutionTimeoutError.test_call_still_falls_back_on_transport_errors— regression guard: plainAgentFieldClientError(e.g. "connection reset by peer") must still trigger sync fallback.All 65 tests pass across the affected files (
test_agent_call.py,test_agent_core.py,test_async_execution_manager_comprehensive.py,test_async_execution_manager_final90.py,test_client_laser_push.py,test_client_bigfiles_coverage.py).🤖 Generated with Claude Code