Skip to content

fix(sdk): don't fall back to sync when the reasoner already ran#530

Merged
AbirAbbas merged 1 commit intomainfrom
fix/sdk-no-sync-fallback-on-execution-failure
May 5, 2026
Merged

fix(sdk): don't fall back to sync when the reasoner already ran#530
AbirAbbas merged 1 commit intomainfrom
fix/sdk-no-sync-fallback-on-execution-failure

Conversation

@AbirAbbas
Copy link
Copy Markdown
Contributor

Summary

Agent.call's async-then-sync-fallback was retrying any non-401/403 exception from the async path. That includes the case where the call reached the reasoner, the work ran, and the reasoner explicitly returned a failed status — at which point retrying via the sync path just runs the same reasoner again with the same input, burns the same budget, and produces the same deterministic failure.

Observed in production: github-buddy → pr-af.review fails with pr_af.orchestrator.BudgetExhaustedError; the SDK silently retries via sync 1 ms later (look for "Falling back to sync execution for target" in the calling agent's logs); pr-af runs intake + anatomy from scratch, hits the budget again, double-billed. The same pattern applies to any deterministic 5xx surface (validation errors, malformed input, exhausted quotas) — one logical failure, two charges.

Fix

A new ExecutionFailedError(AgentFieldClientError) exception distinguishes "the work ran and failed" from "the call never reached the reasoner":

  • async_execution_manager.wait_for_result now raises ExecutionFailedError instead of plain AgentFieldClientError when the polled execution status is FAILED. Backward-compatible: ExecutionFailedError inherits from AgentFieldClientError, so callers catching the parent still see it; new callers can catch the subclass directly to distinguish post-execution errors from transport errors.
  • Agent.call's exception handler skips the sync fallback when the async exception is ExecutionFailedError or ExecutionTimeoutError — both mean the work has already used (or exceeded) its budget on the agent side. Plain AgentFieldClientError (transport / submission / network) continues to fall back via sync, preserving the recovery path fallback_to_sync was designed for.

The behaviour is asymmetric on purpose: retry remains on for transient transport failures (the legitimate use case), but is now off for post-execution failures (the wasteful case). No new config knob — this is the right default, and an opt-in env override would invite future regressions.

Test plan

tests/test_agent_call.py adds three pinning tests:

  • test_call_skips_sync_fallback_on_execution_failed_error — verifies ExecutionFailedError does not trigger the sync fallback.
  • test_call_skips_sync_fallback_on_execution_timeout_error — same for ExecutionTimeoutError.
  • test_call_still_falls_back_on_transport_errors — regression guard: plain AgentFieldClientError (e.g. "connection reset by peer") must still trigger sync fallback.

All 65 tests pass across the affected files (test_agent_call.py, test_agent_core.py, test_async_execution_manager_comprehensive.py, test_async_execution_manager_final90.py, test_client_laser_push.py, test_client_bigfiles_coverage.py).

🤖 Generated with Claude Code

`Agent.call`'s async-then-sync-fallback was retrying any non-401/403
exception from the async path. That includes the case where the call
reached the reasoner, the reasoner ran, and the reasoner explicitly
returned a failed status — at which point retrying via the sync path
just runs the same reasoner again with the same input, burns the same
budget, and produces the same deterministic failure.

Observed in production: github-buddy → pr-af.review fails with
`pr_af.orchestrator.BudgetExhaustedError`; the SDK silently retries
via sync 1 ms later, pr-af runs intake + anatomy from scratch, hits
budget again, double-billed. Same pattern would apply to any
deterministic 5xx surface (validation errors, malformed input, exhausted
quotas) — one logical failure, two charges.

## Fix

Add a new `ExecutionFailedError(AgentFieldClientError)` exception to
distinguish "the work ran and failed" from "the call never reached the
reasoner":

- `async_execution_manager.wait_for_result` now raises
  `ExecutionFailedError` instead of plain `AgentFieldClientError` when
  the polled execution status is `FAILED`. (Backward-compatible:
  `ExecutionFailedError` inherits from `AgentFieldClientError`, so
  callers catching the parent still see it.)
- `Agent.call`'s exception handler skips the sync fallback when the
  async exception is `ExecutionFailedError` or `ExecutionTimeoutError`
  — both mean the work has already used its budget on the agent side.
  Plain `AgentFieldClientError` (transport / submission / network)
  continues to fall back via sync, preserving the recovery path that
  fallback_to_sync was designed for.

The fix is asymmetric on purpose: retry remains on for transient
transport failures (the legitimate use case), but is now off for
post-execution failures (the wasteful case). No new config knob —
this is the right default and an opt-in env override would invite
future regressions.

## Test plan

`tests/test_agent_call.py` adds three pinning tests:
- `test_call_skips_sync_fallback_on_execution_failed_error`
- `test_call_skips_sync_fallback_on_execution_timeout_error`
- `test_call_still_falls_back_on_transport_errors` (regression guard
  for the recovery path)

All 65 tests pass across the affected files (`test_agent_call.py`,
`test_agent_core.py`, `test_async_execution_manager_*`,
`test_client_*`).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@AbirAbbas AbirAbbas requested a review from a team as a code owner May 5, 2026 18:54
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 5, 2026

Performance

SDK Memory Δ Latency Δ Tests Status
Python 7.9 KB -13% 0.23 µs -34%

✓ No regressions detected

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 5, 2026

📊 Coverage gate

Thresholds from .coverage-gate.toml: per-surface ≥ 86%, aggregate ≥ 88%, max per-surface regression ≤ 1.0 pp, max aggregate regression ≤ 0.50 pp.

Surface Current Baseline Δ
control-plane 87.40% 87.30% ↑ +0.10 pp 🟡
sdk-go 91.80% 90.70% ↑ +1.10 pp 🟢
sdk-python 93.66% 93.63% ↑ +0.03 pp 🟢
sdk-typescript 92.63% 92.56% ↑ +0.07 pp 🟢
web-ui 89.69% 90.01% ↓ -0.32 pp 🟡
aggregate 88.85% 89.01% ↓ -0.16 pp 🟡

✅ Gate passed

No surface regressed past the allowed threshold and the aggregate stayed above the floor.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 5, 2026

📐 Patch coverage gate

Threshold: 80% on lines this PR touches vs origin/main (from .coverage-gate.toml:thresholds.min_patch).

Surface Touched lines Patch coverage Status
control-plane 0 ➖ no changes
sdk-go 0 ➖ no changes
sdk-python 0 ➖ no changes
sdk-typescript 0 ➖ no changes
web-ui 0 ➖ no changes

✅ Patch gate passed

Every surface whose lines were touched by this PR has patch coverage at or above the threshold.

@AbirAbbas AbirAbbas merged commit cecb776 into main May 5, 2026
32 checks passed
@AbirAbbas AbirAbbas deleted the fix/sdk-no-sync-fallback-on-execution-failure branch May 5, 2026 19:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant