fix(sdk): don't fall back to sync when the reasoner already ran by AbirAbbas · Pull Request #530 · Agent-Field/agentfield

AbirAbbas · 2026-05-05T18:54:45Z

Summary

Agent.call's async-then-sync-fallback was retrying any non-401/403 exception from the async path. That includes the case where the call reached the reasoner, the work ran, and the reasoner explicitly returned a failed status — at which point retrying via the sync path just runs the same reasoner again with the same input, burns the same budget, and produces the same deterministic failure.

Observed in production: github-buddy → pr-af.review fails with pr_af.orchestrator.BudgetExhaustedError; the SDK silently retries via sync 1 ms later (look for "Falling back to sync execution for target" in the calling agent's logs); pr-af runs intake + anatomy from scratch, hits the budget again, double-billed. The same pattern applies to any deterministic 5xx surface (validation errors, malformed input, exhausted quotas) — one logical failure, two charges.

Fix

A new ExecutionFailedError(AgentFieldClientError) exception distinguishes "the work ran and failed" from "the call never reached the reasoner":

async_execution_manager.wait_for_result now raises ExecutionFailedError instead of plain AgentFieldClientError when the polled execution status is FAILED. Backward-compatible: ExecutionFailedError inherits from AgentFieldClientError, so callers catching the parent still see it; new callers can catch the subclass directly to distinguish post-execution errors from transport errors.
Agent.call's exception handler skips the sync fallback when the async exception is ExecutionFailedError or ExecutionTimeoutError — both mean the work has already used (or exceeded) its budget on the agent side. Plain AgentFieldClientError (transport / submission / network) continues to fall back via sync, preserving the recovery path fallback_to_sync was designed for.

The behaviour is asymmetric on purpose: retry remains on for transient transport failures (the legitimate use case), but is now off for post-execution failures (the wasteful case). No new config knob — this is the right default, and an opt-in env override would invite future regressions.

Test plan

tests/test_agent_call.py adds three pinning tests:

test_call_skips_sync_fallback_on_execution_failed_error — verifies ExecutionFailedError does not trigger the sync fallback.
test_call_skips_sync_fallback_on_execution_timeout_error — same for ExecutionTimeoutError.
test_call_still_falls_back_on_transport_errors — regression guard: plain AgentFieldClientError (e.g. "connection reset by peer") must still trigger sync fallback.

All 65 tests pass across the affected files (test_agent_call.py, test_agent_core.py, test_async_execution_manager_comprehensive.py, test_async_execution_manager_final90.py, test_client_laser_push.py, test_client_bigfiles_coverage.py).

🤖 Generated with Claude Code

`Agent.call`'s async-then-sync-fallback was retrying any non-401/403 exception from the async path. That includes the case where the call reached the reasoner, the reasoner ran, and the reasoner explicitly returned a failed status — at which point retrying via the sync path just runs the same reasoner again with the same input, burns the same budget, and produces the same deterministic failure. Observed in production: github-buddy → pr-af.review fails with `pr_af.orchestrator.BudgetExhaustedError`; the SDK silently retries via sync 1 ms later, pr-af runs intake + anatomy from scratch, hits budget again, double-billed. Same pattern would apply to any deterministic 5xx surface (validation errors, malformed input, exhausted quotas) — one logical failure, two charges. ## Fix Add a new `ExecutionFailedError(AgentFieldClientError)` exception to distinguish "the work ran and failed" from "the call never reached the reasoner": - `async_execution_manager.wait_for_result` now raises `ExecutionFailedError` instead of plain `AgentFieldClientError` when the polled execution status is `FAILED`. (Backward-compatible: `ExecutionFailedError` inherits from `AgentFieldClientError`, so callers catching the parent still see it.) - `Agent.call`'s exception handler skips the sync fallback when the async exception is `ExecutionFailedError` or `ExecutionTimeoutError` — both mean the work has already used its budget on the agent side. Plain `AgentFieldClientError` (transport / submission / network) continues to fall back via sync, preserving the recovery path that fallback_to_sync was designed for. The fix is asymmetric on purpose: retry remains on for transient transport failures (the legitimate use case), but is now off for post-execution failures (the wasteful case). No new config knob — this is the right default and an opt-in env override would invite future regressions. ## Test plan `tests/test_agent_call.py` adds three pinning tests: - `test_call_skips_sync_fallback_on_execution_failed_error` - `test_call_skips_sync_fallback_on_execution_timeout_error` - `test_call_still_falls_back_on_transport_errors` (regression guard for the recovery path) All 65 tests pass across the affected files (`test_agent_call.py`, `test_agent_core.py`, `test_async_execution_manager_*`, `test_client_*`). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-05-05T18:56:34Z

Performance

SDK	Memory	Δ	Latency	Δ	Tests	Status
Python	7.9 KB	-13%	0.23 µs	-34%	✓	✓

✓ No regressions detected

github-actions · 2026-05-05T18:59:23Z

📊 Coverage gate

Thresholds from .coverage-gate.toml: per-surface ≥ 86%, aggregate ≥ 88%, max per-surface regression ≤ 1.0 pp, max aggregate regression ≤ 0.50 pp.

Surface	Current	Baseline	Δ
`control-plane`	87.40%	87.30%	↑ +0.10 pp	🟡
`sdk-go`	91.80%	90.70%	↑ +1.10 pp	🟢
`sdk-python`	93.66%	93.63%	↑ +0.03 pp	🟢
`sdk-typescript`	92.63%	92.56%	↑ +0.07 pp	🟢
`web-ui`	89.69%	90.01%	↓ -0.32 pp	🟡
aggregate	88.85%	89.01%	↓ -0.16 pp	🟡

✅ Gate passed

No surface regressed past the allowed threshold and the aggregate stayed above the floor.

github-actions · 2026-05-05T18:59:24Z

📐 Patch coverage gate

Threshold: 80% on lines this PR touches vs origin/main (from .coverage-gate.toml:thresholds.min_patch).

Surface	Patch coverage	Status
`control-plane`	—	➖ no changes
`sdk-go`	—	➖ no changes
`sdk-python`	—	➖ no changes
`sdk-typescript`	—	➖ no changes
`web-ui`	—	➖ no changes

✅ Patch gate passed

Every surface whose lines were touched by this PR has patch coverage at or above the threshold.

AbirAbbas requested a review from a team as a code owner May 5, 2026 18:54

AbirAbbas merged commit cecb776 into main May 5, 2026
32 checks passed

AbirAbbas deleted the fix/sdk-no-sync-fallback-on-execution-failure branch May 5, 2026 19:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(sdk): don't fall back to sync when the reasoner already ran#530

fix(sdk): don't fall back to sync when the reasoner already ran#530
AbirAbbas merged 1 commit intomainfrom
fix/sdk-no-sync-fallback-on-execution-failure

AbirAbbas commented May 5, 2026

Uh oh!

github-actions Bot commented May 5, 2026

Uh oh!

github-actions Bot commented May 5, 2026

Uh oh!

github-actions Bot commented May 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

AbirAbbas commented May 5, 2026

Summary

Fix

Test plan

Uh oh!

github-actions Bot commented May 5, 2026

Performance

Uh oh!

github-actions Bot commented May 5, 2026

📊 Coverage gate

✅ Gate passed

Uh oh!

github-actions Bot commented May 5, 2026

📐 Patch coverage gate

✅ Patch gate passed

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant