fix(sdk): per-call ai() timeout + OpenRouter image retry/strip fallback by santoshkumarradha · Pull Request #600 · Agent-Field/agentfield

santoshkumarradha · 2026-05-28T22:10:32Z

Summary

Two SDK gaps surfaced by the reel-af example project under a real URL → vertical-reel workload. Both forced consumer-side workarounds that this PR removes.

Filed alongside seven other reel-af pain points in AGENTFIELD_SDK_ISSUES.md. Of the seven, four were already fixed upstream (PR #579 + the video-download auth fix). This PR covers the remaining three.

#	Issue	Status
1	`generate_audio` hardcoded `stream=true` rejects `format='wav'`	Already fixed (auto pcm16 → wav wrap)
2	`generate_audio` cannot accept a system message	Already fixed (`system` kwarg)
3	`generate_image_openrouter` 404s deterministically with `image_config={"aspect_ratio":"9:16"}`	Fixed here
4	`ImageOutput.save()` chokes on `data:` URLs from Gemini	Already fixed (`get_bytes()` handles `data:`)
5	Intermittent ~1-3% routing 404 on `generate_image_openrouter` under concurrency	Fixed here
6	`generate_video` downloads `unsigned_urls` without auth → 401	Already fixed (per-host header logic)
7	`app.ai()` has no per-call timeout override	Fixed here

Changes

`vision.generate_image_openrouter` — retry on routing 404 (#3, #5)

OpenRouter's litellm.NotFoundError: \"No endpoints found that support the requested output modalities\" surfaces in two flavours:

Deterministic when image_config (e.g. aspect_ratio=9:16) hits a model whose upstream replicas don't expose that param.
Intermittent ~1-3% under load when routing momentarily lands on a replica without image modality.

Single retry policy that handles both:

Wraps the litellm.acompletion(...) call in a 3-attempt backoff loop (sleeps 1s, 2s between attempts).
If image_config was non-empty and all 3 retries failed, makes one final attempt with image_config stripped, logs a warning via log_warn, and uses the result.
Other exceptions still propagate immediately (no retry on generic errors).
The existing asyncio.wait_for(..., timeout=...) envelope still wraps every individual attempt — env var AGENTFIELD_LLM_CALL_TIMEOUT semantics unchanged.
Worst-case wait: 3s without image_config, 7s with.

`agent_ai.ai()` — per-call timeout override (#7)

ai() now accepts timeout: Optional[float] = None. When set, it overrides async_config.llm_call_timeout for that single call only, propagating to:

litellm_params[\"timeout\"] — so httpx aborts at the socket level (preserves the comment block at line 527-533 explaining the connection-pool deadlock this prevents).
The asyncio.wait_for(..., timeout=effective_timeout * 2) safety net.

Wired through all three call paths inside ai():

Direct (line 534-537 region)
Tool-loop _make_call() closure (line 585-606 region)
Non-tool _make_litellm_call() closure (line 665-682 region)

A single effective_timeout is computed once at function scope and captured by both nested closures. No env-var defaults touched. No sibling methods (ai_with_audio, ai_with_vision, etc.) modified.

Why this matters: the previous model forced every call in a mixed fast/slow pipeline to use the slowest expected timeout. reel-af's distiller and scene_breaker want ~60s; compose_script on a long arXiv preprint needs 10+ minutes. Today the agent-wide knob makes the fast calls wait too long on the rare timeout path.

Test plan

tests/test_vision.py — 4 new cases:
- test_generate_image_openrouter_retries_on_no_endpoints_then_succeeds — first call raises, second succeeds → acompletion called twice
- test_generate_image_openrouter_strips_image_config_after_retries — all 3 retries fail with image_config={\"aspect_ratio\":\"9:16\"}; 4th attempt has image_config stripped and succeeds
- test_generate_image_openrouter_does_not_retry_on_other_errors — generic RuntimeError propagates after a single call
- test_generate_image_openrouter_gives_up_after_all_retries_no_image_config — 3 attempts then re-raise (no useless 4th call when there's nothing to strip)
- asyncio.sleep patched out via monkeypatch — suite remains instant
tests/test_agent_ai.py — 3 new cases sharing a _setup_timeout_test helper:
- test_ai_per_call_timeout_overrides_agent_default — ai(\"hello\", timeout=30.0) → captured acompletion(timeout=30.0)
- test_ai_falls_back_to_agent_default_when_no_per_call_timeout → captured timeout=120.0
- test_ai_per_call_timeout_applies_to_wait_for_safety_net — timeout=10.0 → captured wait_for(timeout=20.0)
112 related tests pass: full test_agent_ai*, test_vision, test_media_providers, test_openrouter_audio, test_image_config, test_agent_ai_deadlock_recovery suites
ruff check clean on all touched files
ruff format --check clean on all touched files
CI (Python 3.10 / 3.11 / 3.12 matrix) — kicks off on PR open

Notes for reviewers

No public API breaks. timeout= is opt-in; image_config retry/strip is transparent and only triggers on the specific "No endpoints found" error shape.
Both fixes are scoped to single functions. No new module-level abstractions or refactoring.
After this lands, the reel-af sdk_patches.py monkey-patch can shrink (it carries the video-download auth fix which is also now upstream) and the AGENTFIELD_ASYNC_LLM_CALL_TIMEOUT=600 env bump can be replaced with per-call timeout= on the slow calls only.

Two SDK gaps surfaced by the reel-af example project under a real URL-to-vertical-reel workload. Both forced consumer-side workarounds that this PR removes. vision.generate_image_openrouter — retry on routing 404 OpenRouter's "No endpoints found that support the requested output modalities" surfaces in two flavours: 1. Deterministic 404 when image_config (e.g. aspect_ratio=9:16) hits a model whose upstream replicas don't expose that param. 2. Intermittent 1-3% 404 under load when routing momentarily lands on a replica without image modality. Now wraps the litellm.acompletion call in a 3-attempt backoff (1s, 2s) and, if image_config was set and all retries failed, makes one final attempt with image_config stripped, logging a warning. Other exceptions still propagate immediately. The existing asyncio.wait_for timeout still wraps every attempt. Worst-case wait: 3s without image_config, 7s with. agent_ai.ai() — per-call timeout override ai() now accepts timeout: Optional[float] = None. When set, it overrides async_config.llm_call_timeout for that single call, propagating to both litellm_params["timeout"] (httpx socket-level) and the asyncio.wait_for safety net (2x). Previously the only knob was the agent-wide config, which forced every call in a mixed fast/slow pipeline to use the slowest expected timeout. Wired through all three call paths: direct, tool-loop _make_call, and non-tool _make_litellm_call. Tests - 4 new tests in test_vision.py covering retry-then-success, strip- image_config-after-retries, no-retry-on-other-errors, and give-up after retries. Sleeps patched out. - 3 new tests in test_agent_ai.py covering per-call override, fallback to agent default, and 2x safety-net propagation. - 112 related tests pass (full agent_ai + vision + media_providers + openrouter_audio + image_config + deadlock_recovery suites). - ruff check + format check clean.

github-actions · 2026-05-28T22:12:13Z

Performance

SDK	Memory	Δ	Latency	Δ	Tests	Status
Python	9.4 KB	+4%	0.31 µs	-11%	✓	✓

✓ No regressions detected

github-actions · 2026-05-28T22:14:44Z

📊 Coverage gate

Thresholds from .coverage-gate.toml: per-surface ≥ 86%, aggregate ≥ 88%, max per-surface regression ≤ 1.0 pp, max aggregate regression ≤ 0.50 pp.

Surface	Current	Baseline	Δ
`control-plane`	87.50%	87.30%	↑ +0.20 pp	🟡
`sdk-go`	91.90%	90.70%	↑ +1.20 pp	🟢
`sdk-python`	93.73%	93.63%	↑ +0.10 pp	🟢
`sdk-typescript`	92.80%	92.56%	↑ +0.24 pp	🟢
`web-ui`	89.93%	90.01%	↓ -0.08 pp	🟡
aggregate	89.03%	89.01%	↑ +0.02 pp	🟡

✅ Gate passed

No surface regressed past the allowed threshold and the aggregate stayed above the floor.

github-actions · 2026-05-28T22:14:46Z

📐 Patch coverage gate

Threshold: 80% on lines this PR touches vs origin/main (from .coverage-gate.toml:thresholds.min_patch).

Surface	Patch coverage	Status
`control-plane`	—	➖ no changes
`sdk-go`	—	➖ no changes
`sdk-python`	—	➖ no changes
`sdk-typescript`	—	➖ no changes
`web-ui`	—	➖ no changes

✅ Patch gate passed

Every surface whose lines were touched by this PR has patch coverage at or above the threshold.

…image_config Addresses two informational findings from PR review. - agent_ai.ai(): reject timeout <= 0 with ValueError instead of letting asyncio.wait_for raise a confusing immediate TimeoutError. Defensive guard against user error; cheap one-line check. - vision.generate_image_openrouter: add comment explaining that the falsy check for image_config in the strip-and-retry branch is intentional. image_config={} (user explicitly opting into provider defaults per the existing docstring) produces an identical wire call whether stripped or not, so skipping the useless extra attempt is the right behavior. Comment prevents a future "fix" that would just burn cycles. - One new test: test_ai_rejects_non_positive_timeout covers both timeout=0 and timeout=-1.0.

santoshkumarradha requested review from a team and AbirAbbas as code owners May 28, 2026 22:10

santoshkumarradha enabled auto-merge May 28, 2026 22:13

santoshkumarradha merged commit 2c5acb8 into main May 28, 2026
28 checks passed

santoshkumarradha deleted the fix/sdk-reel-af-issues branch May 28, 2026 22:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(sdk): per-call ai() timeout + OpenRouter image retry/strip fallback#600

fix(sdk): per-call ai() timeout + OpenRouter image retry/strip fallback#600
santoshkumarradha merged 2 commits into
mainfrom
fix/sdk-reel-af-issues

santoshkumarradha commented May 28, 2026

Uh oh!

github-actions Bot commented May 28, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 28, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 28, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

santoshkumarradha commented May 28, 2026

Summary

Changes

vision.generate_image_openrouter — retry on routing 404 (#3, #5)

agent_ai.ai() — per-call timeout override (#7)

Test plan

Notes for reviewers

Uh oh!

github-actions Bot commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Performance

Uh oh!

github-actions Bot commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📊 Coverage gate

✅ Gate passed

Uh oh!

github-actions Bot commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📐 Patch coverage gate

✅ Patch gate passed

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

`vision.generate_image_openrouter` — retry on routing 404 (#3, #5)

`agent_ai.ai()` — per-call timeout override (#7)

github-actions Bot commented May 28, 2026 •

edited

Loading

github-actions Bot commented May 28, 2026 •

edited

Loading

github-actions Bot commented May 28, 2026 •

edited

Loading