Skip to content

fix(sdk): per-call ai() timeout + OpenRouter image retry/strip fallback#600

Merged
santoshkumarradha merged 2 commits into
mainfrom
fix/sdk-reel-af-issues
May 28, 2026
Merged

fix(sdk): per-call ai() timeout + OpenRouter image retry/strip fallback#600
santoshkumarradha merged 2 commits into
mainfrom
fix/sdk-reel-af-issues

Conversation

@santoshkumarradha
Copy link
Copy Markdown
Member

Summary

Two SDK gaps surfaced by the reel-af example project under a real URL → vertical-reel workload. Both forced consumer-side workarounds that this PR removes.

Filed alongside seven other reel-af pain points in AGENTFIELD_SDK_ISSUES.md. Of the seven, four were already fixed upstream (PR #579 + the video-download auth fix). This PR covers the remaining three.

# Issue Status
1 generate_audio hardcoded stream=true rejects format='wav' Already fixed (auto pcm16 → wav wrap)
2 generate_audio cannot accept a system message Already fixed (system kwarg)
3 generate_image_openrouter 404s deterministically with image_config={"aspect_ratio":"9:16"} Fixed here
4 ImageOutput.save() chokes on data: URLs from Gemini Already fixed (get_bytes() handles data:)
5 Intermittent ~1-3% routing 404 on generate_image_openrouter under concurrency Fixed here
6 generate_video downloads unsigned_urls without auth → 401 Already fixed (per-host header logic)
7 app.ai() has no per-call timeout override Fixed here

Changes

vision.generate_image_openrouter — retry on routing 404 (#3, #5)

OpenRouter's litellm.NotFoundError: \"No endpoints found that support the requested output modalities\" surfaces in two flavours:

  1. Deterministic when image_config (e.g. aspect_ratio=9:16) hits a model whose upstream replicas don't expose that param.
  2. Intermittent ~1-3% under load when routing momentarily lands on a replica without image modality.

Single retry policy that handles both:

  • Wraps the litellm.acompletion(...) call in a 3-attempt backoff loop (sleeps 1s, 2s between attempts).
  • If image_config was non-empty and all 3 retries failed, makes one final attempt with image_config stripped, logs a warning via log_warn, and uses the result.
  • Other exceptions still propagate immediately (no retry on generic errors).
  • The existing asyncio.wait_for(..., timeout=...) envelope still wraps every individual attempt — env var AGENTFIELD_LLM_CALL_TIMEOUT semantics unchanged.
  • Worst-case wait: 3s without image_config, 7s with.

agent_ai.ai() — per-call timeout override (#7)

ai() now accepts timeout: Optional[float] = None. When set, it overrides async_config.llm_call_timeout for that single call only, propagating to:

  • litellm_params[\"timeout\"] — so httpx aborts at the socket level (preserves the comment block at line 527-533 explaining the connection-pool deadlock this prevents).
  • The asyncio.wait_for(..., timeout=effective_timeout * 2) safety net.

Wired through all three call paths inside ai():

  • Direct (line 534-537 region)
  • Tool-loop _make_call() closure (line 585-606 region)
  • Non-tool _make_litellm_call() closure (line 665-682 region)

A single effective_timeout is computed once at function scope and captured by both nested closures. No env-var defaults touched. No sibling methods (ai_with_audio, ai_with_vision, etc.) modified.

Why this matters: the previous model forced every call in a mixed fast/slow pipeline to use the slowest expected timeout. reel-af's distiller and scene_breaker want ~60s; compose_script on a long arXiv preprint needs 10+ minutes. Today the agent-wide knob makes the fast calls wait too long on the rare timeout path.

Test plan

  • tests/test_vision.py — 4 new cases:
    • test_generate_image_openrouter_retries_on_no_endpoints_then_succeeds — first call raises, second succeeds → acompletion called twice
    • test_generate_image_openrouter_strips_image_config_after_retries — all 3 retries fail with image_config={\"aspect_ratio\":\"9:16\"}; 4th attempt has image_config stripped and succeeds
    • test_generate_image_openrouter_does_not_retry_on_other_errors — generic RuntimeError propagates after a single call
    • test_generate_image_openrouter_gives_up_after_all_retries_no_image_config — 3 attempts then re-raise (no useless 4th call when there's nothing to strip)
    • asyncio.sleep patched out via monkeypatch — suite remains instant
  • tests/test_agent_ai.py — 3 new cases sharing a _setup_timeout_test helper:
    • test_ai_per_call_timeout_overrides_agent_defaultai(\"hello\", timeout=30.0) → captured acompletion(timeout=30.0)
    • test_ai_falls_back_to_agent_default_when_no_per_call_timeout → captured timeout=120.0
    • test_ai_per_call_timeout_applies_to_wait_for_safety_nettimeout=10.0 → captured wait_for(timeout=20.0)
  • 112 related tests pass: full test_agent_ai*, test_vision, test_media_providers, test_openrouter_audio, test_image_config, test_agent_ai_deadlock_recovery suites
  • ruff check clean on all touched files
  • ruff format --check clean on all touched files
  • CI (Python 3.10 / 3.11 / 3.12 matrix) — kicks off on PR open

Notes for reviewers

  • No public API breaks. timeout= is opt-in; image_config retry/strip is transparent and only triggers on the specific "No endpoints found" error shape.
  • Both fixes are scoped to single functions. No new module-level abstractions or refactoring.
  • After this lands, the reel-af sdk_patches.py monkey-patch can shrink (it carries the video-download auth fix which is also now upstream) and the AGENTFIELD_ASYNC_LLM_CALL_TIMEOUT=600 env bump can be replaced with per-call timeout= on the slow calls only.

Two SDK gaps surfaced by the reel-af example project under a real
URL-to-vertical-reel workload. Both forced consumer-side workarounds
that this PR removes.

vision.generate_image_openrouter — retry on routing 404
  OpenRouter's "No endpoints found that support the requested output
  modalities" surfaces in two flavours:
    1. Deterministic 404 when image_config (e.g. aspect_ratio=9:16) hits
       a model whose upstream replicas don't expose that param.
    2. Intermittent 1-3% 404 under load when routing momentarily lands
       on a replica without image modality.
  Now wraps the litellm.acompletion call in a 3-attempt backoff
  (1s, 2s) and, if image_config was set and all retries failed, makes
  one final attempt with image_config stripped, logging a warning. Other
  exceptions still propagate immediately. The existing asyncio.wait_for
  timeout still wraps every attempt. Worst-case wait: 3s without
  image_config, 7s with.

agent_ai.ai() — per-call timeout override
  ai() now accepts timeout: Optional[float] = None. When set, it
  overrides async_config.llm_call_timeout for that single call,
  propagating to both litellm_params["timeout"] (httpx socket-level)
  and the asyncio.wait_for safety net (2x). Previously the only knob
  was the agent-wide config, which forced every call in a mixed
  fast/slow pipeline to use the slowest expected timeout. Wired
  through all three call paths: direct, tool-loop _make_call, and
  non-tool _make_litellm_call.

Tests
  - 4 new tests in test_vision.py covering retry-then-success, strip-
    image_config-after-retries, no-retry-on-other-errors, and give-up
    after retries. Sleeps patched out.
  - 3 new tests in test_agent_ai.py covering per-call override,
    fallback to agent default, and 2x safety-net propagation.
  - 112 related tests pass (full agent_ai + vision + media_providers +
    openrouter_audio + image_config + deadlock_recovery suites).
  - ruff check + format check clean.
@santoshkumarradha santoshkumarradha requested review from a team and AbirAbbas as code owners May 28, 2026 22:10
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 28, 2026

Performance

SDK Memory Δ Latency Δ Tests Status
Python 9.4 KB +4% 0.31 µs -11%

✓ No regressions detected

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 28, 2026

📊 Coverage gate

Thresholds from .coverage-gate.toml: per-surface ≥ 86%, aggregate ≥ 88%, max per-surface regression ≤ 1.0 pp, max aggregate regression ≤ 0.50 pp.

Surface Current Baseline Δ
control-plane 87.50% 87.30% ↑ +0.20 pp 🟡
sdk-go 91.90% 90.70% ↑ +1.20 pp 🟢
sdk-python 93.73% 93.63% ↑ +0.10 pp 🟢
sdk-typescript 92.80% 92.56% ↑ +0.24 pp 🟢
web-ui 89.93% 90.01% ↓ -0.08 pp 🟡
aggregate 89.03% 89.01% ↑ +0.02 pp 🟡

✅ Gate passed

No surface regressed past the allowed threshold and the aggregate stayed above the floor.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 28, 2026

📐 Patch coverage gate

Threshold: 80% on lines this PR touches vs origin/main (from .coverage-gate.toml:thresholds.min_patch).

Surface Touched lines Patch coverage Status
control-plane 0 ➖ no changes
sdk-go 0 ➖ no changes
sdk-python 0 ➖ no changes
sdk-typescript 0 ➖ no changes
web-ui 0 ➖ no changes

✅ Patch gate passed

Every surface whose lines were touched by this PR has patch coverage at or above the threshold.

…image_config

Addresses two informational findings from PR review.

- agent_ai.ai(): reject timeout <= 0 with ValueError instead of letting
  asyncio.wait_for raise a confusing immediate TimeoutError. Defensive
  guard against user error; cheap one-line check.
- vision.generate_image_openrouter: add comment explaining that the
  falsy check for image_config in the strip-and-retry branch is
  intentional. image_config={} (user explicitly opting into provider
  defaults per the existing docstring) produces an identical wire call
  whether stripped or not, so skipping the useless extra attempt is
  the right behavior. Comment prevents a future "fix" that would just
  burn cycles.
- One new test: test_ai_rejects_non_positive_timeout covers both
  timeout=0 and timeout=-1.0.
@santoshkumarradha santoshkumarradha merged commit 2c5acb8 into main May 28, 2026
28 checks passed
@santoshkumarradha santoshkumarradha deleted the fix/sdk-reel-af-issues branch May 28, 2026 22:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant