fix(sdk): per-call ai() timeout + OpenRouter image retry/strip fallback#600
Merged
Conversation
Two SDK gaps surfaced by the reel-af example project under a real
URL-to-vertical-reel workload. Both forced consumer-side workarounds
that this PR removes.
vision.generate_image_openrouter — retry on routing 404
OpenRouter's "No endpoints found that support the requested output
modalities" surfaces in two flavours:
1. Deterministic 404 when image_config (e.g. aspect_ratio=9:16) hits
a model whose upstream replicas don't expose that param.
2. Intermittent 1-3% 404 under load when routing momentarily lands
on a replica without image modality.
Now wraps the litellm.acompletion call in a 3-attempt backoff
(1s, 2s) and, if image_config was set and all retries failed, makes
one final attempt with image_config stripped, logging a warning. Other
exceptions still propagate immediately. The existing asyncio.wait_for
timeout still wraps every attempt. Worst-case wait: 3s without
image_config, 7s with.
agent_ai.ai() — per-call timeout override
ai() now accepts timeout: Optional[float] = None. When set, it
overrides async_config.llm_call_timeout for that single call,
propagating to both litellm_params["timeout"] (httpx socket-level)
and the asyncio.wait_for safety net (2x). Previously the only knob
was the agent-wide config, which forced every call in a mixed
fast/slow pipeline to use the slowest expected timeout. Wired
through all three call paths: direct, tool-loop _make_call, and
non-tool _make_litellm_call.
Tests
- 4 new tests in test_vision.py covering retry-then-success, strip-
image_config-after-retries, no-retry-on-other-errors, and give-up
after retries. Sleeps patched out.
- 3 new tests in test_agent_ai.py covering per-call override,
fallback to agent default, and 2x safety-net propagation.
- 112 related tests pass (full agent_ai + vision + media_providers +
openrouter_audio + image_config + deadlock_recovery suites).
- ruff check + format check clean.
Contributor
Performance
✓ No regressions detected |
Contributor
📊 Coverage gateThresholds from
✅ Gate passedNo surface regressed past the allowed threshold and the aggregate stayed above the floor. |
Contributor
📐 Patch coverage gateThreshold: 80% on lines this PR touches vs
✅ Patch gate passedEvery surface whose lines were touched by this PR has patch coverage at or above the threshold. |
…image_config
Addresses two informational findings from PR review.
- agent_ai.ai(): reject timeout <= 0 with ValueError instead of letting
asyncio.wait_for raise a confusing immediate TimeoutError. Defensive
guard against user error; cheap one-line check.
- vision.generate_image_openrouter: add comment explaining that the
falsy check for image_config in the strip-and-retry branch is
intentional. image_config={} (user explicitly opting into provider
defaults per the existing docstring) produces an identical wire call
whether stripped or not, so skipping the useless extra attempt is
the right behavior. Comment prevents a future "fix" that would just
burn cycles.
- One new test: test_ai_rejects_non_positive_timeout covers both
timeout=0 and timeout=-1.0.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Two SDK gaps surfaced by the reel-af example project under a real URL → vertical-reel workload. Both forced consumer-side workarounds that this PR removes.
Filed alongside seven other reel-af pain points in
AGENTFIELD_SDK_ISSUES.md. Of the seven, four were already fixed upstream (PR #579 + the video-download auth fix). This PR covers the remaining three.generate_audiohardcodedstream=truerejectsformat='wav'generate_audiocannot accept a system messagesystemkwarg)generate_image_openrouter404s deterministically withimage_config={"aspect_ratio":"9:16"}ImageOutput.save()chokes ondata:URLs from Geminiget_bytes()handlesdata:)generate_image_openrouterunder concurrencygenerate_videodownloadsunsigned_urlswithout auth → 401app.ai()has no per-call timeout overrideChanges
vision.generate_image_openrouter— retry on routing 404 (#3, #5)OpenRouter's
litellm.NotFoundError: \"No endpoints found that support the requested output modalities\"surfaces in two flavours:image_config(e.g.aspect_ratio=9:16) hits a model whose upstream replicas don't expose that param.Single retry policy that handles both:
litellm.acompletion(...)call in a 3-attempt backoff loop (sleeps 1s, 2s between attempts).image_configwas non-empty and all 3 retries failed, makes one final attempt withimage_configstripped, logs a warning vialog_warn, and uses the result.asyncio.wait_for(..., timeout=...)envelope still wraps every individual attempt — env varAGENTFIELD_LLM_CALL_TIMEOUTsemantics unchanged.image_config, 7s with.agent_ai.ai()— per-call timeout override (#7)ai()now acceptstimeout: Optional[float] = None. When set, it overridesasync_config.llm_call_timeoutfor that single call only, propagating to:litellm_params[\"timeout\"]— so httpx aborts at the socket level (preserves the comment block at line 527-533 explaining the connection-pool deadlock this prevents).asyncio.wait_for(..., timeout=effective_timeout * 2)safety net.Wired through all three call paths inside
ai():_make_call()closure (line 585-606 region)_make_litellm_call()closure (line 665-682 region)A single
effective_timeoutis computed once at function scope and captured by both nested closures. No env-var defaults touched. No sibling methods (ai_with_audio,ai_with_vision, etc.) modified.Why this matters: the previous model forced every call in a mixed fast/slow pipeline to use the slowest expected timeout. reel-af's distiller and scene_breaker want ~60s;
compose_scripton a long arXiv preprint needs 10+ minutes. Today the agent-wide knob makes the fast calls wait too long on the rare timeout path.Test plan
tests/test_vision.py— 4 new cases:test_generate_image_openrouter_retries_on_no_endpoints_then_succeeds— first call raises, second succeeds →acompletioncalled twicetest_generate_image_openrouter_strips_image_config_after_retries— all 3 retries fail withimage_config={\"aspect_ratio\":\"9:16\"}; 4th attempt hasimage_configstripped and succeedstest_generate_image_openrouter_does_not_retry_on_other_errors— genericRuntimeErrorpropagates after a single calltest_generate_image_openrouter_gives_up_after_all_retries_no_image_config— 3 attempts then re-raise (no useless 4th call when there's nothing to strip)asyncio.sleeppatched out via monkeypatch — suite remains instanttests/test_agent_ai.py— 3 new cases sharing a_setup_timeout_testhelper:test_ai_per_call_timeout_overrides_agent_default—ai(\"hello\", timeout=30.0)→ capturedacompletion(timeout=30.0)test_ai_falls_back_to_agent_default_when_no_per_call_timeout→ capturedtimeout=120.0test_ai_per_call_timeout_applies_to_wait_for_safety_net—timeout=10.0→ capturedwait_for(timeout=20.0)test_agent_ai*,test_vision,test_media_providers,test_openrouter_audio,test_image_config,test_agent_ai_deadlock_recoverysuitesruff checkclean on all touched filesruff format --checkclean on all touched filesNotes for reviewers
timeout=is opt-in;image_configretry/strip is transparent and only triggers on the specific "No endpoints found" error shape.sdk_patches.pymonkey-patch can shrink (it carries the video-download auth fix which is also now upstream) and theAGENTFIELD_ASYNC_LLM_CALL_TIMEOUT=600env bump can be replaced with per-calltimeout=on the slow calls only.