feat(sdk): native OpenRouter audio/video/image routing across Python, TS, Go#579
Merged
Merged
Conversation
… TypeScript, Go
Adds first-class support for OpenRouter's full media surface in all three SDKs
without changing the public API. The provider now fetches model metadata once
(cached) and routes audio to either `POST /audio/speech` (TTS-only models like
hexgrad/kokoro-82m) or `POST /chat/completions` with audio modality (gpt-audio
family). Image generation drops `"text"` from `modalities` so image-only models
like x-ai/grok-imagine-image-quality stop 404-ing. Video properly reads the
current `unsigned_urls` array shape and downloads with Bearer auth (the
"unsigned" URLs are served from openrouter.ai itself).
DX is unchanged — same `generate_audio/video/image` signatures, same defaults.
Why
- Kokoro and other TTS-only models live on `/audio/speech`; the old code only
knew chat-completions audio modality so they 404'd.
- `x-ai/grok-imagine-image-quality` is image-only output and rejects
`modalities=["image","text"]`; we now send `["image"]` which works for both
image-only and dual-output models (verified vs. gemini-2.5-flash-image).
- Video download was using the wrong field (`unsigned_url` singular) and
fetched without auth, so veo-3.1-lite returned 401.
What
- Python (`sdk/python/agentfield/media_providers.py`,
`multimodal_response.py`, `vision.py`):
+ per-instance metadata cache + `_fetch_model_meta`
+ new `_openrouter_audio_speech` path with client-side WAV wrapping
+ `image_urls` reference-image support, `speed`, `extra` passthrough
+ ImageOutput now handles `data:` URLs
- TypeScript (`sdk/typescript/src/ai/{MediaProvider,OpenRouterMediaProvider}.ts`):
+ same routing + WAV wrapping
+ `seedModelMeta` test helper
+ fixed camelCase→snake_case for `frame_images` / `input_references`
+ new `VideoRequest.imageUrl`, `ImageRequest.imageUrls`, `extra` passthrough,
expanded `ImageConfig` (strength/style/rgb_colors/...)
- Go (`sdk/go/ai/{media_provider,openrouter_media}.go`):
+ same metadata-driven routing
+ `SeedModelMeta` test helper
+ reads `unsigned_urls` plural + downloads with Bearer when host is
openrouter.ai
+ new `VideoRequest.ImageURL`, `ImageRequest.ImageURLs`, `Speed`, `Extra`
fields; full `ImageConfig` expansion
Tested
- Smoke-tested end-to-end against openrouter/hexgrad/kokoro-82m (audio),
openrouter/x-ai/grok-imagine-image-quality (image), and
openrouter/google/veo-3.1-lite (video), including image-to-video with
first_frame / last_frame guidance. Outputs saved + verified as RIFF/WAVE,
JPEG, and MP4.
- Python: 107 media tests pass.
- TypeScript: 596 tests pass.
- Go: 228 tests pass.
Contributor
Performance
⚠ Regression detected:
|
…essage The refactor of ImageOutput.save() to delegate to get_bytes() dropped the 'to save' suffix that test_output_objects_raise_for_missing_data asserts on. Restore the upfront check so save() raises 'No image data or URL available to save' while get_bytes() still raises 'No image data or URL available'.
Contributor
📊 Coverage gateThresholds from
✅ Gate passedNo surface regressed past the allowed threshold and the aggregate stayed above the floor. |
Contributor
📐 Patch coverage gateThreshold: 80% on lines this PR touches vs
✅ Patch gate passedEvery surface whose lines were touched by this PR has patch coverage at or above the threshold. |
… to read message.images - Adds Go test file (openrouter_media_routing_test.go) covering fetchModelMeta cache + error paths, /audio/speech success/error, frame_images + input_references + extra translation, and wrapPCM16AsWAV header correctness. Lifts Go patch coverage from 64% to >87% on touched lines. - Adds TS test file (openrouter_media_routing.test.ts) covering the metadata cache (success / 500 / network exception), generateImage multi-part content + imageConfig snake_case translation, video param translation (imageUrl, frameImages, inputReferences, extra), and /audio/speech speed + extra passthrough. Lifts TS coverage to 94.5% lines / 80.1% branches on OpenRouterMediaProvider.ts. - Fixes a real bug uncovered while writing tests: the TS image-response parser only read message.content[] (gpt-image-1 style) and dropped images that OpenRouter returns in the dedicated message.images[] array (gemini-*-image, grok-imagine, when content is null). Now parses both shapes.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds first-class support for OpenRouter's full media surface in all three SDKs (Python, TypeScript, Go) without changing the public API. The provider now fetches each model's metadata once (cached per instance) and routes:
POST /audio/speech(TTS-only models likehexgrad/kokoro-82m) orPOST /chat/completionswith audio modality (gpt-audio family).POST /chat/completionswithmodalities=["image"](works for both image-only models likex-ai/grok-imagine-image-qualityand dual-output models likegoogle/gemini-2.5-flash-image).POST /api/v1/videosasync lifecycle (now reads the currentunsigned_urlsarray and downloads with Bearer auth — the "unsigned" URLs are served from openrouter.ai itself and require the same auth as the API).DX is unchanged — same
generate_audio/generate_video/generate_imagesignatures, same defaults. Adds optionalextra/image_url(s)/speed/frame_typeparameters that simply pass through.Why
/audio/speech. The old code only knew the chat-completions audio modality so those models 404'd withNo endpoints found that support the requested output modalities: text, audio.x-ai/grok-imagine-image-qualityis image-only output and rejectsmodalities=["image","text"]. We now send["image"], verified to work for both image-only models and dual-output models.unsigned_urlsingular) and fetched without auth, sogoogle/veo-3.1-litereturned 401 even though the job completed successfully.frameImages/inputReferenceswere being passed as camelCase to the API, which expects snake_case (frame_images,input_references,frame_type).What changes
Python (
sdk/python/agentfield/)media_providers.py:_model_meta_cache) +_fetch_model_meta()helper that hits/api/v1/models/{id}/endpointslazily._openrouter_audio_speech()path. When caller asks forformat="wav"we requestpcmfrom upstream and wrap in a WAV header client-side so it stays playable.generate_audioroutes byoutput_modalities; default routes to/audio/speech(broader-compat) when metadata is unavailable.generate_video: downloads fromopenrouter.aiURLs with auth, anonymous for CDN URLs.image_urls,speed,extra(passthrough merged into request body).multimodal_response.py:ImageOutput.save()/get_bytes()now handledata:image/...;base64,...URLs.vision.py:modalities=["image"]and multi-part user message whenimage_urlsare passed.TypeScript (
sdk/typescript/src/ai/)OpenRouterMediaProvider.ts:fetchModelMetahelper.seedModelMeta(model, outputModalities, inputModalities)— public test helper to pre-populate the cache (used by tests against mock servers)./audio/speechcode path withwrapPcm16AsWavhelper (RIFF header generation).MediaProvider.ts: new typedVideoFrameImage(withframeType: "first_frame" | "last_frame"),VideoInputReference. AddedimageUrl,extratoVideoRequest;imageUrls,extra, expandedimageConfig(strength, style, rgbColors, backgroundRgbColor, fontInputs) toImageRequest;speed,extratoAudioRequest.Go (
sdk/go/ai/)openrouter_media.go:metaCache+fetchModelMetahelper.SeedModelMeta(model, outputModalities, inputModalities)exported test helper.generateAudioViaSpeechEndpointmethod +wrapPCM16AsWAVhelper.videoJobStatus.UnsignedURLs []string(plural) +Usage.Costparsing.media_provider.go: addedImageURLtoVideoRequest;ImageURLs,ExtratoImageRequest; fullImageConfigexpansion;Speed,ExtratoAudioRequest; new typedFontInput.DX preserved
Same call shapes work; nothing changed for existing callers:
The routing is metadata-driven, so every OpenRouter model in each category works automatically. No allowlist — new TTS / video / image models added by OpenRouter work without an SDK change.
Test plan
pytest tests/test_openrouter_audio.py tests/test_openrouter_video.py tests/test_media_providers.py tests/test_media_providers_additional.py tests/test_media_integration.py tests/test_vision.py tests/test_image_config.py)npm test)go test ./ai/...)openrouter/hexgrad/kokoro-82m→ 31s WAV (RIFF/WAVE PCM 16-bit mono 24kHz)openrouter/x-ai/grok-imagine-image-quality→ 896×1280 JPEGopenrouter/google/veo-3.1-lite→ 4s 1280×720 MP4 (1MB)frame_images=[first_frame, last_frame]from grok-imagine outputs → 4s 720×1280 MP4 (2.6MB)Tested models (examples — not an allowlist)
/chat/completionsw/modalities=["image"]x-ai/grok-imagine-image-quality,google/gemini-2.5-flash-image,openai/gpt-image-1, anything else withimageinoutput_modalities/audio/speechhexgrad/kokoro-82m,openai/gpt-4o-mini-tts, anything whoseoutput_modalitiesis["speech"]/chat/completionsw/modalities=["text","audio"]SSEopenai/gpt-audio,openai/gpt-audio-mini,openai/gpt-4o-audio-preview,google/lyria-3-pro(music)/videosasync pollinggoogle/veo-3.1-lite,google/veo-3.1,kling-video/*, anything withvideoinoutput_modalitiesWebsite docs follow-up to
Agent-Field/website2.0once this merges.