Skip to content

feat(sdk): native OpenRouter audio/video/image routing across Python, TS, Go#579

Merged
santoshkumarradha merged 3 commits into
mainfrom
feat/openrouter-native-media
May 23, 2026
Merged

feat(sdk): native OpenRouter audio/video/image routing across Python, TS, Go#579
santoshkumarradha merged 3 commits into
mainfrom
feat/openrouter-native-media

Conversation

@santoshkumarradha
Copy link
Copy Markdown
Member

Summary

Adds first-class support for OpenRouter's full media surface in all three SDKs (Python, TypeScript, Go) without changing the public API. The provider now fetches each model's metadata once (cached per instance) and routes:

  • Audio → either POST /audio/speech (TTS-only models like hexgrad/kokoro-82m) or POST /chat/completions with audio modality (gpt-audio family).
  • ImagePOST /chat/completions with modalities=["image"] (works for both image-only models like x-ai/grok-imagine-image-quality and dual-output models like google/gemini-2.5-flash-image).
  • VideoPOST /api/v1/videos async lifecycle (now reads the current unsigned_urls array and downloads with Bearer auth — the "unsigned" URLs are served from openrouter.ai itself and require the same auth as the API).

DX is unchanged — same generate_audio / generate_video / generate_image signatures, same defaults. Adds optional extra / image_url(s) / speed / frame_type parameters that simply pass through.

Why

  1. Kokoro and every TTS-only model on OpenRouter live on /audio/speech. The old code only knew the chat-completions audio modality so those models 404'd with No endpoints found that support the requested output modalities: text, audio.
  2. x-ai/grok-imagine-image-quality is image-only output and rejects modalities=["image","text"]. We now send ["image"], verified to work for both image-only models and dual-output models.
  3. Video download used the wrong field (unsigned_url singular) and fetched without auth, so google/veo-3.1-lite returned 401 even though the job completed successfully.
  4. TS bug: frameImages / inputReferences were being passed as camelCase to the API, which expects snake_case (frame_images, input_references, frame_type).

What changes

Python (sdk/python/agentfield/)

  • media_providers.py:
    • per-instance metadata cache (_model_meta_cache) + _fetch_model_meta() helper that hits /api/v1/models/{id}/endpoints lazily.
    • new _openrouter_audio_speech() path. When caller asks for format="wav" we request pcm from upstream and wrap in a WAV header client-side so it stays playable.
    • generate_audio routes by output_modalities; default routes to /audio/speech (broader-compat) when metadata is unavailable.
    • generate_video: downloads from openrouter.ai URLs with auth, anonymous for CDN URLs.
    • new params: image_urls, speed, extra (passthrough merged into request body).
  • multimodal_response.py: ImageOutput.save() / get_bytes() now handle data:image/...;base64,... URLs.
  • vision.py: modalities=["image"] and multi-part user message when image_urls are passed.

TypeScript (sdk/typescript/src/ai/)

  • OpenRouterMediaProvider.ts:
    • module-level WeakMap-backed metadata cache + fetchModelMeta helper.
    • seedModelMeta(model, outputModalities, inputModalities) — public test helper to pre-populate the cache (used by tests against mock servers).
    • new /audio/speech code path with wrapPcm16AsWav helper (RIFF header generation).
    • chat-completions SSE path now decodes audio chunks, concatenates raw bytes, re-encodes, and wraps to WAV when requested.
    • video download: detects openrouter.ai host and attaches Bearer header.
  • MediaProvider.ts: new typed VideoFrameImage (with frameType: "first_frame" | "last_frame"), VideoInputReference. Added imageUrl, extra to VideoRequest; imageUrls, extra, expanded imageConfig (strength, style, rgbColors, backgroundRgbColor, fontInputs) to ImageRequest; speed, extra to AudioRequest.
  • camelCase → snake_case translation for nested objects so OpenRouter actually receives the right field names.

Go (sdk/go/ai/)

  • openrouter_media.go:
    • mutex-protected metaCache + fetchModelMeta helper.
    • SeedModelMeta(model, outputModalities, inputModalities) exported test helper.
    • new generateAudioViaSpeechEndpoint method + wrapPCM16AsWAV helper.
    • videoJobStatus.UnsignedURLs []string (plural) + Usage.Cost parsing.
    • video download with Bearer when host is openrouter.ai.
  • media_provider.go: added ImageURL to VideoRequest; ImageURLs, Extra to ImageRequest; full ImageConfig expansion; Speed, Extra to AudioRequest; new typed FontInput.

DX preserved

Same call shapes work; nothing changed for existing callers:

await app.ai_generate_audio(text="...", model="openrouter/hexgrad/kokoro-82m", voice="af_bella", format="wav")
await app.ai_generate_image(prompt="...", model="openrouter/x-ai/grok-imagine-image-quality")
await app.ai_generate_video(
    prompt="...",
    model="openrouter/google/veo-3.1-lite",
    frame_images=[
        {"type":"image_url","image_url":{"url":first},"frame_type":"first_frame"},
        {"type":"image_url","image_url":{"url":last}, "frame_type":"last_frame"},
    ],
)

The routing is metadata-driven, so every OpenRouter model in each category works automatically. No allowlist — new TTS / video / image models added by OpenRouter work without an SDK change.

Test plan

  • Python: 107 media tests pass (pytest tests/test_openrouter_audio.py tests/test_openrouter_video.py tests/test_media_providers.py tests/test_media_providers_additional.py tests/test_media_integration.py tests/test_vision.py tests/test_image_config.py)
  • TypeScript: 596 tests pass (npm test)
  • Go: 228 tests pass (go test ./ai/...)
  • End-to-end smoke tests against real OpenRouter:
    • audio: openrouter/hexgrad/kokoro-82m → 31s WAV (RIFF/WAVE PCM 16-bit mono 24kHz)
    • image: openrouter/x-ai/grok-imagine-image-quality → 896×1280 JPEG
    • video: openrouter/google/veo-3.1-lite → 4s 1280×720 MP4 (1MB)
    • image-to-video: same model with frame_images=[first_frame, last_frame] from grok-imagine outputs → 4s 720×1280 MP4 (2.6MB)
  • CI green (the gate is what this PR has to clear)

Tested models (examples — not an allowlist)

Modality Endpoint Example models that route here
Image /chat/completions w/ modalities=["image"] x-ai/grok-imagine-image-quality, google/gemini-2.5-flash-image, openai/gpt-image-1, anything else with image in output_modalities
Audio TTS /audio/speech hexgrad/kokoro-82m, openai/gpt-4o-mini-tts, anything whose output_modalities is ["speech"]
Audio chat /chat/completions w/ modalities=["text","audio"] SSE openai/gpt-audio, openai/gpt-audio-mini, openai/gpt-4o-audio-preview, google/lyria-3-pro (music)
Video /videos async polling google/veo-3.1-lite, google/veo-3.1, kling-video/*, anything with video in output_modalities

Website docs follow-up to Agent-Field/website2.0 once this merges.

… TypeScript, Go

Adds first-class support for OpenRouter's full media surface in all three SDKs
without changing the public API. The provider now fetches model metadata once
(cached) and routes audio to either `POST /audio/speech` (TTS-only models like
hexgrad/kokoro-82m) or `POST /chat/completions` with audio modality (gpt-audio
family). Image generation drops `"text"` from `modalities` so image-only models
like x-ai/grok-imagine-image-quality stop 404-ing. Video properly reads the
current `unsigned_urls` array shape and downloads with Bearer auth (the
"unsigned" URLs are served from openrouter.ai itself).

DX is unchanged — same `generate_audio/video/image` signatures, same defaults.

Why
- Kokoro and other TTS-only models live on `/audio/speech`; the old code only
  knew chat-completions audio modality so they 404'd.
- `x-ai/grok-imagine-image-quality` is image-only output and rejects
  `modalities=["image","text"]`; we now send `["image"]` which works for both
  image-only and dual-output models (verified vs. gemini-2.5-flash-image).
- Video download was using the wrong field (`unsigned_url` singular) and
  fetched without auth, so veo-3.1-lite returned 401.

What
- Python (`sdk/python/agentfield/media_providers.py`,
  `multimodal_response.py`, `vision.py`):
  + per-instance metadata cache + `_fetch_model_meta`
  + new `_openrouter_audio_speech` path with client-side WAV wrapping
  + `image_urls` reference-image support, `speed`, `extra` passthrough
  + ImageOutput now handles `data:` URLs
- TypeScript (`sdk/typescript/src/ai/{MediaProvider,OpenRouterMediaProvider}.ts`):
  + same routing + WAV wrapping
  + `seedModelMeta` test helper
  + fixed camelCase→snake_case for `frame_images` / `input_references`
  + new `VideoRequest.imageUrl`, `ImageRequest.imageUrls`, `extra` passthrough,
    expanded `ImageConfig` (strength/style/rgb_colors/...)
- Go (`sdk/go/ai/{media_provider,openrouter_media}.go`):
  + same metadata-driven routing
  + `SeedModelMeta` test helper
  + reads `unsigned_urls` plural + downloads with Bearer when host is
    openrouter.ai
  + new `VideoRequest.ImageURL`, `ImageRequest.ImageURLs`, `Speed`, `Extra`
    fields; full `ImageConfig` expansion

Tested
- Smoke-tested end-to-end against openrouter/hexgrad/kokoro-82m (audio),
  openrouter/x-ai/grok-imagine-image-quality (image), and
  openrouter/google/veo-3.1-lite (video), including image-to-video with
  first_frame / last_frame guidance. Outputs saved + verified as RIFF/WAVE,
  JPEG, and MP4.
- Python: 107 media tests pass.
- TypeScript: 596 tests pass.
- Go: 228 tests pass.
@santoshkumarradha santoshkumarradha requested review from a team and AbirAbbas as code owners May 23, 2026 01:42
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 23, 2026

Performance

SDK Memory Δ Latency Δ Tests Status
Python 9.4 KB +4% 0.32 µs -9%
Go 165 B -41% 0.63 µs -37%
TS 405 B +16% 1.55 µs -22%

Regression detected:

  • TypeScript memory: 350 B → 405 B (+16%)

…essage

The refactor of ImageOutput.save() to delegate to get_bytes() dropped the
'to save' suffix that test_output_objects_raise_for_missing_data asserts on.
Restore the upfront check so save() raises 'No image data or URL available
to save' while get_bytes() still raises 'No image data or URL available'.
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 23, 2026

📊 Coverage gate

Thresholds from .coverage-gate.toml: per-surface ≥ 86%, aggregate ≥ 88%, max per-surface regression ≤ 1.0 pp, max aggregate regression ≤ 0.50 pp.

Surface Current Baseline Δ
control-plane 87.50% 87.30% ↑ +0.20 pp 🟡
sdk-go 91.80% 90.70% ↑ +1.10 pp 🟢
sdk-python 93.73% 93.63% ↑ +0.10 pp 🟢
sdk-typescript 92.80% 92.56% ↑ +0.24 pp 🟢
web-ui 89.91% 90.01% ↓ -0.10 pp 🟡
aggregate 89.02% 89.01% ↑ +0.01 pp 🟡

✅ Gate passed

No surface regressed past the allowed threshold and the aggregate stayed above the floor.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 23, 2026

📐 Patch coverage gate

Threshold: 80% on lines this PR touches vs origin/main (from .coverage-gate.toml:thresholds.min_patch).

Surface Touched lines Patch coverage Status
control-plane 0 ➖ no changes
sdk-go 239 90.00%
sdk-python 0 ➖ no changes
sdk-typescript 218 96.00%
web-ui 0 ➖ no changes

✅ Patch gate passed

Every surface whose lines were touched by this PR has patch coverage at or above the threshold.

… to read message.images

- Adds Go test file (openrouter_media_routing_test.go) covering
  fetchModelMeta cache + error paths, /audio/speech success/error,
  frame_images + input_references + extra translation, and
  wrapPCM16AsWAV header correctness. Lifts Go patch coverage from
  64% to >87% on touched lines.
- Adds TS test file (openrouter_media_routing.test.ts) covering the
  metadata cache (success / 500 / network exception), generateImage
  multi-part content + imageConfig snake_case translation, video
  param translation (imageUrl, frameImages, inputReferences, extra),
  and /audio/speech speed + extra passthrough. Lifts TS coverage to
  94.5% lines / 80.1% branches on OpenRouterMediaProvider.ts.
- Fixes a real bug uncovered while writing tests: the TS image-response
  parser only read message.content[] (gpt-image-1 style) and dropped
  images that OpenRouter returns in the dedicated message.images[]
  array (gemini-*-image, grok-imagine, when content is null). Now
  parses both shapes.
@santoshkumarradha santoshkumarradha merged commit 2dcc803 into main May 23, 2026
33 checks passed
@santoshkumarradha santoshkumarradha deleted the feat/openrouter-native-media branch May 23, 2026 02:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant