Skip to content

[TypeScript SDK] Add MediaProvider interface and OpenRouter media generation #467

@santoshkumarradha

Description

@santoshkumarradha

Summary

Port the MediaProvider abstraction and OpenRouterMediaProvider to the TypeScript SDK, enabling video, image, and audio generation with the same DX as Python.

Context

The TypeScript SDK already has MultimodalResponse.ts and multimodal.ts (input helpers), but has no media generation capability — no image gen, no video gen, no audio gen. The Python SDK has a full MediaProvider ABC with Fal, LiteLLM, and OpenRouter implementations.

This issue brings TypeScript to parity by adding the provider interface and an OpenRouter implementation that covers video, image, and audio generation.

Scope

New Files

File Purpose
sdk/typescript/src/ai/MediaProvider.ts MediaProvider interface + MediaRouter class
sdk/typescript/src/ai/OpenRouterMediaProvider.ts OpenRouter implementation (video, image, audio)

Modified Files

File Change
sdk/typescript/src/ai/AIClient.ts Add generateVideo(), generateImage(), generateAudio() methods
sdk/typescript/src/ai/index.ts Export new types

Interface Design

// MediaProvider.ts
export interface VideoRequest {
  prompt: string;
  model?: string;
  duration?: number;
  resolution?: '480p' | '720p' | '1080p' | '1K' | '2K' | '4K';
  aspectRatio?: '16:9' | '9:16' | '1:1' | '4:3' | '3:4' | '21:9' | '9:21';
  generateAudio?: boolean;
  seed?: number;
  frameImages?: Array<{type: string; imageUrl: {url: string}; frameType?: string}>;
  inputReferences?: Array<{type: string; imageUrl: {url: string}}>;
  pollInterval?: number;  // ms, default 30000
  timeout?: number;       // ms, default 600000
}

export interface ImageRequest {
  prompt: string;
  model?: string;
  size?: string;
  quality?: string;
  imageConfig?: {
    aspectRatio?: string;
    imageSize?: string;
    superResolutionReferences?: string[];
    fontInputs?: Array<{fontUrl: string; text: string}>;
  };
}

export interface AudioRequest {
  text: string;
  model?: string;
  voice?: string;
  format?: string;
}

export interface MediaProvider {
  readonly name: string;
  readonly supportedModalities: string[];
  generateImage(request: ImageRequest): Promise<MultimodalResponse>;
  generateAudio(request: AudioRequest): Promise<MultimodalResponse>;
  generateVideo?(request: VideoRequest): Promise<MultimodalResponse>;
}

Developer Experience

import { AIClient } from '@agentfield/sdk';

const ai = new AIClient({ model: 'openai/gpt-4o' });

// Video generation
const video = await ai.generateVideo({
  prompt: 'A golden retriever on a beach',
  model: 'openrouter/google/veo-3.1',
  resolution: '1080p',
  aspectRatio: '16:9',
  duration: 8,
});
await video.saveFile(video.files[0], 'dog.mp4');

// Image generation
const image = await ai.generateImage({
  prompt: 'A sunset over mountains',
  model: 'openrouter/google/gemini-2.5-flash-image',
  imageConfig: { aspectRatio: '16:9', imageSize: '2K' },
});
await image.saveImage(image.images[0], 'sunset.png');

// Audio generation
const audio = await ai.generateAudio({
  text: 'Welcome to AgentField',
  model: 'openrouter/openai/gpt-audio',
  voice: 'nova',
});
await audio.saveAudio(audio.audio!, 'welcome.wav');

Dependencies

Acceptance Criteria

  • MediaProvider interface defined with generateImage, generateAudio, generateVideo
  • MediaRouter class handles prefix-based provider dispatch
  • OpenRouterMediaProvider implements video gen (async poll), image gen, audio gen (SSE)
  • AIClient exposes generateVideo(), generateImage(), generateAudio() methods
  • Video polling loop handles pending → in_progress → completed/failed
  • Audio SSE streaming collects and concatenates base64 chunks
  • All responses return MultimodalResponse with appropriate content
  • npm run lint passes in sdk/typescript/
  • npm test passes in sdk/typescript/

Notes for Contributors

Severity: HIGH — New feature, TypeScript SDK currently has zero media gen.

Use fetch (native in Node 18+) for HTTP calls — no need for axios or got. For SSE parsing, use a lightweight approach: read the response body as a stream, split by \n\n, parse data: {...} lines.

Reference the Python implementation in #464 for the exact API request/response schemas. The OpenRouter API is identical regardless of client language.

Metadata

Metadata

Labels

ai-friendlyWell-documented task suitable for AI-assisted developmentarea:aiAI/LLM integrationenhancementNew feature or requestsdk:typescriptTypeScript SDK related

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions