Skip to content

feat(ai-grok): audio, speech, and realtime adapters + example wiring#506

Merged
AlemTuzlak merged 22 commits intoTanStack:mainfrom
tombeckenham:503-add-audio-and-speech-support-to-grok-adapter
Apr 24, 2026
Merged

feat(ai-grok): audio, speech, and realtime adapters + example wiring#506
AlemTuzlak merged 22 commits intoTanStack:mainfrom
tombeckenham:503-add-audio-and-speech-support-to-grok-adapter

Conversation

@tombeckenham
Copy link
Copy Markdown
Contributor

@tombeckenham tombeckenham commented Apr 24, 2026

Summary

  • Add Grok TTS (grokSpeech), STT (grokTranscription), and realtime voice agent (grokRealtime / grokRealtimeToken) adapters in @tanstack/ai-grok, with model metadata, provider options, and unit tests.
  • Wire Grok into the ts-react-chat example across the realtime, speech, and transcription pages by extending the shared SPEECH_PROVIDERS / TRANSCRIPTION_PROVIDERS catalogs and the buildSpeechAdapter / buildTranscriptionAdapter factories.
  • Adopt the shared @tanstack/ai/adapter-internals debug logger (resolveDebugOption, InternalLogger) across the new adapters so users can flip debug: true / false / DebugConfig the same way they do on other providers. Replaces the remaining console.error / console.warn calls in the realtime adapter with logger.errors.

Test plan

  • pnpm --filter @tanstack/ai-grok test:lib — unit tests for TTS, transcription, and realtime
  • pnpm --filter @tanstack/ai-e2e test:e2e — end-to-end media suite picks up Grok via the feature-support matrix
  • Manual: XAI_API_KEY=... pnpm --filter ts-react-chat dev
    • /realtime — select "Grok Voice Agent", pick a voice (eve/ara/rex/sal/leo), hold a conversation; verify tools and image upload still work
    • /generations/speech — select Grok TTS, try each voice across streaming / direct / server-fn modes
    • /generations/transcription — select Grok STT, upload an audio file across all three modes
  • Manual debug logging: set debug: true on grokRealtime, grokSpeech, and grokTranscription calls; confirm request / provider / errors output appears with the [tanstack-ai:...] prefixes

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

    • Grok audio support: text-to-speech, speech-to-text with word timestamps, and realtime voice agent (ephemeral token flow); new Grok model identifiers exported.
  • Improvements

    • Example app: Grok provider UI, voice selection, and expanded realtime controls.
    • Better audio handling: robust conversions for multiple input formats.
    • Server APIs: clearer typed error responses for invalid model overrides and unknown providers.
  • Tests

    • Added unit and e2e tests for Grok audio, realtime tokens, and audio utilities.
  • Documentation

    • Published release note entry for this minor release.

tombeckenham and others added 3 commits April 24, 2026 16:13
Add `grokSpeech` (TTS via /v1/tts), `grokTranscription` (STT via /v1/stt),
and `grokRealtime` + `grokRealtimeToken` (Voice Agent via /v1/realtime)
because xAI's standalone audio APIs were shipped publicly and the
adapter previously exposed only text/image/summarize. The TTS/STT
endpoints are not OpenAI-compatible so these adapters use direct fetch
rather than the OpenAI SDK; the realtime API mirrors OpenAI's shape with
URL/provider swaps. E2E coverage is wired via mock.mount('/v1/tts'...)
on aimock.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adopt the @tanstack/ai/adapter-internals logger across grokSpeech,
grokTranscription, grokRealtimeToken, and grokRealtime so users can toggle
debug output the same way they do on other adapters — `debug: true` for full
tracing, `debug: false` to silence, or a DebugConfig for per-category control
and a custom Logger. Replaces the remaining console.error / console.warn
calls in the realtime adapter with logger.errors so nothing is lost when
debugging is off.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 24, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds Grok audio support: TTS, STT (word-level timestamps), and a realtime WebRTC voice-agent with ephemeral tokens; introduces adapters, types, utilities, tests, example app/UI/server wiring, and e2e mocks to exercise Grok audio flows.

Changes

Cohort / File(s) Summary
Release Note
\.changeset/grok-audio-speech-support.md
Changelog entry declaring minor release: Grok TTS, STT, realtime adapters and model/type exports.
Grok Adapters & Models
packages/typescript/ai-grok/src/adapters/tts.ts, packages/typescript/ai-grok/src/adapters/transcription.ts, packages/typescript/ai-grok/src/realtime/adapter.ts, packages/typescript/ai-grok/src/realtime/token.ts, packages/typescript/ai-grok/src/model-meta.ts, packages/typescript/ai-grok/src/index.ts
Adds Grok TTS/STT adapters, realtime WebRTC adapter, ephemeral token adapter, model constants/types, and re-exports. Review: HTTP error handling, content-type/codec mapping, and public API surface.
Realtime Types & Contract
packages/typescript/ai-grok/src/realtime/types.ts, packages/typescript/ai-grok/src/realtime/realtime-contract.ts, packages/typescript/ai-grok/src/realtime/index.ts
Adds realtime voice/types, VAD/turn-detection shapes, RealtimeAdapter/Connection contracts, token/index barrel exports.
Provider Option Types
packages/typescript/ai-grok/src/audio/tts-provider-options.ts, packages/typescript/ai-grok/src/audio/transcription-provider-options.ts
Defines Grok TTS/STT provider option types (voices, codecs, sample rates, audio formats, diarize/inverse normalization).
Utilities
packages/typescript/ai-grok/src/utils/audio.ts, packages/typescript/ai-grok/src/utils/index.ts
Adds toAudioFile() for File/Blob/ArrayBuffer/base64/data-URI conversion and arrayBufferToBase64(); re-exported from utils barrel. Review: data-URI/base64 parsing, MIME/extension mapping, and node/browser handling.
Example App: Provider & Realtime UI/Hook
examples/ts-react-chat/src/lib/audio-providers.ts, examples/ts-react-chat/src/lib/use-realtime.ts, examples/ts-react-chat/src/routes/realtime.tsx
Adds grok provider, adapter selection helper, optional voice param in useRealtime, UI voice selector and control parity for Grok; image upload robustness fixes.
Server Integration & Errors
examples/ts-react-chat/src/lib/server-audio-adapters.ts, examples/ts-react-chat/src/lib/server-fns.ts, examples/ts-react-chat/src/routes/api.generate.speech.ts, examples/ts-react-chat/src/routes/api.transcribe.ts, examples/ts-react-chat/src/routes/api.generate.audio.ts
Wires Grok into server adapter builders; introduces InvalidModelOverrideError and UnknownProviderError; server wrappers map those to 400 responses. Review: error classes and route mappings.
E2E & Test Harness
testing/e2e/global-setup.ts, testing/e2e/src/lib/feature-support.ts, testing/e2e/src/lib/media-providers.ts, testing/e2e/tests/test-matrix.ts
Adds mock /v1/tts and /v1/stt mounts, exports feature matrix and marks Grok supported for tts/transcription, and registers Grok adapters in e2e factories.
Tests: Unit & Realtime Token
packages/typescript/ai-grok/tests/audio-adapters.test.ts, packages/typescript/ai-grok/tests/realtime-token.test.ts, packages/typescript/ai-grok/tests/realtime-contract.drift.test-d.ts
New Vitest suites covering TTS/STT request/response mapping, codec/form-data logic, toAudioFile conversions, realtime token expiry parsing, and structural drift checks against canonical client types.
Package Manifest
packages/typescript/ai-grok/package.json
Adds/adjusts devDependencies (@tanstack/ai, @tanstack/ai-client, zod, vite formatting).
Example server files
examples/ts-react-chat/src/lib/server-audio-adapters.ts, examples/ts-react-chat/src/lib/server-fns.ts
Behavior changes: stricter model override validation (throws InvalidModelOverrideError) and explicit unknown provider error (UnknownProviderError); adapters for grok wired. Review error translation paths.

Sequence Diagrams

sequenceDiagram
    actor User
    participant Client as Browser Client
    participant Server as App Server
    participant Grok as Grok API

    User->>Client: Request speech (text)
    Client->>Server: POST /api/generate-speech (provider: grok, text)
    Server->>Grok: POST /v1/tts (text, voice, output_format) + Bearer
    Grok-->>Server: audio bytes (ArrayBuffer)
    Server->>Server: base64-encode, assemble TTSResult
    Server-->>Client: { audio, format, contentType, id }
    Client->>User: Play audio
Loading
sequenceDiagram
    actor User
    participant Client as Browser Client
    participant Server as App Server
    participant Grok as Grok API

    User->>Client: Upload/record audio
    Client->>Server: POST /api/transcribe (provider: grok, file)
    Server->>Grok: POST /v1/stt (multipart/form-data) + Bearer
    Grok-->>Server: JSON { text, language, words:[{word,start,end}] }
    Server-->>Client: { text, language, words }
    Client->>User: Show transcript with timestamps
Loading
sequenceDiagram
    actor User
    participant Client as Browser Client
    participant TokenSvc as App Server (token endpoint)
    participant GrokSvc as Grok Realtime (signaling)

    User->>Client: Start realtime session
    Client->>TokenSvc: Request ephemeral token
    TokenSvc->>GrokSvc: POST /v1/realtime/client_secrets (model) + Bearer
    GrokSvc-->>TokenSvc: { client_secret, expires_at, model, features }
    TokenSvc-->>Client: { client_secret.value, expiresAt }
    Client->>GrokSvc: WebRTC offer/answer, open data channel
    GrokSvc-->>Client: transcripts, assistant audio, events
    Client->>User: Play audio / show partial transcripts
Loading

Estimated Code Review Effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 I hopped to the API with a twitch and a cheer,

Voices turned to bytes and timestamps ringing clear,
WebRTC whispers, ephemeral keys set free,
Grok sings aloud — a rabbit’s jubilee!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The PR title clearly and concisely summarizes the main change: adding audio (TTS/STT), speech, and realtime adapters for Grok with example integration, which aligns with the changeset and file modifications.
Description check ✅ Passed The PR description provides a detailed summary of changes, includes a test plan section, and addresses the contributing guidelines. However, it does not follow the exact template structure with the required 'Changes' and 'Release Impact' sections, though the information is present in alternate form.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@nx-cloud
Copy link
Copy Markdown

nx-cloud Bot commented Apr 24, 2026

View your CI Pipeline Execution ↗ for commit 268ad6a

Command Status Duration Result
nx run-many --targets=build --exclude=examples/** ✅ Succeeded 52s View ↗

☁️ Nx Cloud last updated this comment at 2026-04-24 11:55:19 UTC

autofix-ci Bot and others added 2 commits April 24, 2026 08:07
The transcription and TTS adapters were calling super(config, model),
but BaseTranscriptionAdapter/BaseTTSAdapter expect (model, config),
causing TS2345 build errors.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@pkg-pr-new
Copy link
Copy Markdown

pkg-pr-new Bot commented Apr 24, 2026

Open in StackBlitz

@tanstack/ai

npm i https://pkg.pr.new/@tanstack/ai@506

@tanstack/ai-anthropic

npm i https://pkg.pr.new/@tanstack/ai-anthropic@506

@tanstack/ai-client

npm i https://pkg.pr.new/@tanstack/ai-client@506

@tanstack/ai-code-mode

npm i https://pkg.pr.new/@tanstack/ai-code-mode@506

@tanstack/ai-code-mode-skills

npm i https://pkg.pr.new/@tanstack/ai-code-mode-skills@506

@tanstack/ai-devtools-core

npm i https://pkg.pr.new/@tanstack/ai-devtools-core@506

@tanstack/ai-elevenlabs

npm i https://pkg.pr.new/@tanstack/ai-elevenlabs@506

@tanstack/ai-event-client

npm i https://pkg.pr.new/@tanstack/ai-event-client@506

@tanstack/ai-fal

npm i https://pkg.pr.new/@tanstack/ai-fal@506

@tanstack/ai-gemini

npm i https://pkg.pr.new/@tanstack/ai-gemini@506

@tanstack/ai-grok

npm i https://pkg.pr.new/@tanstack/ai-grok@506

@tanstack/ai-groq

npm i https://pkg.pr.new/@tanstack/ai-groq@506

@tanstack/ai-isolate-cloudflare

npm i https://pkg.pr.new/@tanstack/ai-isolate-cloudflare@506

@tanstack/ai-isolate-node

npm i https://pkg.pr.new/@tanstack/ai-isolate-node@506

@tanstack/ai-isolate-quickjs

npm i https://pkg.pr.new/@tanstack/ai-isolate-quickjs@506

@tanstack/ai-ollama

npm i https://pkg.pr.new/@tanstack/ai-ollama@506

@tanstack/ai-openai

npm i https://pkg.pr.new/@tanstack/ai-openai@506

@tanstack/ai-openrouter

npm i https://pkg.pr.new/@tanstack/ai-openrouter@506

@tanstack/ai-preact

npm i https://pkg.pr.new/@tanstack/ai-preact@506

@tanstack/ai-react

npm i https://pkg.pr.new/@tanstack/ai-react@506

@tanstack/ai-react-ui

npm i https://pkg.pr.new/@tanstack/ai-react-ui@506

@tanstack/ai-solid

npm i https://pkg.pr.new/@tanstack/ai-solid@506

@tanstack/ai-solid-ui

npm i https://pkg.pr.new/@tanstack/ai-solid-ui@506

@tanstack/ai-svelte

npm i https://pkg.pr.new/@tanstack/ai-svelte@506

@tanstack/ai-vue

npm i https://pkg.pr.new/@tanstack/ai-vue@506

@tanstack/ai-vue-ui

npm i https://pkg.pr.new/@tanstack/ai-vue-ui@506

@tanstack/preact-ai-devtools

npm i https://pkg.pr.new/@tanstack/preact-ai-devtools@506

@tanstack/react-ai-devtools

npm i https://pkg.pr.new/@tanstack/react-ai-devtools@506

@tanstack/solid-ai-devtools

npm i https://pkg.pr.new/@tanstack/solid-ai-devtools@506

commit: 268ad6a

tombeckenham and others added 3 commits April 24, 2026 18:23
After the logger was wired into the audio adapters, the unit tests
need to provide one when calling transcribe/generateSpeech directly
(activities normally inject it via resolveDebugOption).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per project convention, tests should not invoke adapter methods
directly — they call generateSpeech()/generateTranscription() with
the adapter instance, so the core function injects logger, emits
events, and exercises the real public surface.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@tombeckenham tombeckenham marked this pull request as ready for review April 24, 2026 08:32
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 9

🧹 Nitpick comments (10)
packages/typescript/ai-grok/src/utils/audio.ts (1)

20-31: Tighten string-input handling in toAudioFile.

A couple of robustness gaps in the string branch:

  1. The data: path assumes base64 encoding without checking for a ;base64 marker. A percent-encoded data URL (e.g. data:audio/wav,%52%49...) will be fed to atob and throw an opaque InvalidCharacterError.
  2. The non-data: string branch unconditionally treats the input as base64 and labels it audio/mpeg. If a caller accidentally passes a URL, filesystem path, or raw text, atob will throw without any hint about what went wrong.
  3. Minor: for audio/mpeg the derived filename becomes audio.mpeg rather than the more conventional audio.mp3, and MIME subtypes like x-wav yield an invalid extension. Servers typically key off Content-Type, so this is cosmetic, but a small map would be cleaner.
♻️ Suggested hardening
+const MIME_EXTENSION: Record<string, string> = {
+  'audio/mpeg': 'mp3',
+  'audio/mp3': 'mp3',
+  'audio/wav': 'wav',
+  'audio/x-wav': 'wav',
+  'audio/ogg': 'ogg',
+  'audio/flac': 'flac',
+  'audio/aac': 'aac',
+  'audio/mp4': 'mp4',
+  'audio/webm': 'webm',
+}
+
   if (typeof audio === 'string') {
     if (audio.startsWith('data:')) {
       const [header = '', base64Data = ''] = audio.split(',')
+      if (!/;base64/i.test(header)) {
+        throw new Error(
+          'toAudioFile: only base64-encoded data URLs are supported',
+        )
+      }
       const mimeType = header.match(/data:([^;]+)/)?.[1] || 'audio/mpeg'
       const buffer = base64ToArrayBuffer(base64Data)
-      const extension = mimeType.split('/')[1] || 'mp3'
+      const extension = MIME_EXTENSION[mimeType] || 'bin'
       return new File([buffer], `audio.${extension}`, { type: mimeType })
     }

-    const buffer = base64ToArrayBuffer(audio)
-    return new File([buffer], 'audio.mp3', { type: 'audio/mpeg' })
+    try {
+      const buffer = base64ToArrayBuffer(audio)
+      return new File([buffer], 'audio.mp3', { type: 'audio/mpeg' })
+    } catch {
+      throw new Error(
+        'toAudioFile: string input must be a base64 payload or data: URL',
+      )
+    }
   }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/typescript/ai-grok/src/utils/audio.ts` around lines 20 - 31, In
toAudioFile's string handling (the branch that checks typeof audio ===
'string'), ensure data: URLs are checked for a ';base64' marker before using
base64ToArrayBuffer/atob and if not base64 decode the payload via
decodeURIComponent; validate the non-data string input with a conservative
base64 regex before treating it as base64 and throw a clear error if it looks
like a URL/path/plain text (don’t call base64ToArrayBuffer blindly). Also
normalize MIME→extension using a small map (e.g., audio/mpeg -> mp3, audio/wav
or audio/x-wav -> wav, audio/ogg -> ogg) when deriving the filename, and use
base64ToArrayBuffer only for confirmed base64 payloads (update calls in
toAudioFile accordingly).
examples/ts-react-chat/src/routes/realtime.tsx (2)

216-220: Nit: voice <option> labels display lowercase IDs.

The speech providers catalog renders voice labels with capitalized names (e.g., Eve, Ara). Surfacing the raw id here is functional but inconsistent with how voices are presented elsewhere. Consider a small { id, label } map or just titlecasing v.

✨ Optional polish
-                    {GROK_VOICES.map((v) => (
-                      <option key={v} value={v}>
-                        {v}
-                      </option>
-                    ))}
+                    {GROK_VOICES.map((v) => (
+                      <option key={v} value={v}>
+                        {v.charAt(0).toUpperCase() + v.slice(1)}
+                      </option>
+                    ))}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/ts-react-chat/src/routes/realtime.tsx` around lines 216 - 220, The
voice <option> labels currently render raw IDs from GROK_VOICES (mapped as v)
and appear lowercase; update the rendering in the map over GROK_VOICES so the
visible label is titlecased or derived from an { id, label } mapping (keep
option value={v or id} unchanged) — locate the GROK_VOICES usage in the
realtime.tsx dropdown mapping and replace the displayed text {v} with a
titlecased version (or use v.label) while preserving the option key/value as the
id.

226-305: Optional: extract the openai || grok predicate.

The (provider === 'openai' || provider === 'grok') check is repeated 5 times for OpenAI-compatible realtime UI (output mode, temperature, eagerness, tools indicator, image upload). A small local constant like const isOpenAICompat = provider === 'openai' || provider === 'grok' would reduce drift risk when a third compatible provider is added.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/ts-react-chat/src/routes/realtime.tsx` around lines 226 - 305,
Extract the repeated (provider === 'openai' || provider === 'grok') predicate
into a local boolean (e.g., const isOpenAICompat = provider === 'openai' ||
provider === 'grok') and replace all occurrences in this file that gate
OpenAI-compatible UI (the select for OUTPUT_MODE_OPTIONS / setOutputMode, the
temperature range using setTemperature, the semanticEagerness select using
setSemanticEagerness, the tools indicator block, and the image upload block)
with isOpenAICompat; ensure the new constant is declared in the component scope
above these uses so the renders (and calls to getStatusColor/getModeIndicator)
behave identically.
packages/typescript/ai-grok/package.json (1)

15-24: Consider adding /adapters and /realtime subpath exports for tree-shakeability.

The new grokSpeech, grokTranscription, grokRealtime, and grokRealtimeToken adapters (plus the new src/realtime/index.ts barrel) are only reachable via the package root (.). Adding dedicated subpath exports keeps bundles tree-shakeable for consumers that only import a single capability, and aligns with the project-wide pattern for provider adapter packages.

As per coding guidelines: "Export tree-shakeable adapters from /adapters subpath exports in provider adapter packages".

📦 Suggested exports additions
   "exports": {
     ".": {
       "types": "./dist/esm/index.d.ts",
       "import": "./dist/esm/index.js"
     },
     "./tools": {
       "types": "./dist/esm/tools/index.d.ts",
       "import": "./dist/esm/tools/index.js"
+    },
+    "./adapters": {
+      "types": "./dist/esm/adapters/index.d.ts",
+      "import": "./dist/esm/adapters/index.js"
+    },
+    "./realtime": {
+      "types": "./dist/esm/realtime/index.d.ts",
+      "import": "./dist/esm/realtime/index.js"
     }
   },
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/typescript/ai-grok/package.json` around lines 15 - 24, The package
exports currently only expose the package root, so the new adapters and realtime
barrel aren’t tree-shakeable; add explicit subpath exports for "/adapters"
(pointing to the built adapter barrel(s) that export grokSpeech,
grokTranscription, grokRealtimeToken) and "/realtime" (pointing to the built
realtime barrel that exports grokRealtime / the src/realtime/index.ts barrel) in
package.json exports so consumers can import e.g. "ai-grok/adapters" or
"ai-grok/realtime" directly; update the exports object to map "./adapters": {
"types": "./dist/esm/adapters/index.d.ts", "import":
"./dist/esm/adapters/index.js" } and "./realtime": { "types":
"./dist/esm/realtime/index.d.ts", "import": "./dist/esm/realtime/index.js" }
(ensure the built files exist or update build output paths accordingly).
packages/typescript/ai-grok/src/realtime/types.ts (1)

79-111: GrokRealtimeSessionResponse is over-specified and unvalidated at runtime.

In token.ts only client_secret.value, client_secret.expires_at, and model are actually read, and the JSON response is narrowed with a plain as GrokRealtimeSessionResponse cast. Two downsides:

  1. Most fields are typed as required/non-optional — if xAI omits any of them (or changes tool_choice: string to a differently-shaped value), consumers who destructure them will hit runtime undefineds that TypeScript promised wouldn't exist.
  2. There's no runtime validation, so a malformed response surfaces as a cryptic TypeError deep in the caller instead of a clear "Invalid Grok session response" error.

Recommend either (a) marking the unused-by-SDK fields ?/unknown, or — preferably, per the repo guideline to "Use Zod for schema validation" in packages/typescript/**/src/**/*.ts — defining a small z.object({ model: z.string(), client_secret: z.object({ value: z.string(), expires_at: z.number() }) }) schema in token.ts and parsing the response through it. As per coding guidelines: "Use Zod for schema validation and tool definition across the library".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/typescript/ai-grok/src/realtime/types.ts` around lines 79 - 111,
GrokRealtimeSessionResponse is over-specified and unvalidated; update token.ts
to validate only the fields you actually read (model, client_secret.value,
client_secret.expires_at) by replacing the broad cast with a Zod schema and
parsing the fetched JSON through it (e.g., define a z.object({ model:
z.string(), client_secret: z.object({ value: z.string(), expires_at: z.number()
}) }) and use safe parsing), then use the parsed values when accessing
client_secret.value / client_secret.expires_at and model; alternatively mark all
other properties on GrokRealtimeSessionResponse as optional/unknown if you
choose not to add validation.
packages/typescript/ai-grok/src/realtime/token.ts (1)

35-35: API key is read at adapter construction, not at token-generation time.

getGrokApiKeyFromEnv() runs when grokRealtimeToken(options) is called, so any module that imports/builds this adapter at startup will throw if XAI_API_KEY hasn't been loaded yet (e.g., before dotenv/framework config), and subsequent env updates won't be picked up. Consider reading it inside generateToken() for lazy evaluation, matching the pattern used by many request-time adapters:

♻️ Proposed refactor
 export function grokRealtimeToken(
   options: GrokRealtimeTokenOptions = {},
 ): RealtimeTokenAdapter {
-  const apiKey = getGrokApiKeyFromEnv()
   const logger = resolveDebugOption(options.debug)

   return {
     provider: 'grok',

     async generateToken(): Promise<RealtimeToken> {
+      const apiKey = getGrokApiKeyFromEnv()
       const model: GrokRealtimeModel = options.model ?? 'grok-voice-fast-1.0'
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/typescript/ai-grok/src/realtime/token.ts` at line 35, The adapter
currently calls getGrokApiKeyFromEnv() at construction time inside
grokRealtimeToken(), causing startup-time throws and preventing later env
updates from being picked up; move the call into the generateToken() function so
the API key is read lazily at token-generation time, i.e., remove the
top-level/constructor invocation of getGrokApiKeyFromEnv() and instead call it
inside generateToken() (and handle missing key there with the existing error/log
flow) so the behavior matches request-time adapters.
examples/ts-react-chat/src/lib/use-realtime.ts (1)

49-58: Consider adding an exhaustiveness guard for adapterForProvider.

The switch returns in each case but has no default. Today the Provider union is fully covered, so the implicit return is undefined and the types are fine. If a new provider is ever added to Provider without a matching case, callers will silently receive undefined at runtime and TS may or may not surface it depending on config. Adding a never guard future‑proofs this:

♻️ Proposed refactor
 function adapterForProvider(provider: Provider) {
   switch (provider) {
     case 'openai':
       return openaiRealtime()
     case 'elevenlabs':
       return elevenlabsRealtime()
     case 'grok':
       return grokRealtime()
+    default: {
+      const _exhaustive: never = provider
+      throw new Error(`Unknown provider: ${_exhaustive as string}`)
+    }
   }
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/ts-react-chat/src/lib/use-realtime.ts` around lines 49 - 58, The
switch in adapterForProvider(provider: Provider) lacks an exhaustiveness guard
and can return undefined if Provider later gains a member; update
adapterForProvider to include a default branch that asserts unreachable (use a
never-typed variable or throw an Error) so TypeScript will force handling of new
Provider cases at compile time and ensure callers never get undefined at
runtime.
packages/typescript/ai-grok/tests/audio-adapters.test.ts (1)

1-254: Solid black-box coverage for TTS/STT; consider adding a realtime adapter test too.

The unit tests exercise the public request/response contract cleanly (URL, headers, JSON body, codec mapping, FormData fields, error strings) and properly restore globalThis.fetch in afterEach. One gap worth filling in a follow-up: the new grokRealtimeToken POSTs to /v1/realtime/client_secrets and multiplies expires_at * 1000; a mocked-fetch unit test would lock in the URL/body/expiresAt math without needing WebRTC. Not a blocker for this PR.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/typescript/ai-grok/tests/audio-adapters.test.ts` around lines 1 -
254, Add a unit test that mocks globalThis.fetch to verify the realtime token
flow: call the function grokRealtimeToken (or the adapter method that obtains
realtime tokens) and assert fetch was POSTed to '/v1/realtime/client_secrets'
with the expected Authorization header and request body, return a mocked JSON
containing client_id, client_secret and expires_at (unix seconds) and assert the
function multiplies expires_at by 1000 to produce the returned expiresAt value;
ensure the test restores globalThis.fetch in afterEach and checks the response
mapping (client_id/client_secret and expiresAt) to lock in the
URL/body/expires_at math.
packages/typescript/ai-grok/src/adapters/transcription.ts (2)

61-66: Constructor argument order differs from base class; note OpenAI follows the same pattern.

BaseTranscriptionAdapter expects (model, config), but GrokTranscriptionAdapter uses (config, model) and swaps the arguments in super(model, config). While this works correctly, OpenAITranscriptionAdapter follows the same pattern—suggesting it may be intentional—whereas FalTranscriptionAdapter aligns with the base class signature. Consider aligning all adapters to the base class convention for consistency across the package.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/typescript/ai-grok/src/adapters/transcription.ts` around lines 61 -
66, The GrokTranscriptionAdapter constructor currently declares parameters as
(config, model) but calls super(model, config), which differs from
BaseTranscriptionAdapter's (model, config) signature; update
GrokTranscriptionAdapter to declare its constructor as (model: TModel, config:
GrokTranscriptionConfig) (matching BaseTranscriptionAdapter and
FalTranscriptionAdapter), keep the super(model, config) call, and then read
this.apiKey, this.baseURL and this.defaultHeaders from the config parameter;
also scan and update any callers/tests that instantiate new
GrokTranscriptionAdapter to pass (model, config) order if needed so adapters are
consistent with BaseTranscriptionAdapter and OpenAITranscriptionAdapter.

102-110: Add timeout support to the fetch request.

The fetch call lacks a timeout mechanism, so a stalled xAI /v1/stt response will hang indefinitely. Multipart uploads over unreliable networks are particularly susceptible to this issue. Consider adding a timeout property to GrokTranscriptionConfig and wiring it to an AbortController on the fetch request.

(The suggestion to propagate options.signal from TranscriptionOptions is not currently viable, as that interface does not expose an abort signal.)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/typescript/ai-grok/src/adapters/transcription.ts` around lines 102 -
110, The fetch to `${this.baseURL}/stt` needs timeout support: add an optional
timeout:number to GrokTranscriptionConfig, create an AbortController in the
method that calls fetch (use controller.signal in the fetch options), start a
timer (setTimeout) that calls controller.abort() after timeout ms, and clear the
timer once fetch resolves or throws; ensure the abort is handled (AbortError)
and that headers/body usage remains unchanged. Reference:
GrokTranscriptionConfig, the fetch call to `${this.baseURL}/stt`, and
TranscriptionOptions (not exposing a signal) when wiring the AbortController and
timeout.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@packages/typescript/ai-grok/src/adapters/transcription.ts`:
- Around line 121-127: The mapping that builds words in the function producing
TranscriptionWord is dropping Grok STT's optional confidence and speaker
(GrokSTTWord) fields—update the code to preserve these by extending the returned
shape (e.g., add optional confidence:number and speaker:string to
TranscriptionWord or create a provider-specific GrokTranscriptionWord extension)
and then include w.confidence and w.speaker in the map that constructs words;
also update any type declarations and downstream consumers to accept the new
optional fields (or document the limitation if you choose not to extend the
type).

In `@packages/typescript/ai-grok/src/adapters/tts.ts`:
- Around line 85-91: The headers object currently spreads this.defaultHeaders
last which allows callers to override critical headers; change the order so
caller-provided headers are spread first (i.e., ...this.defaultHeaders) and then
explicitly set the required headers Authorization: `Bearer ${this.apiKey}` and
'Content-Type': 'application/json' afterwards so the adapter's Authorization and
Content-Type in the TTS request always take precedence; update the headers
construction where it's created in the TTS adapter (look for the headers block
that references this.defaultHeaders and this.apiKey).
- Around line 100-101: GrokSpeechAdapter uses Node's Buffer (`const audio =
Buffer.from(arrayBuffer).toString('base64')`) which throws in browsers; update
GrokSpeechAdapter (and the code around arrayBuffer/audio) to detect environment
and use a fallback: if typeof Buffer !== 'undefined' use
Buffer.from(...).toString('base64'), otherwise convert the ArrayBuffer to base64
via a browser-safe path (e.g. Uint8Array -> String.fromCharCode -> btoa), or
alternatively add a clear JSDoc comment on GrokSpeechAdapter/export indicating
it is server-only so bundlers/tree-shakers don't include it in browser builds.
Ensure the runtime check targets the Buffer usage in the response.arrayBuffer()
handling.

In `@packages/typescript/ai-grok/src/realtime/adapter.ts`:
- Around line 415-425: sendEvent currently pushes events into pendingEvents when
dataChannel isn't open, but after disconnect() sets dataChannel = null those
events are buffered forever; add an explicit closed boolean flag (e.g.,
isClosed) set to true in disconnect(), then update sendEvent to check isClosed
first and either log/error via logger.provider or throw when callers attempt to
send after disconnect, instead of pushing to pendingEvents; keep existing
behavior of buffering only when the channel exists but is not yet open
(dataChannel != null && dataChannel.readyState !== 'open') and ensure
pendingEvents is flushed only when a new dataChannel is established.
- Around line 136-142: The dataChannel.onerror handler is interpolating the
RTCErrorEvent directly (producing “[object Event]”); update the
dataChannel.onerror callback (the handler assigned to dataChannel.onerror) to
extract a useful message from the RTCErrorEvent (e.g., check evt.error?.message,
evt.message, or serialize the event) and include that extracted string in both
logger.errors('grok.realtime fatal', ...) and the emit('error', ...) payload so
the logged and emitted error contain meaningful diagnostics instead of the raw
Event object.
- Around line 682-684: The await dataChannelReady in connect() can hang forever
because dataChannelReady only resolves on dataChannel.onopen and has no timeout
or rejection; update connect() to race dataChannelReady with a timeout (e.g.,
Promise.race with a configurable timeout) and wire rejection paths by attaching
dataChannel.onerror to reject and peer connection state changes
(pc.onconnectionstatechange) to reject when state === 'failed' or
'disconnected'; ensure any rejection triggers the existing cleanup logic to stop
the mic track and close the RTCPeerConnection so resources aren't left active.
- Around line 533-592: The updateSession function currently always sets
sessionUpdate.input_audio_transcription = { model: 'grok-stt' }, which
overwrites caller intent and prevents disabling/overriding STT; change
updateSession (in the updateSession method) to only set
sessionUpdate.input_audio_transcription when the caller explicitly provided
transcription settings (e.g. via
config.providerOptions?.input_audio_transcription or a new
config.inputAudioTranscription field), and allow explicit null to disable
transcription; do not unconditionally write the default model here (keep the
default only during session.create or when no prior value exists elsewhere).
- Around line 173-199: Wrap the SDP negotiation (the sequence using
pc.createOffer, pc.setLocalDescription, fetch to GROK_REALTIME_URL, and
pc.setRemoteDescription) in a try/catch and in the catch perform deterministic
cleanup before rethrowing: stop all MediaStream tracks on localStream, close pc
via pc.close(), close the data channel if present (e.g. dataChannel.close()),
and close the AudioContext if created (audioContext.close()), then rethrow the
original error; also apply the same cleanup in the existing getUserMedia catch
path so that if getUserMedia fails you still close/cleanup pc, dataChannel,
localStream tracks and audioContext if they exist. Ensure each cleanup step
checks for existence (localStream, pc, dataChannel, audioContext) to avoid
additional errors and preserve the original thrown error when rethrowing.

In `@packages/typescript/ai-grok/src/realtime/types.ts`:
- Around line 12-14: The GrokRealtimeModel union is duplicated here causing a
split source of truth; remove the local declaration in realtime/types.ts and
re-export the canonical type from model-meta.ts (which defines
GROK_REALTIME_MODELS and exports type GrokRealtimeModel = (typeof
GROK_REALTIME_MODELS)[number]) so consumers import the single authoritative
GrokRealtimeModel; update the export in realtime/types.ts to re-export the type
symbol GrokRealtimeModel from model-meta.ts and delete the explicit union in
this file.

---

Nitpick comments:
In `@examples/ts-react-chat/src/lib/use-realtime.ts`:
- Around line 49-58: The switch in adapterForProvider(provider: Provider) lacks
an exhaustiveness guard and can return undefined if Provider later gains a
member; update adapterForProvider to include a default branch that asserts
unreachable (use a never-typed variable or throw an Error) so TypeScript will
force handling of new Provider cases at compile time and ensure callers never
get undefined at runtime.

In `@examples/ts-react-chat/src/routes/realtime.tsx`:
- Around line 216-220: The voice <option> labels currently render raw IDs from
GROK_VOICES (mapped as v) and appear lowercase; update the rendering in the map
over GROK_VOICES so the visible label is titlecased or derived from an { id,
label } mapping (keep option value={v or id} unchanged) — locate the GROK_VOICES
usage in the realtime.tsx dropdown mapping and replace the displayed text {v}
with a titlecased version (or use v.label) while preserving the option key/value
as the id.
- Around line 226-305: Extract the repeated (provider === 'openai' || provider
=== 'grok') predicate into a local boolean (e.g., const isOpenAICompat =
provider === 'openai' || provider === 'grok') and replace all occurrences in
this file that gate OpenAI-compatible UI (the select for OUTPUT_MODE_OPTIONS /
setOutputMode, the temperature range using setTemperature, the semanticEagerness
select using setSemanticEagerness, the tools indicator block, and the image
upload block) with isOpenAICompat; ensure the new constant is declared in the
component scope above these uses so the renders (and calls to
getStatusColor/getModeIndicator) behave identically.

In `@packages/typescript/ai-grok/package.json`:
- Around line 15-24: The package exports currently only expose the package root,
so the new adapters and realtime barrel aren’t tree-shakeable; add explicit
subpath exports for "/adapters" (pointing to the built adapter barrel(s) that
export grokSpeech, grokTranscription, grokRealtimeToken) and "/realtime"
(pointing to the built realtime barrel that exports grokRealtime / the
src/realtime/index.ts barrel) in package.json exports so consumers can import
e.g. "ai-grok/adapters" or "ai-grok/realtime" directly; update the exports
object to map "./adapters": { "types": "./dist/esm/adapters/index.d.ts",
"import": "./dist/esm/adapters/index.js" } and "./realtime": { "types":
"./dist/esm/realtime/index.d.ts", "import": "./dist/esm/realtime/index.js" }
(ensure the built files exist or update build output paths accordingly).

In `@packages/typescript/ai-grok/src/adapters/transcription.ts`:
- Around line 61-66: The GrokTranscriptionAdapter constructor currently declares
parameters as (config, model) but calls super(model, config), which differs from
BaseTranscriptionAdapter's (model, config) signature; update
GrokTranscriptionAdapter to declare its constructor as (model: TModel, config:
GrokTranscriptionConfig) (matching BaseTranscriptionAdapter and
FalTranscriptionAdapter), keep the super(model, config) call, and then read
this.apiKey, this.baseURL and this.defaultHeaders from the config parameter;
also scan and update any callers/tests that instantiate new
GrokTranscriptionAdapter to pass (model, config) order if needed so adapters are
consistent with BaseTranscriptionAdapter and OpenAITranscriptionAdapter.
- Around line 102-110: The fetch to `${this.baseURL}/stt` needs timeout support:
add an optional timeout:number to GrokTranscriptionConfig, create an
AbortController in the method that calls fetch (use controller.signal in the
fetch options), start a timer (setTimeout) that calls controller.abort() after
timeout ms, and clear the timer once fetch resolves or throws; ensure the abort
is handled (AbortError) and that headers/body usage remains unchanged.
Reference: GrokTranscriptionConfig, the fetch call to `${this.baseURL}/stt`, and
TranscriptionOptions (not exposing a signal) when wiring the AbortController and
timeout.

In `@packages/typescript/ai-grok/src/realtime/token.ts`:
- Line 35: The adapter currently calls getGrokApiKeyFromEnv() at construction
time inside grokRealtimeToken(), causing startup-time throws and preventing
later env updates from being picked up; move the call into the generateToken()
function so the API key is read lazily at token-generation time, i.e., remove
the top-level/constructor invocation of getGrokApiKeyFromEnv() and instead call
it inside generateToken() (and handle missing key there with the existing
error/log flow) so the behavior matches request-time adapters.

In `@packages/typescript/ai-grok/src/realtime/types.ts`:
- Around line 79-111: GrokRealtimeSessionResponse is over-specified and
unvalidated; update token.ts to validate only the fields you actually read
(model, client_secret.value, client_secret.expires_at) by replacing the broad
cast with a Zod schema and parsing the fetched JSON through it (e.g., define a
z.object({ model: z.string(), client_secret: z.object({ value: z.string(),
expires_at: z.number() }) }) and use safe parsing), then use the parsed values
when accessing client_secret.value / client_secret.expires_at and model;
alternatively mark all other properties on GrokRealtimeSessionResponse as
optional/unknown if you choose not to add validation.

In `@packages/typescript/ai-grok/src/utils/audio.ts`:
- Around line 20-31: In toAudioFile's string handling (the branch that checks
typeof audio === 'string'), ensure data: URLs are checked for a ';base64' marker
before using base64ToArrayBuffer/atob and if not base64 decode the payload via
decodeURIComponent; validate the non-data string input with a conservative
base64 regex before treating it as base64 and throw a clear error if it looks
like a URL/path/plain text (don’t call base64ToArrayBuffer blindly). Also
normalize MIME→extension using a small map (e.g., audio/mpeg -> mp3, audio/wav
or audio/x-wav -> wav, audio/ogg -> ogg) when deriving the filename, and use
base64ToArrayBuffer only for confirmed base64 payloads (update calls in
toAudioFile accordingly).

In `@packages/typescript/ai-grok/tests/audio-adapters.test.ts`:
- Around line 1-254: Add a unit test that mocks globalThis.fetch to verify the
realtime token flow: call the function grokRealtimeToken (or the adapter method
that obtains realtime tokens) and assert fetch was POSTed to
'/v1/realtime/client_secrets' with the expected Authorization header and request
body, return a mocked JSON containing client_id, client_secret and expires_at
(unix seconds) and assert the function multiplies expires_at by 1000 to produce
the returned expiresAt value; ensure the test restores globalThis.fetch in
afterEach and checks the response mapping (client_id/client_secret and
expiresAt) to lock in the URL/body/expires_at math.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: bd7948d1-7f46-4bfd-a320-09263fbbc664

📥 Commits

Reviewing files that changed from the base of the PR and between dc71c72 and 2d713c2.

⛔ Files ignored due to path filters (1)
  • pnpm-lock.yaml is excluded by !**/pnpm-lock.yaml
📒 Files selected for processing (26)
  • .changeset/grok-audio-speech-support.md
  • examples/ts-react-chat/src/lib/audio-providers.ts
  • examples/ts-react-chat/src/lib/server-audio-adapters.ts
  • examples/ts-react-chat/src/lib/server-fns.ts
  • examples/ts-react-chat/src/lib/use-realtime.ts
  • examples/ts-react-chat/src/routes/api.generate.speech.ts
  • examples/ts-react-chat/src/routes/api.transcribe.ts
  • examples/ts-react-chat/src/routes/realtime.tsx
  • packages/typescript/ai-grok/package.json
  • packages/typescript/ai-grok/src/adapters/transcription.ts
  • packages/typescript/ai-grok/src/adapters/tts.ts
  • packages/typescript/ai-grok/src/audio/transcription-provider-options.ts
  • packages/typescript/ai-grok/src/audio/tts-provider-options.ts
  • packages/typescript/ai-grok/src/index.ts
  • packages/typescript/ai-grok/src/model-meta.ts
  • packages/typescript/ai-grok/src/realtime/adapter.ts
  • packages/typescript/ai-grok/src/realtime/index.ts
  • packages/typescript/ai-grok/src/realtime/token.ts
  • packages/typescript/ai-grok/src/realtime/types.ts
  • packages/typescript/ai-grok/src/utils/audio.ts
  • packages/typescript/ai-grok/src/utils/index.ts
  • packages/typescript/ai-grok/tests/audio-adapters.test.ts
  • testing/e2e/global-setup.ts
  • testing/e2e/src/lib/feature-support.ts
  • testing/e2e/src/lib/media-providers.ts
  • testing/e2e/tests/test-matrix.ts

Comment thread packages/typescript/ai-grok/src/adapters/transcription.ts
Comment thread packages/typescript/ai-grok/src/adapters/tts.ts
Comment thread packages/typescript/ai-grok/src/adapters/tts.ts Outdated
Comment thread packages/typescript/ai-grok/src/realtime/adapter.ts
Comment thread packages/typescript/ai-grok/src/realtime/adapter.ts Outdated
Comment thread packages/typescript/ai-grok/src/realtime/adapter.ts
Comment thread packages/typescript/ai-grok/src/realtime/adapter.ts Outdated
Comment on lines +12 to +14
export type GrokRealtimeModel =
| 'grok-voice-fast-1.0'
| 'grok-voice-think-fast-1.0'
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

GrokRealtimeModel is duplicated between realtime/types.ts and model-meta.ts.

model-meta.ts already defines GROK_REALTIME_MODELS plus export type GrokRealtimeModel = (typeof GROK_REALTIME_MODELS)[number]. Redeclaring the same union here creates two independent sources of truth that will drift (e.g., if a new realtime model is added to GROK_REALTIME_MODELS but not here, consumers importing from ./types will get stale types). Prefer re-exporting from model-meta.ts:

♻️ Proposed refactor
-/**
- * Grok realtime (Voice Agent) model options.
- */
-export type GrokRealtimeModel =
-  | 'grok-voice-fast-1.0'
-  | 'grok-voice-think-fast-1.0'
+import type { GrokRealtimeModel } from '../model-meta'
+export type { GrokRealtimeModel }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
export type GrokRealtimeModel =
| 'grok-voice-fast-1.0'
| 'grok-voice-think-fast-1.0'
import type { GrokRealtimeModel } from '../model-meta'
export type { GrokRealtimeModel }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/typescript/ai-grok/src/realtime/types.ts` around lines 12 - 14, The
GrokRealtimeModel union is duplicated here causing a split source of truth;
remove the local declaration in realtime/types.ts and re-export the canonical
type from model-meta.ts (which defines GROK_REALTIME_MODELS and exports type
GrokRealtimeModel = (typeof GROK_REALTIME_MODELS)[number]) so consumers import
the single authoritative GrokRealtimeModel; update the export in
realtime/types.ts to re-export the type symbol GrokRealtimeModel from
model-meta.ts and delete the explicit union in this file.

@AlemTuzlak
Copy link
Copy Markdown
Contributor

CR Round 1 — cr-loop findings (default mode)

Scope: diff-scoped + load-bearing findings from a 7-agent unbiased review. Pre-existing code is recorded but deferred in default mode.

Round 1: 23a / 12b / 0c findings → fixing bucket (a)

Bucket (a) — MANDATORY (will be addressed in fix commits)

ai-grok package

  1. src/realtime/adapter.tsconnect() leaks pc / localStream / audioContext on failure; microphone indicator stays on for the user.
  2. src/realtime/adapter.tsdataChannelReady has no reject path (ICE fail, close, error, timeout) → connect() hangs forever.
  3. src/realtime/adapter.ts:141dataChannel.onerror produces "[object Event]" error messages.
  4. src/realtime/adapter.tsonmessage JSON parse catch swallows errors silently.
  5. src/realtime/adapter.ts:584updateSession unconditionally sets input_audio_transcription: { model: 'grok-stt' } on every call; caller cannot disable.
  6. src/realtime/adapter.ts:315-316response.done flips currentMode back to listening even from idle (post stopAudioCapture).
  7. src/realtime/adapter.ts:369-394setupOutputAudioAnalysis leaks prior audioElement; autoplay rejection not surfaced.
  8. src/realtime/adapter.ts:385,402audioContext.resume().catch(() => {}) empty swallow.
  9. src/realtime/adapter.tscurrentMessageId not reset on response.created, stale across tool-only turns.
  10. src/realtime/adapter.ts — missing pc.onconnectionstatechange / oniceconnectionstatechange → disconnection invisible.
  11. src/realtime/adapter.ts:500-519sendImage sends image_url as a raw string; OpenAI-realtime (which this mirrors) expects an object { url: ... }.
  12. src/realtime/adapter.tshandleServerEvent silently drops unknown event types (no default branch).
  13. src/realtime/token.ts:71expires_at * 1000 assumes seconds; add unit-safety guard.
  14. src/adapters/tts.ts:154getContentType('pcm') returns invalid IANA MIME 'audio/pcm'.
  15. src/utils/audio.tstoAudioFile bare-base64 branch forces audio/mpeg, mislabels non-mp3 content.
  16. src/audio/transcription-provider-options.ts + src/adapters/transcription.ts — provider option format?: boolean collides with cross-provider TranscriptionOptions.format; rename (e.g. inverse_text_normalization).
  17. src/index.ts + src/realtime/types.ts — duplicate GrokRealtimeModel declaration; consolidate to model-meta.ts.
  18. package.json — missing @tanstack/ai-client in peerDependencies (realtime adapter imports it).
  19. tests/audio-adapters.test.ts:63-77 — test title promises opus/aac/flac coverage, body exercises only opus.
  20. tests/audio-adapters.test.ts:43-48,190-196 — Content-Type assertions are case-sensitive on a raw Record cast; normalize via new Headers().

examples/ts-react-chat
21. src/lib/server-audio-adapters.ts:80-91resolveModel silently substitutes the default when the override is invalid (user gets wrong model output with no error surface); fail-loud.

testing/e2e
22. src/lib/media-providers.ts:48createGrokImage('grok-2-image', ...) is not a valid GrokImageModel; the only entry in GROK_IMAGE_MODELS is 'grok-2-image-1212'.
23. src/lib/feature-support.ts vs tests/test-matrix.ts — near-identical matrices, both edited here; dedupe to prevent future drift.

Bucket (b) — discretionary / speculative, not fixing this round

voice validation table; sample_rate required-for-raw-codec guard; multichannel/channels pair validation; outputModalities wire-field name (cannot verify without xAI docs); /v1/tts vs /v1/audio/speech endpoint path (explicitly chosen by PR author); tool_choice: 'auto' with empty tools; iceServers default; tool-args JSON parse fallback; error-message size/leak; example routes' 500 vs 4xx mapping; agentId typing; extended toAudioFile tests.

Bucket (c) — pre-existing / out of diff: none.

ai-grok realtime adapter:
- cleanup pc/localStream/audioContext/dataChannel on connect() failure
- dataChannelReady rejects on error/close/ICE-failed/timeout
- RTCErrorEvent extracted properly instead of [object Event]
- onmessage parse errors emit to consumers
- input_audio_transcription no longer overrides caller on every update
- response.done preserves idle mode after stopAudioCapture
- setupOutputAudioAnalysis disposes prior audioElement, surfaces autoplay blocks
- audioContext.resume failures emit error instead of silent swallow
- currentMessageId reset on response.created (tool-only turns)
- pc.onconnectionstatechange / oniceconnectionstatechange emit status_change
- sendImage uses object image_url for OpenAI-realtime compatibility
- unknown server events logged via default branch

ai-grok TTS/STT:
- getContentType returns audio/L16 for pcm (valid IANA MIME)
- toAudioFile requires explicit audio_format for bare base64
- transcription option renamed format -> inverse_text_normalization

ai-grok realtime token:
- expires_at unit-safety guard (seconds vs ms)

ai-grok types:
- single source of truth for GrokRealtimeModel (model-meta)

ai-grok tests:
- cover aac/flac in pickCodec test
- normalize header assertions via Headers()
- add realtime-token unit-safety tests

examples/ts-react-chat:
- resolveModel fails loud via InvalidModelOverrideError (no silent fallback)
- audio/speech/transcribe routes return 400 with structured body

testing/e2e:
- media-providers uses valid grok-2-image-1212 model
- test-matrix imports from feature-support (dedupe)
@AlemTuzlak
Copy link
Copy Markdown
Contributor

CR Round 2 (confirmation) — new load-bearing findings

Round 2: 12a / 30+b / 2c → fixing bucket (a).

These are mostly fix-regressions introduced by the Round 1 fix batch, plus a few genuine gaps agents newly surfaced.

Bucket (a) — MANDATORY

ai-grok/realtime

  1. adapter.ts — try/catch around SDP handshake does not cover setupInputAudioAnalysis(localStream) + await dataChannelReady. If dataChannelReady rejects after the analyser+AudioContext were allocated, those resources (plus the mic) leak.
  2. adapter.tsonconnectionstatechange and oniceconnectionstatechange both emit status_change on the same transition → duplicate events for consumers.
  3. adapter.tspc.ontrack-allocated audioElement / outputAnalyser / outputSource are not torn down in the SDP failure catch.
  4. adapter.tsdataChannelReady only rejects on 'failed'. Transitions to 'closed' / 'disconnected' before channel opens leave the caller waiting the full 15s timeout.
  5. adapter.ts:708-719sendImage isUrl detects only http:// / https://. A data: URI input is double-wrapped into data:mime;base64,data:mime;base64,….

ai-grok/utils
6. audio.ts:42|| 'audio/mpeg' silent fallback contradicts the file-header doctrine. Malformed data: URIs without an embedded MIME should throw.
7. audio.tsaudio.split(',') on e.g. data:audio/mpeg;base64 (no payload) silently produces an empty File.
8. audio.ts — when a data: URI is present AND the caller passed an explicit audioFormat, the caller-supplied value is ignored. Priority inversion.

ai-grok/adapters
11. tts.ts:154audio/L16 content-type is missing the required rate= parameter (RFC 3551/3555). Browsers can't decode the PCM stream.

examples/ts-react-chat
9. lib/server-audio-adapters.ts:27-31findConfig throws plain Error('Unknown provider…'). HTTP routes cannot classify this as 400; add a typed UnknownProviderError analogous to InvalidModelOverrideError.
10. lib/server-fns.tsgenerateAudioFn does not map InvalidModelOverrideError to a structured response; error propagates as generic framework error.

tests
12. tests/realtime-token.test.tsafterEach unconditionally delete process.env.XAI_API_KEY, clobbering any pre-existing value in the shell / earlier test files. Capture and restore instead.

Bucket (b) — deferred

voice-at-adapter-boundary validation, multichannel/channels pair validation, inverse_text_normalization/language coupling, silent opus/aac→mp3 log, Buffer.from browser compat, as 'tts-1' casts, test-coverage for sample_rate/channels/mulaw/alaw, changeset wording, response.text() throw resilience, etc.

Bucket (c) — pre-existing

openai runtime dep (predates PR), Buffer reliance in other adapters (pre-existing pattern across package).

ai-grok realtime adapter:
- shared teardownConnection() helper runs on SDP and post-SDP failure paths, disposing input/output analysers, audio element, mic, data channel, pc, and audio context
- pre-open dataChannelReady rejection on failed/closed/disconnected pc states
- pc.onconnectionstatechange is sole source of status_change (ice handler only rejects)
- sendImage detects data: prefix (no more double-wrap)

ai-grok audio utils:
- malformed data: URI MIME parse throws instead of silently defaulting to audio/mpeg
- empty/missing base64 payload throws
- explicit audioFormat argument wins over URI-embedded MIME

ai-grok TTS:
- audio/L16 content-type includes required rate= parameter from modelOptions.sample_rate

ai-grok tests:
- realtime-token afterEach restores original XAI_API_KEY
- new coverage for malformed data URIs, audioFormat precedence, and rate= in audio/L16

examples/ts-react-chat:
- new UnknownProviderError typed class, 400 mapping in audio/speech/transcribe routes
- server-fns ServerFnError wraps typed adapter errors with stable code/details
@AlemTuzlak
Copy link
Copy Markdown
Contributor

CR Round 3 (confirmation) — 4 new load-bearing findings

Round 3: 4a / 40+b / 3c → fixing bucket (a).

Most agent findings this round are speculative xAI-docs-verification concerns, defensive-coding suggestions, or pre-existing patterns (bucket b/c). The 4 truly new bucket (a) items:

  1. examples/ts-react-chat/src/lib/server-fns.ts — Round 2 added typed-error mapping only to generateAudioFn. generateSpeechFn / transcribeFn / generateSpeechStreamFn / transcribeStreamFn still let UnknownProviderError / InvalidModelOverrideError fall through as generic 500. Consistency gap in the stable-code contract.
  2. examples/ts-react-chat/src/routes/realtime.tsx:373-375 — image-display src double-wraps a data: URI exactly like sendImage did before the Round 2 fix. Mirror the same startsWith('data:') guard on the display side.
  3. packages/typescript/ai-grok/src/realtime/adapter.tspendingEvents is never drained in teardownConnection/disconnect. Post-disconnect sendText/sendImage calls accumulate silently in a dead buffer.
  4. packages/typescript/ai-grok/src/adapters/tts.tscontentType reports audio/L16;rate=${sampleRate} using the local ?? 24000 fallback, but output_format.sample_rate is only sent when the caller provided it. If xAI's server default ever drifts from 24000, every PCM response is mislabeled. Send the resolved sampleRate unconditionally so body and contentType agree.

Bucket (b) deferred

Extensive list: xAI-docs-verification items (format vs inverse_text_normalization wire name, modalities vs output_modalities, /v1/tts endpoint path, text_normalization default), test-coverage gaps for realtime adapter, as 'tts-1' casts, ElevenLabs 'alloy' voice fallback, voice validation, dropping confidence/speaker from STT response, pc.connectionState closed vs disconnected mapping distinction, non-exhaustive switch statements, numerous stylistic concerns.

Bucket (c) pre-existing

openai dep (predates PR), missing ./adapters subpath export (predates PR), Buffer non-Node reliance (pre-existing pattern).

examples/ts-react-chat:
- generateSpeechFn/transcribeFn/generateSpeechStreamFn/transcribeStreamFn now wrap adapter construction with rethrowAudioAdapterError for consistent typed-error responses
- realtime image display guards against data:/http(s): double-wrap

ai-grok realtime adapter:
- teardownConnection drains pendingEvents; sendEvent logs and skips after teardown

ai-grok TTS:
- sample_rate always forwarded in output_format so body and contentType rate agree
@AlemTuzlak
Copy link
Copy Markdown
Contributor

CR Round 4 (confirmation) — 7 new load-bearing findings

Round 4: 7a / 40+b / 2c → fixing bucket (a). Most items are fix-regressions introduced by Rounds 2/3 plus one leak I missed.

Bucket (a) — MANDATORY

  1. packages/typescript/ai-grok/src/utils/audio.tstoAudioFile Blob branch ignores audioFormat argument. A typeless Blob plus explicit audioFormat: 'wav' still produces audio.bin / application/octet-stream. Apply the same "prefer explicit audioFormat" rule already applied to ArrayBuffer and bare-base64.
  2. packages/typescript/ai-grok/src/adapters/tts.ts — Round 3 fix forces sample_rate in output_format for all codecs. For container codecs (mp3/wav/opus/aac/flac) whose contentType doesn't encode the rate, this overrides xAI's server default needlessly. Only force sample_rate when the codec's contentType embeds the rate (pcm / L16).
  3. packages/typescript/ai-grok/src/realtime/adapter.tspc and dataChannel are created BEFORE the getUserMedia try/catch (around lines 79-127 vs 374+). Mic-denial throws with no teardown → pc, dataChannel, listeners, and the 15s dataChannelReadyTimeout all leak. Wrap the getUserMedia block with teardown-on-failure.
  4. packages/typescript/ai-grok/src/realtime/adapter.ts:535-553response.function_call_arguments.done uses event.call_id ?? event.item_id. item_id is NOT a substitute for call_id; the subsequent sendToolResult(callId, …) will submit function_call_output with the wrong correlation and xAI silently drops the tool result. Remove the fallback, log + skip.
  5. packages/typescript/ai-grok/src/realtime/adapter.ts:teardownConnectionisTornDown = true assignment is at the bottom AFTER await audioContext.close(). Events dispatched during that await window (e.g. pc.close() cascade firing onconnectionstatechange) bypass the guard. Move the assignment to the top of teardownConnection.
  6. packages/typescript/ai-grok/src/realtime/adapter.ts:setupOutputAudioAnalysis — no isTornDown guard. A late pc.ontrack after teardown allocates a fresh AudioContext/Audio element that leaks.
  7. packages/typescript/ai-grok/src/realtime/adapter.ts:disconnect — explicit status_change: 'idle' emit duplicates the one triggered by pc.close()onconnectionstatechange. Consumers see two idle transitions per user-initiated disconnect.

Bucket (b) — deferred

xAI docs-verification items (wire field names), diarization data drop, voice validation table, RFC-exact μ-law/A-law content-type, RTCPeerConnection iceServers, example routes err.message leakage, ElevenLabs 'alloy' default, non-exhaustive switches, useMemo for adapter factory, rethrowAudioAdapterError indirection fragility, numerous test-coverage gaps, and other stylistic/verification-required concerns.

Bucket (c) — pre-existing

./adapters subpath export gap, openai runtime dep.

ai-grok realtime adapter:
- teardownConnection on getUserMedia failure (mic/pc/dataChannel leak on mic denial)
- response.function_call_arguments.done drops event if call_id absent (no item_id fallback)
- isTornDown set at top of teardown to guard handlers firing during close() awaits
- setupInputAudioAnalysis/setupOutputAudioAnalysis skip when torn down
- onconnectionstatechange no longer double-emits status_change during disconnect()

ai-grok audio utils:
- toAudioFile Blob/File branch prefers explicit audioFormat over Blob.type

ai-grok TTS:
- sample_rate forwarded only when caller provides one or codec is pcm (don't override server defaults for container codecs)

Tests updated to cover new audioFormat precedence paths and adjusted sample_rate assertions.
@AlemTuzlak
Copy link
Copy Markdown
Contributor

CR Round 5 (confirmation) — 6 new load-bearing findings

Round 5: 6a / 30+b / 4c → fixing bucket (a).

Bucket (a) — MANDATORY

  1. ai-grok/src/realtime/adapter.tspc.connectionState === 'failed' emits status_change: 'error' but doesn't call teardownConnection(). Mic tracks / PC / AudioContext keep running after spontaneous failure.
  2. ai-grok/src/realtime/adapter.ts:flushPendingEvents — no try/catch around dataChannel.send(...). A synchronous send failure during the initial flush propagates out of onopen, leaves dataChannelReady pending, and hangs the caller for 15s.
  3. ai-grok/src/realtime/adapter.ts:handleServerEvent case 'error'(event.error as Record<…>).message throws TypeError on error: null or missing field. The outer catch then re-emits the TypeError, masking the original xAI error.
  4. ai-grok/src/realtime/adapter.tsaudio.play().catch and audioContext.resume().catch emit generic error events. Autoplay restrictions pre-user-gesture are routine/recoverable; this surfaces as a fatal error in the UI. Regression from Round 2; log at provider level or emit a dedicated audio_permission_required signal.
  5. ai-grok/src/realtime/adapter.ts:dataChannel.onerror — emits error without an isTornDown guard; post-teardown error events leak after intentional disconnect.
  6. examples/ts-react-chat/src/routes/realtime.tsx:handleImageUpload — no FileReader.onerror; silent drop on corrupt/zero-byte files. result as string can be null on empty files → null.split throws.

Bucket (b) — deferred

xAI wire-name verification items, confidence/speaker drop, ElevenLabs default voice, μ-law/A-law content-type rate, non-exhaustive switches, rethrowAudioAdapterError typing, Zod enum duplication, sendText/sendImage + VAD race, test coverage gaps, error-body leaks, etc.

Bucket (c) — pre-existing

openai runtime dep, missing ./adapters subpath export, sideEffects field, Buffer reliance.

ai-grok realtime adapter:
- pc.connectionState=failed triggers automatic teardownConnection (mic/pc/audioContext no longer leak on spontaneous failure)
- flushPendingEvents wraps send in try/catch; emits error on failure instead of hanging caller
- handleServerEvent case 'error' validates shape of event.error; preserves code/type/param; safe against null/missing fields
- autoplay and audioContext.resume failures log without emitting fatal error events (routine browser-policy outcomes)
- dataChannel.onerror/onclose gated behind isTornDown to suppress post-disconnect error events

examples/ts-react-chat:
- realtime.tsx handleImageUpload validates FileReader result, file.type, and base64 extraction; surfaces errors visibly
@AlemTuzlak
Copy link
Copy Markdown
Contributor

CR Round 6 (confirmation) — 1 new load-bearing finding

Round 6: 1a / 40+b / 6c → fixing bucket (a).

Bucket (a) — MANDATORY

  1. packages/typescript/ai-grok/src/utils/audio.tsextensionFor('audio/basic') (mulaw) and extensionFor('audio/x-alaw-basic') (alaw) fall to the default branch and produce garbage filenames audio.basic / audio.x-alaw-basic. GrokSTTAudioFormat explicitly allows 'mulaw' / 'alaw'. Add explicit cases mapping to sensible extensions.

Bucket (b) — deferred

Most Round 6 findings recur from prior rounds and depend on xAI docs verification or fall into stylistic/design-call territory:

  • xAI wire-name verification items (ITN format, output_modalities, /v1/tts endpoint path)
  • ElevenLabs 'alloy' voice fallback, voice validation
  • let adapter narrowing fragility, rethrowAudioAdapterError indirection
  • Non-exhaustive switch defaults, as 'tts-1' casts
  • inverse_text_normalization requires language validation
  • μ-law/A-law contentType RFC-exact naming
  • multichannel/channels pair validation
  • Diarization confidence/speaker data drop
  • Zod enum duplication across routes
  • Test coverage for realtime adapter, boundary conditions
  • Stale closure / re-render concerns in use-realtime
  • oniceconnectionstatechange pre-open 'disconnected' rejection (intentional fail-fast)
  • Numerous other stylistic / speculative items

Bucket (c) — pre-existing

Missing ./adapters subpath export, openai runtime dep, empty author, missing sideEffects: false, Buffer reliance, duplicate voice type declarations — all predate this PR.

…enames

utils/audio.ts produced 'audio.basic' and 'audio.x-alaw-basic' for mulaw/alaw
via the default-branch MIME split. Servers using filename as a format hint
now see 'audio.mulaw' / 'audio.alaw', matching the reverse toMimeType mapping.
@AlemTuzlak
Copy link
Copy Markdown
Contributor

CR Round 7 (confirmation) — CONVERGED on bucket (a)

Round 7: 0a / 40+b / 9c.

Every finding in this round repeats a previously-deferred bucket (b) item or a pre-existing bucket (c) concern. No new fix-regressions introduced by prior fix batches.

Convergence trajectory: 23 → 12 → 4 → 7 → 6 → 1 → 0 bucket (a) findings over 7 rounds.

Proceeding to final verification (CI on PR HEAD). Bucket (b) items remain explicitly deferred in default mode — mostly xAI-docs verification items (format vs inverse_text_normalization wire name, modalities vs output_modalities, /v1/tts vs /v1/audio/speech, grok-stt realtime transcription model), feature completeness (diarization confidence/speaker forwarding, voice validation tables), and stylistic/defensive concerns not affecting correctness of the diff. Bucket (c) items (openai runtime dep, missing ./adapters subpath export, author empty, generateId edge case in utils/client.ts) predate this PR.

Commits in this CR cycle

  • 5100e73 — Round 1 fixes (23 findings)
  • a7912a5 — Round 2 fixes (12 findings)
  • 3be4261 — Round 3 fixes (4 findings)
  • d86959a — Round 4 fixes (7 findings)
  • d385256 — Round 5 fixes (6 findings)
  • 47ed1fb — Round 6 fix (1 finding)

@AlemTuzlak
Copy link
Copy Markdown
Contributor

CR-loop complete ✅

CI status on PR HEAD (47ed1fb): all gating checks green — E2E Tests, Test (PR), autofix, Preview, Provenance, Version Preview, Continuous Releases. (CodeRabbit is an external review bot, not a gating check.)

Convergence: 0 bucket-(a) findings after a 7-agent unbiased confirmation round in Round 7.

Trajectory: 23 → 12 → 4 → 7 → 6 → 1 → 0 bucket-(a) findings over 7 rounds.

6 fix batches:

  • 5100e73 — Round 1 (23 findings): realtime leaks + dataChannelReady rejects + sendImage shape + updateSession defaults + audio/L16 contentType + toAudioFile MIME strictness + format→inverse_text_normalization rename + duplicate GrokRealtimeModel + expires_at unit safety + Content-Type header test normalization + grok-2-image-1212 e2e fix + test-matrix dedup + InvalidModelOverrideError typed + aac/flac test coverage.
  • a7912a5 — Round 2 (12 findings): post-SDP teardown, duplicate status_change, audioElement/output analysers cleanup, dataChannelReady pre-open reject, sendImage data: detection, malformed data: URI handling, audioFormat precedence, audio/L16 rate= parameter, realtime-token test env restore, typed UnknownProviderError + 400 mapping.
  • 3be4261 — Round 3 (4 findings): server-fn typed errors for speech/transcribe, display-side data: double-wrap fix, pendingEvents teardown clear, sample_rate/contentType agreement.
  • d86959a — Round 4 (7 findings): toAudioFile Blob/File audioFormat precedence, sample_rate forwarding narrowed to pcm-or-caller, getUserMedia-failure teardown, tool_call call_id fallback removed, isTornDown to top of teardown, setup*AudioAnalysis isTornDown guard, disconnect() duplicate status_change suppression.
  • d385256 — Round 5 (6 findings): pc.connectionState=failed auto-teardown, flushPendingEvents try/catch, shape-safe event.error handling, autoplay/resume softened, dataChannel.onerror/onclose isTornDown guard, FileReader error/empty-type handling.
  • 47ed1fb — Round 6 (1 finding): extensionFor mulaw/alaw filename mapping.

Deferred (bucket b/c): mostly xAI-docs-verification items (wire field names for ITN/modalities/endpoint paths, grok-stt realtime transcription model), feature completeness (diarization data forwarding, voice validation tables), and pre-existing items (openai runtime dep, missing ./adapters subpath, generateId in utils/client.ts). All recorded in per-round PR comments.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
examples/ts-react-chat/src/lib/server-fns.ts (1)

111-127: Optional: extract the let adapter; try { ... } catch { rethrow } dance into a helper.

The same 6-line pattern is duplicated across five handlers. A small helper like buildOrRethrow(() => buildSpeechAdapter(...)) would deduplicate and make the intent (translate adapter errors) more obvious at call sites, without changing behavior.

Also applies to: 137-153, 164-180, 279-298, 308-327

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/ts-react-chat/src/lib/server-fns.ts` around lines 111 - 127, Extract
the repeated try/catch that calls buildSpeechAdapter and
rethrowAudioAdapterError into a small helper (e.g., buildOrRethrow) and replace
each duplicated block with a single call to that helper; specifically, create a
helper that accepts a zero-arg function (or provider string) to call
buildSpeechAdapter, catches any error and calls rethrowAudioAdapterError(err),
and returns the adapter so handlers like the anonymous function in .handler and
the other handlers referencing buildSpeechAdapter (lines around start of
generateSpeech calls) simply call buildOrRethrow(() =>
buildSpeechAdapter(data.provider ?? 'openai')) (or
buildOrRethrow(data.provider)) to get the adapter and then pass it to
generateSpeech.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@examples/ts-react-chat/src/routes/api.generate.audio.ts`:
- Around line 73-92: The two 400 error branches return inconsistent payload keys
for the provider; in the InvalidModelOverrideError branch (the block that calls
jsonError with error 'invalid_model_override') change the returned key named
provider to providerId so it matches the UnknownProviderError branch and the
rest of the codebase (see rethrowAudioAdapterError in server-fns.ts and the
sibling route api.transcribe.ts). Locate the jsonError call inside the
InvalidModelOverrideError handling and rename the provider property to
providerId while keeping the same value (err.providerId or err.provider
depending on how the error exposes it), ensuring both branches emit providerId
consistently.

In `@packages/typescript/ai-grok/src/realtime/adapter.ts`:
- Around line 825-834: sendEvent currently checks dataChannel?.readyState ===
'open' then calls dataChannel.send without try/catch, which can throw (e.g.,
InvalidStateError) and leak into callers; change sendEvent so that the send is
wrapped in a try/catch similar to flushPendingEvents: call logger.provider
before sending, attempt JSON.stringify and dataChannel.send inside a try block,
and on any error catch it, log the error (use logger.provider or logger.error
with the same context including event type), and push the event onto
pendingEvents so it will be retried by flushPendingEvents; reference sendEvent,
flushPendingEvents, dataChannel, pendingEvents, and logger.provider in your
change.

---

Nitpick comments:
In `@examples/ts-react-chat/src/lib/server-fns.ts`:
- Around line 111-127: Extract the repeated try/catch that calls
buildSpeechAdapter and rethrowAudioAdapterError into a small helper (e.g.,
buildOrRethrow) and replace each duplicated block with a single call to that
helper; specifically, create a helper that accepts a zero-arg function (or
provider string) to call buildSpeechAdapter, catches any error and calls
rethrowAudioAdapterError(err), and returns the adapter so handlers like the
anonymous function in .handler and the other handlers referencing
buildSpeechAdapter (lines around start of generateSpeech calls) simply call
buildOrRethrow(() => buildSpeechAdapter(data.provider ?? 'openai')) (or
buildOrRethrow(data.provider)) to get the adapter and then pass it to
generateSpeech.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 7c97a895-bf28-41fb-bd8b-ead54c3f5abb

📥 Commits

Reviewing files that changed from the base of the PR and between 2d713c2 and cdff0cd.

📒 Files selected for processing (19)
  • examples/ts-react-chat/src/lib/server-audio-adapters.ts
  • examples/ts-react-chat/src/lib/server-fns.ts
  • examples/ts-react-chat/src/routes/api.generate.audio.ts
  • examples/ts-react-chat/src/routes/api.generate.speech.ts
  • examples/ts-react-chat/src/routes/api.transcribe.ts
  • examples/ts-react-chat/src/routes/realtime.tsx
  • packages/typescript/ai-grok/src/adapters/transcription.ts
  • packages/typescript/ai-grok/src/adapters/tts.ts
  • packages/typescript/ai-grok/src/audio/transcription-provider-options.ts
  • packages/typescript/ai-grok/src/realtime/adapter.ts
  • packages/typescript/ai-grok/src/realtime/index.ts
  • packages/typescript/ai-grok/src/realtime/token.ts
  • packages/typescript/ai-grok/src/realtime/types.ts
  • packages/typescript/ai-grok/src/utils/audio.ts
  • packages/typescript/ai-grok/tests/audio-adapters.test.ts
  • packages/typescript/ai-grok/tests/realtime-token.test.ts
  • testing/e2e/src/lib/feature-support.ts
  • testing/e2e/src/lib/media-providers.ts
  • testing/e2e/tests/test-matrix.ts
✅ Files skipped from review due to trivial changes (3)
  • packages/typescript/ai-grok/src/audio/transcription-provider-options.ts
  • packages/typescript/ai-grok/src/realtime/index.ts
  • packages/typescript/ai-grok/src/realtime/types.ts
🚧 Files skipped from review as they are similar to previous changes (8)
  • examples/ts-react-chat/src/routes/api.transcribe.ts
  • testing/e2e/src/lib/feature-support.ts
  • testing/e2e/src/lib/media-providers.ts
  • examples/ts-react-chat/src/lib/server-audio-adapters.ts
  • packages/typescript/ai-grok/src/realtime/token.ts
  • packages/typescript/ai-grok/src/adapters/transcription.ts
  • examples/ts-react-chat/src/routes/realtime.tsx
  • testing/e2e/tests/test-matrix.ts

Comment on lines +73 to +92
if (err instanceof InvalidModelOverrideError) {
return jsonError(400, {
error: 'invalid_model_override',
message: err.message,
provider: err.providerId,
requestedModel: err.requestedModel,
allowedModels: err.allowedModels,
})
}
// Defense-in-depth: the Zod enum schema above should already reject
// unknown providers, but surface a typed 400 here in case that
// validation drifts or is bypassed.
if (err instanceof UnknownProviderError) {
return jsonError(400, {
error: 'unknown_provider',
message: err.message,
providerId: err.providerId,
allowedProviders: err.allowedProviders,
})
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Minor: payload field name differs between the two 400 cases (provider vs providerId).

The invalid_model_override branch emits the provider under the key provider, while the unknown_provider branch uses providerId. Consumers doing generic error rendering will need a special case for each code. server-fns.ts' rethrowAudioAdapterError uses providerId for both, and the sibling route api.transcribe.ts likely matches — consider aligning to providerId everywhere for a uniform client contract.

🔧 Proposed fix
           if (err instanceof InvalidModelOverrideError) {
             return jsonError(400, {
               error: 'invalid_model_override',
               message: err.message,
-              provider: err.providerId,
+              providerId: err.providerId,
               requestedModel: err.requestedModel,
               allowedModels: err.allowedModels,
             })
           }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if (err instanceof InvalidModelOverrideError) {
return jsonError(400, {
error: 'invalid_model_override',
message: err.message,
provider: err.providerId,
requestedModel: err.requestedModel,
allowedModels: err.allowedModels,
})
}
// Defense-in-depth: the Zod enum schema above should already reject
// unknown providers, but surface a typed 400 here in case that
// validation drifts or is bypassed.
if (err instanceof UnknownProviderError) {
return jsonError(400, {
error: 'unknown_provider',
message: err.message,
providerId: err.providerId,
allowedProviders: err.allowedProviders,
})
}
if (err instanceof InvalidModelOverrideError) {
return jsonError(400, {
error: 'invalid_model_override',
message: err.message,
providerId: err.providerId,
requestedModel: err.requestedModel,
allowedModels: err.allowedModels,
})
}
// Defense-in-depth: the Zod enum schema above should already reject
// unknown providers, but surface a typed 400 here in case that
// validation drifts or is bypassed.
if (err instanceof UnknownProviderError) {
return jsonError(400, {
error: 'unknown_provider',
message: err.message,
providerId: err.providerId,
allowedProviders: err.allowedProviders,
})
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/ts-react-chat/src/routes/api.generate.audio.ts` around lines 73 -
92, The two 400 error branches return inconsistent payload keys for the
provider; in the InvalidModelOverrideError branch (the block that calls
jsonError with error 'invalid_model_override') change the returned key named
provider to providerId so it matches the UnknownProviderError branch and the
rest of the codebase (see rethrowAudioAdapterError in server-fns.ts and the
sibling route api.transcribe.ts). Locate the jsonError call inside the
InvalidModelOverrideError handling and rename the provider property to
providerId while keeping the same value (err.providerId or err.provider
depending on how the error exposes it), ensuring both branches emit providerId
consistently.

Comment on lines +825 to +834
if (dataChannel?.readyState === 'open') {
logger.provider(
`provider=grok direction=out type=${(event.type as string | undefined) ?? '<unknown>'}`,
{ frame: event },
)
dataChannel.send(JSON.stringify(event))
} else {
pendingEvents.push(event)
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🌐 Web query:

Can RTCDataChannel.send() throw even when readyState is 'open' due to a race with the state transitioning to 'closing'?

💡 Result:

Yes, RTCDataChannel.send can throw an InvalidStateError even when readyState is 'open' due to a race condition with the state transitioning to 'closing'.

Citations:


🏁 Script executed:

cd /repository && find . -type f -name "adapter.ts" | grep -i realtime | head -5

Repository: TanStack/ai

Length of output: 117


🏁 Script executed:

# Check the file and verify the code snippet and surrounding context
cat -n packages/typescript/ai-grok/src/realtime/adapter.ts | sed -n '820,870p'

Repository: TanStack/ai

Length of output: 2303


🏁 Script executed:

# Search for flushPendingEvents implementation to verify it has try/catch
cat -n packages/typescript/ai-grok/src/realtime/adapter.ts | sed -n '837,861p'

Repository: TanStack/ai

Length of output: 1285


🏁 Script executed:

# Check for error handling patterns in the file - look for emit('error') and logger usage
rg "emit\('error'" packages/typescript/ai-grok/src/realtime/adapter.ts -A 2 -B 2

Repository: TanStack/ai

Length of output: 1423


🏁 Script executed:

# Check logger.errors usage pattern in this file
rg "logger\.errors" packages/typescript/ai-grok/src/realtime/adapter.ts -A 1 -B 1 | head -20

Repository: TanStack/ai

Length of output: 537


sendEvent doesn't guard against dataChannel.send throwing — inconsistent with flushPendingEvents.

RTCDataChannel.send can throw after the readyState === 'open' check — e.g., InvalidStateError if the channel transitions to closing/closed between the check and the call, or TypeError/NetworkError for buffer/network issues. flushPendingEvents (lines 837–861) wraps its send loop in try/catch; sendEvent does not. Since the public methods (sendText, sendImage, updateSession, interrupt) are synchronous fire-and-forget calls typically invoked from React event handlers, an uncaught throw here will propagate into consumer code and likely trip an error boundary.

Proposed fix
    if (dataChannel?.readyState === 'open') {
      logger.provider(
        `provider=grok direction=out type=${(event.type as string | undefined) ?? '<unknown>'}`,
        { frame: event },
      )
-     dataChannel.send(JSON.stringify(event))
+     try {
+       dataChannel.send(JSON.stringify(event))
+     } catch (error) {
+       logger.errors('grok.realtime sendEvent failed', {
+         error,
+         source: 'grok.realtime',
+         eventType: (event.type as string | undefined) ?? '<unknown>',
+       })
+       emit('error', {
+         error: error instanceof Error ? error : new Error(String(error)),
+       })
+     }
    } else {
      pendingEvents.push(event)
    }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/typescript/ai-grok/src/realtime/adapter.ts` around lines 825 - 834,
sendEvent currently checks dataChannel?.readyState === 'open' then calls
dataChannel.send without try/catch, which can throw (e.g., InvalidStateError)
and leak into callers; change sendEvent so that the send is wrapped in a
try/catch similar to flushPendingEvents: call logger.provider before sending,
attempt JSON.stringify and dataChannel.send inside a try block, and on any error
catch it, log the error (use logger.provider or logger.error with the same
context including event type), and push the event onto pendingEvents so it will
be retried by flushPendingEvents; reference sendEvent, flushPendingEvents,
dataChannel, pendingEvents, and logger.provider in your change.

…tion, fix xAI realtime event names

Refactors from user review:

adapters:
- tts.ts: extract buildTTSRequestBody helper (codec/sample_rate/voice default
  resolution + body assembly). Export getContentType for consumer use.
- transcription.ts: extract buildTranscriptionFormData helper (wire-field
  mapping including xAI's named 'format' boolean toggle for inverse text
  normalization).

model-meta.ts: audio and realtime models now use the same
`as const satisfies ModelMeta` convention as chat/image models
(GROK_TTS, GROK_STT, GROK_VOICE_FAST_1, GROK_VOICE_THINK_FAST_1) with
input/output modalities and tool_calling / reasoning capabilities.

realtime adapter:
- Replace drive-by 'as' casts on untyped server events with runtime-checked
  readers (readString, readObject, readObjectArray); malformed frames return
  undefined instead of throwing a TypeError.
- Accept both legacy OpenAI-realtime event names and current xAI voice-agent
  names per docs.x.ai: response.output_audio.* / response.output_audio_transcript.* /
  response.text.* (plus existing response.audio.* / response.audio_transcript.* /
  response.output_text.* aliases for compatibility).
- RealtimeServerError type replaces repeated 'as Error & { code?: string }' casts.

realtime token:
- Wrap request body with { session: { model } } per xAI /v1/realtime/client_secrets
  schema (was bare { model } before).
@AlemTuzlak
Copy link
Copy Markdown
Contributor

Post-CR refactor — review feedback addressed

Pushed follow-up commits (183a7cc, aa53a9c) responding to maintainer review:

1. Helper extraction

  • buildTTSRequestBody() in adapters/tts.ts — codec / sample_rate / voice defaults + body assembly
  • buildTranscriptionFormData() in adapters/transcription.ts — wire-field mapping for STT form fields
  • getContentType() exported for consumer use

2. as cast reduction in realtime/adapter.ts

  • Added readString / readObject / readObjectArray runtime-checked field readers
  • Replaced every event.X as string / event.X as Record<…> pattern on untyped server events; malformed frames now return undefined instead of throwing TypeError inside handleServerEvent
  • Introduced RealtimeServerError type alias — replaces repeated as Error & { code?: string } casts in the error event handler
  • providerOptions double-cast cleaned up via a single Record<string, unknown> declaration

3. model-meta.ts convention alignment
Audio and realtime models now follow the same as const satisfies ModelMeta pattern as chat/image models:

  • GROK_TTS{ input: ['text'], output: ['audio'] }
  • GROK_STT{ input: ['audio'], output: ['text'] }
  • GROK_VOICE_FAST_1{ input: ['audio','text'], output: ['audio','text'], capabilities: ['tool_calling'] }
  • GROK_VOICE_THINK_FAST_1 — adds 'reasoning' capability

4. xAI docs verification (resolved Round 7 bucket (b) items)
Fetched docs.x.ai/developers/rest-api-reference/inference/voice — results:

✅ Confirmed correct:

  • /v1/tts and /v1/stt paths
  • format: 'true'|'false' IS the xAI wire field for inverse text normalization
  • optimize_streaming_latency: 0 | 1 (integer)
  • text_normalization: boolean
  • modalities in session.update
  • Voices eve/ara/rex/sal/leo
  • Codecs mp3/wav/pcm/mulaw/alaw (opus/aac/flac silent downgrade is correct — xAI doesn't support those)

🔧 Fixed per docs:

  • Realtime server event names: xAI uses response.output_audio.*, response.output_audio_transcript.*, response.text.*. Adapter now accepts these AND the legacy OpenAI-realtime aliases (response.audio.*, response.audio_transcript.*, response.output_text.*) for compatibility
  • Token request body: /v1/realtime/client_secrets expects { session: { model } } per docs — previously sent bare { model }. Added regression test.

5. @tanstack/ai-client in peer+dev
Yes, this is due to realtime: src/realtime/adapter.ts imports RealtimeAdapter and RealtimeConnection types from @tanstack/ai-client. The peerDependencies entry tells consumers to install it; the devDependencies entry (via workspace:*) makes local build/test work inside the monorepo. This is the same pattern used by @tanstack/ai-elevenlabs which also ships a realtime adapter.

autofix-ci Bot and others added 3 commits April 24, 2026 11:02
…time contract

The RealtimeAdapter / RealtimeConnection interfaces are duplicated locally
in src/realtime/realtime-contract.ts. The adapter imports them from there
instead of @tanstack/ai-client, so consumers of @tanstack/ai-grok no longer
have to install @tanstack/ai-client unless they also want to construct a
RealtimeClient from it (structural typing covers that use case).

@tanstack/ai-client stays as a devDependency to run a type-level drift check
(tests/realtime-contract.drift.test-d.ts) that asserts our inlined contract
is bidirectionally assignable to the canonical one. If ai-client ever changes
the interface, that file will fail to compile and we update both in lockstep.

publint --strict: clean.
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (4)
packages/typescript/ai-grok/src/adapters/tts.ts (1)

144-146: bit_rate silently dropped for non-mp3 codecs.

If a caller sets modelOptions.bit_rate alongside e.g. codec: 'opus' (or uses a format that maps to mp3 via pickCodec, but explicitly overrides codec to something else), the bit_rate is discarded without any signal. At minimum, consider a debug log via the existing logger so this isn't silently ignored; alternatively document that bit_rate only applies when the resolved codec is mp3.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/typescript/ai-grok/src/adapters/tts.ts` around lines 144 - 146, The
code currently drops modelOptions.bit_rate when the resolved codec (variable
codec from pickCodec) is not 'mp3' with no signal; update the tts adapter to
detect when modelOptions.bit_rate is set but codec !== 'mp3' and emit a clear
logger warning/debug message (use the existing logger instance) stating that
bit_rate is ignored for non-mp3 codecs and showing the provided bit_rate and
resolved codec; alternatively, if you prefer stricter behavior, validate and
throw an error instead—references: codec, modelOptions.bit_rate, outputFormat,
and pickCodec so you can locate the logic to add the log/validation.
packages/typescript/ai-grok/src/realtime/adapter.ts (3)

100-107: connectionMode !== 'webrtc' becomes a runtime error with no type-level signal.

If a caller sets connectionMode: 'websocket', the failure happens asynchronously inside connect() rather than at construction. If GrokRealtimeOptions['connectionMode'] is typed to include 'websocket', that's a correctness hazard; if it's typed as 'webrtc' only, this whole else branch is unreachable. Either:

  • narrow the type to 'webrtc' and drop the dead branch, or
  • fail fast in grokRealtime() itself (synchronously) so the error surfaces at setup time rather than after the caller has already obtained a token.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/typescript/ai-grok/src/realtime/adapter.ts` around lines 100 - 107,
The runtime-only error in connect() for non-'webrtc' connectionMode must be
moved to a synchronous validation or removed by narrowing the option type:
either update GrokRealtimeOptions['connectionMode'] to only allow 'webrtc' and
delete the dead else branch in connect(), or add an immediate check in
grokRealtime(options) that throws (and logs via logger.errors('grok.realtime
fatal', ...)) when options.connectionMode !== 'webrtc' so callers fail fast
before token acquisition; reference the connect() implementation and
grokRealtime() entry point to locate and apply the change.

693-711: Nitpick: inner content shadows the outer content array.

On line 695 content is the array of content parts; on line 705 a new const content = readString(part, 'text') shadows it within the else if branch. Readers tracking the loop have to re-bind the name mentally. Renaming the inner one (e.g. textContent or just text) would avoid the shadow:

♻️ Suggested rename
-              } else if (partType === 'text') {
-                const content = readString(part, 'text')
-                if (content) {
-                  message.parts.push({ type: 'text', content })
-                }
-              }
+              } else if (partType === 'text') {
+                const text = readString(part, 'text')
+                if (text) {
+                  message.parts.push({ type: 'text', content: text })
+                }
+              }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/typescript/ai-grok/src/realtime/adapter.ts` around lines 693 - 711,
The inner const named `content` in the `else if (partType === 'text')` branch
shadows the outer `content` array (from `const content = readObjectArray(item,
'content')`) inside the loop `for (const part of content)`, which is confusing;
rename the inner variable (e.g. to `text` or `textContent`) and update the
subsequent `message.parts.push({ type: 'text', content })` to use the new name
so the outer `content` array (and the `for (const part of content)` loop) is not
shadowed; the changes affect the `readString(part, 'text')` usage and the
`message.parts.push` call in this block.

119-121: Add ICE server configuration to RTCPeerConnection for restrictive networks.

new RTCPeerConnection() with no iceServers relies entirely on host candidates + server-advertised ICE endpoints. On restrictive networks (symmetric NAT, corporate firewalls), this fails silently as a 15s dataChannelReady timeout rather than a clear ICE gathering failure.

Consider either:

  1. Default to a public STUN server (e.g., stun:stun.l.google.com:19302)
  2. Expose an iceServers option on GrokRealtimeOptions so consumers can supply TURN credentials when needed

Both approaches match WebRTC production patterns and align with OpenAI realtime adapter best practices.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/typescript/ai-grok/src/realtime/adapter.ts` around lines 119 - 121,
The RTCPeerConnection is created without ICE servers which causes silent
failures on restrictive networks; update the RTCPeerConnection instantiation in
adapter.ts (where pc is declared) to accept an iceServers config by: add an
iceServers option to GrokRealtimeOptions and use it when constructing pc (e.g.,
pass { iceServers: options.iceServers ?? [{ urls: 'stun:stun.l.google.com:19302'
}] }) so consumers can supply TURN credentials while defaulting to a public
STUN; ensure types/interfaces for GrokRealtimeOptions are updated to include
iceServers and that any callers pass through or rely on the default.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@packages/typescript/ai-grok/src/adapters/tts.ts`:
- Around line 138-150: The contentType advertised for mulaw/alaw drops an
explicitly provided sample_rate, causing consumers to assume 8kHz; update the
logic that builds the Content-Type (the code using getContentType and the
sampleRateForContentType variable) so that when codec === 'mulaw' or 'alaw' and
callerSampleRate or modelOptions.sample_rate is set you include a rate=
parameter in the Content-Type (i.e., ensure sampleRateForContentType picks up
callerSampleRate and propagate it into the getContentType output) —
alternatively, if you prefer stricter behavior, validate in the same spot (using
codec, callerSampleRate, modelOptions.sample_rate, and outputFormat) and throw a
clear error when a non-default rate is passed for mulaw/alaw; implement one of
these fixes and apply the same change where similar logic exists (the other
block referenced around lines 205-211).

In `@packages/typescript/ai-grok/src/realtime/token.ts`:
- Around line 58-65: generateToken currently calls
fetch(GROK_REALTIME_CLIENT_SECRETS_URL, ...) without a timeout which can hang;
update generateToken in token.ts to create an AbortSignal with a default timeout
(e.g., 15s) and pass it to fetch via the signal option, exposing the timeout as
an optional parameter on generateToken (or options object) so callers can
override; prefer AbortSignal.timeout(...) when running on Node versions that
support it and fall back to a manual AbortController + setTimeout that calls
controller.abort(), and ensure you handle/propagate abort errors consistently
from generateToken.
- Around line 74-89: The code dereferences sessionData.client_secret.expires_at
and sessionData.model without runtime validation, causing opaque TypeErrors on
malformed responses; update the token parsing in token.ts to validate the
response shape (use a Zod schema for GrokRealtimeSessionResponse or at minimum
check sessionData.client_secret?.value and sessionData.client_secret?.expires_at
exist) before using them, normalize expires_at as before (treat >1e12 as ms),
fall back to the requested model if sessionData.model is missing, and throw a
clear, descriptive Error that includes HTTP context (status/code and body) when
validation fails so callers get an actionable error instead of a TypeError.
- Around line 50-57: The comment is accurate: the requestBody in the token
creation code never wires an expires_after override from the caller; add an
optional expiresAfterSeconds?: number property to GrokRealtimeTokenOptions and,
in the function that builds requestBody (the variable requestBody and the
session: { model } assignment), conditionally add requestBody.expires_after = {
seconds: expiresAfterSeconds } when that option is provided so the generated
body matches the xAI docs and the JSDoc comment.

---

Nitpick comments:
In `@packages/typescript/ai-grok/src/adapters/tts.ts`:
- Around line 144-146: The code currently drops modelOptions.bit_rate when the
resolved codec (variable codec from pickCodec) is not 'mp3' with no signal;
update the tts adapter to detect when modelOptions.bit_rate is set but codec !==
'mp3' and emit a clear logger warning/debug message (use the existing logger
instance) stating that bit_rate is ignored for non-mp3 codecs and showing the
provided bit_rate and resolved codec; alternatively, if you prefer stricter
behavior, validate and throw an error instead—references: codec,
modelOptions.bit_rate, outputFormat, and pickCodec so you can locate the logic
to add the log/validation.

In `@packages/typescript/ai-grok/src/realtime/adapter.ts`:
- Around line 100-107: The runtime-only error in connect() for non-'webrtc'
connectionMode must be moved to a synchronous validation or removed by narrowing
the option type: either update GrokRealtimeOptions['connectionMode'] to only
allow 'webrtc' and delete the dead else branch in connect(), or add an immediate
check in grokRealtime(options) that throws (and logs via
logger.errors('grok.realtime fatal', ...)) when options.connectionMode !==
'webrtc' so callers fail fast before token acquisition; reference the connect()
implementation and grokRealtime() entry point to locate and apply the change.
- Around line 693-711: The inner const named `content` in the `else if (partType
=== 'text')` branch shadows the outer `content` array (from `const content =
readObjectArray(item, 'content')`) inside the loop `for (const part of
content)`, which is confusing; rename the inner variable (e.g. to `text` or
`textContent`) and update the subsequent `message.parts.push({ type: 'text',
content })` to use the new name so the outer `content` array (and the `for
(const part of content)` loop) is not shadowed; the changes affect the
`readString(part, 'text')` usage and the `message.parts.push` call in this
block.
- Around line 119-121: The RTCPeerConnection is created without ICE servers
which causes silent failures on restrictive networks; update the
RTCPeerConnection instantiation in adapter.ts (where pc is declared) to accept
an iceServers config by: add an iceServers option to GrokRealtimeOptions and use
it when constructing pc (e.g., pass { iceServers: options.iceServers ?? [{ urls:
'stun:stun.l.google.com:19302' }] }) so consumers can supply TURN credentials
while defaulting to a public STUN; ensure types/interfaces for
GrokRealtimeOptions are updated to include iceServers and that any callers pass
through or rely on the default.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 10e51339-3598-4e76-a5d1-79eee5db4732

📥 Commits

Reviewing files that changed from the base of the PR and between cdff0cd and cef4228.

📒 Files selected for processing (6)
  • packages/typescript/ai-grok/src/adapters/transcription.ts
  • packages/typescript/ai-grok/src/adapters/tts.ts
  • packages/typescript/ai-grok/src/model-meta.ts
  • packages/typescript/ai-grok/src/realtime/adapter.ts
  • packages/typescript/ai-grok/src/realtime/token.ts
  • packages/typescript/ai-grok/tests/realtime-token.test.ts
✅ Files skipped from review due to trivial changes (1)
  • packages/typescript/ai-grok/tests/realtime-token.test.ts
🚧 Files skipped from review as they are similar to previous changes (1)
  • packages/typescript/ai-grok/src/model-meta.ts

Comment thread packages/typescript/ai-grok/src/adapters/tts.ts
Comment on lines +50 to +57
// xAI docs (docs.x.ai/developers/rest-api-reference/inference/voice)
// specify the body as `{ expires_after: { seconds }, session: { model } }`.
// `expires_after` defaults to 600s on the server, so we only set it
// if the caller overrides; `session.model` is required to pin the
// voice agent model for this token.
const requestBody: Record<string, unknown> = {
session: { model },
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Comment references expires_after override that isn't wired up.

The JSDoc-style comment on lines 50–54 says "we only set it if the caller overrides," but requestBody is unconditionally { session: { model } } with no branch reading options.expires_after (or similar) from GrokRealtimeTokenOptions. Either drop the misleading sentence or wire the override through so callers can actually set it:

♻️ Suggested fix (wire through, if that's the intent)
-        const requestBody: Record<string, unknown> = {
-          session: { model },
-        }
+        const requestBody: Record<string, unknown> = {
+          session: { model },
+        }
+        if (options.expiresAfterSeconds !== undefined) {
+          requestBody.expires_after = { seconds: options.expiresAfterSeconds }
+        }

(Requires adding expiresAfterSeconds?: number to GrokRealtimeTokenOptions.)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/typescript/ai-grok/src/realtime/token.ts` around lines 50 - 57, The
comment is accurate: the requestBody in the token creation code never wires an
expires_after override from the caller; add an optional expiresAfterSeconds?:
number property to GrokRealtimeTokenOptions and, in the function that builds
requestBody (the variable requestBody and the session: { model } assignment),
conditionally add requestBody.expires_after = { seconds: expiresAfterSeconds }
when that option is provided so the generated body matches the xAI docs and the
JSDoc comment.

Comment thread packages/typescript/ai-grok/src/realtime/token.ts
Comment thread packages/typescript/ai-grok/src/realtime/token.ts Outdated
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (2)
packages/typescript/ai-grok/src/realtime/adapter.ts (2)

157-167: A throwing handler aborts remaining subscribers in emit.

If any registered handler throws, the for loop unwinds and subsequent handlers for the same event are never invoked. For events like error / status_change / mode_change, one misbehaving subscriber can silently prevent the others (e.g. internal bookkeeping in RealtimeClient) from running. Wrapping each handler call in a try/catch that routes the error through logger.errors would contain the blast radius.

♻️ Proposed fix
   function emit<TEvent extends RealtimeEvent>(
     event: TEvent,
     payload: Parameters<RealtimeEventHandler<TEvent>>[0],
   ) {
     const handlers = eventHandlers.get(event)
     if (handlers) {
       for (const handler of handlers) {
-        handler(payload)
+        try {
+          handler(payload)
+        } catch (error) {
+          logger.errors('grok.realtime event handler threw', {
+            error,
+            source: 'grok.realtime',
+            event,
+          })
+        }
       }
     }
   }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/typescript/ai-grok/src/realtime/adapter.ts` around lines 157 - 167,
The emit function currently stops invoking remaining subscribers if one handler
throws; update emit (function emit and its use of eventHandlers) to wrap each
handler(payload) call in a try/catch so a thrown error does not abort the loop,
and in the catch call logger.errors (include context such as event and handler
identity if available) to report the error while continuing to invoke remaining
handlers.

1107-1170: Per-call Uint8Array allocations in the visualization getters.

calculateLevel and the four get*Data getters each allocate a fresh Uint8Array every time they're read. These are typically polled at the visualization render rate (~60 FPS) for the session duration, producing a steady stream of short-lived buffers and GC pressure. Consider hoisting per-analyser reusable Uint8Arrays into the outer closure and just re-filling them on each call.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/typescript/ai-grok/src/realtime/adapter.ts` around lines 1107 -
1170, The visualization getters allocate new Uint8Array each call; hoist
reusable buffers into the getAudioVisualization closure (e.g.,
inputTimeDomainBuffer, outputTimeDomainBuffer, inputFreqBuffer,
outputFreqBuffer) and reuse them in calculateLevel, getInputTimeDomainData,
getOutputTimeDomainData, getInputFrequencyData, and getOutputFrequencyData;
ensure you allocate or resize each buffer when its analyser appears or its
fftSize/frequencyBinCount changes (compare analyser.fftSize / frequencyBinCount
before reuse) and fill the existing buffers with analyser.getByteTimeDomainData
/ getByteFrequencyData instead of creating new Uint8Array instances per call.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@packages/typescript/ai-grok/src/realtime/adapter.ts`:
- Around line 154-155: emptyFrequencyData and emptyTimeDomainData are
module-scoped shared Uint8Arrays that get returned directly from
getInputFrequencyData/getOutputFrequencyData/getInputTimeDomainData/getOutputTimeDomainData
and can be mutated by callers; change this by replacing the shared buffers with
factory functions (e.g. makeEmptyFrequencyData and makeEmptyTimeDomainData) that
return a new Uint8Array each call (zero-filled for frequency, 128-filled for
time-domain) and update the four getters (getInputFrequencyData,
getOutputFrequencyData, getInputTimeDomainData, getOutputTimeDomainData) to
return makeEmptyFrequencyData()/makeEmptyTimeDomainData() instead of the shared
arrays so callers receive fresh buffers.
- Around line 719-721: The case handling for 'conversation.item.truncated' only
emits 'interrupted' and leaves currentMode (usually 'speaking') unchanged;
mirror the logic in the interrupt() method by setting currentMode to 'listening'
and emitting a 'mode_change' event so listeners see the transition; update the
'conversation.item.truncated' branch (where emit('interrupted', { messageId:
currentMessageId ?? undefined }) is called) to also set currentMode =
'listening' and emit('mode_change', { mode: 'listening' }) (keeping
currentMessageId handling consistent), ensuring consistency with response.done
guards and avoiding a stale speaking state.

In `@packages/typescript/ai-grok/src/realtime/realtime-contract.ts`:
- Line 15: Update the JSDoc reference that currently points to
realtime-contract.drift-check.ts so it correctly names the actual drift-check
file present in this PR (realtime-contract.drift.test-d.ts); locate the comment
in realtime-contract.ts and replace the stale filename with the correct filename
to avoid dead references.

---

Nitpick comments:
In `@packages/typescript/ai-grok/src/realtime/adapter.ts`:
- Around line 157-167: The emit function currently stops invoking remaining
subscribers if one handler throws; update emit (function emit and its use of
eventHandlers) to wrap each handler(payload) call in a try/catch so a thrown
error does not abort the loop, and in the catch call logger.errors (include
context such as event and handler identity if available) to report the error
while continuing to invoke remaining handlers.
- Around line 1107-1170: The visualization getters allocate new Uint8Array each
call; hoist reusable buffers into the getAudioVisualization closure (e.g.,
inputTimeDomainBuffer, outputTimeDomainBuffer, inputFreqBuffer,
outputFreqBuffer) and reuse them in calculateLevel, getInputTimeDomainData,
getOutputTimeDomainData, getInputFrequencyData, and getOutputFrequencyData;
ensure you allocate or resize each buffer when its analyser appears or its
fftSize/frequencyBinCount changes (compare analyser.fftSize / frequencyBinCount
before reuse) and fill the existing buffers with analyser.getByteTimeDomainData
/ getByteFrequencyData instead of creating new Uint8Array instances per call.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 81f896d1-85ab-4c4d-844b-3205706da0df

📥 Commits

Reviewing files that changed from the base of the PR and between cef4228 and 1dc19e1.

📒 Files selected for processing (4)
  • packages/typescript/ai-grok/package.json
  • packages/typescript/ai-grok/src/realtime/adapter.ts
  • packages/typescript/ai-grok/src/realtime/realtime-contract.ts
  • packages/typescript/ai-grok/tests/realtime-contract.drift.test-d.ts
🚧 Files skipped from review as they are similar to previous changes (1)
  • packages/typescript/ai-grok/package.json

Comment thread packages/typescript/ai-grok/src/realtime/adapter.ts Outdated
Comment thread packages/typescript/ai-grok/src/realtime/adapter.ts
Comment thread packages/typescript/ai-grok/src/realtime/realtime-contract.ts Outdated
- tts.ts / transcription.ts: spread `defaultHeaders` BEFORE Authorization /
  Content-Type so a caller-supplied header can't silently clobber the bearer
  token or auth content-type.
- utils/audio.ts: new `arrayBufferToBase64` helper — Buffer fast path on
  Node, chunked btoa fallback everywhere else (browser, Workers, Bun). Replaces
  the Node-only `Buffer.from(arrayBuffer).toString('base64')` in tts.ts.
- transcription.ts: new `GrokTranscriptionWord` interface extends the core
  `TranscriptionWord` with optional `confidence` and `speaker`. The adapter
  now preserves both fields when xAI returns them, so callers that narrow via
  `as Array<GrokTranscriptionWord>` get the diarization output they asked
  for. Test expectations updated.
- tts.ts: mulaw/alaw `contentType` now includes a `;rate=…` parameter (as
  `audio/PCMU` / `audio/PCMA` per RFC 3551) when the caller requests a
  non-default sample rate, instead of the 8 kHz-implying `audio/basic` /
  `audio/x-alaw-basic`.
- realtime/adapter.ts: `conversation.item.truncated` flips mode back to
  `listening` so the visualiser can't get stuck on `speaking` after an
  interrupt. `sendEvent` wraps `dataChannel.send` in try/catch consistent
  with `flushPendingEvents`. The shared `emptyFrequencyData` /
  `emptyTimeDomainData` buffers are gone — `getAudioVisualization`
  returns a fresh `Uint8Array` per call so consumers can't mutate a
  module-level instance.
- realtime/token.ts: adds a 15s `AbortController` timeout on the
  client_secrets request so a dead endpoint can't hang the caller forever.
  Validates `client_secret.value` / `expires_at` shape at runtime before
  dereferencing so a malformed response throws a descriptive error.
- realtime/realtime-contract.ts: JSDoc filename ref updated.
- examples/ts-react-chat audio/speech/transcribe routes: unify the 400
  unknown_provider payload under the `provider` key (was `providerId`)
  to match the invalid_model_override branch and the request body.
@AlemTuzlak
Copy link
Copy Markdown
Contributor

CodeRabbit feedback addressed (92695f6)

Went through every CodeRabbit inline comment on the PR. Most had already been fixed earlier in the cr-loop; the remaining real items are in this commit.

Fixes landed in this commit

Auth / header precedence (tts.ts, transcription.ts)

  • defaultHeaders now spreads BEFORE Authorization / Content-Type, so a caller-supplied header with the same key can no longer silently clobber the bearer token.

Cross-runtime base64 (utils/audio.ts, tts.ts)

  • New arrayBufferToBase64 helper: Node Buffer fast path + chunked btoa fallback for browser / Cloudflare Workers / Bun / Deno. Replaces the Node-only Buffer.from(arrayBuffer).toString('base64') in generateSpeech.

Diarization / confidence data preservation (transcription.ts)

  • New exported GrokTranscriptionWord interface extends the cross-provider TranscriptionWord with optional confidence and speaker. The adapter now preserves both fields on the returned words; callers who know they're using Grok narrow via as Array<GrokTranscriptionWord>. No more silent data drop for diarize: true.

μ-law / A-law contentType (tts.ts)

  • getContentType now emits audio/PCMU;rate=… / audio/PCMA;rate=… (RFC 3551 RTP payload names) when the caller requests non-8kHz mulaw/alaw. Falls back to audio/basic / audio/x-alaw-basic at 8 kHz so the existing RFC 2046 convention still applies where valid.

Realtime interruption state (adapter.ts)

  • conversation.item.truncated now flips currentMode back to 'listening' (unless the user already hit stopAudioCapture()). Previously the visualiser could stay stuck on 'speaking' after an interrupt.

sendEvent error resilience (adapter.ts)

  • Wraps dataChannel.send(JSON.stringify(event)) in try/catch consistent with flushPendingEvents. A synchronous throw (channel flipped to 'closing' between readyState check and send, or JSON.stringify choking) now logs + emits error instead of propagating out of sendText / sendImage / updateSession.

Audio-visualization buffer aliasing (adapter.ts)

  • Removed the shared module-level emptyFrequencyData / emptyTimeDomainData Uint8Arrays. getAudioVisualization() now returns a fresh Uint8Array on each call, so a consumer drawing into the returned buffer can't mutate a shared instance that every other consumer sees.

Realtime token robustness (token.ts)

  • 15s AbortController timeout on the /v1/realtime/client_secrets fetch so a dead endpoint can't hang the caller forever.
  • Runtime shape validation on sessionData.client_secret.value / .expires_at before dereferencing; malformed responses throw a descriptive error instead of a cryptic TypeError.

Error payload consistency (examples/ts-react-chat audio/speech/transcribe routes)

  • Unified the 400 unknown_provider branch to use provider: instead of providerId:, matching both the sibling invalid_model_override branch and the request body's provider field.

Doc fix (realtime-contract.ts)

  • JSDoc now points at the correct filename for the drift check (tests/realtime-contract.drift.test-d.ts).

Items already resolved earlier in the cr-loop

  • Data-channel [object Event] error stringification
  • SDP-path resource leak
  • sendEvent buffering into pendingEvents forever after disconnect
  • input_audio_transcription unconditional overwrite
  • dataChannelReady hang without timeout
  • GrokRealtimeModel duplication between types.ts and model-meta.ts

All CodeRabbit inline items are now either fixed or part of an earlier commit on this branch. Typecheck clean, 53/53 tests pass, publint --strict passes.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
packages/typescript/ai-grok/src/adapters/tts.ts (1)

134-154: Slightly misleading comment around pcmDefault — consider tightening.

The comment at lines 135–137 says pcmDefault is "used only when we MUST attach a rate to the contentType (pcm)", but it's also used as the fallback for sampleRateForContentType on non-pcm codecs (line 154), where it happens to coincidentally match xAI's documented default of 24000 Hz for all codecs. That coincidence is what makes the mulaw/alaw audio/PCMU;rate=24000 labeling correct when the caller doesn't pin a rate — worth making explicit so the invariant doesn't silently break if xAI ever changes the default.

♻️ Suggested comment rewording
-  // Default sample rate documented in GrokTTSProviderOptions is 24000 Hz —
-  // used only when we MUST attach a rate to the contentType (pcm) and the
-  // caller didn't pick one.
+  // xAI's documented default sample rate across codecs is 24000 Hz. We use
+  // it in two places:
+  //   1. `output_format.sample_rate` for pcm (where the Content-Type carries
+  //      the rate via `audio/L16;rate=…`, so the label MUST match the bytes).
+  //   2. `sampleRateForContentType` for non-pcm codecs (mulaw/alaw) when the
+  //      caller didn't pin a rate — relies on the server actually producing
+  //      24000 Hz output when no rate is sent.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/typescript/ai-grok/src/adapters/tts.ts` around lines 134 - 154,
Update the comment around the pcmDefault constant to state that pcmDefault
(24000) is not only used when embedding a rate for pcm but also serves as the
fallback used by sampleRateForContentType for constructing content-type labels
for other codecs (e.g., mulaw/alaw), and explicitly note this relies on xAI's
documented default of 24000 Hz so the invariant must be revisited if xAI changes
defaults; refer to pcmDefault, sampleRateForContentType, callerSampleRate, codec
and modelOptions when adding the clarification.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@packages/typescript/ai-grok/src/realtime/adapter.ts`:
- Around line 1037-1114: In updateSession, remove the undocumented default
input_audio_transcription payload (the sessionUpdate.input_audio_transcription =
{ model: 'grok-stt' } branch) so we don't send an unsupported parameter to xAI;
preserve the caller override behavior that copies an explicit
providerOptions.inputAudioTranscription / input_audio_transcription value
(including null/false) into sessionUpdate.input_audio_transcription when
present, but if the caller did not provide it, do not inject any default (or set
it explicitly to null if you prefer explicit disablement); update references in
the function (providerOptions, callerTranscription, hasSentInitialSessionUpdate,
sendEvent) accordingly to only send sessionUpdate.input_audio_transcription when
callerTranscription is defined.

---

Nitpick comments:
In `@packages/typescript/ai-grok/src/adapters/tts.ts`:
- Around line 134-154: Update the comment around the pcmDefault constant to
state that pcmDefault (24000) is not only used when embedding a rate for pcm but
also serves as the fallback used by sampleRateForContentType for constructing
content-type labels for other codecs (e.g., mulaw/alaw), and explicitly note
this relies on xAI's documented default of 24000 Hz so the invariant must be
revisited if xAI changes defaults; refer to pcmDefault,
sampleRateForContentType, callerSampleRate, codec and modelOptions when adding
the clarification.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 500e52a3-14ab-4bc3-a4c1-98cc1c845c0d

📥 Commits

Reviewing files that changed from the base of the PR and between 1dc19e1 and 268ad6a.

📒 Files selected for processing (11)
  • examples/ts-react-chat/src/routes/api.generate.audio.ts
  • examples/ts-react-chat/src/routes/api.generate.speech.ts
  • examples/ts-react-chat/src/routes/api.transcribe.ts
  • packages/typescript/ai-grok/src/adapters/transcription.ts
  • packages/typescript/ai-grok/src/adapters/tts.ts
  • packages/typescript/ai-grok/src/realtime/adapter.ts
  • packages/typescript/ai-grok/src/realtime/realtime-contract.ts
  • packages/typescript/ai-grok/src/realtime/token.ts
  • packages/typescript/ai-grok/src/utils/audio.ts
  • packages/typescript/ai-grok/src/utils/index.ts
  • packages/typescript/ai-grok/tests/audio-adapters.test.ts
✅ Files skipped from review due to trivial changes (1)
  • packages/typescript/ai-grok/src/realtime/realtime-contract.ts
🚧 Files skipped from review as they are similar to previous changes (2)
  • packages/typescript/ai-grok/src/utils/index.ts
  • examples/ts-react-chat/src/routes/api.generate.audio.ts

Comment on lines +1037 to +1114
updateSession(config: Partial<RealtimeSessionConfig>) {
const sessionUpdate: Record<string, unknown> = {}

if (config.instructions) {
sessionUpdate.instructions = config.instructions
}

if (config.voice) {
sessionUpdate.voice = config.voice
}

if (config.vadMode) {
if (config.vadMode === 'semantic') {
sessionUpdate.turn_detection = {
type: 'semantic_vad',
eagerness: config.semanticEagerness ?? 'medium',
}
} else if (config.vadMode === 'server') {
sessionUpdate.turn_detection = {
type: 'server_vad',
threshold: config.vadConfig?.threshold ?? 0.5,
prefix_padding_ms: config.vadConfig?.prefixPaddingMs ?? 300,
silence_duration_ms: config.vadConfig?.silenceDurationMs ?? 500,
}
} else {
sessionUpdate.turn_detection = null
}
}

if (config.tools !== undefined) {
sessionUpdate.tools = config.tools.map((t) => ({
type: 'function',
name: t.name,
description: t.description,
parameters: t.inputSchema ?? { type: 'object', properties: {} },
}))
sessionUpdate.tool_choice = 'auto'
}

if (config.outputModalities) {
sessionUpdate.modalities = config.outputModalities
}

if (config.temperature !== undefined) {
sessionUpdate.temperature = config.temperature
}

if (config.maxOutputTokens !== undefined) {
sessionUpdate.max_response_output_tokens = config.maxOutputTokens
}

// Let callers forward an explicit `input_audio_transcription` value
// through `providerOptions` — including `null` / `false` to disable
// the feature. Only apply our `grok-stt` default on the first
// session.update and only if the caller hasn't set it themselves.
const providerOptions: Record<string, unknown> =
config.providerOptions ?? {}
const callerTranscription =
'inputAudioTranscription' in providerOptions
? providerOptions.inputAudioTranscription
: 'input_audio_transcription' in providerOptions
? providerOptions.input_audio_transcription
: undefined
if (callerTranscription !== undefined) {
sessionUpdate.input_audio_transcription =
callerTranscription === false ? null : callerTranscription
} else if (!hasSentInitialSessionUpdate) {
sessionUpdate.input_audio_transcription = { model: 'grok-stt' }
}

if (Object.keys(sessionUpdate).length > 0) {
sendEvent({
type: 'session.update',
session: sessionUpdate,
})
hasSentInitialSessionUpdate = true
}
},
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🌐 Web query:

xAI realtime session update input_audio_transcription model names grok-stt documentation

💡 Result:

In xAI's realtime Voice Agent API (wss://api.x.ai/v1/realtime), input audio transcription is provided automatically via the server event conversation.item.input_audio_transcription.completed, which delivers the transcript of user input audio. Official xAI documentation does not explicitly document a configurable input_audio_transcription parameter (like OpenAI's realtime API) in the session.update event for specifying STT model names. The listed session parameters include instructions, voice, tools, turn_detection, and audio formats, but no STT model field. The Voice Agent models are grok-voice-fast-1.0 (default, deprecated) and grok-voice-think-fast-1.0 (recommended), specified in the WebSocket URL query (e.g., ?model=grok-voice-think-fast-1.0). xAI offers a separate standalone Grok STT API (https://api.x.ai/v1/stt for batch, wss://api.x.ai/v1/stt for streaming). Third-party sources suggest using model=grok-stt for this API, but official docs do not list a model parameter for STT endpoints—instead, configuration uses query params like language, format, diarize. No explicit "grok-stt" model name appears in official Voice Agent or STT docs. LiveKit's xAI plugin sets a default input_audio_transcription but omits the model, implying built-in/default STT for the realtime session without user-configurable model selection.

Citations:


Remove or reconsider the input_audio_transcription session parameter — it is not documented in xAI's realtime API.

xAI's official realtime session.update documentation does not include an input_audio_transcription parameter. The realtime Voice Agent API provides transcript data automatically through server events (conversation.item.input_audio_transcription.completed). Sending { model: 'grok-stt' } as an undocumented parameter may be rejected by xAI's API, causing first-session connections to fail with a server error frame.

The caller-override mechanism (explicit null/false, object, or default) is sound in design, but the default value should either be removed entirely or set to null to avoid unsupported parameters.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/typescript/ai-grok/src/realtime/adapter.ts` around lines 1037 -
1114, In updateSession, remove the undocumented default
input_audio_transcription payload (the sessionUpdate.input_audio_transcription =
{ model: 'grok-stt' } branch) so we don't send an unsupported parameter to xAI;
preserve the caller override behavior that copies an explicit
providerOptions.inputAudioTranscription / input_audio_transcription value
(including null/false) into sessionUpdate.input_audio_transcription when
present, but if the caller did not provide it, do not inject any default (or set
it explicitly to null if you prefer explicit disablement); update references in
the function (providerOptions, callerTranscription, hasSentInitialSessionUpdate,
sendEvent) accordingly to only send sessionUpdate.input_audio_transcription when
callerTranscription is defined.

@AlemTuzlak AlemTuzlak merged commit 2e4c942 into TanStack:main Apr 24, 2026
8 checks passed
@github-actions github-actions Bot mentioned this pull request Apr 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants