Small FastAPI service that accepts a frontend WebSocket connection, streams audio to ElevenLabs realtime STT, forwards committed transcripts to your existing backend /v1/chat, then synthesizes the assistant reply with ElevenLabs TTS and streams it back to the frontend.
- Accepts a browser or mobile WebSocket connection at
/v1/voice/ws - Lets the frontend pass VAD settings per session
- Uses ElevenLabs realtime speech-to-text for transcription
- Calls your backend
/v1/registerand/v1/chat - Uses ElevenLabs text-to-speech for assistant playback
- Returns structured WS events for partial transcripts, committed transcripts, assistant text, and assistant audio
- Create a virtual environment and install dependencies:
python3 -m venv .venv
./.venv/bin/pip install -r requirements.txt- Copy
.env.exampleto.envand fill in:
cp .env.example .envAt minimum set:
ELEVENLABS_API_KEYELEVENLABS_VOICE_IDLLM_API_KEYif you want this service to auto-registerBACKEND_BASE_URLshould be the backend service root, for examplehttps://your-backend.example.com, not a full/v1/registeror/v1/chatURLBACKEND_API_KEYshould be set on the service for production deployments
If you already have a stable backend registration_id, set DEFAULT_REGISTRATION_ID and you can skip the LLM registration env values.
For production, ALLOW_FRONTEND_BACKEND_AUTH=false is recommended so browser clients cannot override the deployed backend credentials.
uvicorn app.main:app --host 0.0.0.0 --port 8000 --reloadOpen the browser test console at:
http://localhost:8000/test
Frontend handoff document:
FRONTEND_HANDOFF.md
Connect to:
ws://localhost:8000/v1/voice/ws
Send this first:
{
"type": "session.configure",
"session_id": "voice-test-1",
"backend": {
"registration_id": "abc123-def456"
},
"stt": {
"model_id": "scribe_v2_realtime",
"sample_rate": 16000,
"audio_format": "pcm_16000",
"language_code": "en",
"commit_strategy": "vad",
"vad_threshold": 0.62,
"vad_silence_threshold_secs": 1.4,
"min_speech_duration_ms": 520,
"min_silence_duration_ms": 800,
"include_timestamps": false
},
"tts": {
"voice_id": "YOUR_ELEVENLABS_VOICE_ID",
"model_id": "eleven_flash_v2_5",
"output_format": "pcm_24000",
"voice_settings": {
"stability": 0.4,
"similarity_boost": 0.8,
"speed": 1.0
}
}
}You can also omit backend.registration_id and instead pass:
{
"backend": {
"provider": "custom",
"api_key": "YOUR_LLM_API_KEY",
"model": "llama-3.3-70b-versatile",
"base_url": "https://api.groq.com/openai",
"system_prompt": "You are a helpful assistant."
}
}The service will call your backend /v1/register and cache the returned registration.
If your backend does not expose /v1/register, provide an existing registration_id instead.
Send JSON frames with base64 PCM audio:
{
"type": "audio.append",
"audio": "BASE64_PCM_CHUNK",
"sample_rate": 16000
}You can also send binary WebSocket frames directly after configuration. Binary frames are treated as raw PCM audio and forwarded using the configured sample rate.
If you use manual commit instead of VAD, send:
{
"type": "audio.commit"
}{
"type": "text.prompt",
"text": "Summarize what I just said."
}Examples:
{ "type": "session.ready", "message": "Send session.configure to begin." }
{ "type": "session.configured", "session_id": "voice-test-1" }
{ "type": "stt.session_started", "session_id": "elevenlabs-session-id", "config": { "...": "..." } }
{ "type": "transcript.partial", "text": "hello wor" }
{ "type": "transcript.committed", "text": "hello world" }
{ "type": "assistant.started", "text": "hello world" }
{ "type": "assistant.message", "text": "Hi there!", "metadata": { "...": "..." } }
{ "type": "assistant.audio.start", "turn_id": 1, "content_type": "application/octet-stream", "output_format": "pcm_24000" }
{ "type": "assistant.audio.chunk", "turn_id": 1, "audio": "BASE64_AUDIO_CHUNK", "content_type": "application/octet-stream", "output_format": "pcm_24000", "chunk_index": 1 }
{ "type": "assistant.audio.end", "turn_id": 1, "content_type": "application/octet-stream", "output_format": "pcm_24000", "chunk_count": 42 }
{ "type": "assistant.audio", "audio": "BASE64_AUDIO", "content_type": "audio/mpeg", "output_format": "mp3_44100_128" }
{ "type": "error", "message": "Readable error for the frontend" }- Prefer 16 kHz mono PCM for the simplest STT path.
- Send chunks around 0.1 to 1.0 seconds long for smoother streaming and lower latency.
commit_strategy: "vad"is the default recommended path for the browser test client. The backend now owns turn boundaries by default.- Use
commit_strategy: "manual"only for debugging or custom clients that want to force commits explicitly. - Only send
previous_textwith the first audio chunk after a new segment starts. - For low-latency browser playback, prefer
pcm_24000so the client can start speaking from streamed chunks immediately. - Assistant audio may arrive as streamed
assistant.audio.start/assistant.audio.chunk/assistant.audio.endevents, or as a fallback singleassistant.audioblob.
The service includes a built-in browser console at /test.
Recommended test order:
- Start the API server.
- Open
http://localhost:8000/test. - Paste a
registration_idinto the page if you already have one. - Click
Connect. - Click
Send session.configure. - Click
Send text.promptfirst to verify backend chat plus TTS. - Click
Start Micto test realtime STT and voice playback.
If you need to force a commit while testing, click Send audio.commit.