Voice Layer Service

Small FastAPI service that accepts a frontend WebSocket connection, streams audio to ElevenLabs realtime STT, forwards committed transcripts to your existing backend /v1/chat, then synthesizes the assistant reply with ElevenLabs TTS and streams it back to the frontend.

What this service does

Accepts a browser or mobile WebSocket connection at /v1/voice/ws
Lets the frontend pass VAD settings per session
Uses ElevenLabs realtime speech-to-text for transcription
Calls your backend /v1/register and /v1/chat
Uses ElevenLabs text-to-speech for assistant playback
Returns structured WS events for partial transcripts, committed transcripts, assistant text, and assistant audio

Setup

Create a virtual environment and install dependencies:

python3 -m venv .venv
./.venv/bin/pip install -r requirements.txt

Copy .env.example to .env and fill in:

cp .env.example .env

At minimum set:

ELEVENLABS_API_KEY
ELEVENLABS_VOICE_ID
LLM_API_KEY if you want this service to auto-register
BACKEND_BASE_URL should be the backend service root, for example https://your-backend.example.com, not a full /v1/register or /v1/chat URL
BACKEND_API_KEY should be set on the service for production deployments

If you already have a stable backend registration_id, set DEFAULT_REGISTRATION_ID and you can skip the LLM registration env values.

For production, ALLOW_FRONTEND_BACKEND_AUTH=false is recommended so browser clients cannot override the deployed backend credentials.

Run

uvicorn app.main:app --host 0.0.0.0 --port 8000 --reload

Open the browser test console at:

http://localhost:8000/test

Frontend handoff document:

FRONTEND_HANDOFF.md

Frontend WebSocket Contract

Connect to:

ws://localhost:8000/v1/voice/ws

1. Configure the session

Send this first:

{
  "type": "session.configure",
  "session_id": "voice-test-1",
  "backend": {
    "registration_id": "abc123-def456"
  },
  "stt": {
    "model_id": "scribe_v2_realtime",
    "sample_rate": 16000,
    "audio_format": "pcm_16000",
    "language_code": "en",
    "commit_strategy": "vad",
    "vad_threshold": 0.62,
    "vad_silence_threshold_secs": 1.4,
    "min_speech_duration_ms": 520,
    "min_silence_duration_ms": 800,
    "include_timestamps": false
  },
  "tts": {
    "voice_id": "YOUR_ELEVENLABS_VOICE_ID",
    "model_id": "eleven_flash_v2_5",
    "output_format": "pcm_24000",
    "voice_settings": {
      "stability": 0.4,
      "similarity_boost": 0.8,
      "speed": 1.0
    }
  }
}

You can also omit backend.registration_id and instead pass:

{
  "backend": {
    "provider": "custom",
    "api_key": "YOUR_LLM_API_KEY",
    "model": "llama-3.3-70b-versatile",
    "base_url": "https://api.groq.com/openai",
    "system_prompt": "You are a helpful assistant."
  }
}

The service will call your backend /v1/register and cache the returned registration. If your backend does not expose /v1/register, provide an existing registration_id instead.

2. Stream audio

Send JSON frames with base64 PCM audio:

{
  "type": "audio.append",
  "audio": "BASE64_PCM_CHUNK",
  "sample_rate": 16000
}

You can also send binary WebSocket frames directly after configuration. Binary frames are treated as raw PCM audio and forwarded using the configured sample rate.

If you use manual commit instead of VAD, send:

{
  "type": "audio.commit"
}

3. Optional direct text prompt

{
  "type": "text.prompt",
  "text": "Summarize what I just said."
}

Server Events

Examples:

{ "type": "session.ready", "message": "Send session.configure to begin." }
{ "type": "session.configured", "session_id": "voice-test-1" }
{ "type": "stt.session_started", "session_id": "elevenlabs-session-id", "config": { "...": "..." } }
{ "type": "transcript.partial", "text": "hello wor" }
{ "type": "transcript.committed", "text": "hello world" }
{ "type": "assistant.started", "text": "hello world" }
{ "type": "assistant.message", "text": "Hi there!", "metadata": { "...": "..." } }
{ "type": "assistant.audio.start", "turn_id": 1, "content_type": "application/octet-stream", "output_format": "pcm_24000" }
{ "type": "assistant.audio.chunk", "turn_id": 1, "audio": "BASE64_AUDIO_CHUNK", "content_type": "application/octet-stream", "output_format": "pcm_24000", "chunk_index": 1 }
{ "type": "assistant.audio.end", "turn_id": 1, "content_type": "application/octet-stream", "output_format": "pcm_24000", "chunk_count": 42 }
{ "type": "assistant.audio", "audio": "BASE64_AUDIO", "content_type": "audio/mpeg", "output_format": "mp3_44100_128" }
{ "type": "error", "message": "Readable error for the frontend" }

Notes for the frontend engineer

Prefer 16 kHz mono PCM for the simplest STT path.
Send chunks around 0.1 to 1.0 seconds long for smoother streaming and lower latency.
commit_strategy: "vad" is the default recommended path for the browser test client. The backend now owns turn boundaries by default.
Use commit_strategy: "manual" only for debugging or custom clients that want to force commits explicitly.
Only send previous_text with the first audio chunk after a new segment starts.
For low-latency browser playback, prefer pcm_24000 so the client can start speaking from streamed chunks immediately.
Assistant audio may arrive as streamed assistant.audio.start / assistant.audio.chunk / assistant.audio.end events, or as a fallback single assistant.audio blob.

Browser test page

The service includes a built-in browser console at /test.

Recommended test order:

Start the API server.
Open http://localhost:8000/test.
Paste a registration_id into the page if you already have one.
Click Connect.
Click Send session.configure.
Click Send text.prompt first to verify backend chat plus TTS.
Click Start Mic to test realtime STT and voice playback.

If you need to force a commit while testing, click Send audio.commit.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
app		app
tests		tests
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
Dockerfile		Dockerfile
FRONTEND_HANDOFF.md		FRONTEND_HANDOFF.md
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Voice Layer Service

What this service does

Setup

Run

Frontend WebSocket Contract

1. Configure the session

2. Stream audio

3. Optional direct text prompt

Server Events

Notes for the frontend engineer

Browser test page

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Voice Layer Service

What this service does

Setup

Run

Frontend WebSocket Contract

1. Configure the session

2. Stream audio

3. Optional direct text prompt

Server Events

Notes for the frontend engineer

Browser test page

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages