Skip to content

Voice Pipeline

CortexPrism edited this page Jun 17, 2026 · 1 revision

Voice Pipeline

CortexPrism includes a full voice pipeline: speech-to-text (STT), text-to-speech (TTS), voice activity detection (VAD), and real-time audio streaming.

Components

Component File Description
STT src/voice/stt.ts Speech-to-text via OpenAI Whisper
TTS src/voice/tts.ts Text-to-speech via OpenAI TTS or ElevenLabs
VAD src/voice/vad.ts Energy-based voice activity detection
Audio src/voice/audio.ts Audio format conversion (ffmpeg fallback)
Channel src/voice/channel.ts Voice channel plugin implementing ChannelPlugin
Pipeline src/voice/pipeline.ts Auto-TTS post-output hook
Manager src/voice/manager.ts Voice mode state management

Voice Mode

cortex voice enable         # Enable voice mode
cortex voice disable        # Disable voice mode
cortex voice status         # Show current voice config
cortex voice set-voice <id> # Change default voice

Voice Settings

In ~/.cortex/config.json:

{
  "voice": {
    "enabled": false,
    "provider": "openai",
    "defaultVoice": "alloy",
    "autoTTS": false
  }
}

OpenAI TTS voices: alloy, echo, fable, onyx, nova, shimmer

ElevenLabs is supported as an alternative TTS provider. Set voice.provider to elevenlabs and configure your API key in settings.

Auto-TTS

When voice.autoTTS is enabled, a post-output pipeline hook automatically synthesizes agent text responses to audio. Audio is forwarded to the WebSocket client before the done signal.

VAD (Voice Activity Detection)

Energy-based VAD with configurable parameters:

  • Frame size: Duration of each audio frame
  • Speech threshold: Energy threshold to detect speech
  • Silence timeout: Duration of silence before ending capture
  • Minimum speech duration: Minimum length to consider valid speech

WebSocket Audio Protocol

Real-time audio streaming over WebSocket:

Client → Server:

{ "type": "audio_chunk", "data": "<base64>" }
{ "type": "audio_end" }

Server → Client:

{ "type": "speak", "text": "...", "voice": "alloy" }
{ "type": "audio", "data": "<base64>" }
{ "type": "voice_state", "enabled": true }

Transcribed speech is dispatched directly into the agent loop as a user message.

Agent Tools

Tool Description
speak Text-to-speech via configured TTS provider
listen Speech-to-text via configured STT provider

REST API

Endpoint Description
POST /api/voice/transcribe Speech-to-text transcription
POST /api/voice/synthesize Text-to-speech synthesis
GET /api/voice/synthesize/:text TTS via GET request
GET /api/voice/providers List available TTS providers

Web UI

Voice settings tab includes:

  • Provider selection (OpenAI / ElevenLabs)
  • Default voice choice
  • Language setting
  • Auto-TTS toggle
  • Microphone button in chat (with recording animation)
  • Speaker button on each assistant message for on-demand TTS
  • Speaking pulse animation indicator

See Also

Clone this wiki locally