Voice Pipeline

CortexPrism includes a full voice pipeline: speech-to-text (STT), text-to-speech (TTS), voice activity detection (VAD), and real-time audio streaming.

Components

Component	File	Description
STT	`src/voice/stt.ts`	Speech-to-text via OpenAI Whisper
TTS	`src/voice/tts.ts`	Text-to-speech via OpenAI TTS or ElevenLabs
VAD	`src/voice/vad.ts`	Energy-based voice activity detection
Audio	`src/voice/audio.ts`	Audio format conversion (ffmpeg fallback)
Channel	`src/voice/channel.ts`	Voice channel plugin implementing ChannelPlugin
Pipeline	`src/voice/pipeline.ts`	Auto-TTS post-output hook
Manager	`src/voice/manager.ts`	Voice mode state management

Voice Mode

cortex voice enable         # Enable voice mode
cortex voice disable        # Disable voice mode
cortex voice status         # Show current voice config
cortex voice set-voice <id> # Change default voice

Voice Settings

In ~/.cortex/config.json:

{
  "voice": {
    "enabled": false,
    "provider": "openai",
    "defaultVoice": "alloy",
    "autoTTS": false
  }
}

OpenAI TTS voices: alloy, echo, fable, onyx, nova, shimmer

ElevenLabs is supported as an alternative TTS provider. Set voice.provider to elevenlabs and configure your API key in settings.

Auto-TTS

When voice.autoTTS is enabled, a post-output pipeline hook automatically synthesizes agent text responses to audio. Audio is forwarded to the WebSocket client before the done signal.

VAD (Voice Activity Detection)

Energy-based VAD with configurable parameters:

Frame size: Duration of each audio frame
Speech threshold: Energy threshold to detect speech
Silence timeout: Duration of silence before ending capture
Minimum speech duration: Minimum length to consider valid speech

WebSocket Audio Protocol

Real-time audio streaming over WebSocket:

Client → Server:

{ "type": "audio_chunk", "data": "<base64>" }
{ "type": "audio_end" }

Server → Client:

{ "type": "speak", "text": "...", "voice": "alloy" }
{ "type": "audio", "data": "<base64>" }
{ "type": "voice_state", "enabled": true }

Transcribed speech is dispatched directly into the agent loop as a user message.

Agent Tools

Tool	Description
`speak`	Text-to-speech via configured TTS provider
`listen`	Speech-to-text via configured STT provider

REST API

Endpoint	Description
`POST /api/voice/transcribe`	Speech-to-text transcription
`POST /api/voice/synthesize`	Text-to-speech synthesis
`GET /api/voice/synthesize/:text`	TTS via GET request
`GET /api/voice/providers`	List available TTS providers

Web UI

Voice settings tab includes:

Provider selection (OpenAI / ElevenLabs)
Default voice choice
Language setting
Auto-TTS toggle
Microphone button in chat (with recording animation)
Speaker button on each assistant message for on-demand TTS
Speaking pulse animation indicator

Uh oh!

Voice Pipeline

Voice Pipeline

Components

Voice Mode

Voice Settings

Auto-TTS

VAD (Voice Activity Detection)

WebSocket Audio Protocol

Agent Tools

REST API

Web UI

See Also

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

CortexPrism Wiki

Getting Started

Core Concepts

AI & Models

Features

Extending

API Reference

Operations

Development

Reference

Clone this wiki locally