-
-
Notifications
You must be signed in to change notification settings - Fork 3
Voice Pipeline
CortexPrism includes a full voice pipeline: speech-to-text (STT), text-to-speech (TTS), voice activity detection (VAD), and real-time audio streaming.
| Component | File | Description |
|---|---|---|
| STT | src/voice/stt.ts |
Speech-to-text via OpenAI Whisper |
| TTS | src/voice/tts.ts |
Text-to-speech via OpenAI TTS or ElevenLabs |
| VAD | src/voice/vad.ts |
Energy-based voice activity detection |
| Audio | src/voice/audio.ts |
Audio format conversion (ffmpeg fallback) |
| Channel | src/voice/channel.ts |
Voice channel plugin implementing ChannelPlugin |
| Pipeline | src/voice/pipeline.ts |
Auto-TTS post-output hook |
| Manager | src/voice/manager.ts |
Voice mode state management |
cortex voice enable # Enable voice mode
cortex voice disable # Disable voice mode
cortex voice status # Show current voice config
cortex voice set-voice <id> # Change default voiceIn ~/.cortex/config.json:
{
"voice": {
"enabled": false,
"provider": "openai",
"defaultVoice": "alloy",
"autoTTS": false
}
}OpenAI TTS voices: alloy, echo, fable, onyx, nova, shimmer
ElevenLabs is supported as an alternative TTS provider. Set voice.provider to elevenlabs and configure your API key in settings.
When voice.autoTTS is enabled, a post-output pipeline hook automatically synthesizes agent text responses to audio. Audio is forwarded to the WebSocket client before the done signal.
Energy-based VAD with configurable parameters:
- Frame size: Duration of each audio frame
- Speech threshold: Energy threshold to detect speech
- Silence timeout: Duration of silence before ending capture
- Minimum speech duration: Minimum length to consider valid speech
Real-time audio streaming over WebSocket:
Client → Server:
{ "type": "audio_chunk", "data": "<base64>" }
{ "type": "audio_end" }Server → Client:
{ "type": "speak", "text": "...", "voice": "alloy" }
{ "type": "audio", "data": "<base64>" }
{ "type": "voice_state", "enabled": true }Transcribed speech is dispatched directly into the agent loop as a user message.
| Tool | Description |
|---|---|
speak |
Text-to-speech via configured TTS provider |
listen |
Speech-to-text via configured STT provider |
| Endpoint | Description |
|---|---|
POST /api/voice/transcribe |
Speech-to-text transcription |
POST /api/voice/synthesize |
Text-to-speech synthesis |
GET /api/voice/synthesize/:text |
TTS via GET request |
GET /api/voice/providers |
List available TTS providers |
Voice settings tab includes:
- Provider selection (OpenAI / ElevenLabs)
- Default voice choice
- Language setting
- Auto-TTS toggle
- Microphone button in chat (with recording animation)
- Speaker button on each assistant message for on-demand TTS
- Speaking pulse animation indicator
- REST API — Voice API endpoints
- Configuration — Voice config options
CortexPrism — Open-source agentic AI harness · MIT License · Built with Deno 2.x + TypeScript