Voice-to-action AI assistant that captures voice input, transcribes with Whisper, generates responses with GPT, and plays spoken output using ElevenLabs.
- Voice recording in browser (MediaRecorder API)
- Whisper speech-to-text transcription
- GPT-based conversational response generation
- ElevenLabs text-to-speech audio playback
- Futuristic responsive UI with animated gradients
- Modular backend AI service architecture
- Frontend: React + Vite
- Backend: Node.js + Express
- AI APIs: OpenAI (Whisper + GPT), ElevenLabs (TTS)
npm installCopy and populate:
server/.env.example->server/.env
Required variables:
OPENAI_API_KEYOPENAI_MODEL(defaults togpt-4.1-mini)WHISPER_MODEL(defaults towhisper-1)ELEVENLABS_API_KEYELEVENLABS_VOICE_IDPORT(defaults to5000)CLIENT_ORIGIN(defaults tohttp://localhost:5173)
npm run dev- Frontend:
http://localhost:5173 - Backend:
http://localhost:5000
This project demonstrates end-to-end voice AI orchestration:
- Speech capture
- Speech-to-text
- LLM response generation
- Text-to-speech synthesis
- Audio playback UX
flowchart LR
subgraph client [Client]
UI[React UI]
Hook[Voice session hook]
APIc[API client]
UI --> Hook
Hook --> APIc
end
subgraph server [Server]
R[routes]
C[voice controller]
S[voice pipeline service]
W[Whisper service]
G[GPT service]
T[TTS service]
R --> C
C --> S
S --> W
S --> G
S --> T
end
APIc -->|multipart audio| R
T -->|MPEG audio| APIc