A C++ voice assistant client designed for multi-platform deployment, from desktop to embedded systems (Raspberry Pi, ESP32, Arduino).
JotaClient is the client component of the Jota ecosystem - an AI-powered voice assistant platform. This implementation serves as a desktop prototype with a clear path for embedded deployment, demonstrating the core architecture for:
- Real-time audio capture and streaming
- WebSocket-based communication with transcription server
- Automatic silence detection
- Dual-mode operation (file recording + real-time streaming)
The system is designed to eventually run on embedded hardware (Raspberry Pi, ESP32, Arduino), enabling voice-controlled smart home and IoT applications.
graph TB
subgraph Client["JotaClient (C++)"]
Mic[Microphone] --> AudioListener[AudioListener<br/>PortAudio]
AudioListener --> FileMode[File Recording]
AudioListener --> StreamMode[Real-time Streaming]
FileMode --> WAV[WAV Files]
StreamMode --> Callback[Audio Callback]
end
subgraph Server["Transcription Server (Future)"]
WS[WebSocket Server] --> Whisper[Whisper.cpp<br/>Speech-to-Text]
Whisper --> LLM[LLM Integration<br/>GPT/Local Model]
LLM --> TTS[Text-to-Speech]
TTS --> Response[Audio Response]
end
Callback -->|Audio Chunks| WS
Response -->|Audio Stream| Client
style Server fill:#f0f0f0,stroke:#999,stroke-dasharray: 5 5
- ✅ Audio Capture: High-quality 16kHz mono audio recording using PortAudio
- ✅ Dual Mode Operation:
- File-only: Record to WAV file
- Stream-only: Real-time audio streaming via callbacks
- Both: Simultaneous recording and streaming
- ✅ Silence Detection: Automatic recording stop after 1 second of silence
- ✅ Real-time Visualization: Volume level meter during recording
- ✅ Exception Handling: Custom exception hierarchy for robust error management
- ✅ Path Management: Organized data storage with automatic directory creation
- 🔄 ESP32/Arduino Support: Port to embedded hardware
- 🔄 WebSocket Client: Real-time audio streaming to server
- 🔄 Server Integration: Whisper.cpp transcription service
- 🔄 LLM Integration: GPT or local model for intelligent responses
- 🔄 TTS Playback: Audio response playback on device
- 🔄 Power Management: Optimized for battery-powered operation
JotaClient/
├── src/
│ ├── AudioListener.h/cpp # Audio capture and streaming
│ ├── TranscriptionWrapper.h/cpp # WebSocket client (in progress)
│ ├── main.cpp # Entry point with visualization
│ ├── exceptions/
│ │ ├── JotaException.h/cpp # Base exception class
│ │ ├── AudioExceptions.h # Audio-specific exceptions
│ │ └── FileExceptions.h # File I/O exceptions
│ ├── types/
│ │ ├── DataTypes.h # Data type enumerations
│ │ └── ErrorCodes.h # Error code definitions
│ └── utils/
│ ├── PathManager.h/cpp # Path resolution utilities
│ └── DataManager.h/cpp # Data organization
├── scripts/
│ └── download_models.sh # Download Whisper and VAD models
├── CMakeLists.txt # Build configuration
└── README.md # This file
- CMake 3.16 or higher
- C++17 compatible compiler (GCC, Clang, MSVC)
- PortAudio: Cross-platform audio I/O library
- libwebsockets: WebSocket client/server library (for future streaming)
brew install cmake portaudio libwebsocketssudo apt-get install cmake libportaudio2 libportaudio-dev libwebsockets-devUse vcpkg or build dependencies from source.
git clone <repository-url>
cd JotaClient
mkdir build && cd build
cmake ..
make# Debug build with symbols
cmake .. -DCMAKE_BUILD_TYPE=Debug
# Release build with optimizations
cmake .. -DCMAKE_BUILD_TYPE=Release
# Strict warnings (recommended for development)
cmake .. -DCMAKE_CXX_FLAGS="-Wall -Wextra -Werror"./JotaClientThe application will:
- Start recording audio from the default microphone
- Display real-time volume visualization
- Automatically stop after 1 second of silence
- Save the recording to
data/audio_recordings/YYYY-MM-DD_HH-MM-SS.wav
Press q + Enter to stop recording manually.
The AudioListener class supports three modes:
AudioListener listener;
// File only - record to WAV file
listener.start(AudioMode::FILE_ONLY);
// Stream only - real-time callback processing
listener.setCallback([](const std::vector<float>& chunk) {
// Process audio chunk
});
listener.start(AudioMode::STREAM_ONLY);
// Both - record AND stream simultaneously
listener.start(AudioMode::BOTH);This project follows the Google C++ Style Guide:
- Class names:
PascalCase - Function names:
camelCase - Constants:
UPPER_SNAKE_CASE - Private members: trailing underscore (
member_)
Custom exception hierarchy based on JotaException:
AudioDeviceException: Audio hardware errorsAudioStreamException: Audio stream errorsFileWriteException: File I/O errorsFileReadException: File reading errors
All exceptions include error codes and contextual information.
- Create feature branch
- Follow existing code structure
- Add appropriate exception handling
- Update documentation
- Test thoroughly before merging
- Implement audio streaming via WebSocket
- Send audio chunks in real-time to server
- Receive transcribed text from server
- Display transcription results
- Handle reconnection and error cases
- Integrate Porcupine Wakeword Detection (Next Step)
Important
Preliminary Version: The current architecture is designed to be Wakeword-Only. The client starts in a LISTENING state and waits for a wakeword trigger (currently a placeholder) before connecting to the server and streaming audio. This ensures efficient bandwidth usage and privacy.
Note: Server will handle Whisper transcription. LLM orchestration and TTS will be implemented later as a separate orchestrator component.
- Port to Raspberry Pi platform
- Optimize for ARM architecture
- Power management optimization
- Create installation script
- Port to ESP32 platform
- Replace PortAudio with I2S microphone
- Optimize memory usage for embedded systems
- Implement WiFi configuration
- Add OTA update support
- Evaluate feasibility for Arduino Nano/Uno
- Extreme memory optimization if viable
- Consider ESP32 as primary embedded target
- Home automation commands
- Multi-device coordination
- Cloud synchronization
- Mobile app companion
- Sample Rate: 16,000 Hz (optimal for speech recognition)
- Channels: Mono
- Format: 32-bit float (internal), 16-bit PCM (WAV output)
- Buffer Size: 512 frames
- Silence Threshold: 0.005 RMS
- Silence Timeout: 1000ms
- CPU Usage: ~5-10% on modern desktop (recording + visualization)
- Memory: ~10MB typical usage
- Latency: <50ms audio processing latency
Error: Failed to initialize PortAudio
Solution: Check that your microphone is connected and not in use by another application.
fatal error: portaudio.h: No such file or directory
Solution: Install PortAudio development headers (see Dependencies section).
Error: Failed to open audio stream
Solution: Grant microphone permissions in System Preferences → Security & Privacy → Privacy → Microphone.
This is a personal portfolio project by Sito.
Sito
Note: This is currently a desktop prototype. The multi-platform architecture is designed to be portable to Raspberry Pi, ESP32, and potentially Arduino with minimal modifications. See the development roadmap for platform-specific adaptation plans.