A modular Python-based voice assistant robot that uses AI for speech-to-text, natural language processing, and text-to-speech. The agent integrates with a backend API using x402 payment protocol for AI interactions.
- π€ Voice Activity Detection (VAD): Automatically detects when you're speaking using multi-criteria analysis (RMS energy, zero-crossing rate, spectral centroid)
- π£οΈ Speech-to-Text: Transcribes your speech using OpenAI Whisper API
- π€ AI Chat Integration: Sends transcriptions to an AI backend API with x402 payment protocol support
- π Text-to-Speech: Converts AI responses to natural-sounding speech using ElevenLabs
- πΉ Webcam Support: Vision module for camera input (extensible)
- π Modular Architecture: Event-driven system with pluggable modules
The project uses an event-driven architecture with the following components:
βββββββββββββββββββ
β AgentCore β β Main orchestrator
ββββββββββ¬βββββββββ
β
ββββββ΄βββββ
βEventBus β β Event communication hub
ββββββ¬βββββ
β
ββββββ΄ββββββββββββββββββββββββββββββ
β β
βββββΌβββββββββββ ββββββββββββββββ ββββΌβββββββββββββ
β WakeWordVAD β β OpenAIWhisperβ β ElevenLabsTTS β
β Module βββΆβ STT βββΆβ Module β
ββββββββββββββββ ββββββββββββββββ βββββββββββββββββ
β β β
β β β
βΌ βΌ βΌ
Microphone Transcription Speakers
- WakeWordVADModule continuously listens for speech
- When speech is detected, it records audio until silence
- Emits
audio_ready_for_sttevent with audio data - OpenAIWhisperSTTModule receives audio and transcribes it
- Emits
transcriptionevent with text - AgentCore handles transcription and sends to backend API
- Backend responds with AI-generated text
- AgentCore emits
agent_responseevent - ElevenLabsTTSModule converts text to speech and plays it
- Python 3.8 or higher
- Microphone and speakers/headphones
- API keys for:
- OpenAI (for Whisper STT)
- ElevenLabs (for TTS)
- Backend API access
pip install -r requirements.txtCreate a .env file in the project root:
# Required: x402 payment mnemonic (seed phrase)
MNEMONIC=your twelve word seed phrase here
# Required: OpenAI API key for speech-to-text
OPENAI_API_KEY=sk-your-openai-api-key
# Required: ElevenLabs API key and voice ID for text-to-speech
ELEVENLABS_API_KEY=your-elevenlabs-api-key
ELEVENLABS_VOICE_ID=your-voice-id
# Optional: Agent configuration
AGENT_NAME=0xdacd02dd0ce8a923ad26d4c49bb94ece09306c3e # Default Wiz token ID
SENDER_NAME=UserRun the agent:
python main.pyThe agent will:
- Initialize all modules
- List available audio input devices
- Start listening for speech
- Process your voice input and respond with AI-generated speech
Press Ctrl+C to gracefully stop the agent and all modules.
By default, the agent uses the system's default microphone. To specify a different device:
- Run the agent once to see available devices
- Edit
modules/audio_input/wake_word_vad.pyand setdevice_indexin theWakeWordVADModuleconstructor
Example:
WakeWordVADModule(
event_bus,
device_index=2, # Use device index 2
rate=16000,
chunk_size=512,
vad_threshold=0.01,
silence_timeout_ms=1000
)Adjust voice activity detection sensitivity in main.py:
WakeWordVADModule(
event_bus,
vad_threshold=0.01, # Lower = more sensitive (0.005-0.02)
silence_timeout_ms=1000 # Milliseconds of silence before ending speech
)Change the voice by updating ELEVENLABS_VOICE_ID in your .env file. You can find available voices in your ElevenLabs dashboard.
- AgentCore (
core/agent_core.py): Main orchestrator that manages modules and handles events - EventBus (
core/event_bus.py): Event-driven communication system - BackendConnector (
core/backend_connector.py): Handles HTTP requests and x402 payments - ModuleBase (
core/module_base.py): Base class for all modules
- WakeWordVADModule (
modules/audio_input/wake_word_vad.py): Voice activity detection using multi-criteria analysis - MicrophoneModule (
modules/audio_input/microphone.py): Basic microphone input (if needed)
- OpenAIWhisperSTTModule (
modules/ai/openai_whisper_stt.py): Speech-to-text using OpenAI Whisper API - ElevenLabsTTSModule (
modules/ai/elevenlabs_tts.py): Text-to-speech using ElevenLabs API
- SpeakersModule (
modules/audio_output/speakers.py): Audio playback through speakers
- WebCamModule (
modules/vision/web_cam.py): Webcam input (extensible for future use)
- Create a new file in the appropriate module directory
- Inherit from
ModuleBaseor a more specific base class - Implement required methods:
start(),stop(),loop() - Use
event_bus.emit()to send events - Use
event_bus.listen()to receive events - Add your module to the
moduleslist inmain.py
Example:
from core.module_base import ModuleBase
class MyModule(ModuleBase):
async def start(self):
self.running = True
print("MyModule started")
async def stop(self):
self.running = False
print("MyModule stopped")
async def loop(self):
while self.running:
event = await self.event_bus.listen("my_event_type")
# Process event...
await asyncio.sleep(0.1)Problem: No audio input detected
- Check microphone permissions
- Verify device index is correct
- Ensure microphone is not muted
- Try adjusting
vad_threshold(lower = more sensitive)
Problem: Audio playback not working
- Check speakers/headphones are connected
- Verify audio output device in system settings
Problem: OpenAI API errors
- Verify
OPENAI_API_KEYis set correctly - Check API key has sufficient credits
- Ensure internet connection is stable
Problem: ElevenLabs API errors
- Verify
ELEVENLABS_API_KEYandELEVENLABS_VOICE_IDare set - Check API key has sufficient credits
- Verify voice ID exists in your account
Problem: Backend connection errors
- Verify
BACKEND_URLis correct - Check backend server is running
- Ensure
MNEMONICis set for x402 payments - Check network connectivity
python-robot/
βββ core/ # Core system components
β βββ agent_core.py # Main orchestrator
β βββ backend_connector.py # API and payment handling
β βββ event_bus.py # Event system
β βββ module_base.py # Base module class
βββ modules/ # Pluggable modules
β βββ ai/ # AI services (STT, TTS)
β βββ audio_input/ # Audio input modules
β βββ audio_output/ # Audio output modules
β βββ vision/ # Vision modules
βββ main.py # Entry point
βββ requirements.txt # Python dependencies
βββ README.md # This file
eth-account: Ethereum account management for x402 paymentsx402: x402 payment protocol clientpython-dotenv: Environment variable managementopencv-python: Computer vision (for webcam)pyaudio: Audio I/Onumpy: Numerical operationsopenai: OpenAI API client (Whisper)elevenlabs: ElevenLabs API client (TTS)
[Add your license here]
[Add contribution guidelines here]
For issues and questions, please [open an issue on GitHub] or contact [your contact information].