A production-grade, low-latency Voice AI system for telephony and video rooms. Powered by the VideoSDK Agents Framework and Gemini 2.5 Flash Native Audio.
This repository implements a Real-Time Telephony Agent designed for high-concurrency voice interactions. By integrating VideoSDK with Google's Gemini 2.5 Flash Native Audio model, the system achieves sub-second response times and human-like voice quality without the latency typical of traditional STT (Speech-to-Text) and TTS (Text-to-Speech) pipelines.
Unlike traditional voice bots that convert audio to text before processing, this agent utilizes Native Audio Modalities. The model "hears" and "speaks" audio directly, allowing it to capture emotional nuance, tone, and pacing in real-time.
- Gemini 2.5 Flash Native Audio: Leveraging the latest multimodal preview models for direct audio-to-audio reasoning.
- VideoSDK Agent Pipeline: Automated room joining, media stream handling, and session management.
- High Concurrency Orchestration: Configured to handle up to 10 concurrent calls per worker instance, making it suitable for scalable customer support applications.
- Stateful Event Handling: Specialized logic for agent entry (
on_enter) and exit (on_exit) to ensure a professional and friendly user experience. - Telephony Routing: Registered with a unique
agent_idfor seamless global telephony routing and job dispatching.
main.py: The core agent implementation, including model configuration and VideoSDK worker orchestration..env: Management of secure credentials (VideoSDK Keys, Google API Keys).requirements.txt: Minimalist dependency list for optimized containerization.
- Framework: VideoSDK Agents (RealTimePipeline).
- AI Model: Google Gemini 2.5 Flash (Native Audio Preview).
- Concurrency: Python
asyncio& VideoSDKWorkerJob. - Infrastructure: Telephony-to-WebRTC bridging via VideoSDK.
Add your credentials to the .env file:
VIDEO_SDK_API_KEY="your_api_key"
VIDEO_SDK_SECRET_KEY="your_secret_key"
GOOGLE_API_KEY="your_google_api_key"# Setup environment
python -m venv .venv
source .venv/Scripts/activate # Windows: .venv\Scripts\activate
pip install -r requirements.txtpython main.py- Telephony Ready: Built specifically for integration with phone numbers and SIP trunks.
- Low Latency: Optimized media pipelines ensure minimal "dead air" during conversations.
- Flexible Personality: Easy-to-configure system instructions for various industry personas (Support, Sales, Information).
"Redefining voice interaction through native audio intelligence and real-time orchestration."