VoiceAgents: Real-Time Telephony AI with VideoSDK

A production-grade, low-latency Voice AI system for telephony and video rooms. Powered by the VideoSDK Agents Framework and Gemini 2.5 Flash Native Audio.

Project Overview

This repository implements a Real-Time Telephony Agent designed for high-concurrency voice interactions. By integrating VideoSDK with Google's Gemini 2.5 Flash Native Audio model, the system achieves sub-second response times and human-like voice quality without the latency typical of traditional STT (Speech-to-Text) and TTS (Text-to-Speech) pipelines.

The Native Audio Advantage:

Unlike traditional voice bots that convert audio to text before processing, this agent utilizes Native Audio Modalities. The model "hears" and "speaks" audio directly, allowing it to capture emotional nuance, tone, and pacing in real-time.

Key Technical Features

Gemini 2.5 Flash Native Audio: Leveraging the latest multimodal preview models for direct audio-to-audio reasoning.
VideoSDK Agent Pipeline: Automated room joining, media stream handling, and session management.
High Concurrency Orchestration: Configured to handle up to 10 concurrent calls per worker instance, making it suitable for scalable customer support applications.
Stateful Event Handling: Specialized logic for agent entry (on_enter) and exit (on_exit) to ensure a professional and friendly user experience.
Telephony Routing: Registered with a unique agent_id for seamless global telephony routing and job dispatching.

Project Structure

main.py: The core agent implementation, including model configuration and VideoSDK worker orchestration.
.env: Management of secure credentials (VideoSDK Keys, Google API Keys).
requirements.txt: Minimalist dependency list for optimized containerization.

Technical Stack

Framework: VideoSDK Agents (RealTimePipeline).
AI Model: Google Gemini 2.5 Flash (Native Audio Preview).
Concurrency: Python asyncio & VideoSDK WorkerJob.
Infrastructure: Telephony-to-WebRTC bridging via VideoSDK.

Setup & Execution

1. Environment Configuration

Add your credentials to the .env file:

VIDEO_SDK_API_KEY="your_api_key"
VIDEO_SDK_SECRET_KEY="your_secret_key"
GOOGLE_API_KEY="your_google_api_key"

2. Installation

# Setup environment
python -m venv .venv
source .venv/Scripts/activate  # Windows: .venv\Scripts\activate
pip install -r requirements.txt

3. Run the Agent Worker

python main.py

Why This Framework?

Telephony Ready: Built specifically for integration with phone numbers and SIP trunks.
Low Latency: Optimized media pipelines ensure minimal "dead air" during conversations.
Flexible Personality: Easy-to-configure system instructions for various industry personas (Support, Sales, Information).

"Redefining voice interaction through native audio intelligence and real-time orchestration."

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.gitignore		.gitignore
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VoiceAgents: Real-Time Telephony AI with VideoSDK

Project Overview

The Native Audio Advantage:

Key Technical Features

Project Structure

Technical Stack

Setup & Execution

1. Environment Configuration

2. Installation

3. Run the Agent Worker

Why This Framework?

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

VoiceAgents: Real-Time Telephony AI with VideoSDK

Project Overview

The Native Audio Advantage:

Key Technical Features

Project Structure

Technical Stack

Setup & Execution

1. Environment Configuration

2. Installation

3. Run the Agent Worker

Why This Framework?

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages