Skip to content

MABSSSSS/vagents

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 

Repository files navigation

VoiceAgents: Real-Time Telephony AI with VideoSDK

A production-grade, low-latency Voice AI system for telephony and video rooms. Powered by the VideoSDK Agents Framework and Gemini 2.5 Flash Native Audio.


Project Overview

This repository implements a Real-Time Telephony Agent designed for high-concurrency voice interactions. By integrating VideoSDK with Google's Gemini 2.5 Flash Native Audio model, the system achieves sub-second response times and human-like voice quality without the latency typical of traditional STT (Speech-to-Text) and TTS (Text-to-Speech) pipelines.

The Native Audio Advantage:

Unlike traditional voice bots that convert audio to text before processing, this agent utilizes Native Audio Modalities. The model "hears" and "speaks" audio directly, allowing it to capture emotional nuance, tone, and pacing in real-time.


Key Technical Features

  • Gemini 2.5 Flash Native Audio: Leveraging the latest multimodal preview models for direct audio-to-audio reasoning.
  • VideoSDK Agent Pipeline: Automated room joining, media stream handling, and session management.
  • High Concurrency Orchestration: Configured to handle up to 10 concurrent calls per worker instance, making it suitable for scalable customer support applications.
  • Stateful Event Handling: Specialized logic for agent entry (on_enter) and exit (on_exit) to ensure a professional and friendly user experience.
  • Telephony Routing: Registered with a unique agent_id for seamless global telephony routing and job dispatching.

Project Structure

  • main.py: The core agent implementation, including model configuration and VideoSDK worker orchestration.
  • .env: Management of secure credentials (VideoSDK Keys, Google API Keys).
  • requirements.txt: Minimalist dependency list for optimized containerization.

Technical Stack

  • Framework: VideoSDK Agents (RealTimePipeline).
  • AI Model: Google Gemini 2.5 Flash (Native Audio Preview).
  • Concurrency: Python asyncio & VideoSDK WorkerJob.
  • Infrastructure: Telephony-to-WebRTC bridging via VideoSDK.

Setup & Execution

1. Environment Configuration

Add your credentials to the .env file:

VIDEO_SDK_API_KEY="your_api_key"
VIDEO_SDK_SECRET_KEY="your_secret_key"
GOOGLE_API_KEY="your_google_api_key"

2. Installation

# Setup environment
python -m venv .venv
source .venv/Scripts/activate  # Windows: .venv\Scripts\activate
pip install -r requirements.txt

3. Run the Agent Worker

python main.py

Why This Framework?

  • Telephony Ready: Built specifically for integration with phone numbers and SIP trunks.
  • Low Latency: Optimized media pipelines ensure minimal "dead air" during conversations.
  • Flexible Personality: Easy-to-configure system instructions for various industry personas (Support, Sales, Information).

"Redefining voice interaction through native audio intelligence and real-time orchestration."

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages