# LiveKit Voice Agent with Comprehensive Metrics

This notebook contains a complete voice agent implementation that:
- Listens to user speech (STT - Speech to Text)
- Processes text with AI (LLM - Large Language Model)
- Responds with synthesized speech (TTS - Text to Speech)
- Tracks detailed performance metrics

## Required API Keys
Before running, make sure you have:
- OpenAI API key (for GPT and Whisper)
- ElevenLabs API key (for voice synthesis)

Add these to a `.env` file in your project directory.

## Installation Requirements

Run this cell first to install required packages:

In [None]:
# Install required packages
!pip install livekit-agents livekit-plugins-openai livekit-plugins-elevenlabs livekit-plugins-silero python-dotenv

## Environment Setup and Configuration

In [None]:
import logging
import os
from dotenv import load_dotenv

# Load environment variables from the specified .env file
# This file should contain your API keys (OPENAI_API_KEY, ELEVEN_API_KEY)
# The override=True parameter ensures that environment variables are 
# overwritten if they're already set
load_dotenv(dotenv_path=".env", override=True)

# Configure logging for the DeepLearning.AI agent
# This helps track the agent's behavior and debug issues
logger = logging.getLogger("dlai-agent")
logger.setLevel(logging.INFO)

# Add console handler for notebook output
console_handler = logging.StreamHandler()
console_handler.setLevel(logging.INFO)
formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')
console_handler.setFormatter(formatter)
logger.addHandler(console_handler)

print("Environment setup complete!")

## Import LiveKit Voice Agent Components

In [None]:
from livekit import agents
from livekit.agents import Agent, AgentSession, JobContext, WorkerOptions, jupyter
from livekit.plugins import (
    openai,      # For LLM (Language Model) and STT (Speech-to-Text)
    elevenlabs,  # For TTS (Text-to-Speech) - high quality voice synthesis
    silero,      # For VAD (Voice Activity Detection) - detects when user is speaking
)
from livekit.agents.metrics import LLMMetrics, STTMetrics, TTSMetrics, EOUMetrics
import asyncio

print("LiveKit imports successful!")

## Main Voice Agent Class with Comprehensive Metrics Tracking

In [None]:
class MetricsAgent(Agent):
    """
    A voice agent that can:
    1. Listen to user speech (STT - Speech to Text)
    2. Process the text with an AI model (LLM - Large Language Model)
    3. Respond with synthesized speech (TTS - Text to Speech)
    4. Track detailed performance metrics for all components
    
    This agent is designed for real-time voice conversations with comprehensive
    performance monitoring to help optimize latency and quality.
    """
    
    def __init__(self) -> None:
        """
        Initialize the voice agent with all necessary components.
        
        Components initialized:
        - LLM: Language model for generating responses
        - STT: Speech-to-text for understanding user input
        - TTS: Text-to-speech for voice responses
        - VAD: Voice activity detection for better conversation flow
        """
        
        # API KEY VALIDATION
        
        # Check if OpenAI API key is available
        # This is required for both LLM (GPT) and STT (Whisper) functionality
        if not os.getenv("OPENAI_API_KEY"):
            raise ValueError(
                "OPENAI_API_KEY environment variable is not set. "
                "Please add your OpenAI API key to the .env file."
            )
        
        # Check if ElevenLabs API key is available
        # This is required for high-quality text-to-speech synthesis
        if not os.getenv("ELEVEN_API_KEY"):
            raise ValueError(
                "ELEVEN_API_KEY environment variable is not set. "
                "Please add your ElevenLabs API key to the .env file."
            )
        
        # COMPONENT INITIALIZATION
        
        # Initialize the Language Model (LLM)
        # Using gpt-4o-mini for cost efficiency while maintaining good quality
        # Alternative: gpt-4o for higher quality but increased cost
        llm = openai.LLM(model="gpt-4o-mini")
        
        # Initialize Speech-to-Text (STT)
        # Whisper-1 is OpenAI's speech recognition model
        # It supports multiple languages and handles various audio qualities well
        stt = openai.STT(model="whisper-1")
        
        # Initialize Text-to-Speech (TTS)
        # ElevenLabs provides high-quality, natural-sounding voice synthesis
        try:
            tts = elevenlabs.TTS()
        except Exception as e:
            # Log the error with detailed information for debugging
            logger.error(f"Failed to initialize ElevenLabs TTS: {e}")
            logger.error("This usually means the API key is invalid or network issues")
            raise
        
        # Initialize Voice Activity Detection (VAD)
        # Silero VAD helps detect when the user is speaking vs. silent
        # This improves conversation flow by reducing false triggers
        silero_vad = silero.VAD.load()
        
        # AGENT INITIALIZATION
        
        # Initialize the parent Agent class with all components
        super().__init__(
            # Instructions define the agent's personality and behavior
            instructions=(
                "You are a helpful assistant communicating via voice. "
                "Keep responses concise and natural for spoken conversation. "
                "Avoid overly long responses that might lose the user's attention. "
                "Be conversational and friendly."
            ),
            stt=stt,           # Speech-to-text component
            llm=llm,           # Language model component
            tts=tts,           # Text-to-speech component
            vad=silero_vad,    # Voice activity detection component
        )
        
        # Set up metrics collection for performance monitoring
        self._setup_metrics_callbacks()
    
    def _setup_metrics_callbacks(self):
        """
        Set up event listeners for metrics collection from all components.
        
        This allows us to track:
        - LLM performance (token usage, response time)
        - STT performance (transcription accuracy, latency)
        - TTS performance (synthesis speed, audio quality)
        - End-of-utterance detection performance
        
        Metrics help identify bottlenecks and optimize the user experience.
        """
        
        # LLM Metrics Callback
        # Tracks language model performance (tokens, speed, time-to-first-token)
        def llm_metrics_wrapper(metrics: LLMMetrics):
            # Use asyncio.create_task to handle async callback without blocking
            asyncio.create_task(self.on_llm_metrics_collected(metrics))
        self.llm.on("metrics_collected", llm_metrics_wrapper)
        
        # STT Metrics Callback
        # Tracks speech-to-text performance (transcription time, accuracy)
        def stt_metrics_wrapper(metrics: STTMetrics):
            asyncio.create_task(self.on_stt_metrics_collected(metrics))
        self.stt.on("metrics_collected", stt_metrics_wrapper)
        
        # End-of-Utterance (EOU) Metrics Callback
        # Tracks how quickly the system detects when user stops speaking
        def eou_metrics_wrapper(metrics: EOUMetrics):
            asyncio.create_task(self.on_eou_metrics_collected(metrics))
        self.stt.on("eou_metrics_collected", eou_metrics_wrapper)
        
        # TTS Metrics Callback
        # Tracks text-to-speech performance (synthesis speed, audio generation time)
        def tts_metrics_wrapper(metrics: TTSMetrics):
            asyncio.create_task(self.on_tts_metrics_collected(metrics))
        self.tts.on("metrics_collected", tts_metrics_wrapper)
    
    # METRICS COLLECTION HANDLERS
    
    async def on_llm_metrics_collected(self, metrics: LLMMetrics) -> None:
        """
        Handle Language Model metrics collection.
        
        Key metrics tracked:
        - prompt_tokens: Number of tokens in the user's input
        - completion_tokens: Number of tokens in the AI's response
        - tokens_per_second: Speed of token generation (higher = faster)
        - ttft: Time To First Token (lower = more responsive)
        
        These metrics help optimize cost (token usage) and user experience (speed).
        """
        logger.info(
            "LLM Metrics - Prompt: %d tokens, Completion: %d tokens, "
            "Speed: %.4f tok/s, TTFT: %.4f s",
            metrics.prompt_tokens, 
            metrics.completion_tokens, 
            metrics.tokens_per_second, 
            metrics.ttft
        )
    
    async def on_stt_metrics_collected(self, metrics: STTMetrics) -> None:
        """
        Handle Speech-to-Text metrics collection.
        
        Key metrics tracked:
        - duration: Total time for speech recognition process
        - audio_duration: Length of the audio that was processed
        - streamed: Whether the transcription was streamed (real-time) or batch
        
        These metrics help optimize transcription latency and quality.
        """
        logger.info(
            "STT Metrics - Processing: %.4f s, Audio Length: %.4f s, "
            "Streamed: %s",
            metrics.duration, 
            metrics.audio_duration, 
            "Yes" if metrics.streamed else "No"
        )
    
    async def on_eou_metrics_collected(self, metrics: EOUMetrics) -> None:
        """
        Handle End-of-Utterance metrics collection.
        
        Key metrics tracked:
        - end_of_utterance_delay: Time to detect user stopped speaking
        - transcription_delay: Time from end of speech to transcription completion
        
        These metrics are crucial for conversation flow - shorter delays mean
        more natural conversation rhythm.
        """
        logger.info(
            "EOU Metrics - End Detection: %.4f s, Transcription Delay: %.4f s",
            metrics.end_of_utterance_delay, 
            metrics.transcription_delay
        )
    
    async def on_tts_metrics_collected(self, metrics: TTSMetrics) -> None:
        """
        Handle Text-to-Speech metrics collection.
        
        Key metrics tracked:
        - ttfb: Time To First Byte of audio (lower = more responsive)
        - duration: Total time for speech synthesis
        - audio_duration: Length of the generated audio
        - streamed: Whether audio was streamed or generated as a complete file
        
        These metrics help optimize response time and audio quality.
        """
        logger.info(
            "TTS Metrics - TTFB: %.4f s, Processing: %.4f s, "
            "Audio Length: %.4f s, Streamed: %s",
            metrics.ttfb, 
            metrics.duration, 
            metrics.audio_duration, 
            "Yes" if metrics.streamed else "No"
        )

print("MetricsAgent class defined successfully!")

## Main Entrypoint Function

In [None]:
async def entrypoint(ctx: JobContext):
    """
    Main entrypoint function for the voice agent.
    
    This function is called by LiveKit when a new session starts.
    It handles:
    1. Connecting to the LiveKit room
    2. Creating and starting the agent session
    3. Error handling and logging
    
    Args:
        ctx (JobContext): LiveKit job context containing room and connection info
    """
    try:
        # Connect to the LiveKit room
        # This establishes the WebRTC connection for real-time audio
        await ctx.connect()
        logger.info("Successfully connected to LiveKit room")
        
        # Create a new agent session
        # This manages the lifecycle of the conversation
        session = AgentSession()
        
        # Create an instance of our MetricsAgent
        agent = MetricsAgent()
        logger.info("MetricsAgent initialized successfully")
        
        # Start the agent session
        # This begins listening for audio and handling conversations
        await session.start(
            agent=agent,
            room=ctx.room,
        )
        logger.info("Agent session started successfully - ready for conversations")
        
    except Exception as e:
        # Log any errors that occur during initialization
        logger.error(f"Error in entrypoint: {e}")
        logger.error("This could be due to network issues, invalid API keys, or room connection problems")
        raise

print("Entrypoint function defined!")

## Pre-flight Checks and Agent Startup

In [None]:
# PRE-FLIGHT CHECKS

# Verify OpenAI API key is available
if not os.getenv("OPENAI_API_KEY"):
    print("ERROR: OPENAI_API_KEY not found in environment variables")
    print("Please add your OpenAI API key to the .env file:")
    print("OPENAI_API_KEY=your_api_key_here")
else:
    print("✅ OpenAI API key found")

# Verify ElevenLabs API key is available
if not os.getenv("ELEVEN_API_KEY"):
    print("ERROR: ELEVEN_API_KEY not found in environment variables")
    print("Please add your ElevenLabs API key to the .env file:")
    print("ELEVEN_API_KEY=your_api_key_here")
else:
    print("✅ ElevenLabs API key found")

# Check if both keys are available
if os.getenv("OPENAI_API_KEY") and os.getenv("ELEVEN_API_KEY"):
    print("\n✅ All API keys loaded successfully")
    print("🚀 Ready to start voice agent...")
else:
    print("\n❌ Please set up your API keys before proceeding")

## Start the Voice Agent

**Important:** This cell will start the voice agent and open a web interface for testing. Make sure your API keys are properly configured before running this cell.

In [None]:
# Only run if API keys are available
if os.getenv("OPENAI_API_KEY") and os.getenv("ELEVEN_API_KEY"):
    print("Starting LiveKit Voice Agent...")
    print("This will open a web interface for testing the voice agent.")
    
    # Run the LiveKit agent application
    # WorkerOptions configures how the agent worker behaves
    # The jupyter_url provides a web interface for testing and debugging
    try:
        jupyter.run_app(
            WorkerOptions(entrypoint_fnc=entrypoint), 
            jupyter_url="https://jupyter-api-livekit.vercel.app/api/join-token"
        )
    except KeyboardInterrupt:
        print("\nAgent stopped by user")
    except Exception as e:
        print(f"\nError starting agent: {e}")
        print("Check your internet connection and API keys")
else:
    print("Please configure your API keys first!")
    print("Add them to a .env file in the same directory as this notebook:")
    print("OPENAI_API_KEY=your_openai_key_here")
    print("ELEVEN_API_KEY=your_elevenlabs_key_here")

## Usage Instructions

1. **Setup Environment:**
   - Create a `.env` file in the same directory as this notebook
   - Add your API keys:
     ```
     OPENAI_API_KEY=your_openai_api_key_here
     ELEVEN_API_KEY=your_elevenlabs_api_key_here
     ```

2. **Run the Cells:**
   - Execute cells in order from top to bottom
   - The installation cell only needs to be run once
   - The final cell will start the voice agent

3. **Test the Agent:**
   - When the agent starts, it will provide a web interface URL
   - Click the URL to open the testing interface
   - Allow microphone access when prompted
   - Start speaking to interact with the voice agent

4. **Monitor Metrics:**
   - Watch the notebook output for detailed performance metrics
   - Metrics include response times, token usage, and audio processing stats
   - Use these metrics to optimize performance

## Troubleshooting

- **API Key Errors:** Double-check your `.env` file format and key validity
- **Connection Issues:** Ensure stable internet connection
- **Audio Problems:** Check browser microphone permissions
- **Performance Issues:** Monitor the metrics output for bottlenecks