## **Building Your Voice-to-Voice Agent with Azure AI Speech and AOAI**

This notebook provides a step-by-step guide to create a voice-to-voice agent using Azure AI Speech services and Azure OpenAI. It walks you through the process of configuring speech recognition, integrating external tools, and generating human-like responses for real-time interactions.

1. **Audio Ingestion**: Ensure the capability to record audio is set up.  
2. **Azure Speech-to-Text (STT)**: Converts live audio into transcribed text for LLM processing.  
3. **Azure OpenAI with Function Calling & Streaming**: Understands patient intent, routes queries, and dynamically calls backend tools in real time.  
4. **Azure Text-to-Speech (TTS)**: Delivers natural, empathetic voice responses back to the user in chunks.  

## **Prerequisites & Environment Setup**

Before we start building our first ARTAgent, make sure you have the following setup:

**🔧 Environment Setup**

1. **Python 3.11+** - Required for the ARTAgent framework
2. **Dependencies** - Install the required packages:

```bash
pip install -r requirements.txt
```

**☁️ Required Azure Services**

This notebook requires **2 main Azure services** to function properly:

### **1. Azure Speech Services** 🎤🔊
For Speech-to-Text (STT) and Text-to-Speech (TTS) capabilities.

**Create Azure Speech Service:**
- 🔗 **Azure Portal**: [Create Speech Service](https://portal.azure.com/#create/Microsoft.CognitiveServicesSpeechServices)
- 📖 **Documentation**: [Speech Service Setup Guide](https://docs.microsoft.com/en-us/azure/cognitive-services/speech-service/overview)

**What you'll need from this service:**
- API Key (`AZURE_OPENAI_STT_TTS_KEY`)
- Endpoint URL (`AZURE_OPENAI_STT_TTS_ENDPOINT`)
- Region (`AZURE_SPEECH_REGION`)

### **2. Azure OpenAI Service** 🤖
For GPT-4o chat completions with streaming and function calling.

**Create Azure OpenAI Service:**
- 🔗 **Azure Portal**: [Create Azure OpenAI](https://portal.azure.com/#create/Microsoft.CognitiveServicesOpenAI)
- 📖 **Documentation**: [Azure OpenAI Setup Guide](https://docs.microsoft.com/en-us/azure/cognitive-services/openai/how-to/create-resource)

**Required Model Deployment:**
- Deploy **GPT-4o-mini** (or GPT-4o) model in your Azure OpenAI resource. Or pick your own. 
- 📖 **Model Deployment Guide**: [Deploy Models in Azure OpenAI](https://docs.microsoft.com/en-us/azure/cognitive-services/openai/how-to/create-resource#deploy-a-model)

**What you'll need from this service:**
- API Key (`AZURE_OPENAI_KEY`)
- Endpoint URL (`AZURE_OPENAI_ENDPOINT`)
- Deployment Name (`AZURE_OPENAI_CHAT_DEPLOYMENT_ID`)

**🔐 Security & Environment Variables**

**IMPORTANT:** Never hardcode API keys in your notebooks or code files. Always use environment variables or `.env` files.

**Option 1: Create a `.env` file in the main directory**

Create a file named `.env` in the project root directory with the following structure:

```bash
# Azure Speech Services
AZURE_OPENAI_STT_TTS_KEY=your_azure_speech_key_here
AZURE_OPENAI_STT_TTS_ENDPOINT=https://your-speech-service.cognitiveservices.azure.com
AZURE_SPEECH_REGION=eastus

# Azure OpenAI Services  
AZURE_OPENAI_KEY=your_azure_openai_key_here
AZURE_OPENAI_ENDPOINT=https://your-openai-service.openai.azure.com/
AZURE_OPENAI_API_VERSION=2024-12-01-preview
AZURE_OPENAI_CHAT_DEPLOYMENT_ID=gpt-4o-mini
```

**Option 2: Set system environment variables**

You can also set these as system environment variables in your operating system.

**📂 Project Structure**

The code below automatically sets up the correct working directory for the notebook to access the ARTAgent framework and all dependencies.

**🛡️ Security Best Practices**

- ✅ Use environment variables or `.env` files for sensitive data
- ✅ Add `.env` to your `.gitignore` file
- ✅ Use different keys for development, staging, and production
- ❌ Never commit API keys to version control
- ❌ Never share API keys in screenshots or documentation

**💰 Cost Considerations**

- **Azure Speech Services**: Pay-per-use pricing for STT/TTS operations
- **Azure OpenAI**: Pay-per-token pricing for GPT model usage
- 📖 **Pricing Details**: [Azure Speech Pricing](https://azure.microsoft.com/en-us/pricing/details/cognitive-services/speech-services/) | [Azure OpenAI Pricing](https://azure.microsoft.com/en-us/pricing/details/cognitive-services/openai-service/)

In [1]:
# 🔐 Load Environment Variables Securely
import os
from dotenv import load_dotenv

# Load environment variables from .env file if it exists
load_dotenv()

# Required environment variables for Azure services
REQUIRED_ENV_VARS = [
    "AZURE_SPEECH_ENDPOINT", 
    "AZURE_SPEECH_REGION",
    "AZURE_OPENAI_KEY",
    "AZURE_OPENAI_ENDPOINT",
    "AZURE_OPENAI_API_VERSION",
    "AZURE_OPENAI_CHAT_DEPLOYMENT_ID"
]

# Validate that all required environment variables are set
missing_vars = []
for var in REQUIRED_ENV_VARS:
    if not os.getenv(var):
        missing_vars.append(var)

if missing_vars:
    print("❌ Missing required environment variables:")
    for var in missing_vars:
        print(f"   - {var}")
    print("\n💡 Please set these variables in your .env file or system environment.")
    print("📖 See the previous cell for instructions on setting up environment variables.")
else:
    print("✅ All required environment variables are set!")
    print("🔒 API keys are properly loaded from environment variables.")
    
# Display non-sensitive configuration for verification
print(f"\n📋 Configuration Summary:")
print(f"   Azure Speech Region: {os.getenv('AZURE_SPEECH_REGION', 'Not set')}")
print(f"   Azure OpenAI Endpoint: {os.getenv('AZURE_OPENAI_ENDPOINT', 'Not set')}")
print(f"   Azure Speech TTS Endpoint: {os.getenv('AZURE_SPEECH_ENDPOINT', 'Not set')}")
print(f"   OpenAI API Version: {os.getenv('AZURE_OPENAI_API_VERSION', 'Not set')}")
print(f"   Chat Deployment ID: {os.getenv('AZURE_OPENAI_CHAT_DEPLOYMENT_ID', 'Not set')}")
print(f"   🔐 API Keys: {'✅ Loaded' if os.getenv('AZURE_OPENAI_KEY') else '❌ Missing'}")

✅ All required environment variables are set!
🔒 API keys are properly loaded from environment variables.

📋 Configuration Summary:
   Azure Speech Region: eastus
   Azure OpenAI Endpoint: https://aoai-ai-factory-eus-dev.openai.azure.com/
   Azure Speech TTS Endpoint: https://azure-ai-services-eastus-test.cognitiveservices.azure.com/
   OpenAI API Version: 2024-12-01-preview
   Chat Deployment ID: gpt-4o-mini
   🔐 API Keys: ✅ Loaded


## **Test Audio Capture from your microphone**

In [2]:
import pyaudio


def list_audio_devices():
    """
    List all available audio devices using PyAudio.

    This function initializes PyAudio, retrieves the list of audio devices,
    and prints their names. It also includes error handling to ensure proper
    cleanup of resources.
    """
    try:
        p = pyaudio.PyAudio()
        print("Available audio devices:")
        for ii in range(p.get_device_count()):
            device_name = p.get_device_info_by_index(ii).get("name")
            print(f"{ii}: {device_name}")
    except Exception as e:
        print(f"An error occurred while listing audio devices: {e}")
    finally:
        # Ensure PyAudio resources are released
        if "p" in locals():
            p.terminate()


# Call the function to list audio devices
list_audio_devices()

Available audio devices:
0: Microsoft Sound Mapper - Input
1: Headset (Shiva’s AirPods Pro #2
2: Surface Stereo Microphones (Sur
3: Microphone (Microsoft Surface T
4: Microphone (Lumina Camera - Raw
5: Microsoft Sound Mapper - Output
6: Headphones (Shiva’s AirPods Pro
7: Speakers (Dell USB Audio)
8: Surface Omnisonic Speakers (Sur
9: Headset (Microsoft Surface Thun
10: Primary Sound Capture Driver
11: Headset (Shiva’s AirPods Pro #2)
12: Surface Stereo Microphones (Surface High Definition Audio)
13: Microphone (Microsoft Surface Thunderbolt(TM) 4 Dock Audio)
14: Microphone (Lumina Camera - Raw)
15: Primary Sound Driver
16: Headphones (Shiva’s AirPods Pro #2)
17: Speakers (Dell USB Audio)
18: Surface Omnisonic Speakers (Surface High Definition Audio)
19: Headset (Microsoft Surface Thunderbolt(TM) 4 Dock Audio)
20: Speakers (Dell USB Audio)
21: Headphones (Shiva’s AirPods Pro #2)
22: Surface Omnisonic Speakers (Surface High Definition Audio)
23: Headset (Microsoft Surface Thunderbolt(TM)

In [3]:
import pyaudio
import wave


def test_microphone():
    """
    Test the microphone by recording audio and playing it back.

    This function captures audio from the default input device (microphone),
    saves it to a temporary WAV file, and plays it back to ensure the microphone
    is working correctly.
    """
    # Audio configuration
    chunk = 1024  # Number of frames per buffer
    format = pyaudio.paInt16  # 16-bit audio format
    channels = 1  # Mono audio
    rate = 44100  # Sampling rate (44.1 kHz)
    record_seconds = 5  # Duration of the recording
    output_filename = "test_audio.wav"

    # Initialize PyAudio
    p = pyaudio.PyAudio()

    try:
        # Open the microphone stream
        print("Recording...")
        stream = p.open(
            format=format,
            channels=channels,
            rate=rate,
            input=True,
            frames_per_buffer=chunk,
        )

        frames = []

        # Record audio in chunks
        for _ in range(0, int(rate / chunk * record_seconds)):
            data = stream.read(chunk)
            frames.append(data)

        print("Recording complete. Saving audio...")

        # Save the recorded audio to a WAV file
        with wave.open(output_filename, "wb") as wf:
            wf.setnchannels(channels)
            wf.setsampwidth(p.get_sample_size(format))
            wf.setframerate(rate)
            wf.writeframes(b"".join(frames))

        print(f"Audio saved to {output_filename}. Playing back...")

        # Play back the recorded audio
        stream.stop_stream()
        stream.close()

        # Open the WAV file for playback
        wf = wave.open(output_filename, "rb")
        playback_stream = p.open(
            format=p.get_format_from_width(wf.getsampwidth()),
            channels=wf.getnchannels(),
            rate=wf.getframerate(),
            output=True,
        )

        # Read and play audio data
        data = wf.readframes(chunk)
        while data:
            playback_stream.write(data)
            data = wf.readframes(chunk)

        playback_stream.stop_stream()
        playback_stream.close()

        print("Playback complete.")

    except Exception as e:
        print(f"An error occurred: {e}")

    finally:
        # Terminate PyAudio
        p.terminate()


# Run the microphone test
test_microphone()

Recording...
Recording complete. Saving audio...
Audio saved to test_audio.wav. Playing back...
Playback complete.


## **Define Clients**

In [4]:
# 📂 Setup Working Directory for ARTAgent Framework Access
import os
# Navigate to the project root directory
# This ensures we can import ARTAgent framework modules properly
try:
    # Move up two directories from samples/hello_world/ to project root
    os.chdir("../../")
    
    # Allow override via environment variable for different setups
    target_directory = os.getenv(
        "TARGET_DIRECTORY", os.getcwd()
    )  # Use environment variable if available
    
    # Verify the target directory exists before changing
    if os.path.exists(target_directory):
        os.chdir(target_directory)
        print(f"✅ Changed directory to: {os.getcwd()}")
    else:
        print(f"❌ Directory does not exist: {target_directory}")
        
except Exception as e:
    print(f"❌ Error changing directory: {e}")
    
# Verify we're in the correct location
print(f"📁 Current working directory: {os.getcwd()}")

✅ Changed directory to: c:\Users\pablosal\Desktop\gbb-ai-audio-agent
📁 Current working directory: c:\Users\pablosal\Desktop\gbb-ai-audio-agent


In [5]:
## import logger 
import os
import time
import threading

from utils.ml_logging import get_logger

timestamp = time.strftime("%Y-%m-%d %H:%M:%S", time.localtime())
pid = os.getpid()
tid = threading.get_ident()
user = os.getenv("USER") or os.getenv("USERNAME") or "unknown"

logger = get_logger(f"run_test_{user}_{timestamp}_{pid}_{tid}")

In [6]:
## Settings 

VOICE = "en-US-Ava:DragonHDLatestNeural" 
VAD_SILENCE_TIMEOUT_MS = 800
USE_SEMANTIC_VAD = False
CANDIDATE_LANGUAGES = ["en-US", "fr-FR", "de-DE", "es-ES", "it-IT"]
AOAI_TEMPERATURE = 1
AOAI_MODEL = "gpt-4o"  # Default model, can be overridden in agent config
TTS_ENDS = [".", "!", "?"]

PROMPT_STORE_DIR = "samples/hello_world/agents/prompt_store"
PROMPT_LOCATION = "samples/hello_world/agents/prompt_store/templates/customer_support_agent.jinja"

In [7]:
from src.speech.text_to_speech import SpeechSynthesizer
from src.speech.speech_recognizer import StreamingSpeechRecognizerFromBytes
from openai import AzureOpenAI
from samples.hello_world.agents.prompt_store.prompt_manager import PromptManager

if "az_speech_recognizer_stream_client" not in locals():
    az_speech_recognizer_stream_client = StreamingSpeechRecognizerFromBytes(
        region=os.getenv("AZURE_SPEECH_REGION"), 
        vad_silence_timeout_ms=VAD_SILENCE_TIMEOUT_MS,
        use_semantic_segmentation=USE_SEMANTIC_VAD,
        audio_format="pcm",
        candidate_languages=CANDIDATE_LANGUAGES,
        enable_diarisation=True,
        speaker_count_hint=2,
        enable_neural_fe=False,
    )
    

if "az_speech_synthesizer_client" not in locals():
    az_speech_synthesizer_client = SpeechSynthesizer(region=os.getenv("AZURE_SPEECH_REGION"),  # Fixed: was AZURE_REGION
                                                     voice=VOICE)

# Ensure Azure OpenAI client is initialized only if not already defined
if "client" not in locals():
    client = AzureOpenAI(
        api_version="2025-02-01-preview",
        azure_endpoint=os.getenv("AZURE_OPENAI_ENDPOINT"),
        api_key=os.getenv("AZURE_OPENAI_KEY"),
    )

if "prompt_manager" not in locals():
    prompt_manager = PromptManager()

[2025-08-20 12:51:33,491] INFO - src.speech.speech_recognizer: Azure Monitor tracing initialized for speech recognizer
[2025-08-20 12:51:33,495] INFO - src.speech.speech_recognizer: Creating SpeechConfig with API key authentication
[2025-08-20 12:51:33,500] INFO - src.speech.text_to_speech: Azure Monitor tracing initialized for speech synthesizer
[2025-08-20 12:51:33,505] INFO - src.speech.text_to_speech: Creating SpeechConfig with API key authentication
[2025-08-20 12:51:33,513] INFO - src.speech.text_to_speech: Speech synthesizer initialized successfully


Templates found: ['customer_support_agent.jinja']


In [8]:
# get prompt 
PROMPT = prompt_manager.get_prompt('customer_support_agent.jinja')
import pprint

pprint.pprint(PROMPT)

('\n'
 'You are a helpful customer support agent for Demo Corp. Your role is to:\n'
 '\n'
 '🎯 **Primary Responsibilities:**\n'
 '- Answer product questions accurately and helpfully\n'
 '- Help customers check order status and tracking\n'
 '- Process return and exchange requests\n'
 '- Provide basic troubleshooting guidance\n'
 '- Escalate complex issues to human agents when needed\n'
 '\n'
 '🗨️ **Communication Style:**\n'
 '- Be friendly, professional, and empathetic\n'
 '- Use clear, concise language\n'
 '- Always confirm understanding before taking action\n'
 '- Provide specific next steps when possible\n'
 '\n'
 '🛠️ **Available Tools:**\n'
 '- `search_product_catalog`: Find product information, specs, pricing\n'
 '- `check_order_status`: Look up order details and shipping status  \n'
 '- `create_return_request`: Initiate return/exchange process\n'
 '- `escalate_to_human`: Transfer to live agent for complex issues\n'
 '\n'
 '🚫 **Important Constraints:**\n'
 '- Only use the tools prov

In [9]:
# import Tools 

from samples.hello_world.agents.tool_store.tool_registry import TOOL_REGISTRY

TOOL_REGISTRY

{'search_product_catalog': {'type': 'function',
  'function': {'name': 'search_product_catalog',
   'description': 'Search the product catalog for information about products including specs, pricing, and availability',
   'parameters': {'type': 'object',
    'properties': {'query': {'type': 'string',
      'description': 'Search term or product ID to look up in catalog'}},
    'required': ['query']}}},
 'check_order_status': {'type': 'function',
  'function': {'name': 'check_order_status',
   'description': 'Check the status and tracking information for a customer order',
   'parameters': {'type': 'object',
    'properties': {'order_id': {'type': 'string',
      'description': 'The order ID to look up (e.g., ORD123456)'}},
    'required': ['order_id']}}},
 'create_return_request': {'type': 'function',
  'function': {'name': 'create_return_request',
   'description': 'Create a return request for a customer order',
   'parameters': {'type': 'object',
    'properties': {'order_id': {'type

In [10]:
# 🎯 Complete Voice-to-Voice Agent with Streaming Tool Calls (FINAL PROD)
# - Solid barge-in (partials stop TTS cleanly, debounced)
# - Parallel tool-calls support
# - Tools passed as a LIST (not dict) to Azure OpenAI
# - Mic loop race fixed
# - Avoids micro-fragment TTS (e.g., "99.")

import os, time, threading, json, asyncio
from typing import Dict, List, Any, Optional

# Audio capture
RATE, CHANNELS, CHUNK = 16000, 1, 1024

# Barge-in tuning
_TTS_STOP_DEBOUNCE_SEC = 0.3
_MIN_TTS_CHARS = 8  # don't speak super tiny fragments

# Tools & registry
from samples.hello_world.agents.tool_store.customer_support_tools import (
    search_product_catalog,
    check_order_status,
    create_return_request,
    escalate_to_human
)
from samples.hello_world.agents.tool_store.tool_registry import TOOL_REGISTRY

# ──────────────────────────────────────────────────────────────────────────────
# Clients (Speech + Azure OpenAI)
# ──────────────────────────────────────────────────────────────────────────────
if "az_speech_synthesizer_client" not in locals():
    az_speech_synthesizer_client = SpeechSynthesizer(
        key=os.getenv("AZURE_SPEECH_KEY"),
        region=os.getenv("AZURE_SPEECH_REGION"),
        voice=VOICE,
        # make sure your wrapper routes to default speaker or let you pass output_device_id
        use_default_speaker=True,
    )

if "az_speech_recognizer_stream_client" not in locals():
    az_speech_recognizer_stream_client = StreamingSpeechRecognizerFromBytes(
        region=os.getenv("AZURE_SPEECH_REGION"),
        vad_silence_timeout_ms=VAD_SILENCE_TIMEOUT_MS,
        use_semantic_segmentation=USE_SEMANTIC_VAD,
        audio_format="pcm",
        candidate_languages=CANDIDATE_LANGUAGES,
        enable_diarisation=True,
        speaker_count_hint=2,
        enable_neural_fe=False,
    )

if "client" not in locals():
    client = AzureOpenAI(
        api_version="2025-02-01-preview",
        azure_endpoint=os.getenv("AZURE_OPENAI_ENDPOINT"),
        api_key=os.getenv("AZURE_OPENAI_KEY"),
    )

# ──────────────────────────────────────────────────────────────────────────────
# Tools map (function name -> callable)
# ──────────────────────────────────────────────────────────────────────────────
function_mapping = {
    "search_product_catalog": search_product_catalog,
    "check_order_status": check_order_status,
    "create_return_request": create_return_request,
    "escalate_to_human": escalate_to_human,
}

# ──────────────────────────────────────────────────────────────────────────────
# Global state
# ──────────────────────────────────────────────────────────────────────────────
user_buffer = ""
is_synthesizing = False
conversation_active = False
audio_stream = None
audio_interface = None
_mic_thread = None  # guard for mic thread
_tts_future = None
_last_tts_stop = 0.0

# ──────────────────────────────────────────────────────────────────────────────
# Utilities: TTS control (barge-in)
# ──────────────────────────────────────────────────────────────────────────────
def speak(text: str):
    """Start TTS in a controlled way (sets flag, keeps future, ignores micro-fragments)."""
    global is_synthesizing, _tts_future
    if not text or len(text.strip()) < _MIN_TTS_CHARS:
        return
    is_synthesizing = True
    try:
        _tts_future = az_speech_synthesizer_client.start_speaking_text(text)
    except Exception as e:
        print(f"❌ TTS start error: {e}")
        is_synthesizing = False
        _tts_future = None

def _stop_tts(reason: str = ""):
    """Stop/cancel current TTS with debounce to avoid flapping on tiny partials."""
    global is_synthesizing, _tts_future, _last_tts_stop
    now = time.time()
    if now - _last_tts_stop < _TTS_STOP_DEBOUNCE_SEC:
        return
    _last_tts_stop = now

    try:
        if _tts_future:
            try:
                _tts_future.cancel()
            except Exception:
                pass
            _tts_future = None
        az_speech_synthesizer_client.stop_speaking()
        if reason:
            print(f"🛑 TTS stopped (barge-in): {reason}")
    except Exception as e:
        print(f"⚠️ stop_speaking() error (ignored): {e}")
    finally:
        is_synthesizing = False

# ──────────────────────────────────────────────────────────────────────────────
# Tool-call streaming state (parallel-friendly)
# ──────────────────────────────────────────────────────────────────────────────
class _SingleToolState:
    def __init__(self, call_id: str):
        self.call_id = call_id
        self.name = ""
        self.args_json = []  # fragments

    @property
    def args_str(self) -> str:
        return "".join(self.args_json)

def _ensure_tools_list(tools_like) -> List[Dict[str, Any]]:
    """Azure OpenAI expects a list of tool objects; convert dict registries."""
    if tools_like is None:
        return []
    if isinstance(tools_like, dict):
        return list(tools_like.values())
    if isinstance(tools_like, list):
        return tools_like
    raise TypeError("tools must be list[tool] or dict[name->tool]")

# ──────────────────────────────────────────────────────────────────────────────
# LLM streaming with tool-calls
# ──────────────────────────────────────────────────────────────────────────────
async def process_streaming_response_with_tools(
    messages: List[Dict[str, Any]],
    tools: List[Dict[str, Any]] = None
) -> None:
    """Streams assistant text, handles tool-calls, then streams a follow-up."""
    tools = _ensure_tools_list(tools or TOOL_REGISTRY)

    print("🤖 Processing GPT response...")

    response = client.chat.completions.create(
        stream=True,
        messages=messages,
        tools=tools,
        tool_choice="auto",
        max_tokens=4096,
        temperature=AOAI_TEMPERATURE,
        top_p=1.0,
        model=os.getenv("AZURE_OPENAI_CHAT_DEPLOYMENT_ID"),
    )

    collected_text: List[str] = []
    tool_states: Dict[str, _SingleToolState] = {}

    for chunk in response:
        if not chunk.choices:
            continue
        delta = chunk.choices[0].delta

        # Tool-calls (may be multiple)
        if hasattr(delta, "tool_calls") and delta.tool_calls:
            for tc in delta.tool_calls:
                if tc.id and tc.id not in tool_states:
                    tool_states[tc.id] = _SingleToolState(tc.id)
                st = tool_states.get(getattr(tc, "id", ""))
                if not st:
                    continue
                if hasattr(tc, "function") and tc.function:
                    if getattr(tc.function, "name", None):
                        st.name = tc.function.name
                        print(f"🛠️ Tool call detected: {st.name} (id={st.call_id})")
                    if getattr(tc.function, "arguments", None):
                        st.args_json.append(tc.function.arguments)

        # Text streaming
        elif hasattr(delta, "content") and delta.content:
            text_chunk = delta.content
            collected_text.append(text_chunk)
            print(text_chunk, end="", flush=True)

            # Speak on sentence boundaries (but avoid tiny fragments)
            if text_chunk in TTS_ENDS and sum(len(x) for x in collected_text) >= _MIN_TTS_CHARS:
                sentence = "".join(collected_text).strip()
                if sentence:
                    print(f"\n🔊 Speaking: {sentence}")
                    speak(sentence)
                    collected_text.clear()

    # Flush any remaining text
    if collected_text:
        remaining = "".join(collected_text).strip()
        if remaining:
            print(f"\n🔊 Speaking final: {remaining}")
            speak(remaining)
            messages.append({"role": "assistant", "content": remaining})

    print()  # newline

    # If tools were called, add assistant tool_calls message, execute, then follow up
    if tool_states:
        messages.append({
            "role": "assistant",
            "content": None,
            "tool_calls": [
                {"id": st.call_id, "type": "function",
                 "function": {"name": st.name, "arguments": st.args_str}}
                for st in tool_states.values()
            ],
        })
        await _execute_tools_and_followup(tool_states, messages, tools)

async def _execute_tools_and_followup(
    tool_states: Dict[str, _SingleToolState],
    messages: List[Dict[str, Any]],
    tools: List[Dict[str, Any]],
) -> None:
    for st in tool_states.values():
        print(f"\n🔧 Executing tool: {st.name} (id={st.call_id})")
        print(f"📝 Arguments (raw): {st.args_str}")

        # Parse streamed args (with gentle repair)
        try:
            args = json.loads(st.args_str) if st.args_str.strip() else {}
        except json.JSONDecodeError as e:
            repaired = st.args_str.strip()
            if repaired and not repaired.startswith("{"):
                repaired = "{" + repaired
            if repaired and not repaired.endswith("}"):
                repaired = repaired + "}"
            try:
                args = json.loads(repaired)
                print("⚠️ JSON repaired for tool args.")
            except Exception:
                print(f"❌ Error parsing tool arguments: {e}")
                args = {}

        if st.name in function_mapping:
            fn = function_mapping[st.name]
            try:
                result = await fn(args) if asyncio.iscoroutinefunction(fn) else fn(args)
                print(f"✅ Tool result: {result}")
                messages.append({
                    "tool_call_id": st.call_id, "role": "tool",
                    "name": st.name,
                    "content": json.dumps(result) if isinstance(result, (dict, list)) else str(result),
                })
            except Exception as e:
                err_payload = {"error": f"{type(e).__name__}: {e}"}
                print(f"❌ Tool execution error: {err_payload}")
                messages.append({
                    "tool_call_id": st.call_id, "role": "tool",
                    "name": st.name, "content": json.dumps(err_payload),
                })
        else:
            print(f"❌ Unknown tool: {st.name}")
            messages.append({
                "tool_call_id": st.call_id, "role": "tool",
                "name": st.name, "content": json.dumps({"error": f"Unknown tool {st.name}"}),
            })

    print("\n🤖 Getting follow-up response...")
    await _process_followup_response(messages)

async def _process_followup_response(messages: List[Dict[str, Any]]) -> None:
    response = client.chat.completions.create(
        stream=True,
        messages=messages,
        temperature=AOAI_TEMPERATURE,
        top_p=1.0,
        max_tokens=4096,
        model=os.getenv("AZURE_OPENAI_CHAT_DEPLOYMENT_ID"),
    )

    collected_text: List[str] = []
    for chunk in response:
        if chunk.choices and hasattr(chunk.choices[0].delta, "content"):
            content = chunk.choices[0].delta.content
            if content:
                collected_text.append(content)
                print(content, end="", flush=True)
                if content in TTS_ENDS and sum(len(x) for x in collected_text) >= _MIN_TTS_CHARS:
                    sentence = "".join(collected_text).strip()
                    if sentence:
                        print(f"\n🔊 Speaking: {sentence}")
                        speak(sentence)
                        collected_text.clear()

    if collected_text:
        final_text = "".join(collected_text).strip()
        if final_text:
            print(f"\n🔊 Speaking final: {final_text}")
            speak(final_text)
            messages.append({"role": "assistant", "content": final_text})

    print()

# ──────────────────────────────────────────────────────────────────────────────
# Speech setup (partials + final with language)
# ──────────────────────────────────────────────────────────────────────────────
def setup_speech_recognition():
    """Wire STT callbacks. Partials will barge-in (stop TTS) when meaningful."""
    global user_buffer

    def on_final(text: str, lang: str):
        global user_buffer
        print(f"\n🧾 User (final) in {lang}: {text}")
        user_buffer += text.strip() + "\n"

    def on_partial(text: str, lang: str):
        print(f"🗣️ User (partial) in {lang}: {text}")
        # Only barge-in on meaningful speech (>=3 chars)
        if is_synthesizing and len(text.strip()) >= 3:
            _stop_tts("user started speaking")

    az_speech_recognizer_stream_client.set_partial_result_callback(on_partial)
    az_speech_recognizer_stream_client.set_final_result_callback(on_final)

# ──────────────────────────────────────────────────────────────────────────────
# Microphone loop
# ──────────────────────────────────────────────────────────────────────────────
def setup_microphone():
    """Stream mic PCM to recognizer."""
    global audio_stream, audio_interface, _mic_thread, conversation_active
    try:
        import pyaudio
        audio_interface = pyaudio.PyAudio()
        audio_stream = audio_interface.open(
            format=pyaudio.paInt16,
            channels=CHANNELS,
            rate=RATE,
            input=True,
            frames_per_buffer=CHUNK,
        )

        def mic_loop():
            while conversation_active and audio_stream:
                try:
                    data = audio_stream.read(CHUNK, exception_on_overflow=False)
                    az_speech_recognizer_stream_client.write_bytes(data)
                except Exception as e:
                    print(f"❌ Microphone error: {e}")
                    break

        if not _mic_thread or not _mic_thread.is_alive():
            _mic_thread = threading.Thread(target=mic_loop, daemon=True)
            _mic_thread.start()
            print("✅ Microphone setup complete (thread started)")
        else:
            print("ℹ️ Microphone thread already running")

    except Exception as e:
        print(f"❌ Microphone setup failed: {e}")

# ──────────────────────────────────────────────────────────────────────────────
# Agent lifecycle
# ──────────────────────────────────────────────────────────────────────────────
def start_voice_agent():
    """Start agent, ensure mic loop and STT are ready before talking."""
    global conversation_active, user_buffer
    print("🎯 Starting Voice-to-Voice Agent...")

    conversation_active = True   # set active BEFORE mic thread starts
    user_buffer = ""

    setup_speech_recognition()
    setup_microphone()

    az_speech_recognizer_stream_client.start()
    print("🎙️ Speech recognition started")
    time.sleep(0.1)  # tiny delay for SDK readiness

    print("\n✅ Voice-to-Voice Agent Ready!")
    print("💡 Speak to interact with the customer support agent")
    print("🛑 Use stop_voice_agent() to end the conversation")

def stop_voice_agent():
    """Stop agent and clean up resources."""
    global conversation_active, audio_stream, audio_interface
    print("🛑 Stopping Voice-to-Voice Agent...")
    conversation_active = False

    try:
        az_speech_recognizer_stream_client.stop()
        print("✅ Speech recognition stopped")
    except Exception as e:
        print(f"⚠️ recognizer.stop() error (ignored): {e}")

    try:
        _stop_tts("agent stopping")  # ensures synthesis is halted
        print("✅ Text-to-speech stopped")
    except Exception as e:
        print(f"⚠️ tts.stop() error (ignored): {e}")

    try:
        if audio_stream:
            audio_stream.stop_stream()
            audio_stream.close()
        if audio_interface:
            audio_interface.terminate()
        print("✅ Microphone stream closed")
    except Exception as e:
        print(f"⚠️ mic close error (ignored): {e}")

    print("🎯 Voice-to-Voice Agent stopped. All resources cleaned up.")

# ──────────────────────────────────────────────────────────────────────────────
# Conversation loops
# ──────────────────────────────────────────────────────────────────────────────
async def process_user_input():
    """Process any pending user input from STT buffer (single turn)."""
    global user_buffer
    if user_buffer.strip():
        user_input = user_buffer.strip()
        user_buffer = ""
        print(f"\n👤 Processing user input: {user_input}")
        messages = [{"role": "system", "content": PROMPT},
                    {"role": "user", "content": user_input}]
        await process_streaming_response_with_tools(messages, TOOL_REGISTRY)
        return True
    return False

async def full_conversation_loop():
    """Continuous conversation: start agent, then react to user speech."""
    greeting()
    global user_buffer
    print("🎯 Starting Complete Voice-to-Voice Conversation...")
    start_voice_agent()

    global messages
    messages = [{"role": "system", "content": PROMPT}]
    try:
        while conversation_active:
            if user_buffer.strip():
                user_input = user_buffer.strip()
                user_buffer = ""
                print(f"\n👤 Processing user input: {user_input}")
                messages.append({"role": "user", "content": user_input})
                await process_streaming_response_with_tools(messages, TOOL_REGISTRY)
            await asyncio.sleep(0.1)
    except KeyboardInterrupt:
        print("\n⚠️ Conversation interrupted by user")
    except Exception as e:
        print(f"\n❌ Error in conversation loop: {e}")
    finally:
        stop_voice_agent()

# ──────────────────────────────────────────────────────────────────────────────
# Optional: quick TTS sanity check (play once to confirm device/creds)
# ──────────────────────────────────────────────────────────────────────────────
def greeting():
  az_speech_synthesizer_client.start_speaking_text("Hi there from XYZ Customer service, How can I help you today?")

print("✅ Voice-to-Voice Agent functions loaded!")
print("\n🚀 Usage Options:")
print("1. start_voice_agent()")
print("2. await process_user_input()  # one-shot")
print("3. stop_voice_agent()")
print("4. await full_conversation_loop()  # continuous")


✅ Voice-to-Voice Agent functions loaded!

🚀 Usage Options:
1. start_voice_agent()
2. await process_user_input()  # one-shot
3. stop_voice_agent()
4. await full_conversation_loop()  # continuous


In [None]:
# 🎯 Start Complete Voice-to-Voice Conversation
# This will start the agent and continuously process voice input
# Press Ctrl+C or run the stop cell to end the conversation
# test order id: ORD123456

await full_conversation_loop()