### Voice Chatbot with ASR (Automatic Speech Recognition)

In this cookbook, we will walk through the process of creating a simple sales chatbot with automatic speech recognition (ASR) and text-to-speech (TTS) capabilities. We'll use a GPT model via the Chat Completions API to drive the conversation with the user. At the end of the interaction, the chatbot will present an order cart containing the items the user wishes to purchase.

Voice chatbots based on ASR/TTS can introduce latency due to the speech-to-text and text-to-speech conversion processes. We will explore strategies to minimize this lag to ensure a better conversational flow.

Creating an ASR/TTS-based voice chatbot is a three-step process, as outlined below:

**1. Set Up the GPT Model (Text-to-Text Modality)**  
Initialize the GPT model with system prompts that define the goal of the conversation, guiding the chatbot's responses toward assisting with sales order placement. The prompts can be set up for a multi-assistant system, where one assistant drives the conversation with the customer and another assistant manages the cart in parallel. Also, set up tools for assistants to use when asking for human help or interacting with each other (such as cart pricing).

**2. Develop Audio Modules for ASR (Automatic Speech Recognition) and TTS (Text-To-Speech)**  
Create an audio interface that listens to the user, records their speech, and forwards the audio data to the ASR solution (such as Whisper) to transcribe it to text. Implement a **VAD (Voice Activity Detection) module** for a **handsfree operation**. This module detects input audio from user and segments the audio at silence intervals to send to Whisper model for transcription. Keep VAD (Voice Activity Detection) module parameters (threshold of audio amplitude that qualifies as silence, and the duration of silent chunks) configurable, so they can be adjusted based on the environment. Set up a TTS function that, given an input text, converts the text to audio and relays it back to the user. You can pre-record common phrases to reduce lag.

**3. Create a conversation loop and manage order cart**  
Implement a conversation loop where the agent listens to the user and responds back, continuing until an event occurs that breaks the loop, such as a request to speak with a human or another indication of the end of the conversation.


Overarching solution architecture is as follows:   
![ASR/TTS](./images/asr-text-to-speech.png)

For the purposes of this cookbook, we will use an example of an office stationery ordering bot. You can interact with the bot to order general-purpose office products such as pencils, pens, paper clips, writing pads, printing paper, and envelopes.

The key challenges we want to address are:

1. Ensure customers can only order items that are available.
2. Escalate to a human in the loop if the customer requests help or engages in non-order-related conversation.
3. Provide an accurate summary of the order with prices to the customer.
4. Minimize the lag in the conversation 


Before we get started, make sure you have the following libraries installed: `pyaudio`, `numpy`, `openai`, `playsound`, and that you have configured your OpenAI API key as an environment variable.

### 1. Set Up the GPT Model (Text-to-Text Modality)

First step is to set the foundation for the GPT model to operate effectively as a sales chatbot within the office stationery domain. By carefully crafting the prompts and defining the functions, we ensure that the bot can handle customer interactions smoothly, maintain the flow of conversation, and provide accurate assistance aligned with the objectives of our project.

We will initiate a Sales Bot prompt `SALES_BOT_PROMPT` that would drive the interaction with the user, and a `SALES_CART_PROMPT` prompt that would manage the cart. Note that the list of items available for sale are provide as a list of JSON objects `office_stationery_items`. This helps the sales bot and sales cart assistant to understand the available items, and repond the user accordingly.  


In [71]:
import json

# Creates a list of dictionaries, where each dictionary represents an office stationery item available for purchase.
office_stationery_items = [
    {"item-id": "0001", "item-name": "pencil", "item-price": "$0.50"},
    {"item-id": "0002", "item-name": "pen", "item-price": "$1.00"},
    {"item-id": "0003", "item-name": "clip", "item-price": "$0.05"},
    {"item-id": "0004", "item-name": "writing pad", "item-price": "$2.00"},
    {"item-id": "0005", "item-name": "printing paper", "item-price": "$5.00"},
    {"item-id": "0006", "item-name": "envelope", "item-price": "$0.10"}
]

# Defines the system prompt that instructs the GPT model on how to behave during the conversation.
SALES_BOT_PROMPT = f"""You are a office stationery sales bot. The customer will ask to buy one of the following items. Follow the rules below: 
1. Be succinct in your responses up to 10 words or less if possible.  
2. If the customer asks for an item that is not available, you should let the customer know that item is not available.
3. Once the customer has placed an order, reply with ANYTHING ELSE
4. If the customer wants to chat with a human, call the function  'get_human_help'
5. If the customer discusses any other topic, other than ordering office stationery, call the function 'get_human_help'
6. When the order is final, call the function `get_order_details` and let the customer know the price.
<LIST OF ITEMS>
{office_stationery_items}
</LIST OF ITEMS>  
"""

# Provides a separate prompt to guide the bot in generating the final order cart
# An example is provided to illustrate the desired output format, ensuring consistency and accuracy in the bot's response
# This could be further enhanced by structured output, but one shot example is sufficient in this context 
SALES_CART_PROMPT = f"""You are an office stationery sales bot, that will generate a cart based on a conversation between a user and an agent. The list of items available for purchase is provided below. Output the cart in JSON format. Include quantity and total price of the order. 

<LIST OF ITEMS>
{office_stationery_items}
</LIST OF ITEMS> 

<EXAMPLE OF A CART> 
{{
  "cart": [
    {{
      "item-id": "0001",
      "item-name": "pencil",
      "quantity": 4,
      "item-price": "$0.50",
      "total-item-price": "$2.00"
    }}
  ],
  "total-price": "$2.00"
}}
</EXAMPLE OF A CART> 
"""

# Defines functions that the bot can "call" during the conversation to handle specific situations such as to get order details and get human help 
TOOLS = [
    {
        "type": "function",
        "function": {
            "name": "get_order_details",
            "description": "Use this function once the customer has finished ordering to get the order price."
        }
    }, 
    {
        "type": "function",
        "function": {
            "name": "get_human_help",
            "description": "Use this function if customer discusses topics other than the order or wants to speak with a human."
        }
    }]

# Initialize the prompt for sales agent 
sales_agent_prompt = [{"role": "system", "content": SALES_BOT_PROMPT}]

# Initialize the prompt for pricing agent 
pricing_agent_prompt = [{"role": "system", "content": SALES_CART_PROMPT}]

### 2. Develop Audio Modules for ASR (Automatic Speech Recognition) and TTS (text-to-speech)

The following Python code implements an interactive voice agent that facilitates customer interactions for ordering office stationery.  
 
The `listen()` function implements the **VAD (Voice Activity Detection) module** using `PyAudio` that streams the audio in `frames_per_buffer` defined as `CHUNK`. Each `CHUNK` is `1024` frames in the audio buffer. Function `is_silent(input_data)` determines if the audio data is below the `SILENCE_THRESHOLD` to classify the audio chunk as silent. This can help filter out low noise in the environment such as breathing sounds. If there are consecutive `50` `SILENT_CHUNKS`  as defined in the code below, the function interprets it as the customer has finished speaking, and saves the audio to a WAV file. To qualify as valid user input, the user must have spoken something which is determined using `SPOKEN_CHUNKS`. Once the **VAD (Voice Activity Detection) module** determines the user input is valid, and user has finished speaking, the audio is sent to OpenAI's Whisper model for **ASR (Automatic Speech Recognition)**, and transcription in English returned as part of the function call.  

Variables `SPOKEN_CHUNKS`, `SILENCE_THRESHOLD` and `SILENT_CHUNKS` can be adjusted based on the environment and type of use case to segment the input audio. 

The `speak(agent_message)` function takes the agent's text response, converts it into spoken audio using OpenAI's text-to-speech model, saves it as a WAV file, and plays it back to the customer. Overall, the code enables a conversational interface by integrating speech recognition and synthesis. [Note that in this implementation audio cannot be interrupted once the function starts speaking. The user must wait for its turn to speak.] 

To reduce the lag, we have pre-recorded sound snippets and stored them under `sounds` folder. If agent response is one of these pre-recorded phrases we can play them instantaneously, reducing the perceived lag. 


In [77]:
import pyaudio
import numpy as np
import wave
from openai import OpenAI
from playsound import playsound

CHUNK = 1024  # CHUNK sets the number of frames per buffer.
FORMAT = pyaudio.paInt16  # FORMAT specifies the sample format (16-bit in this case).
CHANNELS = 1  # CHANNELS sets the number of audio channels: 1 for mono, 2 for stereo
RATE = 44100  # RATE sets the sample rate to 44100 Hz
SILENCE_THRESHOLD = 20  # Adjust this threshold based on your environment
SILENT_CHUNKS = 50  # Number of chunks of silence to trigger stop
SPOKEN_CHUNKS = 50  # Number of spoken chunks to have a valid response from the user

oai_client = OpenAI()


# List of pre-recorded messages 
initial_message = "What would you like to order?"
human_help_message = "Let me get you a human to help!"
finalize_order = "Thank you for your order, let me calculate the total price."
anything_else = "Anything else"


def listen():
    """Listen to the customer. Return the text from the speech"""
    print("Agent listening ...")

    def is_silent(input_data):
        """Check if the given data chunk is silent."""
        audio_data = np.frombuffer(input_data, dtype=np.int16)
        return np.abs(audio_data).mean() < SILENCE_THRESHOLD

    output = "user_response.wav"
    with wave.open(output, 'wb') as wf:
        p = pyaudio.PyAudio()
        wf.setsampwidth(p.get_sample_size(FORMAT))
        wf.setnchannels(CHANNELS)
        wf.setframerate(RATE)

        stream = p.open(format=FORMAT,
                        channels=CHANNELS,
                        rate=RATE,
                        input=True,
                        frames_per_buffer=CHUNK)

        # print("* recording")
        frames = []
        silent_chunks = 0
        speech_chunks = 0

        while True:
            data = stream.read(CHUNK)
            frames.append(data)

            if is_silent(data):
                silent_chunks += 1
            else:
                silent_chunks = 0
                speech_chunks += 1

            if silent_chunks > SILENT_CHUNKS and speech_chunks > SPOKEN_CHUNKS:
                break

        print("* done listening")
        stream.stop_stream()
        stream.close()
        p.terminate()
        wf.writeframes(b''.join(frames))

        # Upload the recorded audio file to OpenAI whisper-1 model for transcription
        audio_file = open(output, "rb")
        transcription = oai_client.audio.transcriptions.create(
            model="whisper-1",
            file=audio_file,
            prompt="This is a customer trying to order office stationery"
        )

        return transcription.text


def speak(agent_message):
    # Common phrases can be pre-recorded to reduce the lag 
    if agent_message.lower() == initial_message.lower():
        print("Agent speaking ... (pre-recorded audio)")
        playsound("./sounds/initial_message.wav")
    
    elif agent_message.lower() == human_help_message.lower(): 
        print("Agent speaking ... (pre-recorded audio)")
        playsound("./sounds/human_help_message.wav")
        
    elif agent_message.lower() == finalize_order.lower(): 
        print("Agent speaking ... (pre-recorded audio)")
        playsound("./sounds/finalize_order.wav")
    
    elif agent_message.lower() == anything_else.lower():
        print("Agent speaking ... (pre-recorded audio)")
        playsound("./sounds/anything_else.wav")
    
    else:
        print("Agent speaking ...(new audio file generated)")
        # Convert text to speech
        agent_voice_response = oai_client.audio.speech.create(
            model="tts-1",
            voice="nova",
            input=agent_message
        )
        # Save to the file
        agent_voice_response.write_to_file("agent_response.wav")

        # play the audio file
        playsound("agent_response.wav")
        

### 3. Create a conversation loop and manage order cart  

The code below utilizes `listen()` and `speak()` functions in tandem to handle ASR (Automatic Speech Recognition) with VAD (Voice Activity Detection), and TTS (Text-To-Speech) as discussed in previous section. This loop enables **handsfree voice-based communication** with the application.
 
The Application initiates a conversation with a user asking what they would like to order with `speak()` function, then continuously listens to the user's input using the `listen()` function. The voice conversation is managed through a sales agent that processes each user input and provides a response. The conversation is maintained in a message dictionary `messages_dictionary`, which is the log of entire interaction. 

The sales agent manages the response to user's input. It is guided by the `SALES_BOT_PROMPT`, which includes the list of available items, ensuring the agent only processes orders for items in stock. If the response generated by the sales agent response involves a tool call, such as finalize the order by getting order details or get a human for help, the loop breaks, signaling the end of the interaction.    

The cart management agent is guided by the `SALES_CART_PROMPT` and generates a detailed cart in JSON format, including item quantities and total prices, ensuring the user receives an accurate summary of their order. Both agents receive the same item inventory in JSON format, and the same messages dictionary of the conversation. In production applications, cart management agent can be run in parallel as an asynchronous process to reduce the lag.  

It is also a good idea to convert the numeric dollar price (e.g., $3.00) into spoken description of the amount (e.g., 3 dollars and 0 cents) for a more natural sounding response. This is accomplished by the `convert_to_words` function below. 

In [78]:
# Helper Util to convert the $ amount into spoken description for a natural sounding response  
def convert_to_words(amount_str):
    try: 
        # Remove the dollar sign and convert the string to a float
        amount = float(amount_str.replace('$', '').strip())
    
        # Separate the dollar and cent parts
        dollars = int(amount)
        cents = int(round((amount - dollars) * 100))
    
        # Convert dollars and cents to words
        dollar_word = f"{dollars} dollar{'s' if dollars != 1 else ''}"
        cent_word = f"{cents} cent{'s' if cents != 1 else ''}"
    
        # Construct the final string
        return f"{dollar_word} and {cent_word}"
    except ValueError: 
        print("Unable to convert to spoken description")
        return amount_str
    

# Set the messages dictionary with initial welcome message to the customer
messages_dictionary = [{
    "role": "assistant",
    "content": initial_message
}] 

# Set cart to empty 
cart = [] 

# Initiate the conversation with the user 
speak(initial_message)


# Loop until the user has completed the order or asks for human help 
while True:
    # listen to the user input 
    user_input = listen()

    # Append the message to messages dictionary to pass on the model 
    messages_dictionary.append({
        "role": "user",
        "content": user_input
    })
    
    # Response from the model to user input 
    response = oai_client.chat.completions.create(
        model='gpt-4o',
        messages=sales_agent_prompt + messages_dictionary, 
        tools=TOOLS
    )
    
    tool_calls = response.choices[0].message.tool_calls
    
    # Check if model wants to call a tool  
    if tool_calls: 
        tool_function_name = tool_calls[0].function.name
        if tool_function_name == "get_order_details":
            # The pricing agent generates a detailed cart in JSON format, including item quantities and total prices, ensuring the user receives an accurate summary of their order.
            # Let the user know bot is calculating the price 
            speak(finalize_order)
            
            response = oai_client.chat.completions.create(
                model='gpt-4o',
                messages=pricing_agent_prompt + messages_dictionary, 
                response_format={"type": "json_object"}
            )
            # Get the cart 
            # At this point the cart can be sent to the Point-of-sale system 
            cart = json.loads(response.choices[0].message.content)
    
            # Extracting the total price of the entire order
            total_price = cart["total-price"]
            final_message = f".. your total is {convert_to_words(total_price)}"
            
            speak(final_message)
            messages_dictionary.append({
                "role": "assistant",
                "content": finalize_order + " " + final_message
                })
            break;
        elif tool_function_name == "get_human_help":
            #  get_human_help function allows the assistant to gracefully transfer the conversation to a human agent if the user requests assistance or deviates from the order process.
            speak(human_help_message)
            messages_dictionary.append({
                "role": "assistant",
                "content": human_help_message
                })
            break;
        else: 
            print(f"Tool does not exist: {response.choices[0].message.tool_calls}")
    
    # Get message content 
    response_message = response.choices[0].message.content
    
    # Append the message to messages dictionary 
    messages_dictionary.append({
    "role": "assistant",
    "content": response_message
    })
    speak(response_message)
    
    
# Print the conversation
print ("*" * 10 + " Conversation log: " + "*" * 10)
print(json.dumps(messages_dictionary, indent=4))

# Print the cart 
print("*" * 10 + " Cart: " + "*" * 10)
print(json.dumps(cart, indent=4))

Agent speaking ... (pre-recorded audio)
Agent listening ...
* done listening
Agent speaking ...(new audio file generated)
Agent listening ...
* done listening
Agent speaking ... (pre-recorded audio)
Agent speaking ...(new audio file generated)
********** Conversation log: **********
[
    {
        "role": "assistant",
        "content": "What would you like to order?"
    },
    {
        "role": "user",
        "content": "Hi, can I get four pencils and printing paper?"
    },
    {
        "role": "assistant",
        "content": "Order: 4 pencils, and printing paper. ANYTHING ELSE?"
    },
    {
        "role": "user",
        "content": "That will be all. Thank you."
    },
    {
        "role": "assistant",
        "content": "Thank you for your order, let me calculate the total price. Your total is 7 dollars and 0 cents"
    }
]
********** Cart: **********
{
    "cart": [
        {
            "item-id": "0001",
            "item-name": "pencil",
            "quantity": 4,
      

### Conclusion

In this notebook, we developed an interactive and efficient voice chatbot capable of handling sales orders for office stationery. The conversation loop effectively manages user interactions, leveraging the GPT model for intelligent responses and ensuring the user's needs are met—whether by completing an order or escalating to a human agent. We pre-recorded common phrases to improve the latency, and provided visual cues to the user when the model is speaking and listening for coordinated conversation. This setup provides a solid foundation for building advanced voice-based chatbots with order management capabilities.

### Tips and Tricks to Improve User Experience for STT/TTS Voice Bot Solutions

Due to the inherent lag introduced by speech-to-text (STT) and text-to-speech (TTS) conversions, optimizing the user experience is crucial. Here are some strategies to enhance the responsiveness and fluidity of the conversation:

- **Provide Visual Cues**: Implement visual indicators, when possible, to show when the model is speaking or listening, keeping users informed about the bot's status.  
- **Chunk Incoming Audio**: Segment incoming audio at silence intervals before processing with Whisper or any ASR solution to reduce latency.  
- **Pre-Record Common Phrases**: Use pre-recorded audio for frequently used phrases like welcome messages or notifications about escalating to a human agent, which can save processing time.  
- **Keep Responses Concise**: Encourage the model to generate shorter text outputs and abbreviate common responses to speed up TTS processing.  
- **Stream Output Audio**: Stream the audio output as it is generated, rather than waiting for the entire audio file to be ready, to provide a more seamless conversational experience.  
