# Azure AI Speech - Text to Speech with Avatar rendering
In this notebook we are going to look at how can we build a chatbot with an avatar rendering using Azure AI Speech and Text to Speech services. 

We will use the Azure AI Speech service to convert the user question into text, Azure OpenAI to generate an answer and then AI Speech to convert the answer into speech in avatar format.

In [1]:
from IPython.display import display, HTML, Audio, Video
import os
import azure.cognitiveservices.speech as speechsdk
from openai import AzureOpenAI
from dotenv import load_dotenv
import json
import requests
import time

load_dotenv()

OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
OPENAI_DEPLOYMENT_ENDPOINT = os.getenv("OPENAI_DEPLOYMENT_ENDPOINT")
OPENAI_GPT35_DEPLOYMENT_NAME = os.getenv("OPENAI_GPT35_DEPLOYMENT_NAME")

client = AzureOpenAI(
  azure_endpoint = OPENAI_DEPLOYMENT_ENDPOINT, 
  api_key=OPENAI_API_KEY,  
  api_version="2023-05-15"
)

SPEECH_KEY = os.getenv("SPEECH_KEY")
SPEECH_REGION = os.getenv("SPEECH_REGION")
SPEECH_HOST = "customvoice.api.speech.microsoft.com"

In [2]:
def call_openAI(text):
    response = client.chat.completions.create(
        model=OPENAI_GPT35_DEPLOYMENT_NAME,
        messages = text,
        temperature=0.5,
        max_tokens=100,
        top_p=0.95,
        frequency_penalty=0,
        presence_penalty=0,
        stop=None
    )

    return response.choices[0].message.content

## AI Speech Speech to Text

In [3]:
'''
Defining the function that will convert an audio file (with the user question) into text
'''
def voice_to_text_from_file(file_name):
    speech_config = speechsdk.SpeechConfig(subscription=SPEECH_KEY, region=SPEECH_REGION)
    speech_config.speech_recognition_language="en-US"

    audio_config = speechsdk.audio.AudioConfig(filename=file_name)
    speech_recognizer = speechsdk.SpeechRecognizer(speech_config=speech_config, audio_config=audio_config)

    speech_recognition_result = speech_recognizer.recognize_once_async().get()

    if speech_recognition_result.reason == speechsdk.ResultReason.RecognizedSpeech:
        print("Text: {}".format(speech_recognition_result.text))
    return speech_recognition_result.text

In [4]:
# Audio that we want to convert
Audio("./data/user question.wav")

In [5]:
user_message = voice_to_text_from_file("./data/user question.wav")

Text: Hello is it possible to create a real time avatar using Azure AI services?


## Simulating a RAG chatbot
We already looked at how to build a RAG. In this notebook we will simulate that approach by already having the context necessary in a file.

In this example, we are going to use the [documentation](https://learn.microsoft.com/en-us/azure/ai-services/speech-service/text-to-speech-avatar/what-is-text-to-speech-avatar) from the Azure AI Speech service to generate the Avatars.

In [6]:
with open("./data/tts-avatar.md", "r") as file:
    context = file.read()

print(context)

# Text to speech avatar overview

Text to speech avatar converts text into a digital video of a photorealistic human (either a prebuilt avatar or a custom text to speech avatar speaking with a natural-sounding voice. The text to speech avatar video can be synthesized asynchronously or in real time. Developers can build applications integrated with text to speech avatar through an API, or use a content creation tool on Speech Studio to create video content without coding.

With text to speech avatar's advanced neural network models, the feature empowers users to deliver life-like and high-quality synthetic talking avatar videos for various applications while adhering to responsible AI practices.

> [!NOTE]
> The text to speech avatar feature is only available in the following service regions: West US 2, West Europe, and Southeast Asia. 

Azure AI text to speech avatar feature capabilities include:

- Converts text into a digital video of a photorealistic human speaking with natural-soun

Let's define the system prompt and create a mockup history of the conversation.

This will generate the answer we need to give to the user in text format.

In [7]:
# prepare prompt
system_message = f""""
You are a HELPFUL assistant answering user questions about Azure services. Answer in a clear and concise manner.

Below you have all the information you require to anser the user's question. Do not answer anything which is not in this information.

Information starts here:
{context}
"""


messages = [{"role": "system", "content": system_message},
            {"role": "user", "content": "Good morning, how are you today?"},
            {"role": "assistant", "content": "Hello, I am doing well. How can I help you today?"},
            {"role": "user", "content": user_message}]
 
result = call_openAI(messages)
display(HTML(result))

## Azure AI Speech - Render avatar
We are going to use the [batch generation](https://learn.microsoft.com/en-us/azure/ai-services/speech-service/text-to-speech-avatar/batch-synthesis-avatar) of the Avatar but there is also a [real-time generation](https://learn.microsoft.com/en-us/azure/ai-services/speech-service/text-to-speech-avatar/real-time-synthesis-avatar) available that you can use in a real-time chatbot.

In [8]:
'''
Function which implements a REST API call to the AI Speech service to initiate the batch generation of the Avatar.
'''
def submit_synthesis(input_text):
    url = f'https://{SPEECH_REGION}.{SPEECH_HOST}/api/texttospeech/3.1-preview1/batchsynthesis/talkingavatar'
    header = {
        'Ocp-Apim-Subscription-Key': SPEECH_KEY,
        'Content-Type': 'application/json'
    }

    payload = {
        'displayName': "Azure QnA Avatar",
        'description': "This bot answers questions about Azure services.",
        "textType": "PlainText",
        'synthesisConfig': {
            "voice": "en-US-MonicaNeural",
        },
        # Replace with your custom voice name and deployment ID if you want to use custom voice.
        # Multiple voices are supported, the mixture of custom voices and platform voices is allowed.
        # Invalid voice name or deployment ID will be rejected.
        'customVoices': {
            # "YOUR_CUSTOM_VOICE_NAME": "YOUR_CUSTOM_VOICE_ID"
        },
        "inputs": [
            {
                "text": input_text,
            },
        ],
        "properties": {
            "customized": False, # set to True if you want to use customized avatar
            "talkingAvatarCharacter": "lisa",  # talking avatar character
            "talkingAvatarStyle": "graceful-sitting",  # talking avatar style, required for prebuilt avatar, optional for custom avatar
            "videoFormat": "mp4",  # mp4 or webm, webm is required for transparent background
            "videoCodec": "h264",  # hevc, h264 or vp9, vp9 is required for transparent background; default is hevc
            "subtitleType": "soft_embedded",
            "backgroundColor": "#FFFFFFFF", # background color in RGBA format, default is white; can be set to 'transparent' for transparent background
        }
    }

    response = requests.post(url, json.dumps(payload), headers=header)
    if response.status_code < 400:
        print('Batch avatar synthesis job submitted successfully')
        print(f'Job ID: {response.json()["id"]}')
        return response.json()["id"]
    else:
        print(f'Failed to submit batch avatar synthesis job: {response.text}')


In [9]:
'''
Function that, based on the job_id of the translation, will get the status of the job.
'''
def get_synthesis(job_id):
    url = f'https://{SPEECH_REGION}.{SPEECH_HOST}/api/texttospeech/3.1-preview1/batchsynthesis/talkingavatar/{job_id}'
    header = {
        'Ocp-Apim-Subscription-Key': SPEECH_KEY
    }
    response = requests.get(url, headers=header)
    if response.status_code < 400:
        if response.json()['status'] == 'Succeeded':
            print(f'Batch synthesis job succeeded, download URL: {response.json()["outputs"]["result"]}')
        return response.json()
    else:
        print(f'Failed to get batch synthesis job: {response.text}')


In [10]:
job_id = submit_synthesis(result)
if job_id is not None:
    while True:
        job_info = get_synthesis(job_id)
        if job_info['status'] == 'Succeeded':
            print('batch avatar synthesis job succeeded')
            break
        elif job_info['status'] == 'Failed':
            print('batch avatar synthesis job failed')
            break
        else:
            print(f"batch avatar synthesis job is still running, status [{job_info['status']}]")
            time.sleep(5)

Batch avatar synthesis job submitted successfully
Job ID: ec041e89-2896-4cd3-84fe-39897025de9a
batch avatar synthesis job is still running, status [Running]
batch avatar synthesis job is still running, status [Running]
batch avatar synthesis job is still running, status [Running]
batch avatar synthesis job is still running, status [Running]
batch avatar synthesis job is still running, status [Running]
Batch synthesis job succeeded, download URL: https://cvoiceprodweu.blob.core.windows.net/batch-synthesis-output/ec041e89-2896-4cd3-84fe-39897025de9a/0001.mp4?skoid=85130dbe-2390-4897-a9e9-5c88bb59daff&sktid=33e01921-4d64-4f8c-a055-5bdaffd5e33d&skt=2024-03-13T17%3A23%3A22Z&ske=2024-03-19T17%3A28%3A22Z&sks=b&skv=2023-11-03&sv=2023-11-03&st=2024-03-13T17%3A23%3A22Z&se=2024-03-14T17%3A28%3A22Z&sr=b&sp=rl&sig=LQ3PRcrmSOsiuDEWzbydBfcS%2FyxIl1WVCA56bhZQ4sM%3D
batch avatar synthesis job succeeded


In [11]:
# Download avatar file to local system
response = requests.get(job_info["outputs"]["result"])
file_Path = "./data/avatar_answer.mp4"
 
if response.status_code == 200:
    with open(file_Path, 'wb') as file:
        file.write(response.content)
    print('File downloaded successfully')
else:
    print('Failed to download file')

File downloaded successfully


In [12]:
# It is normal that within VSCode there is no audio, but if you open the video locally you can listen to it.
Video(file_Path)