# Process and Translate Speech with Azure AI Services
The Azure AI Speech service enables you to build speech-enabled applications. This module focuses on using the speech-to-text and text to speech APIs, which enable you to create apps that are capable of speech recognition and speech synthesis.

## Learning objectives
In this module, you'll learn how to:

- Provision an Azure resource for the Azure AI Speech service
- Use the Azure AI Speech to text API to implement speech recognition
- Use the Text to speech API to implement speech synthesis
- Configure audio format and voices
- Use Speech Synthesis Markup Language (SSML)

## Provision an Azure resource for Speech
- Make sure you're signed in to the Azure portal at https://portal.azure.com with your Microsoft account.
- Create an Azure AI Speech resource in your Azure subscription. You can use either a dedicated Azure AI Speech resource or a multi-service Azure AI Services resource.
- Then install the Speech SDK package by running the command below if not already installed.
- Install playsound package by running the command below if you don't have it installed already.

In [1]:
%pip install azure-cognitiveservices-speech==1.28.0
%pip install playsound

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


- Update the `.env` file with your Azure Speech resource key and region.
- Run the cell below to import the required modules.

In [6]:
from dotenv import load_dotenv
from datetime import datetime
import os

# Import namespaces
import azure.cognitiveservices.speech as speech_sdk
from playsound import playsound

## Build a speaking Clock app

- The main function below performs the following tasks
    - Loads environment variables from the .env file
    - Creates an instance of the SpeechConfig class, which is used to set the key and region for your Speech resource
    - Calls the Transcribe function to transcribe the speech to text
    - Calls the TellTime function to get speech output for time

In [3]:
def main():
    try:
        global speech_config

        # Get Configuration Settings
        load_dotenv()
        cog_key = os.getenv('COG_SERVICE_KEY')
        cog_region = os.getenv('COG_SERVICE_REGION')

        # Configure speech service
        speech_config = speech_sdk.SpeechConfig(cog_key, cog_region)
        print('Ready to use speech service in:', speech_config.region)

        # Get spoken input
        command = TranscribeCommand()
        print(command)
        if command.lower() == 'what time is it?':
            TellTime()

    except Exception as ex:
        print(ex)

- TranscribeCommand fuction below reads the audio file and transcribes the speech to text. (You can try and replace this with your own audio file)

In [7]:
def TranscribeCommand():
    command = ''
    # Configure speech recognition
    audioFile = 'time.wav'
    playsound(audioFile)
    audio_config = speech_sdk.AudioConfig(filename=audioFile)
    speech_recognizer = speech_sdk.SpeechRecognizer(speech_config, audio_config)

    # Process speech input
    speech = speech_recognizer.recognize_once_async().get()
    if speech.reason == speech_sdk.ResultReason.RecognizedSpeech:
        command = speech.text
        print(command)
    else:
        print(speech.reason)
        if speech.reason == speech_sdk.ResultReason.Canceled:
            cancellation = speech.cancellation_details
            print(cancellation.reason)
            print(cancellation.error_details)
    # Return the command
    return command


- TellTime function below gets the current time and converts it to speech output using the text to speech API.
- Call to the main function below run the app.

In [8]:
def TellTime():
    now = datetime.now()
    response_text = 'The time is {}:{:02d}'.format(now.hour,now.minute)

    # Configure speech synthesis
    speech_config.speech_synthesis_voice_name = "en-GB-RyanNeural"
    speech_synthesizer = speech_sdk.SpeechSynthesizer(speech_config)

    # Synthesize spoken output
    speak = speech_synthesizer.speak_text_async(response_text).get()
    if speak.reason != speech_sdk.ResultReason.SynthesizingAudioCompleted:
        print(speak.reason)

    # Print the response
    print(response_text)

main()

Ready to use speech service in: eastus
What time is it?
What time is it?
The time is 1:03


## Translate speech with the Azure AI Speech service

- Create an Azure AI Speech resource in your Azure subscription if not already present.
- Install the Speech SDK package by running the command below if not already installed.
- Update the `.env` file with your Azure Speech resource key and region.

In [10]:
%pip install azure-cognitiveservices-speech==1.24.0

Collecting azure-cognitiveservices-speech==1.24.0
  Downloading azure_cognitiveservices_speech-1.24.0-py3-none-manylinux1_x86_64.whl (2.5 MB)
[K     |████████████████████████████████| 2.5 MB 234 kB/s eta 0:00:01
[?25hInstalling collected packages: azure-cognitiveservices-speech
  Attempting uninstall: azure-cognitiveservices-speech
    Found existing installation: azure-cognitiveservices-speech 1.28.0
    Uninstalling azure-cognitiveservices-speech-1.28.0:
      Successfully uninstalled azure-cognitiveservices-speech-1.28.0
Successfully installed azure-cognitiveservices-speech-1.24.0
Note: you may need to restart the kernel to use updated packages.


- Run the cell below to import the required modules.

In [15]:
from dotenv import load_dotenv
from datetime import datetime
import os

# Import namespaces
import azure.cognitiveservices.speech as speech_sdk
from playsound import playsound


The main function below performs the following tasks:
- Loads environment variables from the .env file
- Creates an instance of the SpeechTranslationConfig class, which is used to set the key and region for your Speech resource
- Takes target language as input from the user
- Creates an instance of the SpeechConfig class, which is used to set the key and region for your Speech resource
- Calls the translate function to translate the speech.

In [19]:
def main():
    try:
        global speech_config
        global translation_config

        # Get Configuration Settings
        load_dotenv()
        cog_key = os.getenv('COG_SERVICE_KEY')
        cog_region = os.getenv('COG_SERVICE_REGION')

        # Configure translation
        translation_config = speech_sdk.translation.SpeechTranslationConfig(cog_key, cog_region)
        translation_config.speech_recognition_language = 'en-US'
        translation_config.add_target_language('fr')
        translation_config.add_target_language('es')
        translation_config.add_target_language('hi')
        print('Ready to translate from',translation_config.speech_recognition_language)

        # Configure speech
        speech_config = speech_sdk.SpeechConfig(cog_key, cog_region)

        # Get user input
        targetLanguage = ''
        while targetLanguage != 'quit':
            targetLanguage = input('\nEnter a target language\n fr = French\n es = Spanish\n hi = Hindi\n Enter anything else to stop\n').lower()
            if targetLanguage in translation_config.target_languages:
                Translate(targetLanguage)
            else:
                targetLanguage = 'quit'

    except Exception as ex:
        print(ex)


- Translate function below reads the audio file and translates the speech to text. (You can try and replace this with your own audio file)
- It then generates the translated text and speech output in the target language.

In [20]:
def Translate(targetLanguage):
    translation = ''

    # Translate speech
    audioFile = 'station.wav'
    playsound(audioFile)
    audio_config = speech_sdk.AudioConfig(filename=audioFile)
    translator = speech_sdk.translation.TranslationRecognizer(translation_config, audio_config = audio_config)
    print("Getting speech from file...")
    result = translator.recognize_once_async().get()
    print('Translating "{}"'.format(result.text))
    translation = result.translations[targetLanguage]
    print(translation)


    # Synthesize translation
    voices = {
            "fr": "fr-FR-HenriNeural",
            "es": "es-ES-ElviraNeural",
            "hi": "hi-IN-MadhurNeural"
    }
    speech_config.speech_synthesis_voice_name = voices.get(targetLanguage)
    speech_synthesizer = speech_sdk.SpeechSynthesizer(speech_config)
    speak = speech_synthesizer.speak_text_async(translation).get()
    if speak.reason != speech_sdk.ResultReason.SynthesizingAudioCompleted:
        print(speak.reason)


In [21]:
main()

Ready to translate from en-US
Getting speech from file...
Translating "Where is the station?"
Où est la gare?
