Skip to content

Latest commit

 

History

History
181 lines (101 loc) · 12.9 KB

language-support.md

File metadata and controls

181 lines (101 loc) · 12.9 KB
title titleSuffix description author manager ms.service ms.topic ms.date ms.author ms.custom
Language support - Speech service
Azure AI services
The Speech service supports numerous languages for speech to text and text to speech conversion, along with speech translation. This article provides a comprehensive list of language support by service feature.
eric-urban
nitinme
azure-ai-speech
conceptual
2/1/2024
eur
references_regions, build-2024

Language and voice support for the Speech service

The following tables summarize language support for speech to text, text to speech, pronunciation assessment, speech translation, speaker recognition, and more service features.

You can also get a list of locales and voices supported for each specific region or endpoint via:

Supported languages

Language support varies by Speech service functionality.

Note

See Speech Containers and Embedded Speech separately for their supported languages.

Choose a Speech feature

The table in this section summarizes the locales supported for Speech to text. See the table footnotes for more details.

More remarks for Speech to text locales are included in the custom speech section of this article.

Tip

Try out the real-time speech to text tool without having to use any code.

[!INCLUDE Language support include]

Custom speech

To improve Speech to text recognition accuracy, customization is available for some languages and base models. Depending on the locale, you can upload audio + human-labeled transcripts, plain text, structured text, and pronunciation data. By default, plain text customization is supported for all available base models. To learn more about customization, see custom speech.

These are the locales that support the display text format feature: da-DK, de-DE, en-AU, en-CA, en-GB, en-HK, en-IE, en-IN, en-NG, en-NZ, en-PH, en-SG, en-US, es-ES, es-MX, fi-FI, fr-CA, fr-FR, hi-IN, it-IT, ja-JP, ko-KR, nb-NO, nl-NL, pl-PL, pt-BR, pt-PT, sv-SE, tr-TR, zh-CN, zh-HK.

Fast transcription

The supported locales for the fast transcription API are: en-US, es-ES, es-MX, fr-FR, hi-IN, it-IT, ja-JP, ko-KR, pt-BR, and zh-CN. You can only specify one locale per transcription request.

The table in this section summarizes the locales and voices supported for Text to speech. See the table footnotes for more details.

More remarks for text to speech locales are included in the voice styles and roles, prebuilt neural voices, Custom neural voice, and personal voice sections in this article.

Tip

Check the Voice Gallery and determine the right voice for your business needs.

[!INCLUDE Language support include]

Multilingual voices

Multilingual voices can support more languages. This expansion enhances your ability to express content in various languages, to overcome language barriers and foster a more inclusive global communication environment.

Use this table to understand all supported speaking languages for each multilingual neural voice. If the voice doesn’t speak the language of the input text, the Speech service doesn’t output synthesized audio. The table is sorted by the number of supported languages in descending order. The primary locale for each voice is indicated by the prefix in its name, such as the voice en-US-AndrewMultilingualNeural, its primary locale is en-US.

[!INCLUDE Language support include]

Voice styles and roles

In some cases, you can adjust the speaking style to express different emotions like cheerfulness, empathy, and calm. All prebuilt voices with speaking styles and multi-style custom voices support style degree adjustment. You can optimize the voice for different scenarios like customer service, newscast, and voice assistant. With roles, the same voice can act as a different age and gender.

To learn how you can configure and adjust neural voice styles and roles, see Speech Synthesis Markup Language.

Use the following table to determine supported styles and roles for each neural voice.

[!INCLUDE Language support include]

Viseme

This table lists all the locales supported for Viseme. For more information about Viseme, see Get facial position with viseme and Viseme element.

[!INCLUDE Language support include]

Prebuilt neural voices

Each prebuilt neural voice supports a specific language and dialect, identified by locale. You can try the demo and hear the voices in the Voice Gallery.

Important

Pricing varies for Prebuilt Neural Voice (see Neural on the pricing page) and custom neural voice (see Custom Neural on the pricing page). For more information, see the Pricing page.

Each prebuilt neural voice model is available at 24kHz and high-fidelity 48kHz. Other sample rates can be obtained through upsampling or downsampling when synthesizing.

Note that the following neural voices are retired.

  • The English (United Kingdom) voice en-GB-MiaNeural is retired on October 30, 2021. All service requests to en-GB-MiaNeural will be redirected to en-GB-SoniaNeural automatically as of October 30, 2021. If you're using container Neural TTS, download and deploy the latest version. All requests with previous versions won't succeed starting from October 30, 2021.
  • The en-US-JessaNeural voice is retired and replaced by en-US-AriaNeural. If you were using "Jessa" before, convert to "Aria."
  • The Chinese (Mandarin, Simplified) voice zh-CN-XiaoxuanNeural is retired on February 29, 2024. All service requests to zh-CN-XiaoxuanNeural will be redirected to zh-CN-XiaoyiNeural automatically as of February 29, 2024. If you're using container Neural TTS, download and deploy the latest version. All requests with previous versions won't succeed starting from February 29, 2024.

Custom neural voice

Custom neural voice lets you create synthetic voices that are rich in speaking styles. You can create a unique brand voice in multiple languages and styles by using a small set of recording data. Multi-style custom neural voices support style degree adjustment. There are two custom neural voice (CNV) project types: CNV Pro and CNV Lite (preview).

Select the right locale that matches your training data to train a custom neural voice model. For example, if the recording data is spoken in English with a British accent, select en-GB.

With the cross-lingual feature, you can transfer your custom neural voice model to speak a second language. For example, with the zh-CN data, you can create a voice that speaks en-AU or any of the languages with Cross-lingual support. For the cross-lingual feature, we categorize locales into two tiers: one includes source languages that support the cross-lingual feature, and the other tier comprises locales designated as target languages for cross-lingual transfer. Within the following table, distinguish locales that function as both cross-lingual sources and targets and locales eligible solely as the target locale for cross-lingual transfer.

[!INCLUDE Language support include]

Personal voice

Personal voice is a feature that lets you create a voice that sounds like you or your users. The following table summarizes the locales supported for personal voice.

[!INCLUDE Language support include]

The table in this section summarizes the 33 locales supported for pronunciation assessment, and each language is available on all Speech to text regions. Latest update extends support from English to 32 more languages and quality enhancements to existing features, including accuracy, fluency and miscue assessment. You should specify the language that you're learning or practicing improving pronunciation. The default language is set as en-US. If you know your target learning language, set the locale accordingly. For example, if you're learning British English, you should specify the language as en-GB. If you're teaching a broader language, such as Spanish, and are uncertain about which locale to select, you can run various accent models (es-ES, es-MX) to determine the one that achieves the highest score to suit your specific scenario. If you're interested in languages not listed in the following table, fill out this intake form for further assistance.

[!INCLUDE Language support include]

Real-time speech translation

The table in this section summarizes the locales supported for Speech translation. Speech translation supports different languages for speech to speech and speech to text translation. The available target languages depend on whether the translation target is speech or text.

Translate from language

To set the input speech recognition language, specify the full locale with a dash (-) separator. See the speech to text language table. All languages are supported except jv-ID and wuu-CN. The default language is en-US if you don't specify a language.

Translate to text language

To set the translation target language, with few exceptions you only specify the language code that precedes the locale dash (-) separator. For example, use es for Spanish (Spain) instead of es-ES. See the speech translation target language table below. The default language is en if you don't specify a language.

[!INCLUDE Language support include]

Video translation

The following table illustrates the fixed mapping relationship between source and target locales, along with the full locales associated with each language.

[!INCLUDE Language support include]

The table in this section summarizes the locales supported for Language identification.

Note

Language Identification compares speech at the language level, such as English and German. Do not include multiple locales of the same language in your candidate list.

[!INCLUDE Language support include]

The table in this section summarizes the locales supported for Speaker recognition. Speaker recognition is mostly language agnostic. The universal model for text-independent speaker recognition combines various data sources from multiple languages. We tuned and evaluated the model on these languages and locales. For more information on speaker recognition, see the overview.

[!INCLUDE Language support include]

The table in this section summarizes the locales supported for custom keyword and keyword verification.

[!INCLUDE Language support include]

The table in this section summarizes the locales supported for the Intent Recognizer Pattern Matcher.

[!INCLUDE Language support include]


Next steps