title | titleSuffix | description | author | manager | ms.service | ms.topic | ms.date | ms.author | ms.custom |
---|---|---|---|---|---|---|---|---|---|
Language support - Speech service |
Azure AI services |
The Speech service supports numerous languages for speech to text and text to speech conversion, along with speech translation. This article provides a comprehensive list of language support by service feature. |
eric-urban |
nitinme |
azure-ai-speech |
conceptual |
2/1/2024 |
eur |
references_regions, build-2024 |
The following tables summarize language support for speech to text, text to speech, pronunciation assessment, speech translation, speaker recognition, and more service features.
You can also get a list of locales and voices supported for each specific region or endpoint via:
Language support varies by Speech service functionality.
Note
See Speech Containers and Embedded Speech separately for their supported languages.
Choose a Speech feature
The table in this section summarizes the locales supported for Speech to text. See the table footnotes for more details.
More remarks for Speech to text locales are included in the custom speech section of this article.
Tip
Try out the real-time speech to text tool without having to use any code.
[!INCLUDE Language support include]
To improve Speech to text recognition accuracy, customization is available for some languages and base models. Depending on the locale, you can upload audio + human-labeled transcripts, plain text, structured text, and pronunciation data. By default, plain text customization is supported for all available base models. To learn more about customization, see custom speech.
These are the locales that support the display text format feature: da-DK, de-DE, en-AU, en-CA, en-GB, en-HK, en-IE, en-IN, en-NG, en-NZ, en-PH, en-SG, en-US, es-ES, es-MX, fi-FI, fr-CA, fr-FR, hi-IN, it-IT, ja-JP, ko-KR, nb-NO, nl-NL, pl-PL, pt-BR, pt-PT, sv-SE, tr-TR, zh-CN, zh-HK.
The supported locales for the fast transcription API are: en-US, es-ES, es-MX, fr-FR, hi-IN, it-IT, ja-JP, ko-KR, pt-BR, and zh-CN. You can only specify one locale per transcription request.
The table in this section summarizes the locales and voices supported for Text to speech. See the table footnotes for more details.
More remarks for text to speech locales are included in the voice styles and roles, prebuilt neural voices, Custom neural voice, and personal voice sections in this article.
Tip
Check the Voice Gallery and determine the right voice for your business needs.
[!INCLUDE Language support include]
Multilingual voices can support more languages. This expansion enhances your ability to express content in various languages, to overcome language barriers and foster a more inclusive global communication environment.
Use this table to understand all supported speaking languages for each multilingual neural voice. If the voice doesn’t speak the language of the input text, the Speech service doesn’t output synthesized audio. The table is sorted by the number of supported languages in descending order. The primary locale for each voice is indicated by the prefix in its name, such as the voice en-US-AndrewMultilingualNeural
, its primary locale is en-US
.
[!INCLUDE Language support include]
In some cases, you can adjust the speaking style to express different emotions like cheerfulness, empathy, and calm. All prebuilt voices with speaking styles and multi-style custom voices support style degree adjustment. You can optimize the voice for different scenarios like customer service, newscast, and voice assistant. With roles, the same voice can act as a different age and gender.
To learn how you can configure and adjust neural voice styles and roles, see Speech Synthesis Markup Language.
Use the following table to determine supported styles and roles for each neural voice.
[!INCLUDE Language support include]
This table lists all the locales supported for Viseme. For more information about Viseme, see Get facial position with viseme and Viseme element.
[!INCLUDE Language support include]
Each prebuilt neural voice supports a specific language and dialect, identified by locale. You can try the demo and hear the voices in the Voice Gallery.
Important
Pricing varies for Prebuilt Neural Voice (see Neural on the pricing page) and custom neural voice (see Custom Neural on the pricing page). For more information, see the Pricing page.
Each prebuilt neural voice model is available at 24kHz and high-fidelity 48kHz. Other sample rates can be obtained through upsampling or downsampling when synthesizing.
Note that the following neural voices are retired.
- The English (United Kingdom) voice
en-GB-MiaNeural
is retired on October 30, 2021. All service requests toen-GB-MiaNeural
will be redirected toen-GB-SoniaNeural
automatically as of October 30, 2021. If you're using container Neural TTS, download and deploy the latest version. All requests with previous versions won't succeed starting from October 30, 2021. - The
en-US-JessaNeural
voice is retired and replaced byen-US-AriaNeural
. If you were using "Jessa" before, convert to "Aria." - The Chinese (Mandarin, Simplified) voice
zh-CN-XiaoxuanNeural
is retired on February 29, 2024. All service requests tozh-CN-XiaoxuanNeural
will be redirected tozh-CN-XiaoyiNeural
automatically as of February 29, 2024. If you're using container Neural TTS, download and deploy the latest version. All requests with previous versions won't succeed starting from February 29, 2024.
Custom neural voice lets you create synthetic voices that are rich in speaking styles. You can create a unique brand voice in multiple languages and styles by using a small set of recording data. Multi-style custom neural voices support style degree adjustment. There are two custom neural voice (CNV) project types: CNV Pro and CNV Lite (preview).
Select the right locale that matches your training data to train a custom neural voice model. For example, if the recording data is spoken in English with a British accent, select en-GB
.
With the cross-lingual feature, you can transfer your custom neural voice model to speak a second language. For example, with the zh-CN
data, you can create a voice that speaks en-AU
or any of the languages with Cross-lingual support. For the cross-lingual feature, we categorize locales into two tiers: one includes source languages that support the cross-lingual feature, and the other tier comprises locales designated as target languages for cross-lingual transfer. Within the following table, distinguish locales that function as both cross-lingual sources and targets and locales eligible solely as the target locale for cross-lingual transfer.
[!INCLUDE Language support include]
Personal voice is a feature that lets you create a voice that sounds like you or your users. The following table summarizes the locales supported for personal voice.
[!INCLUDE Language support include]
The table in this section summarizes the 33 locales supported for pronunciation assessment, and each language is available on all Speech to text regions. Latest update extends support from English to 32 more languages and quality enhancements to existing features, including accuracy, fluency and miscue assessment. You should specify the language that you're learning or practicing improving pronunciation. The default language is set as en-US
. If you know your target learning language, set the locale accordingly. For example, if you're learning British English, you should specify the language as en-GB
. If you're teaching a broader language, such as Spanish, and are uncertain about which locale to select, you can run various accent models (es-ES
, es-MX
) to determine the one that achieves the highest score to suit your specific scenario. If you're interested in languages not listed in the following table, fill out this intake form for further assistance.
[!INCLUDE Language support include]
The table in this section summarizes the locales supported for Speech translation. Speech translation supports different languages for speech to speech and speech to text translation. The available target languages depend on whether the translation target is speech or text.
To set the input speech recognition language, specify the full locale with a dash (-
) separator. See the speech to text language table. All languages are supported except jv-ID
and wuu-CN
. The default language is en-US
if you don't specify a language.
To set the translation target language, with few exceptions you only specify the language code that precedes the locale dash (-
) separator. For example, use es
for Spanish (Spain) instead of es-ES
. See the speech translation target language table below. The default language is en
if you don't specify a language.
[!INCLUDE Language support include]
The following table illustrates the fixed mapping relationship between source and target locales, along with the full locales associated with each language.
[!INCLUDE Language support include]
The table in this section summarizes the locales supported for Language identification.
Note
Language Identification compares speech at the language level, such as English and German. Do not include multiple locales of the same language in your candidate list.
[!INCLUDE Language support include]
The table in this section summarizes the locales supported for Speaker recognition. Speaker recognition is mostly language agnostic. The universal model for text-independent speaker recognition combines various data sources from multiple languages. We tuned and evaluated the model on these languages and locales. For more information on speaker recognition, see the overview.
[!INCLUDE Language support include]
The table in this section summarizes the locales supported for custom keyword and keyword verification.
[!INCLUDE Language support include]
The table in this section summarizes the locales supported for the Intent Recognizer Pattern Matcher.
[!INCLUDE Language support include]