Permalink
Fetching contributors…
Cannot retrieve contributors at this time
65 lines (46 sloc) 4.13 KB
title titleSuffix description services author manager ms.service ms.component ms.topic ms.date ms.author ms.custom
About Speech to Text - Speech Services
Azure Cognitive Services
The Speech-to-Text API transcribes audio streams into text that your app can display or act on as an input. The service is available via the SDK and a RESTful endpoint.
cognitive-services
erhopf
cgronlun
cognitive-services
speech-service
conceptual
05/07/2018
erhopf
seodec18

About the Speech to Text API

The Speech to Text API transcribes audio streams into text that your application can display to the user or act upon as command input. The APIs can be used either with an SDK client library (for supported platforms and languages) or a REST API.

The Speech to Text API offers the following features:

  • Advanced speech recognition technology from Microsoft—the same used by Cortana, Office, and other Microsoft products.

  • Real-time continuous recognition. Speech to Text allows users to transcribe audio into text in real time. It also supports receiving intermediate results of the words that have been recognized so far. The service automatically recognizes the end of speech. Users can also choose additional formatting options, including capitalization and punctuation, profanity masking, and inverse text normalization.

  • Results returned in both Lexical and Display forms (for Lexical results, see DetailedSpeechRecognitionResult in the examples or API).

  • Support for many spoken languages and dialects. For the full list of supported languages in each recognition mode, see Supported languages.

  • Customized language and acoustic models, so you can tailor your application to your users' specialized domain vocabulary, speaking environment and way of speaking.

  • Natural-language understanding. Through integration with Language Understanding (LUIS), you can derive intents and entities from speech. Users don't have to know your app's vocabulary, but can describe what they want in their own words.

  • Confidence score is returned back from the service if you specify a detailed output on the speech configuration object (SpeechConfig.OutputFormat property). Then you can use either Best() method on the result or get it the score directly from JSON returned from the service (something like result.Properties.GetProperty(PropertyId.SpeechServiceResponse_JsonResult)).

API capabilities

Some of the capabilities of the Speech to Text API, especially around customization, are available via REST. The following table summarizes the capabilities of each method of accessing the API. For a full list of capabilities and API details, see Swagger reference.

| Use case | REST | SDKs | |-----|-----|-----|----| | Transcribe a short utterance, such as a command (length < 15 s); no interim results | Yes | Yes | | Transcribe a longer utterance (> 15 s) | No | Yes | | Transcribe streaming audio with optional interim results | No | Yes | | Understand speaker intents via LUIS | No* | Yes | | Create Accuracy Tests | Yes | No | | Upload datasets for model adaptation | Yes | No | | Create & manage speech models | Yes | No | | Create & manage model deployments | Yes | No | | Manage Subscriptions | Yes | No | | Create & manage model deployments | Yes | No | | Create & manage model deployments | Yes | No |

[!NOTE] The REST API implements throttling that limits the API requests to 25 per 5 second. Message headers will inform of the limits

* LUIS intents and entities can be derived using a separate LUIS subscription. With this subscription, the SDK can call LUIS for you and provide entity and intent results as well as speech transcriptions. With the REST API, you can call LUIS yourself to derive intents and entities with your LUIS subscription.

Next steps