SDK v1.13 draft: add KeywordRecognizer support to UWP VA #500

trrwilson · 2020-07-11T21:29:50Z

Purpose

Speech SDK v1.12 introduced a new KeywordRecognizer object that enables standalone on-device keyword matching without an active connection to Azure Speech Services. The audio associated with results from this object can then be routed into existing objects (such as the DialogServiceConnector) for use in existing scenarios.

This functionality has a significant benefit to voice assistant applications that may be initiated in a "cold start" situation:

The user speaks an activating utterance and expects something happen ASAP
The assistant application is activated in response to the "detected but not yet confirmed" keyword utterance
Audio/app lifecycle spin-up occurs (latency hit of at least a few hundred milliseconds)
Before connecting to the speech service, an access token must be retrieved from another off-device source (latency hit, potentially 1s or more)
Once an access token is available, DialogServiceConnector won't begin processing keyword audio until a connection is established (latency hit, several hundred milliseconds)
The Speech SDK then processes the queued audio and catches up to the detected keyword (another few hundred milliseconds before the on-device result)
Only at this point (with an on-device confirmation result available) is it appropriate for the waiting user to receive feedback

KeywordRecognizer allows us to parallelize and skip (4) and (5) above, typically saving more than 500ms in cold start and often saving multiple seconds (depending on token retrieval and connection establishment speeds). An on-device result can be obtained in parallel to networking needs and the DialogServiceConnector, as a consumer of the KeywordRecognitionResult's audio, can catch up after user-facing action has already begun.

This addresses #486 .

Caveats: chaining a KeywordRecognizer into a DialogServiceConnector isn't trivial and requires both audio adapters and some state management. Investigation with v1.12 also revealed that multi-turn use of an audio stream derived from a KeywordRecognitionResult did not automatically consume recognized audio, which made effective use additionally challenging. This automatic consumption behavior is fixed in v1.13 and this change takes a dependency on that fix.

Further, since audio adapters were already necessary, this change also applies said adapters to improve the keyword rejection behavior (and remove the so-called "failsafe timer" approach):

Prior to this change, all audio is pushed into the Speech SDK objects (DialogServiceConnector) as fast as possible, meaning we have no accounting of how much data is/has been consumed at any point
This means we have no way of knowing if we've already evaluated enough audio to determine that there's no keyword in the input -- we instead rely on a wall clock timer ("2.0 real seconds after the 'start audio' call, fire an event that deduces no keyword recognition is going to happen")
The wall clock, failsafe approach isn't ideal: many variables impact the actual amount of audio we get a chance to process, and that means we need to be very conservative (usually evaluating a lot of extra audio) to ensure we don't give up too quickly in slower configurations/situations; being conservative and consuming extra audio in turn means we have greater periods of "deafness" or unresponsiveness when evaluating false activations, directly harming end-to-end accuracy
With this change, audio is now pulled into the Speech SDK objects and we can directly monitor how much audio has been requested (and therefore processed)
This means we can deterministically conclude when a certain duration of audio has been evaluated and reject based on that rather than an error prone wall clock assessment
This is currently hard-coded to 2.0s of audio, calculated after the existing 2s preroll trim in AgentAudioProducer -- this means we'll evaluate an audio range from approximately 1200ms before a keyword detection threshold to approximately 800ms after that keyword detection threshold and conclude "no keyword" if no confirmation result is obtained from that evaluation.

Does this introduce a breaking change?

[ ] Yes
[ ] No
[X] Maybe

Keyword detection metrics are likely impacted by the introduction of the new objects. Efforts were made to preserve the logic but there's likely something regressed that can/should be addressed in a subsequent submission.

Pull Request Type

[ ] Bugfix
[X] Feature
[ ] Code style update (formatting, local variables)
[ ] Refactoring (no functional changes, no api changes)
[ ] Documentation content changes
[ ] Other... Please describe:

How to Test / What to Check

Note: as of draft time, validation still in progress

Voice activations work: single & multi turn, cold & warm start
Push-to-talk works, both independently as well as in conjunction with voice activation

…onnector

tomh05 · 2020-08-11T11:01:18Z

clients/csharp-uwp/UWPVoiceAssistantSample/DirectLineSpeechDialogBackend.cs

+                this.audioIntoConnectorSink.BookmarkPosition = KeywordRejectionTimeout;
+                this.EnsureConnectorReady();
+                this.logger.Log($"Starting connector");
+                _ = this.connector.StartKeywordRecognitionAsync(this.ConfirmationModel as KeywordRecognitionModel);


@trrwilson Does this line mean 2nd stage recognition will be run a second time? If so, is there a way to tell the connector that 2nd stage has already been evaluated and to skip to 3rd stage?

(I presume if 3rd stage isn't enabled, then a call to ListenOnceAsync() would work here, but that StartKeywordRecognitionAsync is being used instead to ensure 3rd stage gets called)

Hey @tomh05! Thanks for piling on here and apologies I've been away a bit--between a week of vacation and then getting pulled into a few things, I've been more absent on this front than I'd like.

You are 100% correct that the DialogServiceConnector will unnecessarily repeat the on-device portion of the keyword spotting. Very rarely (but still occasionally) they'll even disagree, with the DialogServiceConnector not deigning to fire after the KeywordRecognizer does. That last part is likely attributable to subtle differences in the byte alignment based on how the KeywordRecognitionResult selects the audio start position, but I'm rambling.

Two parts of that:

The SDK should support a means of doing this. We have an item on our backlog to design and implement a way to chain a KeywordRecognizer and DialogServiceConnector together to get KWV without repeating on-device KWS. I'd be curious to know from you as one of our most informed consumers, though: how do you think it should work, i.e. what code would make sense to write? There are a lot of ways to specify this--something in the config, a parameter to the Start() call, a different way of creating/annotating the input, and more--and part of the debate is what the most intuitive and clear way to expose this option/capability would be.

In the interim, one thing that can be experimented with is using ListenOnce instead of KWS on the DialogServiceConnector. That will not be eligible for KWV right now, but at least as an exploration/prototyping step, it'd allow a full observation of what latency benefits skipping that second step would have. In my own ad hoc testing for this change, I saw that the KWS delayed the start of stream to the service by ~200-400ms depending on configuration; that doesn't translate to a full 200-400ms of extra time until the KWV result arrives (it runs faster than real-time by a good margin), but it's still going to be a considerable increase.

Hi @trrwilson! Will throw my 2 cents in here 🙂

Re: SDK provision for avoiding the repetition of 2nd stage, this feels like a good fit for a setting on the Connector, because it should be consistent in an application based on the architecture. Adding a parameter on the Start call would imply that it could be different at the initialization of different conversations within the same app, but surely any given app would either use the KeywordRecognizer, or it wouldn't? If there is a use-case for apps to switch between online and offline recognition then a parameter to the Start call seems like a reasonable fallback position 👍

Since we can't do 3rd stage verification just now we can experiment with using ListenOnce in the short term and see what the latency improvements are like. Is there a timeline for the SDK changes to avoid repeating 2nd stage? Once 3rd stage is available to us it would be a shame to have to pick between "have 3rd stage, but gain latency from repeating 2nd" and "lower latency, but no 3rd stage".

trrwilson added 2 commits July 10, 2020 20:05

Revise dialog for chained use of KeywordRecognizer and DialogServiceC…

accba56

…onnector

Merge branch 'master' into user/travisw/keywordRecognizer

76a0fe1

trrwilson requested review from glecaros and dargilco July 11, 2020 21:29

trrwilson self-assigned this Jul 11, 2020

dargilco approved these changes Jul 13, 2020

View reviewed changes

dargilco requested review from olmidy and chschrae July 13, 2020 15:05

trrwilson added 2 commits July 23, 2020 09:58

NuGet dependency updates and a few miscellaneous warnings

a2a8096

Un-fixing warnings that broke test build

f3c8cd8

tomh05 reviewed Aug 11, 2020

View reviewed changes

trrwilson added 2 commits September 3, 2020 08:27

Merge branch 'master' into user/travisw/keywordRecognizer

fb11d79

Debug audio file abstraction for comparing sinks

d0a6310

microsoft-github-updates bot changed the base branch from master to main January 4, 2021 17:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SDK v1.13 draft: add KeywordRecognizer support to UWP VA #500

SDK v1.13 draft: add KeywordRecognizer support to UWP VA #500

trrwilson commented Jul 11, 2020

tomh05 Aug 11, 2020

trrwilson Aug 12, 2020

wilhen01 Aug 13, 2020 •

edited

Loading

SDK v1.13 draft: add KeywordRecognizer support to UWP VA #500

Are you sure you want to change the base?

SDK v1.13 draft: add KeywordRecognizer support to UWP VA #500

Conversation

trrwilson commented Jul 11, 2020

Purpose

Does this introduce a breaking change?

Pull Request Type

How to Test / What to Check

tomh05 Aug 11, 2020

Choose a reason for hiding this comment

trrwilson Aug 12, 2020

Choose a reason for hiding this comment

wilhen01 Aug 13, 2020 • edited Loading

Choose a reason for hiding this comment

wilhen01 Aug 13, 2020 •

edited

Loading