Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SDK v1.13 draft: add KeywordRecognizer support to UWP VA #500

Draft
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

trrwilson
Copy link
Member

Purpose

Speech SDK v1.12 introduced a new KeywordRecognizer object that enables standalone on-device keyword matching without an active connection to Azure Speech Services. The audio associated with results from this object can then be routed into existing objects (such as the DialogServiceConnector) for use in existing scenarios.

This functionality has a significant benefit to voice assistant applications that may be initiated in a "cold start" situation:

  1. The user speaks an activating utterance and expects something happen ASAP
  2. The assistant application is activated in response to the "detected but not yet confirmed" keyword utterance
  3. Audio/app lifecycle spin-up occurs (latency hit of at least a few hundred milliseconds)
  4. Before connecting to the speech service, an access token must be retrieved from another off-device source (latency hit, potentially 1s or more)
  5. Once an access token is available, DialogServiceConnector won't begin processing keyword audio until a connection is established (latency hit, several hundred milliseconds)
  6. The Speech SDK then processes the queued audio and catches up to the detected keyword (another few hundred milliseconds before the on-device result)
  7. Only at this point (with an on-device confirmation result available) is it appropriate for the waiting user to receive feedback

KeywordRecognizer allows us to parallelize and skip (4) and (5) above, typically saving more than 500ms in cold start and often saving multiple seconds (depending on token retrieval and connection establishment speeds). An on-device result can be obtained in parallel to networking needs and the DialogServiceConnector, as a consumer of the KeywordRecognitionResult's audio, can catch up after user-facing action has already begun.

This addresses #486 .

Caveats: chaining a KeywordRecognizer into a DialogServiceConnector isn't trivial and requires both audio adapters and some state management. Investigation with v1.12 also revealed that multi-turn use of an audio stream derived from a KeywordRecognitionResult did not automatically consume recognized audio, which made effective use additionally challenging. This automatic consumption behavior is fixed in v1.13 and this change takes a dependency on that fix.

Further, since audio adapters were already necessary, this change also applies said adapters to improve the keyword rejection behavior (and remove the so-called "failsafe timer" approach):

  • Prior to this change, all audio is pushed into the Speech SDK objects (DialogServiceConnector) as fast as possible, meaning we have no accounting of how much data is/has been consumed at any point
  • This means we have no way of knowing if we've already evaluated enough audio to determine that there's no keyword in the input -- we instead rely on a wall clock timer ("2.0 real seconds after the 'start audio' call, fire an event that deduces no keyword recognition is going to happen")
  • The wall clock, failsafe approach isn't ideal: many variables impact the actual amount of audio we get a chance to process, and that means we need to be very conservative (usually evaluating a lot of extra audio) to ensure we don't give up too quickly in slower configurations/situations; being conservative and consuming extra audio in turn means we have greater periods of "deafness" or unresponsiveness when evaluating false activations, directly harming end-to-end accuracy
  • With this change, audio is now pulled into the Speech SDK objects and we can directly monitor how much audio has been requested (and therefore processed)
  • This means we can deterministically conclude when a certain duration of audio has been evaluated and reject based on that rather than an error prone wall clock assessment
  • This is currently hard-coded to 2.0s of audio, calculated after the existing 2s preroll trim in AgentAudioProducer -- this means we'll evaluate an audio range from approximately 1200ms before a keyword detection threshold to approximately 800ms after that keyword detection threshold and conclude "no keyword" if no confirmation result is obtained from that evaluation.

Does this introduce a breaking change?

[ ] Yes
[ ] No
[X] Maybe

Keyword detection metrics are likely impacted by the introduction of the new objects. Efforts were made to preserve the logic but there's likely something regressed that can/should be addressed in a subsequent submission.

Pull Request Type

[ ] Bugfix
[X] Feature
[ ] Code style update (formatting, local variables)
[ ] Refactoring (no functional changes, no api changes)
[ ] Documentation content changes
[ ] Other... Please describe:

How to Test / What to Check

Note: as of draft time, validation still in progress

  • Voice activations work: single & multi turn, cold & warm start
  • Push-to-talk works, both independently as well as in conjunction with voice activation

@trrwilson trrwilson self-assigned this Jul 11, 2020
@dargilco dargilco requested review from olmidy and chschrae July 13, 2020 15:05
this.audioIntoConnectorSink.BookmarkPosition = KeywordRejectionTimeout;
this.EnsureConnectorReady();
this.logger.Log($"Starting connector");
_ = this.connector.StartKeywordRecognitionAsync(this.ConfirmationModel as KeywordRecognitionModel);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@trrwilson Does this line mean 2nd stage recognition will be run a second time? If so, is there a way to tell the connector that 2nd stage has already been evaluated and to skip to 3rd stage?

(I presume if 3rd stage isn't enabled, then a call to ListenOnceAsync() would work here, but that StartKeywordRecognitionAsync is being used instead to ensure 3rd stage gets called)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @tomh05! Thanks for piling on here and apologies I've been away a bit--between a week of vacation and then getting pulled into a few things, I've been more absent on this front than I'd like.

You are 100% correct that the DialogServiceConnector will unnecessarily repeat the on-device portion of the keyword spotting. Very rarely (but still occasionally) they'll even disagree, with the DialogServiceConnector not deigning to fire after the KeywordRecognizer does. That last part is likely attributable to subtle differences in the byte alignment based on how the KeywordRecognitionResult selects the audio start position, but I'm rambling.

Two parts of that:

  • The SDK should support a means of doing this. We have an item on our backlog to design and implement a way to chain a KeywordRecognizer and DialogServiceConnector together to get KWV without repeating on-device KWS. I'd be curious to know from you as one of our most informed consumers, though: how do you think it should work, i.e. what code would make sense to write? There are a lot of ways to specify this--something in the config, a parameter to the Start() call, a different way of creating/annotating the input, and more--and part of the debate is what the most intuitive and clear way to expose this option/capability would be.
  • In the interim, one thing that can be experimented with is using ListenOnce instead of KWS on the DialogServiceConnector. That will not be eligible for KWV right now, but at least as an exploration/prototyping step, it'd allow a full observation of what latency benefits skipping that second step would have. In my own ad hoc testing for this change, I saw that the KWS delayed the start of stream to the service by ~200-400ms depending on configuration; that doesn't translate to a full 200-400ms of extra time until the KWV result arrives (it runs faster than real-time by a good margin), but it's still going to be a considerable increase.

Copy link

@wilhen01 wilhen01 Aug 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @trrwilson! Will throw my 2 cents in here 🙂

Re: SDK provision for avoiding the repetition of 2nd stage, this feels like a good fit for a setting on the Connector, because it should be consistent in an application based on the architecture. Adding a parameter on the Start call would imply that it could be different at the initialization of different conversations within the same app, but surely any given app would either use the KeywordRecognizer, or it wouldn't? If there is a use-case for apps to switch between online and offline recognition then a parameter to the Start call seems like a reasonable fallback position 👍

Since we can't do 3rd stage verification just now we can experiment with using ListenOnce in the short term and see what the latency improvements are like. Is there a timeline for the SDK changes to avoid repeating 2nd stage? Once 3rd stage is available to us it would be a shame to have to pick between "have 3rd stage, but gain latency from repeating 2nd" and "lower latency, but no 3rd stage".

@microsoft-github-updates microsoft-github-updates bot changed the base branch from master to main January 4, 2021 17:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants