Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Alexa style" continuous speech instruction #1

Open
mph070770 opened this issue May 13, 2016 · 19 comments
Open

"Alexa style" continuous speech instruction #1

mph070770 opened this issue May 13, 2016 · 19 comments

Comments

@mph070770
Copy link

mph070770 commented May 13, 2016

Hi - great software!

I have your demo working with Ubuntu. What I'd like to do is detect the keyword in continuous speech in a similar way to the Amazon echo. Is that possible? For example, this:

"Alexa, turn on the lights"

instead of

"Alexa" [ding] "turn on the lights"

Ideally, I'd also want to know where in the audio the keyword was spoken so that it can be removed from audio before I send it to an online engine (such as api.ai or AVS).

Any suggestions would be great.

Thanks

@xuchen
Copy link
Collaborator

xuchen commented May 13, 2016

The [ding] sound is actually a callback function you can define yourself. Here's an idea:

  1. keep an audio buffer and a global variable is_triggered = False
  2. when triggered, set is_triggered = True in your callback
  3. send any audio after this point in your buffer to AVS for speech recognition.

Does it make sense?

@chenguoguo
Copy link
Collaborator

What Xuchen said was correct. You may have to play with the audio buffer a
little bit, to make sure you send all the audio after hotword detection to
the ASR.

Guoguo

On Sat, May 14, 2016 at 1:01 AM, xuchen notifications@github.com wrote:

The [ding] sound is actually a callback function you can define yourself.
Here's an idea:

  1. keep an audio buffer and a global variable is_triggered = False
  2. when triggered, set is_triggered = True in your callback
  3. send any audio after this point in your buffer to AVS for speech
    recognition.

Does it make sense?


You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub
#1 (comment)

@chenguoguo
Copy link
Collaborator

Looks like it has been resolved, so closing this.

@mph070770
Copy link
Author

Thanks for the feedback. Are you suggesting a new audio buffer or utilising the ring buffer?

@chenguoguo
Copy link
Collaborator

Re-opening this since there's on-going discussion... Let me write in more details.

  1. In order to remove the [ding] sound, you only have to modify the callback function as Xuchen said. You do not need another buffer. If your ASR server does online decoding, then you can start transmitting your audio data to the server right after the triggering of the hotword.
  2. You may need another buffer if:
    2.1. there is a delay in hotword detection. In this case, you need a buffer to keep some data before the triggering of the hotword, so that you will have a "complete" sentence for your ASR.
    2.2 your ASR server can only do offline decoding. In this case you need a buffer for the whole sentence after the triggering of the hotword. You will have to detect the end of the sentence (I can explain more on this if necessary), and then send the whole sentence to your ASR server (this may not be your case).

Does this solve your problem?

@chenguoguo chenguoguo reopened this May 16, 2016
@chenguoguo
Copy link
Collaborator

Closing this as it has been integrated into AlexaPI. See:

https://youtu.be/wLbsAQDmN-c

https://github.com/sammachin/AlexaPi/pull/85

@jwhite
Copy link

jwhite commented Dec 31, 2016

I don't think is closed. This issue is related to continuous detection using a buffer. Alexa-Pi only uses the hotword record method at this time as far as I can tell.

@chenguoguo
Copy link
Collaborator

OK re-open it. What I suggested above should still stand.

@dmc6297
Copy link

dmc6297 commented Mar 7, 2017

I did this by customizing the snowboy_index.js. In the processDetectionResult function I set a "command" flag once the hotword is detected and emit all chunks until silence is detected. Another script builds a buffer from all the chunks and sends them to Microsoft LUIS for recognition.

So you can say "Alexa turn off the lights" all in one phrase without pausing.

_write(chunk, encoding, callback) {
var parent = this;
    const index = this.nativeInstance.RunDetection(chunk);

    this.processDetectionResult(index, chunk);
    if(parent.bufferingCommand == true)
    {
        this.emit('chunk', chunk, encoding);
    }
    return callback();
}

@evancohen
Copy link
Contributor

@dmc6297 you might want to check out Sonus. There's an implementation on the audio-buffer branch which uses a ringbuffer + stream transformation (basically what @chenguoguo described in this thread).

The only drawback with my ring buffer implementation is that it doesn't perform super well on low powered devices (Like the Pi Zero, where detection lag increases by about 1/3 of a second).

@Stan92
Copy link

Stan92 commented Mar 8, 2017

Hi,

I'm looking for something like this too using nodejs but less sophisticated :-)

@evancohen, I've seen your project it seems it could probably satisfy my needs (except for MS Cognitive Services).

There are several steps that I can manage using 2 "audio buffers" (one for snowboy, one for Bing).
But I think I'm not on the good path.

This is the workflow I'd like to implement.
I have several hotwords
- For local actions (Time, Light, etc...)
- 1 for activating online action (Go Online)
- 1 for stopping online action (Bye)

a) if it's "Time, Light,..." then I run my "local action"
b) if "Go Online" is detected then I say to the user I'm listening

c.1) if the word/sentence doesn't not exist within Snowboy Model and I'm in "listening mode" I would like to send the word/sentence online (using MS Cognitive Services).

c.2) if the word/sentence exists within the Model and I'm in "listening mode", I don't want to send the data online.

d) if it's "Bye", any word/sentence will be sent online until the user says "Go Online"
e) When a silence of x seconds is detected, I need to back "offline" (means any word/sentence will be sent online until the user says the "Go Online"

@Stan92
Copy link

Stan92 commented Mar 12, 2017

@dmc6297 I tried your customized snowboy_index.js but it doesn't work for me.
When I save the chunk into a buffer, the final file (I concatenate the buffer into a array of bytes), the wav file is inaudible.

    detector.on('chunk', function (chunk, encoding) {
        if (chunk){
            buffers.push(chunk);
            if ((new Date()-timeStart)/1000 > timerInSecond ) {
                detector.bufferingCommand=false;
                getText(buffers); 
            }
        }
    });

The getText transforms the buffer into an array of bytes and sends it to an api
var bytes = Buffer.concat(buffers);

Could you please give me a hand?
Thanks

@dmc6297
Copy link

dmc6297 commented Mar 13, 2017

@Stan92 The data is pcm audio, you will need to prepend a wav header to the buffer, or convert to another format. This is how I made it work.

Start the command buffer

detector.on('commandStart', function (hotwordChunk) {
audioCommandBuffer = new Buffer(5000);

var samplesLength = 10000;

var header = new Buffer(1024);
header.write('RIFF',0);

//file length
header.writeUInt32LE(32 + samplesLength * 2,4);
header.write('WAVE',8);

//format chunk idnetifier
header.write('fmt ',12);

//format chunk length
header.writeUInt32LE(16,16);

//sample format (raw)
header.writeUInt16LE(1,20);

//Channel Count
header.writeUInt16LE(detector.numChannels(),22);

//sample rate
header.writeUInt32LE(detector.sampleRate(),24);

//byte rate
//header.writeUInt32LE(detector.sampleRate() * 4,28);
header.writeUInt32LE(32000,28);

//block align (channel count * bytes per sample)
header.writeUInt16LE(2,32);

//bits per sample
header.writeUInt16LE(16,34);

//data chunk identifier
header.write('data',36);

//data chunk length
header.writeUInt32LE(15728640,40);

audioCommandBuffer = header.slice(0,50);

//Comment this out to omit the hotword chunk of audio
audioCommandBuffer = Buffer.concat([audioCommandBuffer,hotwordChunk]);

});

Append to the buffer

detector.on('chunk', function (chunk, encoding) {
audioCommandBuffer = Buffer.concat([audioCommandBuffer,chunk]);
});

And to output the buffer to a file

detector.on('commandStop', function () {
fs.writeFile('/home/pi/Speech/audio.wav',audioCommandBuffer);
});

@Stan92
Copy link

Stan92 commented Mar 13, 2017

@dmc6297 ... I don't know how to thank you... :-).. I'll make a try asap
Thanks once again

@zikphil
Copy link

zikphil commented Oct 15, 2017

Hey you guys, I think this thread is exactly what I am trying to do but in Python. On top of being able to say the full sentence without stopping, I'd also like the capability to keep a 3seconds buffer before HWD kicks-in so I can say stuff like "Goodnight Snowboy". or "What do you think Snowboy" through Google Speech API. Any suggestions on how to achieve that?

@chenguoguo
Copy link
Collaborator

As you said you can maintain a buffer before the hotword, and when the hotword is detected, you send the buffer to Google Speech API, and see if there's anything meaningful there.

@sintetico82
Copy link

Someone can write an example for nodejs?

@evancohen
Copy link
Contributor

evancohen commented Oct 22, 2017 via email

@uchagani
Copy link

uchagani commented Mar 3, 2018

@zikphil Were you able to get this working? I am trying to do the same thing. Any help is appreciated. thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants