New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speech need to be primed using touch/tap in Safari #995

Open
rodmcleay opened this Issue May 30, 2018 · 15 comments

Comments

Projects
5 participants
@rodmcleay

rodmcleay commented May 30, 2018

I'm have a webchat conrol for a bot that that is up and running and working well in Chrome. The link How to enable speech in Web Chat shows how to set this up and we have done it exactly like this.

It mentioned multiple browsers, but does not specify Safari in any way.

We need this working on an iPhone, however it just doesn't seem to work, there is not a lot of feedback from the browser, the icon changes and it appears to have turned on the microphone after access is approved.

Nothing spoken is recorded/recognized and the text area of the bot stays empty, no 'listining....' or any other indication its working other than the red microphone on icon in the browser header. clicking the icon mutes and un-mutes as you'd expect, it just doesn't seem to be connected to the webchat control in the browser.

All of my investigation appears to go around in circles.

  1. Has anyone achieved this with WebChat, or any other direct line component?
  2. Or Can anyone confirm definitely doesn't work with safari so I can stop banging my head against it.
  3. Are there any alternatives to webchat that do work in iPhone/Safari?

Thanks for taking the time to read, any assistance would be much appreciated, I'm at the end on this investigation and pulling my hair out.

@compulim

This comment has been minimized.

Collaborator

compulim commented Jun 5, 2018

@rodmcleay, we have just tested it on an iPhone with iOS 11.4, running Safari, and use Cognitive Services Speech. It works.

Can you check for a few things?

  1. Your iPhone is running iOS 11+
  2. You are using Safari, not Chrome or Edge app
  3. Settings app > Safari > Camera & Microphone Access is enabled
  4. Your web site is on HTTPS. Safari block microphone on insecure HTTP
  5. You are running on iPhone
    • We tested it don't run on iPod with iOS 11.4, we haven't test iPad yet
  6. Your page is using Cognitive Services, not browser speech (a.k.a. WebSpeech API)

I agree we need to make the speech detection more robust and informative. But also need to make sure detection doesn't pop up the "Access to Microphone" dialog too early. But unfortunately, in some cases, you can't have both.

@shubhamchawla

This comment has been minimized.

shubhamchawla commented Jun 5, 2018

How to get it working on Chrome for iOS, any help would be greatly appreciated.
Thanks in advance.

@compulim

This comment has been minimized.

Collaborator

compulim commented Jun 5, 2018

@shubhamchawla It doesn't work in Chrome for iOS because Chrome does not support WebRTC on iOS. The only browser on iOS which support WebRTC is Safari right now.

@rodmcleay

This comment has been minimized.

rodmcleay commented Jun 6, 2018

Hi Compulim,
I don't mind the popup asking access, that is understandable and expected.
I'm using cognitive services, as per the code below, and it is on HTTPS, working fine in Chrome on windows and android phones. iPhone is iOS11.3.1

const speechOptions = {
    speechRecognizer: new CognitiveServices.SpeechRecognizer({
        fetchCallback: (authFetchEventId) => getToken(),
        fetchOnExpiryCallback: (authFetchEventId) => getToken()
    }),
    speechSynthesizer: new CognitiveServices.SpeechSynthesizer({
        gender: CognitiveServices.SynthesisGender.Female,
        subscriptionKey: '@System.Configuration.ConfigurationManager.AppSettings["CognitiveKey"]',
        voiceName: 'Microsoft Server Speech Text to Speech Voice (en-US, JessaRUS)'
    })
};

Is that the config you would expect?

Get token is on the client at the moment.

function getToken() {
      // Normally this token fetch is done from your secured backend to avoid exposing the API key and this call
      // would be to your backend, or to retrieve a token that was served as part of the original page.
      return fetch(
        'https://api.cognitive.microsoft.com/sts/v1.0/issueToken',
        {
          headers: {
            'Ocp-Apim-Subscription-Key': '@System.Configuration.ConfigurationManager.AppSettings["CognitiveKey"]'
          },
          method: 'POST'
        }
      ).then(res => res.text());
    }
@rosskyl

This comment has been minimized.

Contributor

rosskyl commented Jun 8, 2018

I got it working with Safari and Firefox with the following javascript. I just include this in a javascript file while still using the linked CognitiveServices.js file from the cdn. I use the bing speech recognizer and the browser speech synthesizer.

This works because their current version uses window.navigator.getUserMedia which is being deprecated so change that to use window.navigator.mediaDevices.getUserMedia. Then Safari has problems with playing audio using the speech synthesizer programatically, so I register an event to the microphone click to play a sound from the speech synthesizer and remove that event. Finally, Safari also has problems recording audio programatically again so I create the audio context before actually needing it and connect the processor. Safari doesn't allow recording audio or playing audio with the speech synthesizer unless it is a direct result from a touch or tap. This includes the then part of the promise returned from window.navigator.mediaDevises.getUserMedia.

I've tested this with the latest version of Chrome, Firefox, and Edge on windows 10, Chrome on android, and Safari on an iPad pro. The only browser I haven't had it work on is internet explorer.

// Necessary for safari
// Safari will only speak after speaking from a button click
var isSafari = /^((?!chrome|android).)*safari/i.test(navigator.userAgent);

function SpeakText() {
    var msg = new SpeechSynthesisUtterance();
    window.speechSynthesis.speak(msg);

    document.getElementsByClassName("wc-mic")[0].removeEventListener("click", SpeakText);
}

if (isSafari) {

    window.addEventListener("load", function () {
        document.getElementsByClassName("wc-mic")[0].addEventListener("click", SpeakText);
    });
}

// Needed to change between the two audio contexts
var AudioContext = window.AudioContext || window.webkitAudioContext;

var context;
var processor;

// Overrides the base constructor to use a singleton like structure
// Needed for Safari
var BasePrototype = AudioContext.prototype;
AudioContext = function () {
    return context;
};
AudioContext.prototype = BasePrototype;

// Sets the old style getUserMedia to use the new style that is supported in more browsers
window.navigator.getUserMedia = function (constraints, successCallback, errorCallback) {
    context = new BasePrototype.constructor;
    processor = context.createScriptProcessor(1024, 1, 1);
    processor.connect(context.destination);

    window.navigator.mediaDevices.getUserMedia(constraints)
        .then(function (e) {
            successCallback(e);
        })
        .catch(function (e) {
            errorCallback(e);
        });
};
@compulim

This comment has been minimized.

Collaborator

compulim commented Jun 13, 2018

@rosskyl this is good hack, without the need to touch the Web Chat code.

Can you explain a little bit more on synthesis part? Do you mean Safari requires touch/tap for both synthesis and recognition part?

@rosskyl

This comment has been minimized.

Contributor

rosskyl commented Jun 13, 2018

The first time you use either the speech synthesis or recognizer, it needs to be triggered by a user touch or tap. After the speech synthesis was triggered once, then I was able to get it to work without needing a touch or tap. Apple requires this to prevent the web page from automatically playing audio or recording audio even though all of the other browsers allow it.

The speech synthesis or recognizer will not work if they are triggered from a setTimeOut or from the .then portion of a promise (which is what the newer version of getUserMedia uses. For getUserMedia, the AudioContext object must be created from the tap and the processor created and connected from the tap. The recording can be done later.

@compulim

This comment has been minimized.

Collaborator

compulim commented Jun 13, 2018

@rosskyl Thanks for the explanation. I totally understand the recognizer requirement for tap/touch, but it just feel weird to me for the synthesis part. I bet one don't need to tap/touch for WebAudio.

Anyway, it's Apple's requirement then we need to work with it. 😉

@rosskyl

This comment has been minimized.

Contributor

rosskyl commented Jun 13, 2018

You could try it without adding the event listener, but I couldn't get it to work without it. You could also write your own custom speech synthesizer and try it with WebAudio. I originally wrote my own that used the speech synthesizer, but ended up with the same problem the BrowserSpeechSynthesizer had. I fixed it with the event listener and figured out it worked with the BrowserSpeechSynthesizer also.

@compulim compulim added Bug and removed investigating labels Jun 16, 2018

@compulim

This comment has been minimized.

Collaborator

compulim commented Jun 16, 2018

Thanks @rosskyl. I will make this a bug.

BTW, we are planning to polyfill HTML WebSpeech API using Cognitive Services. So we don't need to maintain two different APIs, and we can bring Cognitive Services to platforms that does not support WebSpeech (e.g. Edge, desktop Firefox).

As always, we welcome contributions, and we will take quality projects as dependencies.

Anyway, note to bug fixer:

  • Safari requires touch/tap to enable both speech recognition and speech synthesis
  • We need to workaround this, one possible move:
    1. On any touch/tap on Web Chat, synthesis an empty string to prime the browser

@compulim compulim added this to To do in (Deprecated) Ignite: Others via automation Jun 18, 2018

@compulim compulim changed the title from Bot Framework Webchat microphone speech not working in Safari to Speech need to be primed using touch/tap in Safari Jun 18, 2018

@serpino

This comment has been minimized.

serpino commented Aug 6, 2018

Hi @rosskyl
I am using the chat and without using your javascript code, the voice conversation works correctly, except for IOS.

If I add your code to the project, it gives me an error when I press the microphone. Can you please help me?

The error I get in Chrome is this:

export function __awaiter(thisArg, _arguments, P, generator) {
    return new (P || (P = Promise))(function (resolve, reject) {
        function fulfilled(value) { try { step(generator.next(value)); } catch (e) { reject(e); } }
        -->(IN THIS LINE)-->function rejected(value) { try { step(generator.throw(value)); } catch (e) { reject(e); } }<--(In this line)
        function step(result) { result.done ? resolve(result.value) : new P(function (resolve) { resolve(result.value); }).then(fulfilled, rejected); }
        step((generator = generator.apply(thisArg, _arguments || [])).next());
    });
}
Uncaught (in promise) TypeError: Illegal invocation
     at MicAudioSource.TurnOn (MicAudioSource.ts: 110)
     at MicAudioSource.Listen (MicAudioSource.ts: 182)
     at MicAudioSource.Attach (MicAudioSource.ts: 131)
     at Recognizer.Recognize (Recognizer.ts: 97)
     at SpeechRecognizer. <anonymous> (SpeechRecognition.ts: 153)
     at step (tslib.es6.js: 91)
     at Object.next (tslib.es6.js: 72)
     at tslib.es6.js: 65
     at new Promise (<anonymous>)
     at Object .__ awaiter (tslib.es6.js: 61)

And in Firefox is:
TypeError: 'get state' called on an object that does not implement interface BaseAudioContext

I am using cognitiveServices. What can be failing?

Thanks

@rosskyl

This comment has been minimized.

Contributor

rosskyl commented Aug 6, 2018

I believe that is because some of the internals for the cognitiveServices changes. The following is what I currently use:

var isSafari = /^((?!chrome|android).)*safari/i.test(navigator.userAgent);

function SpeakText() {
    var msg = new SpeechSynthesisUtterance();
    window.speechSynthesis.speak(msg);

    document.getElementsByClassName("wc-mic")[0].removeEventListener("click", SpeakText);
}

if (isSafari) {

    window.addEventListener("load", function () {
        document.getElementsByClassName("wc-mic")[0].addEventListener("click", SpeakText);
    });
}

// Needed to change between the two audio contexts
var AudioContext = window.AudioContext || window.webkitAudioContext;

// Sets the old style getUserMedia to use the new style that is supported in more browsers even though the framework uses the new style
if (window.navigator.mediaDevices.getUserMedia && !window.navigator.getUserMedia) {
    window.navigator.getUserMedia = function (constraints, successCallback, errorCallback) {
        window.navigator.mediaDevices.getUserMedia(constraints)
            .then(function (e) {
                successCallback(e);
            })
            .catch(function (e) {
                errorCallback(e);
            });
    };
}

I have this working for all of the major browsers on Windows, android, macOS, and iOS.

@serpino

This comment has been minimized.

serpino commented Aug 7, 2018

@rosskyl This works much better, at least for the rest of browsers.

I already made it work for any mobile device.

In the end I did it in the following way:
When it detects that a response arrives to the user's message, I call this function by passing the text of the response and the language in which it should speak.

function playMessage(msgText, locale ){
  var msg = new SpeechSynthesisUtterance();
            msg.text = msgText;
            msg.volume = 1; // 0 to 1

            msg.rate = 1; // 0.1 to 9

            msg.pitch = 1; // 0 to 2, 1=normal

            msg.lang = locale ;//"en-US";
            speechSynthesis.speak(msg);
}

A part I do some other checks as if the user is on a mobile device or if the message comes from the micro or not
Thank you very much!

@rosskyl

This comment has been minimized.

Contributor

rosskyl commented Sep 25, 2018

Just a note that this will only work for the browser speech synthesizer. It does not work for the cognitive services speech synthesizer.

I tried to prime it like above by creating an audio context and playing a tone but that does not work. I can get the tone to play on the mic tap, but can't get it to work programmatically.

@serpino

This comment has been minimized.

serpino commented Sep 26, 2018

@rosskyl
Right. I only use this process when cognitive services do not work. That depends on the browser. So I use both methods depending on the browser.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment