Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speech recognition with visual feedback #252

Open
Cuperino opened this issue May 22, 2024 · 0 comments
Open

Speech recognition with visual feedback #252

Cuperino opened this issue May 22, 2024 · 0 comments
Assignees
Labels
enhancement New feature or request

Comments

@Cuperino
Copy link
Owner

Cuperino commented May 22, 2024

Michael Tunnell wrote in our Telegram chat:

what are the chances you could add a feature of automatically scrolls as you speak, stops when you pause or improvise, and seamlessly resumes when you return to your script?

To which I replied:

This can be added. Thank you for the suggestion.

It would take some time, since it's a beefy task and there are other tasks people have voted for more, thus have higher priority. > If you or anyone else here would like to this to be get a higher priority, please vote for this in our Patreon Poll: https://www.patreon.com/posts/choose-where-is-78344217

It would be this voting option: Voice control with visual feedback indicating to slow down or increase speed
The visual feedback aspect is more of an implementation detail. A pause listening button could show when a tangent is detected, so it doesn't move the prompter ahead of time.

The one thing that had been holding me back from implementing this (besides the move to Qt 6) is the tradeoff between latency, accuracy, and multi-language support. Until very recently there was no way of locally producing accurate speech transcriptions in real-time.

The best that could be done until last year was to incorporate a third party STT API from Azure or Google, which transcribe pretty well, but suffer from latency depending on your location. Unfortunately those have a cost for the user and, if I tried to abstract it behind a subscription and make something out of it, it could also go horribly wrong as someone could pirate it, so I've steered away from this option. Plus there are also the privacy concerns with those third party services.

Now the state of the art in local STT is using OpenAI's Whisper. That wasn't an option until recently because it processes audio in 30 second chunks. The workaround is to give it smaller chunks and pad them to 30s with silence. The Whisper "small" model is impressively accurate and is fast enough that it could be employed here.

Having said that, QPrompt users would need to have a beefy GPU to use this feature. The real-time performance I'm talking about I got using a laptop with an Intel Core i7-1280P and an NVIDIA Quadro T550. The good news is CUDA cores are separate from the shader cores, so running this should have no impact on QPrompt's fluidity, if the CPU multi-threading is programmed correctly.

He then wrote:

That makes a lot of sense in regards to limitations and that OpenAI Whisper thing sounds pretty freaking cool.

That is very promising and I hadnt considered how much would be involved in it so glad it is actually doable now 😄 That part about the powerful enough computer is interesting. Do you think high-end phones are powerful enough to do it?

I am asking about the phone thing because I saw you got Apple Developer status and I think make an iOS app would be beneficial very much for this because I think you should absolutely make it subscription based for these features. I think something that automates the scrolling is undoubtedly a premium feature.

Regarding phones I replied:

Not yet. Some kind of hardware acceleration would have to be involved. I read someone modified the small model to run on a TPU, but the accuracy is reduced. A flagship with a powerful TPU might run something like this, but I don't think the kind of TPU or compute cores that most phones carry are able to handle this today.

Even if a phone could handle it, the model is too large to be distributed with the app and would have to be downloaded after the fact, as an asset, for the app to be accepted into the Play Store.

What most software does is use a server to do the processing, but the number of paying users need to out-weight the cost, and there would need to be multiple servers, otherwise the latency could be too high depending on your region.

For the rest of the conversation, see our Telegram chat room. Let's keep this issue on topic.

@Cuperino Cuperino self-assigned this May 22, 2024
@Cuperino Cuperino added the enhancement New feature or request label May 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Status: Remote Control
Development

No branches or pull requests

1 participant