Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

my adventures with GpWhisper: log to files the different commands ? #122

Open
teto opened this issue Mar 25, 2024 · 18 comments
Open

my adventures with GpWhisper: log to files the different commands ? #122

teto opened this issue Mar 25, 2024 · 18 comments
Labels
enhancement New feature or request

Comments

@teto
Copy link
Collaborator

teto commented Mar 25, 2024

Now that I have my GPU used by localai I wanted to try whisper locally via :GpWhisper after installing sox and I got a not very helpful:

Gp: Whisper query exited: 2, 0

I had installed sox because checkhealth asked for it:

- OK sox is installed
- OK sox is compiled with mp3 support

Note that the mp3 check is invalid as
sox -h | grep -i mp3 did return mp3 but there seems to be a dinstinction between writing and reading mp3 https://bugs.launchpad.net/ubuntu/+source/sox/+bug/223783
I am on nix and I had to install (sox.override({enableLame = true;})) for sox to be able to generate mp3.

In oder to debug my setup, I print-ed stuff, would be nice if gp.nvim could log some of its operations to a file instead. I dont like plenary much but it has some facilities. With package managers like https://github.com/nvim-neorocks/rocks.nvim/ , it should become more tractable to use dependencies in the future.

So anyway GpWhisper was trying to run:

|| cd /tmp/gp_whisper && export LC_NUMERIC='C' && sox --norm=-3 rec.wav norm.wav && t=$(sox 'norm.wav' -n channels 1 stats 2>&1 | grep 'RMS lev dB'  | sed -e 's/.* //' | awk '{print $1*1.75}') && sox -q norm.wav -C 196.5 final.mp3 silence -l 1 0.05 $t'dB' -1 1.0 $t'dB' pad 0.1 0.1 tempo 1.75 && curl  --max-time 20 https://api.openai.com/v1/audio/transcriptions -s -H "Authorization: Bearer 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" -H "Content-Type: multipart/form-data" -F model="whisper-1" -F language="en" -F file="@final.mp3" -F response_format="json"
|| Gp: Whisper query exited: 2, 0

So I found out that rec.wav did not exist/was empty. Checking for the size of the record could help diagnose wrong recording.
Then I had to split the command to find the issue. Turns out that the conversion to mp3 failed because of what I mentioned earlier: my version of sox listed mp3 in sox -h but it was not able to generate mp3 until I enabled the "lame" library.

So now it works (yeah \o/) but initially I wanted to try it locally so I changed the hardcoded endpoint towards my local localai endpoint
.. " --max-time 20 http://localhost:11111/v1/audio/transcriptions -s "
and it works so fast it's scary (with a RTX3060, so no that fancy)

My first attempt was in my native language != English and the result was garbage ^^
maybe thedefault `whisper_language = "en" could be chosen via the locale instead ? but I nitpick.
Took me a few (2?) hours to get there so I'll pause for now :)

My USB mic needed some custom config that I am listing more for my future self than for the maintainers (sry ^^'):

$ arecord -l                                                                                         
**** List of CAPTURE Hardware Devices ****
card 1: PCH [HDA Intel PCH], device 0: ALC892 Analog [ALC892 Analog]
  Subdevices: 1/1
  Subdevice #0: subdevice #0
card 1: PCH [HDA Intel PCH], device 2: ALC892 Alt Analog [ALC892 Alt Analog]
  Subdevices: 1/1
  Subdevice #0: subdevice #0
card 2: Microphones [Blue Microphones], device 0: USB Audio [USB Audio]
  Subdevices: 1/1
  Subdevice #0: subdevice #0

The help/doc of arecord is not great so from there it was not clear how to specify the device.
I found the answer here https://unix.stackexchange.com/questions/360192/alsa-error-channel-count-2-not-available-for-playback-invalid-argument
: plughw accepts more options than hw it seems and in the end

arecord  "-D" "plughw:2,0" "-c" "1" "-f" "S16_LE" "-r" "48000" "-d" 3600 "/tmp/gp_whisper/rec.wav"
@mecattaf
Copy link

@teto Would really appreciate a short guide on how to get local whisper working for english transcription on your device!

@teto
Copy link
Collaborator Author

teto commented Jul 16, 2024

I would rather make the plugin more approachable with things such as #125, more logging etc. The merge of the multiple providers support is a very good news. Now it depends on my available time :'( and merges.

@mecattaf
Copy link

Is there a way i can help with the logging part?

@teto
Copy link
Collaborator Author

teto commented Jul 23, 2024

@mecattaf that would be awesome. We talked a bit about it at #125 (comment) . It's important to discuss the implementation with @Robitx first.

@Robitx
Copy link
Owner

Robitx commented Jul 23, 2024

@mecattaf I was just working on it this evening ( #166 ) but if you're willing to help, I'm sure we can find something. What is your current biggest pain point concerning Gp?

@mecattaf
Copy link

Hey @Robitx my end goal here is to get local whisper (or any speech to text) in gp.nvim
I want to work on logging to unblock @teto from creating the PR that enabled that :)

@teto
Copy link
Collaborator Author

teto commented Jul 24, 2024

@mecattaf what's not working for you ? main works nicely for me. And some of the configuration is setup dependant. we can add checks to verify for instance that the generated mp3 is not empty. More logging should help though so maybe add a PR that logs at crucial times. For instance when running external commands.

@mecattaf
Copy link

mecattaf commented Jul 24, 2024

@teto could you be so kind to share your config file? Or perhaps a short guide on how to set it up? Many thanks!

@gonzaloserrano
Copy link

@Robitx suggested in Discord to share these config examples in the repo wiki. This would be a great start.

@teto
Copy link
Collaborator Author

teto commented Jul 24, 2024

as you can see I have nothing specific anymore, the commented code is remnants from before the multiple providers MR was merged: https://github.com/teto/home/blob/577d3a6cbb37ad874b601bc0af73e7162486d479/config/nvim/lua/teto/gp/setup.lua#L24

@Robitx
Copy link
Owner

Robitx commented Jul 24, 2024

Just to backup some of my thoughts concerning whisper.

TLDR: The cross platform audio world sucks.

SoX is the only cross platform candidate usable for recording, but unknown amount of people were hitting latency issues causing beginning/end of the recording cut offs. That lead me to eventually introduce ffmpeg with avfoundation for mac os and arecord for linux which are prioritized over SoX for recording if found.

Yes, ffmpeg could potentially replace SoX completely, but cross platform mess of possible incantations based on available input devices is something I'd like to avoid (https://ffmpeg.org/ffmpeg-devices.html#Input-Devices).

Whisper (at least through OpenAI api) is limited to 25MB input files with mostly proprietary formats (mp3, mp4, mpeg, mpga, m4a, wav, and webm). Wav is around 10MB of mono recording per minute, mp3 is around 1MB per minute, which means compression for any non trivial length of audio and dealing with SoX potentially missing mp3 support (at least NixOS and Ubuntu both have this problem).

I haven't tested/looked up if and how transcription speed depends on the format - we might be able to avoid mp3 for simple whisper instructions like GpWhisperRewrite. But for use cases such as dictating something for transcription wav is unusable without some splitting mechanism which would again complicate things.

Then, there is the question on what to use for running whisper locally. Whisper model is relatively small enough that I might consider bundling some cross platform solution directly with the Gp plugin.

Ollama can't be expected anytime soon ollama/ollama#1168, but there are other candidates such as https://github.com/fedirz/faster-whisper-server.

The

docker run --publish 8000:8000 --volume /tmp/huggingface:/root/.cache/huggingface fedirz/faster-whisper-server:latest-cpu

and setting into conf:

whisper_api_endpoint = "http://localhost:8000/v1/audio/transcriptions",

basically works already (although slow, first call timeouts since it pulls the model), and it uses customized models so the currently hard coded whisper-1 doesn't match (making it fallback to default Systran/faster-whisper-medium.en)

$docker run --publish 8000:8000 --volume /tmp/huggingface:/root/.cache/huggingface fedirz/faster-whisper-server:latest-cpu
INFO:     Started server process [1]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
2024-07-24 17:15:22,710:INFO:faster_whisper_server.logger:handle_default_openai_model:whisper-1 is not a valid model name. Using Systran/faster-whisper-medium.en instead.
2024-07-24 17:16:25,328:INFO:faster_whisper_server.logger:load_model:Loaded Systran/faster-whisper-medium.en loaded in 62.60 seconds. cpu(int8) will be used for inference.
2024-07-24 17:16:54,145:INFO:faster_whisper_server.logger:handle_default_openai_model:whisper-1 is not a valid model name. Using Systran/faster-whisper-medium.en instead.
INFO:     172.17.0.1:42792 - "POST /v1/audio/transcriptions HTTP/1.1" 200 OK
2024-07-24 17:17:26,107:INFO:faster_whisper_server.logger:handle_default_openai_model:whisper-1 is not a valid model name. Using Systran/faster-whisper-medium.en instead.
INFO:     172.17.0.1:52242 - "POST /v1/audio/transcriptions HTTP/1.1" 200 OK
2024-07-24 17:20:51,740:INFO:faster_whisper_server.logger:load_model:Max models (1) reached. Unloading the oldest model: Systran/faster-whisper-medium.en
2024-07-24 17:21:34,795:INFO:faster_whisper_server.logger:load_model:Loaded Systran/faster-distil-whisper-large-v3 loaded in 43.04 seconds. cpu(int8) will be used for inference.
INFO:     172.17.0.1:56894 - "POST /v1/audio/transcriptions HTTP/1.1" 200 OK
INFO:     172.17.0.1:40354 - "POST /v1/audio/transcriptions HTTP/1.1" 200 OK
INFO:     172.17.0.1:54184 - "POST /v1/audio/transcriptions HTTP/1.1" 200 OK

@mecattaf
Copy link

@Robitx do you think this should be done in a separate plugin?
Specifically local whisper. It looks like there is some shared/overlapping functionality but I do not want to keep spamming you guys here as it looks like a lot of moving pieces are out of your control. I would not change the UI/UX that you created.
I can start by forking gp.nvim (perhaps call it "whisper.nvim") and start working on a bespoke splitting mechanism. Any guidance, or if you're up for it a full PRD would be very helpful.
And again thank you so much for the continued support. Cracking local speech to text for audio of arbitrary length straight into nvim is something that I would use 5 hours per day, and I think this is relevant to enough users that I can tackle the associated complexity.

@Robitx
Copy link
Owner

Robitx commented Jul 25, 2024

@mecattaf I don't think there is a need to separate it, I've added GpWhisper exactly because voice control/dictation is important for me too.

Concerning the splitting mechanism - SoX itself can do it, the trouble are the silence thresholds which will differ from device to device and often time of day on the same device.

  • first step would be continuous recording without any post processing, just splitting it into files of certain duration sox -d raw.wav trim 0.0 0.1 : newfile : restart (this would make rawN+1.wav every second)
  • old raw files can be periodically removed
  • when the user calls some whisper command, we can start from files created after lets say now - 2 seconds, concat them together using sox, play with some post processing and shove it into whisper endpoint
  • for long dictation we could take N oldest unprocessed chunks, concat them together and try to split them on some sufficient pause/silence, the first part would go to whisper, and second would become the first unprocessed chunk for next iteration

Notes:

@mecattaf
Copy link

Got it!

Three things I really like from this project: https://github.com/mkiol/dsnote

  1. re: silence thresholds, Speech to text functionality waits a few seconds before transcribing, and uaer can keep talking simultaneously. Can draw inspiration from how they do the splitting?
  2. Huge collection of self-hosted whisper-like models with many different languages. Can prepare a companion repo with download/install instructions if helpful to anybody.
  3. Add-on for nvidia and amd GPUs. I know this is not the purpose of this project but also happy to provide companion repo for those.

I do not like the Speech Note ui, the gp.nvim experience is second to none imo. Hopefully we can get the best of both worlds :)

@Robitx
Copy link
Owner

Robitx commented Jul 25, 2024

@mecattaf Gp uses rudimentary threshold detection already. Reading RMS level ("average loudness" for example -10dB) and multiplying it by some constant => RMS*1.75 = -17.5 dB and everything below that would be considered silence (audio under threshold for specified duration would cause split).

gp.nvim/lua/gp/init.lua

Lines 3370 to 3379 in 0e7a4a2

.. "export LC_NUMERIC='C' && "
-- normalize volume to -3dB
.. "sox --norm=-3 rec.wav norm.wav && "
-- get RMS level dB * silence threshold
.. "t=$(sox 'norm.wav' -n channels 1 stats 2>&1 | grep 'RMS lev dB' "
.. " | sed -e 's/.* //' | awk '{print $1*"
.. M.config.whisper_silence
.. "}') && "
-- remove silence, speed up, pad and convert to mp3
.. "sox -q norm.wav -C 196.5 final.mp3 silence -l 1 0.05 $t'dB' -1 1.0 $t'dB'"

But there is a lot of SoX magic not utilized yet, since I didn't have time to play with it. For example compand effect which could get voice/silence to predictable level before splitting and make it easier for whisper to proccess.

I'll try to spend some time on it during weekend.

@mecattaf
Copy link

mecattaf commented Aug 4, 2024

Three things I really like from this project: https://github.com/mkiol/dsnote

  1. Huge collection of self-hosted whisper-like models with many different languages. Can prepare a companion repo with download/install instructions if helpful to anybody.

@teto I think it would be useful to compile all the info related to offline whisper somewhere. Should this be a self-contained markdown file, or a separate repo? I would like to
A) document the steps needed to make local whisper work with gp.nvim
B) offer a large collection of models that we can try locally, inspired by dsnote's docs
Let me know your thoughts

@Robitx Robitx added the enhancement New feature or request label Aug 4, 2024
@teto
Copy link
Collaborator Author

teto commented Aug 5, 2024

I think it would be useful to compile all the info related to offline whisper somewhere.

this is up to whisper's project IMO

document the steps needed to make local whisper work with gp.nvim

I feel like the hundreds of tutorials on how to run LLMs locally. Maybe just link one of those with some comments ?

I think the wiki is the most appropriate place to do so. Go ahead but gp.nvim != whisper so link the whisper doc when appropriate rather than duplicate it with the risk of it getting outdated

@Robitx
Copy link
Owner

Robitx commented Aug 5, 2024

@mecattaf sorry I haven't got around to it yet, spend last two week(s/ends) cleaning up the code base and squashing some bug reports.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants