Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

encoding errors on example program #23

Open
Masaiki opened this issue Feb 14, 2023 · 3 comments
Open

encoding errors on example program #23

Masaiki opened this issue Feb 14, 2023 · 3 comments

Comments

@Masaiki
Copy link

Masaiki commented Feb 14, 2023

Thank you for your contribution. The directml version of whisper is much faster than the pure cpu version of whisper.cpp. And I had some issues when using it. The first one was the encoding problem. In the debugoutput of the desktop version, the output content sometimes lacked a few characters. I think it was a conversion from utf-8 to CP_ACP (windows936 gb2312-80). Similar encoding errors also happened in the cli, and the output dictation text was almost unreadable '?'
The second issue is that almost every audio that is transcribed will report the error “runFullImpl: failed to generate timestamp token - skipping one second”.
The third problem is similar to #18, it always stops working after recognizing for a period of time, and repeatedly outputs the last sentence of recognition content.
If you plan to track the last two problems, I can open another issue

@Masaiki
Copy link
Author

Masaiki commented Feb 14, 2023

Example file that can reproduce the above problem: https://drive.google.com/file/d/19WnNJLL1IThoVznUog6hyMQUwTZKU2T2/view?usp=share_link
the audio file come from the command line "ffmpeg -i input_video -vn -ar 16000 -ac 1 -c:a pcm_s16le output.wav"
model is ggml-medium.bin from whisper.cpp, audio language is japanese

@rsmith02ct
Copy link

I've noticed the same with Japanese- I get the same line repeated again and again. Using medium and large models.

@emcodem
Copy link

emcodem commented Feb 18, 2023

feeding higher rate audio than 16k helped me to get rid of "runFullImpl: failed to generate timestamp token - skipping one second". I only fed 16k because the CPU version of whispercpp wanted that but Mr Const-me's version doesnt seem to have that limit.
The "repeating" sentences AKA "sequence-to-sequence architecture failure loop" however is not influenced by the source audio rate. Instead i can influence it by feeding different parts of the source audio, e.g. when feeding just 1 minute of the portion that caused the issue, the issue does not occur but when feeding 8 minutes before more, it occurs. (Of course this must be done using a binary exact portion of the source audio, e.g. ffmpeg -codec copy). The issue is also mentioned here: openai/whisper#192
These 2 problems don't seem to be related for me.

Does anyone like to confirm that higher audio resolution solvves "runFullImpl" error?
On the other hand regarding the failure loop issue, condition_on_previous_text is recommended to handle that but this is not available in this project.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants