Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

text output looping/repeating (until end) #26

Closed
emcodem opened this issue Feb 22, 2023 · 27 comments
Closed

text output looping/repeating (until end) #26

emcodem opened this issue Feb 22, 2023 · 27 comments

Comments

@emcodem
Copy link

emcodem commented Feb 22, 2023

Also mentioned here:
#23

I ran a few tests on longer video clips (e.g. 2 hours) and mostly it tends to repeat the sentence from a certain point until the end. E.g. after one hour, you see repeated output forever. The timestamps seem to indicate that there was a new text detected but the text content is the same as before.

https://1drv.ms/u/s!AkS-A9Jqq09FgzEX78lvh7SiMAYu?e=C6f8GW

In this example, after about 2 minutes, the sentence repeats: "Jagt mich mal mit frahmen nudelholz".
The exact command that i use:
C:\dev\whisper\Whisper\x64\Release\main.exe -f C:\temp\test.wav -l de -m C:\temp\whisper\ggml-large.bin

I did different tests and cut portions of some affected file and it turns out that it is not caused by the audio content itself because it the affected area will translate just fine if i e.g. cut 1 minute before it but if i leave 2 minutes before, it happens.

In ContextImpl.cpp, i tried to catch "repeated text" by just copying the latest "text" to the heap as soon as it is complete (about line 740) and before that, compare if the text is the same as last time. If yes, seek a little.
lasttext = new std::string(text);

if (0 == strncmp(lasttext->c_str(), text.c_str(), text.length())) {
 logDebug(u8"last text repeated");
 seek += 100;
}

delete lasttext;
lasttext = new std::string(text);

This seems to "workaround the issue" (still needs lot of testing) but i am really not sure if this is the correct way to do it.
Also, a question. Is it correct to do such workarounds there (there is another workaround a few lines above) or should the cause be searched and fixed somewhere else? (where)

@emcodem
Copy link
Author

emcodem commented Feb 23, 2023

A few tests with the workaround mentioned are done now, it looks like we still get repeated text output but it seems to be a lot better than before, e.g. the issue does not occur for a full hour but only for 1 minute duration, then it catches up with the content.
I feel that we would need to re-initalize or flush with zero data instead of just seeking. OR of course we get out what is actually causing the bug but i have no idea yet how to do that.

@GrahamboJangles
Copy link

GrahamboJangles commented Mar 1, 2023

I'm gonna leave this here because it's relevant but I don't know how this fix would translate into this Windows port, as I don't see any way to edit the parameters.

openai/whisper#192

@VRCWizard
Copy link

VRCWizard commented Mar 8, 2023

For my use case (transcribing from microphone with c# NuGet package) setting the max-context parameter to 0 worked for me to stop text from repeating. In the example code for c# max-context is set to -1 by default.

Relevant Posts that mention this solution:
ggerganov/whisper.cpp#508
ggerganov/whisper.cpp#408 (reply in thread)

@emcodem
Copy link
Author

emcodem commented Mar 9, 2023

thanks @VRCWizard that opened a lot of insights for me. I believe setting the max-content (-mc) option to 0 is the equivalent of the mentioned condition_on_previous_text of the python implementation which they propose as a solution (the issue @GrahamboJangles mentioned above).
First testing shows good results and i believe to see that when -mc is 0, instead of getting a loop, i get a warning message "failed to generate timestamp token - skipping one second" at the spot where it normally would start repeating the output.

Also i played with adopting this into my workaround, e.g. i leave the -mc at default in order to get improved detection quality but when i detect repeated output, i clear the history instead of seeking:

					if (0 == strncmp(lasttext->c_str(), text.c_str(), text.length())) {
						logDebug(u8"last text repeated, clearing token/prompt past");
						prompt_past.clear();
					}

This also seems to work and it feels much better than seeking.

EDIT: the following sentences are superseeded by new info, which i posted later.
I far away from understanding everything in depth but i have a feeling that the program should take more care about this prompt_past, e.g. maybe the past should also be cleared if there was nothing decoded for a certain time period. Is it really helpful to feed the past of 30 seconds ago? Also, what about * Music * and silence, is it really helpful to keep a past when there was a clear break of conversation/speech?

Also, if i see correctly, the default setting for this is really high:

ContextImpl.misc.cpp:
rdi->n_max_text_ctx = 16384;

I don't know what exactly that is, tokens or characters or something?

Anyway, this explains why the issue does not happen for me when feeding starting from -1 minute from the spot of interest but it does happen when feeding -2 minutes.
As far as i understand this "context" or "past" stuff helps the model to get better results but is it really helpful or foreseen that the past is that long?

Just for reference, here is the original description of the python implementation on the subject:

condition_on_previous_text: bool
    if True, the previous output of the model is provided as a prompt for the next window;
    disabling may make the text inconsistent across windows, but the model becomes less prone to
    getting stuck in a failure loop, such as repetition looping or timestamps going out of sync.

This sounds like the stuff is only helpful within one sentence, not within the past 10 minutes...

@bilo1967
Copy link

It seems like there's an attempt to fix this repetition issue in the official OpenAI-whisper: openai/whisper#1052

@emcodem
Copy link
Author

emcodem commented Apr 25, 2023

Just adding some relevant informations. Just found that openai says " The model will only consider the final 224 tokens of the prompt and ignore anything earlier."
https://platform.openai.com/docs/guides/speech-to-text/longer-inputs
That seconds my assumption that the default prompt size 16384 that i mention above might be a little bit too high.
Also, openai provides the option to submit a prompt in text form which i find very interesting. I ask myself if they keep the same prompt over the whole session or they have the same strategy as const-me version does and just replace the prompt by the most recent tokens from the currently running transcription.
EDIT: i just saw that we basically never take thee 16384 default prompt size because the decision in run_full_impl is like "take the lower number, either the -mc userinput or the model.parameters.n_text_ctx/2 --> which is 224. If i am correct, It would be worth a mention in the commandline prameters that the default AND max value of -mc is 224.

On another topic, yet i cannot 100% confirm that clearing the prompt past always helps when the output starts repeating, i hope i'll be able to confirm that soon.

@emcodem
Copy link
Author

emcodem commented Apr 27, 2023

As most readers here might not be able to add the changes i mentioned above in the code for testing, i thought it might be a good idea to share what i am working with.
The result is not perfect, e.g. i still have repeated text but it does not repeat forever at least. Also, it can only workaround "repeated" text but not other prompt related issues like the model repeating to output a count 11, 12, 13... forever because the human voice counted from 1 to 10. So the base issue stays of course, it is just a pill for the pain but it doesnt heal the wound.

Anyone who feels adventurous, please try this main.exe, i use it in production. It helps additionally when you set -mc 223 as i mention above, it seems to be what openai does too. Also, as mentioned above, feeding 48khz wav files instead of 16khz (only the original whispercpp is limited to 16khz) seems to help too.

Whisper_last_text_repeated_workaround.zip

Source code changes:
0001-test-for-detect-repeated-text-and-reset-prompt-past.patch

@eagleftw023
Copy link

As most readers here might not be able to add the changes i mentioned above in the code for testing, i thought it might be a good idea to share what i am working with. The result is not perfect, e.g. i still have repeated text but it does not repeat forever at least. Also, it can only workaround "repeated" text but not other prompt related issues like the model repeating to output a count 11, 12, 13... forever because the human voice counted from 1 to 10. So the base issue stays of course, it is just a pill for the pain but it doesnt heal the wound.

Anyone who feels adventurous, please try this main.exe, i use it in production. It helps additionally when you set -mc 223 as i mention above, it seems to be what openai does too. Also, as mentioned above, feeding 48khz wav files instead of 16khz (only the original whispercpp is limited to 16khz) seems to help too.

Whisper_last_text_repeated_workaround.zip

Source code changes: 0001-test-for-detect-repeated-text-and-reset-prompt-past.patch

-mc 223 or 224?

@Highlander1536
Copy link

Highlander1536 commented Apr 27, 2023

As most readers here might not be able to add the changes i mentioned above in the code for testing, i thought it might be a good idea to share what i am working with. The result is not perfect, e.g. i still have repeated text but it does not repeat forever at least. Also, it can only workaround "repeated" text but not other prompt related issues like the model repeating to output a count 11, 12, 13... forever because the human voice counted from 1 to 10. So the base issue stays of course, it is just a pill for the pain but it doesnt heal the wound.

Anyone who feels adventurous, please try this main.exe, i use it in production. It helps additionally when you set -mc 223 as i mention above, it seems to be what openai does too. Also, as mentioned above, feeding 48khz wav files instead of 16khz (only the original whispercpp is limited to 16khz) seems to help too.

Whisper_last_text_repeated_workaround.zip

Source code changes: 0001-test-for-detect-repeated-text-and-reset-prompt-past.patch

tried it out and it does catch repeats and refreshes, definitely a good fix. was able to do a 3+ hour video without it getting perma stuck! Also helps repeating in live transcribing aswell

@albino1
Copy link

albino1 commented Apr 28, 2023

Yeah, I also tested it, and it caught a bunch of repeats and restarted without issues.

I'm not sure how much different it is than just running -mc 0 as it's really hard to compare with the really long examples I used to test with. Certainly though it spits out a message about repeated lines and does some cleanup.

Either way it's a huge improvement over the base version as it stands, so thank you for making it.

@emcodem
Copy link
Author

emcodem commented Apr 28, 2023

@eagleftw023

-mc 223 or 224?

Honestly very good question. I wrote 223 on purpose because in my last test it made a difference to 224 so i thought we must start counting at 0 so 223 is actually 224 (pretty sure i'm wrong here), but i also wrote above that i belive to have found out that it will be 224 as maxium anyway, no matter if you put a higher number. Also, there is the question as albino1 says, it is not yet 100% clear if -mc > 0 makes any difference at all, especially for stuff that cannot easily be put into context like a movie (compared to news or some discussion, in movies anything can be spoken anytime, so whats kind of context could help to get it right)...

@eagleftw023
Copy link

Are you still getting runFullImpl: failed to generate timestamp token - skipping one second?

@emcodem
Copy link
Author

emcodem commented Apr 28, 2023

yup, i didnt really look into that part yet, its a minor issue for me

@eagleftw023
Copy link

How you transcribe again the parts that skipped when this occur runFullImpl: failed to generate timestamp token - skipping one second? or when it catch repeats

@emcodem
Copy link
Author

emcodem commented Apr 28, 2023

i don't, i'm happy to get 99% output of a 3 hour clip :D In my case, a human corrects the stuff afterwards anyway, the timecodes and lots of words are garbish in any case.

@emcodem
Copy link
Author

emcodem commented Apr 30, 2023

Just adding another puzzle piece from the whisper.cpp project but this is only for the last 500 samples, not very helpful i fear (except for streaming mode maybe?)

    // if there is a very short audio segment left to process, we remove any past prompt since it tends
    // to confuse the decoder and often make it repeat or hallucinate stuff
    if (seek > seek_start && seek + 500 >= seek_end) {
        prompt_past.clear();
    }

@eagleftw023
Copy link

eagleftw023 commented Apr 30, 2023

do you update in source code?
if yes can you upload the updated one like the previous you uploaded?

@emcodem
Copy link
Author

emcodem commented Jul 21, 2023

Update: the patch above works but we can do much better.
This updated version only deletes a single token from the prompt past instead of the whole prompt which can potentially keep up accuracy. Also i strongly believe we must check if we compare "empty" text because it fired far too often. Not sure if it is foreseen that we have the text empty but it is what i saw in production pretty often.
Further thoughts: maybe we should also check if the length of the texts matches because currently it can als hit if only a substring matches.

Let's just hope we don't get into a repeating "empty" text output loop because this one would not be fixed anymore.

					if (!prompt_past.empty()){
						if (lasttext->length() > 0 && 0 == strncmp(lasttext->c_str(), text.c_str(), text.length())) {
							logDebug(u8"Last text repeated ('%s'), clearing token/prompt past. Prompt size before: %d", lasttext->c_str(),prompt_past.size());
							prompt_past.erase(prompt_past.begin());
							logDebug(u8"Size after: %d", prompt_past.size());
						}
					}
					delete lasttext;
					lasttext = new std::string(text);

and @eagleftw023 sorry for not replying to your request but id wouldnt really make sense to publish a custom build with the lines i posted above i guess because it only helps in the last x milliseconds. (hmmm maybe it helps in live mode but not sure about that)

@eagleftw023
Copy link

Update: the patch above works but we can do much better. This updated version only deletes a single token from the prompt past instead of the whole prompt which can potentially keep up accuracy. Also i strongly believe we must check if we compare "empty" text because it fired far too often. Not sure if it is foreseen that we have the text empty but it is what i saw in production pretty often. Further thoughts: maybe we should also check if the length of the texts matches because currently it can als hit if only a substring matches.

Let's just hope we don't get into a repeating "empty" text output loop because this one would not be fixed anymore.

					if (!prompt_past.empty()){
						if (lasttext->length() > 0 && 0 == strncmp(lasttext->c_str(), text.c_str(), text.length())) {
							logDebug(u8"Last text repeated ('%s'), clearing token/prompt past. Prompt size before: %d", lasttext->c_str(),prompt_past.size());
							prompt_past.erase(prompt_past.begin());
							logDebug(u8"Size after: %d", prompt_past.size());
						}
					}
					delete lasttext;
					lasttext = new std::string(text);

and @eagleftw023 sorry for not replying to your request but id wouldnt really make sense to publish a custom build with the lines i posted above i guess because it only helps in the last x milliseconds. (hmmm maybe it helps in live mode but not sure about that)

Thanks alot,can you share the new patch with exe?

@emcodem
Copy link
Author

emcodem commented Jul 21, 2023

Nah sorry, i just realize this code was totally garbage, it needs to remove much more from the history. I am currently testing a more sophisticated method and post here when it works.

From my current understanding, the whole problem is coming from whisper model for some reason gives back the same text twice. This is not a problem as such but our program collects all the output and feeds it as prompt for the next text generation.
The prompt from model perspective can be referred to as "these words and lowercase/punctuation style is important for you".

Basically it is no problem if the last text is used as "prompt" for the model but in case the same words/sentences are used in the prompt, we kind of force the model to output only this.

So as far as i understand is very similar like this:
https://www.youtube.com/watch?v=wpSr-ZOLYGA

Anyway, i'll update here as soon as i have a working update.

@albino1
Copy link

albino1 commented Jul 21, 2023

I gave up on Const-me and base whisper.cpp when I found this whisper fork which seems to work with minimal repetition issues:

https://github.com/Purfview/whisper-standalone-win/

Maybe it's not as accurate, I don't know, but it's CUDA accelerated, so it's still very fast, and it works without having to spend ages cleaning it up afterward because of hallucinations.

@zlbbme
Copy link

zlbbme commented Jul 28, 2023

As most readers here might not be able to add the changes i mentioned above in the code for testing, i thought it might be a good idea to share what i am working with. The result is not perfect, e.g. i still have repeated text but it does not repeat forever at least. Also, it can only workaround "repeated" text but not other prompt related issues like the model repeating to output a count 11, 12, 13... forever because the human voice counted from 1 to 10. So the base issue stays of course, it is just a pill for the pain but it doesnt heal the wound.

Anyone who feels adventurous, please try this main.exe, i use it in production. It helps additionally when you set -mc 223 as i mention above, it seems to be what openai does too. Also, as mentioned above, feeding 48khz wav files instead of 16khz (only the original whispercpp is limited to 16khz) seems to help too.

Whisper_last_text_repeated_workaround.zip

Source code changes: 0001-test-for-detect-repeated-text-and-reset-prompt-past.patch

Can we directly run the exe file to achieve the purpose of modifying this parameter

@emcodem
Copy link
Author

emcodem commented Jul 28, 2023

@zlbbme sorry not sure if i understand the question. The .zip file download i provided contains my modification, there is no parameter to turn this off or on...

@emcodem
Copy link
Author

emcodem commented Aug 3, 2023

OK so here my youngest attempt.

Download Whisper Desktop and main.exe for users:
WhisperCM_prevent-and-remove-duplicated-tokens-from-prompt.zip

Instead of just detecting repeated text, i now prevent repeated stuff to be put into prompt always and IF something repeated AND it is already contained in the prompt, i remove the sentence from prompt. From my current experiments, it should be working. The benefits over the strategy above is that we delete from prompt history much less frequent and only when really needed.

Developer note: also i prevent pushing any non text tokens into prompt, not sure if that leads to the model not outputting any timecode tokens at all anymore.
Also, to logDebug (STDERR), i print the prompt for the next iteration in form of token_text(token_id)token_text(token_id). If you see a space anywhere, it is actually part of the token. This has proven to be very helpful in order to debug problems and find out if it is a prompt related problem or not.

The reasons why i post this kind of unfinished work is that it looks like i will not work a lot with const-me version for a longer time and instead experiment with faster-whisper (which is more for developers, if you want a GUI for all whisper versions, you can e.g use subtitleedit).

However, it currently looks like i will need to apply the exact same mitigation concepts to any other version of whisper because it looks still like all versions except mine above underly the repeat-forever issue. The reason might be that all projects mostly wait for the MAIN whisper project so come up with solutions.

I am sorry for not taking care about the Desktop Version users so far, i do not use Desktop programs for transcibe but the changes i made are valid for the CLI and for Desktop version so it is no additional effort for me to provide the desktop version as well. Actually they share the same whisper.dll anyways...

The patch for developers:
0001-prevent-and-remove-duplicated-tokens-from-prompt.patch
Unfortunately, the patch was made starting from another custom version so there is no way you can apply the patch to existing source code. But it shows the ideas and ways how i did it for documentation purpose mainly.

Last but not least, i am sure this kind of troubles will settle over time. More and more puzzle pieces are coming in from month to month, in forms of code commits in the original whisper project but lots of research work is still to be done. I am sure over time, they will have found adequate solutions to prevent forever repeated stuff in a much better way than i do currently.

If there are is no more activity in this thread, i will close it in a few weeks because it is not relevant for me anymore. There are enough other issues about the same topic open and i don't think it is really helpful to keep this issue with my beginners experiences open forever.

@ghost
Copy link

ghost commented Sep 8, 2023

#26 (comment)

@Const-me Is this safe to use, emcodem's duplicated-tokens-from-prompt.zip? Just being cautious because I'm a noob.

@emcodem
Copy link
Author

emcodem commented Sep 8, 2023

@softlypink if you are asking for "security/virus" related reasons, there is no way for anyone to tell if any compiled exe file is safe to use. Antivius software attempts to do this but from my perspective with only very limited success. What A/V Software can do for you is to detect "known" virusses/behaviour but in this case you face an exe that maybe 10-20 people are using so in this case even your Antivirus might have a hard life.

If you want to be sure about security, there is no way around reading (and understanding) the whole source code of the original program, then apply the changes i posted above and compile your own version.

It is only about you to decide what sources for compiled stuff is ok to use and what not (some people use Linux for that exact reason).

@emcodem
Copy link
Author

emcodem commented Sep 8, 2023

Closed because solved for me in the code i posted above.
Final note: all "forever" related issues (might be same sentence, lowercase, counting up forever) can only be solved by taking care about what is being thrown into prompt past. This is why -mc 0 always solves the forever repeat issue without any doubts.
These issues are not "model related" because it is a decision of the inference only if and what it uses for prompting the next decoding.
There might be some guys out there who understand the "importance" of each prompted token the model uses and it can be solved by playing with these importance values. However i don't dive that deep into the decoding process as the problem is very easy to solve by just detecting repeated output and deleting the prompt in that case.
The same concept is used e.g. in faster-whisper, look at the "compression thresshold". The problem with that however is that it only clears the prompt in case the "currently decoded output" is too repetitive (instead of the current + the last decoded output)

Usually when we face the forever repeat issue, e.g. output is
This
This
This

In this case, we also have a prompt past like "This This This", so the "model" thinks that "This" is very important for us because we have it hundred times in prompt past. Obviously the confidence of the prompted tokens is much higher than any "really detected" tokens confidence so we are getting into a loop. But again, the inference decided to throw "This" 100 times into prompt past, the model is not guilty for that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants