Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Timestamp of first word after long silence in not accurate sometimes #125

Closed
marlon-br opened this issue Apr 7, 2023 · 18 comments
Closed

Comments

@marlon-br
Copy link

Hello,

I am testing faster-whisper on different audios and have noticed one situation that happens rather frequently. After long silence first word in a segment has very inaccurate timestamp - it can be 10 seconds early then is being pronounced.

Example audio has duration 90 secs and the first word is being pronounced on the 75 second, link to file is: https://drive.google.com/file/d/1malodLzzI7WJNKv_NNyqIhHBBPjbccYO/view?usp=share_link

Code is:

from faster_whisper import WhisperModel
model = WhisperModel("/large-v2-gpu", device='cuda', compute_type="float16")

segments, info = model.transcribe("short.wav", language="en", word_timestamps=True, vad_filter=True)
words= []
for segment in segments:
    for word in segment.words:
        words.append([word.word.strip(), word.start, word.end])
        print("[%.2fs -> %.2fs] %s" % (word.start, word.end, word.word.strip()))


    print("segment: [%.2fs -> %.2fs] %s" % (segment.start, segment.end, segment.text))   

And the result is:

[65.57s -> 66.53s] Oh
[75.09s -> 75.37s] my
[75.37s -> 76.05s] god,
[76.23s -> 76.71s] Danny.
segment: [65.57s -> 76.71s] Oh my god, Danny.
[77.05s -> 77.79s] Hey
[77.79s -> 78.07s] mom.
segment: [77.05s -> 78.07s] Hey mom.
[79.45s -> 79.79s] Hi.
[79.75s -> 80.13s] It's
[80.13s -> 80.35s] been
[80.35s -> 80.87s] years.
[81.05s -> 81.17s] What
[81.17s -> 81.37s] are
[81.37s -> 81.61s] you
[81.61s -> 81.91s] doing
[81.91s -> 82.25s] here?
segment: [79.45s -> 82.25s] Hi. It's been years. What are you doing here?
etc

"Oh" is expected 9 seconds later than is being computed. And I see such results rather frequently.

Any idea how to address this issue?

@guillaumekln
Copy link
Contributor

Hi,

Yes word timestamps after VAD are not very accurate at the moment. See also #120.

Because the timestamps predicted by Whisper are not perfectly accurate to start with, some words get assigned to the incorrect speech chunk in the original audio which can result in a shift of several seconds.

We should implement a different approach.

@junchen6072
Copy link

junchen6072 commented Apr 13, 2023

I think encountered this issue as well, I split the stereo audio to two mono channels, there're expected long silence in each mono audio. Using vad_filter=True fixed a few, but still quite some are not fixed

@mayeaux
Copy link
Contributor

mayeaux commented Apr 14, 2023

I and some users of Freesubtitles.ai have noticed this as well

@mayeaux
Copy link
Contributor

mayeaux commented Apr 16, 2023

The timestamps from Silero look OK and so do the word timestamps from Whisper, but some logic seems a bit off when it's being rebuilt. For example this word:

{
"start": 111.7,
"end": 112.06,
"word": " Nema",
"probability": 0.437408447265625
},

Is turned into this:

{
"start": 127.93,
"end": 136.11,
"word": " Nema",
"probability": 0.277252197265625
},

Actually, the issue seems to be caused by the use of word_timestamps. Without word timestamps, when I transcribe the output from the VAD implementation:

[01:11.900 --> 01:14.400]  Mi trobiš sa slažim dušata
[01:22.400 --> 01:23.900]  Ne imalo problemi?
[01:23.900 --> 01:25.400]  Her še it šovka je gečti

With word timestamps:

[01:11.820 --> 01:13.520]  Mi trobiš sa slažim dušata
[01:13.520 --> 01:23.760]  Ne imalo problemi?
[01:24.880 --> 01:25.920]  Svičko na red?

'Ne imalo problemi?' is an example where it shows up as a subtitle long before it's said.

Update: In my case it seems to be largely an issue with the align function of the Whisper model, for example if there is noise or music that makes it way through the VAD, then Whisper will start the timestamp at the beginning of that sound rather than when the word actually begins. This leads to words beginning earlier or dragging on later than they should.

@mayeaux
Copy link
Contributor

mayeaux commented Apr 17, 2023

I am implementing a hacky but effective solution for freesubtitles.ai, basically I will scan the first word for each segment and if it's obscenely long (say over 5 seconds) I will trim it to one second. This will fix the very broken case where sometimes the words can have durations of 30 seconds or more, I tried whisperx , stable-ts, and whisper-timestamps but none of them really fixed the problem properly, the timestamps from Whisper are just too innacurate sometimes to really be dealt with but this hack solution will fix the worst cases.

@marlon-br
Copy link
Author

@mayeaux I have implemented pretty much the same workaround for now in my app

@mayeaux
Copy link
Contributor

mayeaux commented Apr 20, 2023

It seems to me that the issue is that Whisper has a tendency to show the beginning of a segment as a bit (400-500ms) before the first word of a segment is actually uttered, which is good for user experience as the subtitles show up a bit early and it feels easier to follow as a viewer, however I believe that is the cause of the subtitle getting marked as existing in the previous (n-1) VAD chunk.

Two possible solutions for this could be to transcribe each chunk separately, but especially for implementations that use a small value for min_silence_duration could end up not being feasible or advisable since the chunks would be too small to be coherent (WhisperX seems to transcribe the VAD chunks as separate 30s chunks but I can't remember how it's done offhand).

Another possible solution could be adding time padding between each chunk to accommodate for Whisper's tendency to mark the start time as 400-500ms earlier than the speech is actually uttered. Adding silence padding exists in the current API but to my understanding it's lost after the chunks are reassembled. I think if there was silence padding of say 500ms between the different speech chunks, and that a larger minimum silence was used (1000 ms+) that you could get fully accurate timings after applying VAD. I tried to implement it but ran into some issues but may take another stab at it, this is all just theoretical and spitballing though so I'm open to anyone else's input on these ideas.

@marlon-br
Copy link
Author

marlon-br commented Apr 21, 2023

Today I've got one more interesting case from a user:

[0.19s -> 2.95s] Can
[2.95s -> 3.44s] you
[3.44s -> 4.10s] see
[4.10s -> 4.94s] what
[4.94s -> 5.72s] you've
[5.72s -> 6.68s] done
[6.68s -> 7.34s] to
[7.34s -> 8.48s] my
[8.48s -> 11.90s] heart?
segment: [0.19s -> 11.90s] Can you see what you've done to my heart?
[0.19s -> 1.11s] And
[1.11s -> 2.19s] so,
[4.46s -> 6.55s] this
[6.55s -> 6.94s] is
[6.94s -> 7.38s] your
[7.38s -> 9.38s] wasteland
[9.38s -> 11.90s] now
segment: [0.19s -> 11.90s] And so, this is your wasteland now
[0.19s -> 11.90s] And
[14.90s -> 15.50s] you
[15.50s -> 15.73s] put
[15.73s -> 16.04s] the
[16.04s -> 16.63s] wings
[16.63s -> 17.35s] around
[17.35s -> 18.69s] yourself
segment: [0.19s -> 18.69s] And you put the wings around yourself

All the segments in video start at 0.19

To repro:

from faster_whisper import WhisperModel
model = WhisperModel("/large-v2-gpu", device='cuda', compute_type="float16")

segments, info = model.transcribe("lv_0_20230421072029.mp4", language="en", word_timestamps=True, vad_filter=True)
words= []

for segment in segments:
    for word in segment.words:
        words.append([word.word.strip(), word.start, word.end])
        print("[%.2fs -> %.2fs] %s" % (word.start, word.end, word.word.strip()))

    print("segment: [%.2fs -> %.2fs] %s" % (segment.start, segment.end, segment.text))

short video is here: https://drive.google.com/file/d/17U-vGCkwnYpjHzWxGf2hZiXd71x0TYef/view?usp=sharing

@guillaumekln
Copy link
Contributor

@marlon-br I don't reproduce this output with version 0.4.1. Here's what I get:

[0.34s -> 1.22s] Can
[1.22s -> 1.38s] you
[1.38s -> 1.60s] see
[1.60s -> 1.86s] what
[1.86s -> 2.14s] you've
[2.14s -> 2.46s] done
[2.46s -> 2.68s] to
[2.68s -> 3.04s] my
[3.04s -> 4.20s] heart?
segment: [0.34s -> 4.20s]  Can you see what you've done to my heart?
[5.02s -> 5.44s] And
[5.44s -> 5.94s] so,
[6.98s -> 7.94s] this
[7.94s -> 8.12s] is
[8.12s -> 8.32s] a
[8.32s -> 9.24s] wasteland
[9.24s -> 10.24s] now
segment: [5.02s -> 10.24s]  And so, this is a wasteland now
[10.24s -> 12.42s] And
[15.22s -> 15.66s] you
[15.66s -> 15.84s] put
[15.84s -> 16.08s] the
[16.08s -> 16.54s] wings
[16.54s -> 17.10s] around
[17.10s -> 18.42s] yourself
segment: [10.24s -> 18.42s]  And you put the wings around yourself

@mayeaux Thank you for exploring this. I've been also trying a few things but there are always other corner cases that are not well handled.

Note that the current speech_pad_ms option does not pad with silence. It just expands the speech boundaries in the original audio. Inserting silence does not work well in practice because the model was not trained on audio with artificially inserted silence.

Still the default value for speech_pad_ms should probably be increased.

Also we currently assign the word to a speech chunk based on its start timestamp:

https://github.com/guillaumekln/faster-whisper/blob/358d373691c95205021bd4bbf28cde7ce4d10030/faster_whisper/transcribe.py#L781

But it's probably better to consider the timestamp at the middle ((word.start + word.end) / 2).

@marlon-br
Copy link
Author

marlon-br commented Apr 22, 2023

@guillaumekln Oh, you are right, I was using fit-words-timestamps branch to repro the issue with this video. Confirm that with 0.4.1 everything is OK.

@mayeaux
Copy link
Contributor

mayeaux commented Apr 22, 2023

I was testing different solutions and found that to re-use the original 'hack' solution of comparing the first word of the segment against the median duration of the segment and then shortening the duration if it is x2 as long (I used 1.8x) was actually a solid approach for solving this issue. The problem is that it's done before the realigning of timestamps based on the VAD chunks, you can re-run the same algorithm afterwards though and get a lot out of it.

https://github.com/guillaumekln/faster-whisper/blob/master/faster_whisper/transcribe.py#L743

        # hack: ensure the first and second word is not longer than twice the median word duration.
        # a better segmentation algorithm based on VAD should be able to replace this.
        word_durations = end_times - start_times
        word_durations = word_durations[word_durations.nonzero()]
        if len(word_durations) > 0:
            median_duration = np.median(word_durations)
            max_duration = median_duration * 2
            if len(word_durations) >= 2 and word_durations[1] > max_duration:
                boundary = max(end_times[2] / 2, end_times[2] - max_duration)
                end_times[0] = start_times[1] = boundary
            if (
                len(word_durations) >= 1
                and end_times[0] - start_times[0] > max_duration
            ):
                start_times[0] = max(0, end_times[0] - max_duration)

Update:

This might be an even better approach:

https://github.com/openai/whisper/pull/1114/files

@guillaumekln
Copy link
Contributor

Actually I'm not sure to understand the issue related to the word duration. In the current implementation, the words timestamps are only shifted to a VAD chunk, but the duration is unchanged. So I don't know why we need to compare again with the median duration.

Is it possible for you to share an audio file with this type of issue?

@guillaumekln
Copy link
Contributor

#180 #179

I merged 2 small changes that prevent some words to be assigned to the N-1 speech chunk. At least this is fixing the incorrect timestamps for the audio shared in the first post.

@marlon-br
Copy link
Author

@guillaumekln I confirm that the original issue is not reproduced anymore on master

thanks a lot for the discussion and for the fix!

@junchen6072
Copy link

Thanks @guillaumekln ! Yeah in my experience end is more accurate than start. So accounting for end will be more accurate.
I have one question about speech_pad_ms Does it just expands the speech boundaries between chunks? So as long as it's smaller than min_speech_duration_ms, then should be fine?

@guillaumekln
Copy link
Contributor

Does it just expands the speech boundaries between chunks?

Yes.

So as long as it's smaller than min_speech_duration_ms, then should be fine?

Did you mean min_silence_duration_ms?

@junchen6072
Copy link

So as long as it's smaller than min_speech_duration_ms, then should be fine?

Did you mean min_silence_duration_ms?

sorry yes I mean min_silence_duration_ms =.=

Actually I have another question, why do we need to do padding?

@guillaumekln
Copy link
Contributor

I think it is mostly to have a safety margin, as the VAD may not be perfectly accurate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants