Timestamp of first word after long silence in not accurate sometimes #125

marlon-br · 2023-04-07T12:32:40Z

Hello,

I am testing faster-whisper on different audios and have noticed one situation that happens rather frequently. After long silence first word in a segment has very inaccurate timestamp - it can be 10 seconds early then is being pronounced.

Example audio has duration 90 secs and the first word is being pronounced on the 75 second, link to file is: https://drive.google.com/file/d/1malodLzzI7WJNKv_NNyqIhHBBPjbccYO/view?usp=share_link

Code is:

from faster_whisper import WhisperModel
model = WhisperModel("/large-v2-gpu", device='cuda', compute_type="float16")

segments, info = model.transcribe("short.wav", language="en", word_timestamps=True, vad_filter=True)
words= []
for segment in segments:
    for word in segment.words:
        words.append([word.word.strip(), word.start, word.end])
        print("[%.2fs -> %.2fs] %s" % (word.start, word.end, word.word.strip()))


    print("segment: [%.2fs -> %.2fs] %s" % (segment.start, segment.end, segment.text))

And the result is:

[65.57s -> 66.53s] Oh
[75.09s -> 75.37s] my
[75.37s -> 76.05s] god,
[76.23s -> 76.71s] Danny.
segment: [65.57s -> 76.71s] Oh my god, Danny.
[77.05s -> 77.79s] Hey
[77.79s -> 78.07s] mom.
segment: [77.05s -> 78.07s] Hey mom.
[79.45s -> 79.79s] Hi.
[79.75s -> 80.13s] It's
[80.13s -> 80.35s] been
[80.35s -> 80.87s] years.
[81.05s -> 81.17s] What
[81.17s -> 81.37s] are
[81.37s -> 81.61s] you
[81.61s -> 81.91s] doing
[81.91s -> 82.25s] here?
segment: [79.45s -> 82.25s] Hi. It's been years. What are you doing here?
etc

"Oh" is expected 9 seconds later than is being computed. And I see such results rather frequently.

Any idea how to address this issue?

The text was updated successfully, but these errors were encountered:

guillaumekln · 2023-04-07T16:27:39Z

Hi,

Yes word timestamps after VAD are not very accurate at the moment. See also #120.

Because the timestamps predicted by Whisper are not perfectly accurate to start with, some words get assigned to the incorrect speech chunk in the original audio which can result in a shift of several seconds.

We should implement a different approach.

junchen6072 · 2023-04-13T20:38:04Z

I think encountered this issue as well, I split the stereo audio to two mono channels, there're expected long silence in each mono audio. Using vad_filter=True fixed a few, but still quite some are not fixed

mayeaux · 2023-04-14T19:31:51Z

I and some users of Freesubtitles.ai have noticed this as well

mayeaux · 2023-04-16T00:01:15Z

The timestamps from Silero look OK and so do the word timestamps from Whisper, but some logic seems a bit off when it's being rebuilt. For example this word:

{
"start": 111.7,
"end": 112.06,
"word": " Nema",
"probability": 0.437408447265625
},

Is turned into this:

{
"start": 127.93,
"end": 136.11,
"word": " Nema",
"probability": 0.277252197265625
},

Actually, the issue seems to be caused by the use of word_timestamps. Without word timestamps, when I transcribe the output from the VAD implementation:

[01:11.900 --> 01:14.400]  Mi trobiš sa slažim dušata
[01:22.400 --> 01:23.900]  Ne imalo problemi?
[01:23.900 --> 01:25.400]  Her še it šovka je gečti

With word timestamps:

[01:11.820 --> 01:13.520]  Mi trobiš sa slažim dušata
[01:13.520 --> 01:23.760]  Ne imalo problemi?
[01:24.880 --> 01:25.920]  Svičko na red?

'Ne imalo problemi?' is an example where it shows up as a subtitle long before it's said.

Update: In my case it seems to be largely an issue with the align function of the Whisper model, for example if there is noise or music that makes it way through the VAD, then Whisper will start the timestamp at the beginning of that sound rather than when the word actually begins. This leads to words beginning earlier or dragging on later than they should.

mayeaux · 2023-04-17T12:14:31Z

I am implementing a hacky but effective solution for freesubtitles.ai, basically I will scan the first word for each segment and if it's obscenely long (say over 5 seconds) I will trim it to one second. This will fix the very broken case where sometimes the words can have durations of 30 seconds or more, I tried whisperx , stable-ts, and whisper-timestamps but none of them really fixed the problem properly, the timestamps from Whisper are just too innacurate sometimes to really be dealt with but this hack solution will fix the worst cases.

marlon-br · 2023-04-17T12:32:36Z

@mayeaux I have implemented pretty much the same workaround for now in my app

mayeaux · 2023-04-20T20:22:35Z

It seems to me that the issue is that Whisper has a tendency to show the beginning of a segment as a bit (400-500ms) before the first word of a segment is actually uttered, which is good for user experience as the subtitles show up a bit early and it feels easier to follow as a viewer, however I believe that is the cause of the subtitle getting marked as existing in the previous (n-1) VAD chunk.

Two possible solutions for this could be to transcribe each chunk separately, but especially for implementations that use a small value for min_silence_duration could end up not being feasible or advisable since the chunks would be too small to be coherent (WhisperX seems to transcribe the VAD chunks as separate 30s chunks but I can't remember how it's done offhand).

Another possible solution could be adding time padding between each chunk to accommodate for Whisper's tendency to mark the start time as 400-500ms earlier than the speech is actually uttered. Adding silence padding exists in the current API but to my understanding it's lost after the chunks are reassembled. I think if there was silence padding of say 500ms between the different speech chunks, and that a larger minimum silence was used (1000 ms+) that you could get fully accurate timings after applying VAD. I tried to implement it but ran into some issues but may take another stab at it, this is all just theoretical and spitballing though so I'm open to anyone else's input on these ideas.

marlon-br · 2023-04-21T10:49:39Z

Today I've got one more interesting case from a user:

[0.19s -> 2.95s] Can
[2.95s -> 3.44s] you
[3.44s -> 4.10s] see
[4.10s -> 4.94s] what
[4.94s -> 5.72s] you've
[5.72s -> 6.68s] done
[6.68s -> 7.34s] to
[7.34s -> 8.48s] my
[8.48s -> 11.90s] heart?
segment: [0.19s -> 11.90s] Can you see what you've done to my heart?
[0.19s -> 1.11s] And
[1.11s -> 2.19s] so,
[4.46s -> 6.55s] this
[6.55s -> 6.94s] is
[6.94s -> 7.38s] your
[7.38s -> 9.38s] wasteland
[9.38s -> 11.90s] now
segment: [0.19s -> 11.90s] And so, this is your wasteland now
[0.19s -> 11.90s] And
[14.90s -> 15.50s] you
[15.50s -> 15.73s] put
[15.73s -> 16.04s] the
[16.04s -> 16.63s] wings
[16.63s -> 17.35s] around
[17.35s -> 18.69s] yourself
segment: [0.19s -> 18.69s] And you put the wings around yourself

All the segments in video start at 0.19

To repro:

from faster_whisper import WhisperModel
model = WhisperModel("/large-v2-gpu", device='cuda', compute_type="float16")

segments, info = model.transcribe("lv_0_20230421072029.mp4", language="en", word_timestamps=True, vad_filter=True)
words= []

for segment in segments:
    for word in segment.words:
        words.append([word.word.strip(), word.start, word.end])
        print("[%.2fs -> %.2fs] %s" % (word.start, word.end, word.word.strip()))

    print("segment: [%.2fs -> %.2fs] %s" % (segment.start, segment.end, segment.text))

short video is here: https://drive.google.com/file/d/17U-vGCkwnYpjHzWxGf2hZiXd71x0TYef/view?usp=sharing

guillaumekln · 2023-04-22T11:29:34Z

@marlon-br I don't reproduce this output with version 0.4.1. Here's what I get:

[0.34s -> 1.22s] Can
[1.22s -> 1.38s] you
[1.38s -> 1.60s] see
[1.60s -> 1.86s] what
[1.86s -> 2.14s] you've
[2.14s -> 2.46s] done
[2.46s -> 2.68s] to
[2.68s -> 3.04s] my
[3.04s -> 4.20s] heart?
segment: [0.34s -> 4.20s]  Can you see what you've done to my heart?
[5.02s -> 5.44s] And
[5.44s -> 5.94s] so,
[6.98s -> 7.94s] this
[7.94s -> 8.12s] is
[8.12s -> 8.32s] a
[8.32s -> 9.24s] wasteland
[9.24s -> 10.24s] now
segment: [5.02s -> 10.24s]  And so, this is a wasteland now
[10.24s -> 12.42s] And
[15.22s -> 15.66s] you
[15.66s -> 15.84s] put
[15.84s -> 16.08s] the
[16.08s -> 16.54s] wings
[16.54s -> 17.10s] around
[17.10s -> 18.42s] yourself
segment: [10.24s -> 18.42s]  And you put the wings around yourself

@mayeaux Thank you for exploring this. I've been also trying a few things but there are always other corner cases that are not well handled.

Note that the current speech_pad_ms option does not pad with silence. It just expands the speech boundaries in the original audio. Inserting silence does not work well in practice because the model was not trained on audio with artificially inserted silence.

Still the default value for speech_pad_ms should probably be increased.

Also we currently assign the word to a speech chunk based on its start timestamp:

https://github.com/guillaumekln/faster-whisper/blob/358d373691c95205021bd4bbf28cde7ce4d10030/faster_whisper/transcribe.py#L781

But it's probably better to consider the timestamp at the middle ((word.start + word.end) / 2).

marlon-br · 2023-04-22T12:05:25Z

@guillaumekln Oh, you are right, I was using fit-words-timestamps branch to repro the issue with this video. Confirm that with 0.4.1 everything is OK.

mayeaux · 2023-04-22T12:29:11Z

I was testing different solutions and found that to re-use the original 'hack' solution of comparing the first word of the segment against the median duration of the segment and then shortening the duration if it is x2 as long (I used 1.8x) was actually a solid approach for solving this issue. The problem is that it's done before the realigning of timestamps based on the VAD chunks, you can re-run the same algorithm afterwards though and get a lot out of it.

https://github.com/guillaumekln/faster-whisper/blob/master/faster_whisper/transcribe.py#L743

        # hack: ensure the first and second word is not longer than twice the median word duration.
        # a better segmentation algorithm based on VAD should be able to replace this.
        word_durations = end_times - start_times
        word_durations = word_durations[word_durations.nonzero()]
        if len(word_durations) > 0:
            median_duration = np.median(word_durations)
            max_duration = median_duration * 2
            if len(word_durations) >= 2 and word_durations[1] > max_duration:
                boundary = max(end_times[2] / 2, end_times[2] - max_duration)
                end_times[0] = start_times[1] = boundary
            if (
                len(word_durations) >= 1
                and end_times[0] - start_times[0] > max_duration
            ):
                start_times[0] = max(0, end_times[0] - max_duration)

Update:

This might be an even better approach:

https://github.com/openai/whisper/pull/1114/files

guillaumekln · 2023-04-24T13:58:26Z

Actually I'm not sure to understand the issue related to the word duration. In the current implementation, the words timestamps are only shifted to a VAD chunk, but the duration is unchanged. So I don't know why we need to compare again with the median duration.

Is it possible for you to share an audio file with this type of issue?

guillaumekln · 2023-04-25T13:58:39Z

#180 #179

I merged 2 small changes that prevent some words to be assigned to the N-1 speech chunk. At least this is fixing the incorrect timestamps for the audio shared in the first post.

marlon-br · 2023-04-25T14:19:50Z

@guillaumekln I confirm that the original issue is not reproduced anymore on master

thanks a lot for the discussion and for the fix!

junchen6072 · 2023-04-25T14:42:20Z

Thanks @guillaumekln ! Yeah in my experience end is more accurate than start. So accounting for end will be more accurate.
I have one question about speech_pad_ms Does it just expands the speech boundaries between chunks? So as long as it's smaller than min_speech_duration_ms, then should be fine?

guillaumekln · 2023-04-25T14:53:05Z

Does it just expands the speech boundaries between chunks?

Yes.

So as long as it's smaller than min_speech_duration_ms, then should be fine?

Did you mean min_silence_duration_ms?

junchen6072 · 2023-04-25T15:03:15Z

So as long as it's smaller than min_speech_duration_ms, then should be fine?

Did you mean min_silence_duration_ms?

sorry yes I mean min_silence_duration_ms =.=

Actually I have another question, why do we need to do padding?

guillaumekln · 2023-04-26T15:42:34Z

I think it is mostly to have a safety margin, as the VAD may not be perfectly accurate.

guillaumekln mentioned this issue Apr 10, 2023

music cause word timestamp not be accurate #135

Closed

entn-at mentioned this issue Apr 10, 2023

Add faster-whisper (ctranslate2) as option for Whisper annotation workflow lhotse-speech/lhotse#1017

Open

mayeaux mentioned this issue Apr 16, 2023

Enabling VAD results in subtitle timing issue Softcatala/whisper-ctranslate2#14

Closed

marlon-br closed this as completed Apr 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Timestamp of first word after long silence in not accurate sometimes #125

Timestamp of first word after long silence in not accurate sometimes #125

marlon-br commented Apr 7, 2023

guillaumekln commented Apr 7, 2023

junchen6072 commented Apr 13, 2023 •

edited

mayeaux commented Apr 14, 2023

mayeaux commented Apr 16, 2023 •

edited

mayeaux commented Apr 17, 2023

marlon-br commented Apr 17, 2023

mayeaux commented Apr 20, 2023

marlon-br commented Apr 21, 2023 •

edited

guillaumekln commented Apr 22, 2023

marlon-br commented Apr 22, 2023 •

edited

mayeaux commented Apr 22, 2023 •

edited

guillaumekln commented Apr 24, 2023

guillaumekln commented Apr 25, 2023

marlon-br commented Apr 25, 2023

junchen6072 commented Apr 25, 2023

guillaumekln commented Apr 25, 2023

junchen6072 commented Apr 25, 2023

guillaumekln commented Apr 26, 2023

Timestamp of first word after long silence in not accurate sometimes #125

Timestamp of first word after long silence in not accurate sometimes #125

Comments

marlon-br commented Apr 7, 2023

guillaumekln commented Apr 7, 2023

junchen6072 commented Apr 13, 2023 • edited

mayeaux commented Apr 14, 2023

mayeaux commented Apr 16, 2023 • edited

mayeaux commented Apr 17, 2023

marlon-br commented Apr 17, 2023

mayeaux commented Apr 20, 2023

marlon-br commented Apr 21, 2023 • edited

guillaumekln commented Apr 22, 2023

marlon-br commented Apr 22, 2023 • edited

mayeaux commented Apr 22, 2023 • edited

guillaumekln commented Apr 24, 2023

guillaumekln commented Apr 25, 2023

marlon-br commented Apr 25, 2023

junchen6072 commented Apr 25, 2023

guillaumekln commented Apr 25, 2023

junchen6072 commented Apr 25, 2023

guillaumekln commented Apr 26, 2023

junchen6072 commented Apr 13, 2023 •

edited

mayeaux commented Apr 16, 2023 •

edited

marlon-br commented Apr 21, 2023 •

edited

marlon-br commented Apr 22, 2023 •

edited

mayeaux commented Apr 22, 2023 •

edited