-
Notifications
You must be signed in to change notification settings - Fork 796
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Timestamp of first word after long silence in not accurate sometimes #125
Comments
Hi, Yes word timestamps after VAD are not very accurate at the moment. See also #120. Because the timestamps predicted by Whisper are not perfectly accurate to start with, some words get assigned to the incorrect speech chunk in the original audio which can result in a shift of several seconds. We should implement a different approach. |
I think encountered this issue as well, I split the stereo audio to two mono channels, there're expected long silence in each mono audio. Using vad_filter=True fixed a few, but still quite some are not fixed |
I and some users of Freesubtitles.ai have noticed this as well |
The timestamps from Silero look OK and so do the word timestamps from Whisper, but some logic seems a bit off when it's being rebuilt. For example this word:
Is turned into this:
Actually, the issue seems to be caused by the use of
With word timestamps:
'Ne imalo problemi?' is an example where it shows up as a subtitle long before it's said. Update: In my case it seems to be largely an issue with the |
I am implementing a hacky but effective solution for freesubtitles.ai, basically I will scan the first word for each segment and if it's obscenely long (say over 5 seconds) I will trim it to one second. This will fix the very broken case where sometimes the words can have durations of 30 seconds or more, I tried |
@mayeaux I have implemented pretty much the same workaround for now in my app |
It seems to me that the issue is that Whisper has a tendency to show the beginning of a segment as a bit (400-500ms) before the first word of a segment is actually uttered, which is good for user experience as the subtitles show up a bit early and it feels easier to follow as a viewer, however I believe that is the cause of the subtitle getting marked as existing in the previous (n-1) VAD chunk. Two possible solutions for this could be to transcribe each chunk separately, but especially for implementations that use a small value for Another possible solution could be adding time padding between each chunk to accommodate for Whisper's tendency to mark the start time as 400-500ms earlier than the speech is actually uttered. Adding silence padding exists in the current API but to my understanding it's lost after the chunks are reassembled. I think if there was silence padding of say 500ms between the different speech chunks, and that a larger minimum silence was used (1000 ms+) that you could get fully accurate timings after applying VAD. I tried to implement it but ran into some issues but may take another stab at it, this is all just theoretical and spitballing though so I'm open to anyone else's input on these ideas. |
Today I've got one more interesting case from a user: [0.19s -> 2.95s] Can All the segments in video start at 0.19 To repro:
short video is here: https://drive.google.com/file/d/17U-vGCkwnYpjHzWxGf2hZiXd71x0TYef/view?usp=sharing |
@marlon-br I don't reproduce this output with version 0.4.1. Here's what I get:
@mayeaux Thank you for exploring this. I've been also trying a few things but there are always other corner cases that are not well handled. Note that the current Still the default value for Also we currently assign the word to a speech chunk based on its start timestamp: But it's probably better to consider the timestamp at the middle ( |
@guillaumekln Oh, you are right, I was using fit-words-timestamps branch to repro the issue with this video. Confirm that with 0.4.1 everything is OK. |
I was testing different solutions and found that to re-use the original 'hack' solution of comparing the first word of the segment against the median duration of the segment and then shortening the duration if it is x2 as long (I used 1.8x) was actually a solid approach for solving this issue. The problem is that it's done before the realigning of timestamps based on the VAD chunks, you can re-run the same algorithm afterwards though and get a lot out of it. https://github.com/guillaumekln/faster-whisper/blob/master/faster_whisper/transcribe.py#L743
Update: This might be an even better approach: |
Actually I'm not sure to understand the issue related to the word duration. In the current implementation, the words timestamps are only shifted to a VAD chunk, but the duration is unchanged. So I don't know why we need to compare again with the median duration. Is it possible for you to share an audio file with this type of issue? |
@guillaumekln I confirm that the original issue is not reproduced anymore on thanks a lot for the discussion and for the fix! |
Thanks @guillaumekln ! Yeah in my experience end is more accurate than start. So accounting for end will be more accurate. |
Yes.
Did you mean |
sorry yes I mean Actually I have another question, why do we need to do padding? |
I think it is mostly to have a safety margin, as the VAD may not be perfectly accurate. |
Hello,
I am testing faster-whisper on different audios and have noticed one situation that happens rather frequently. After long silence first word in a segment has very inaccurate timestamp - it can be 10 seconds early then is being pronounced.
Example audio has duration 90 secs and the first word is being pronounced on the 75 second, link to file is: https://drive.google.com/file/d/1malodLzzI7WJNKv_NNyqIhHBBPjbccYO/view?usp=share_link
Code is:
And the result is:
[65.57s -> 66.53s] Oh
[75.09s -> 75.37s] my
[75.37s -> 76.05s] god,
[76.23s -> 76.71s] Danny.
segment: [65.57s -> 76.71s] Oh my god, Danny.
[77.05s -> 77.79s] Hey
[77.79s -> 78.07s] mom.
segment: [77.05s -> 78.07s] Hey mom.
[79.45s -> 79.79s] Hi.
[79.75s -> 80.13s] It's
[80.13s -> 80.35s] been
[80.35s -> 80.87s] years.
[81.05s -> 81.17s] What
[81.17s -> 81.37s] are
[81.37s -> 81.61s] you
[81.61s -> 81.91s] doing
[81.91s -> 82.25s] here?
segment: [79.45s -> 82.25s] Hi. It's been years. What are you doing here?
etc
"Oh" is expected 9 seconds later than is being computed. And I see such results rather frequently.
Any idea how to address this issue?
The text was updated successfully, but these errors were encountered: