-
Notifications
You must be signed in to change notification settings - Fork 4.4k
whisper-cli : align token timestamps with VAD ts #3218
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
b23c671
to
75db936
Compare
Has this issue been resolved? It seems it hasn't been merged into the main branch, or has it already been fixed in the branch (vad-token-timestamp-alignment) that I can use it ? |
No, it has not been resolved yet. I changed it to a draft (which might have sent a notification) as I noticed the token level timestamps are still not correct and I need to revisit this. |
75db936
to
12e44a1
Compare
@chriswang- It would be great if you could try this out with the audio sample in your original issue report. |
@danbev Sorry The issue is not commited by me, But I can try to verify it . |
@chriswang- Ah my bad, I should have checked to be sure and not just assumed. |
subtitle-master-with-vad.json @danbev |
This commit aligns the token timestamps with the VAD timestamps when VAD is enabled. The motivation of this is that currently the token timestamps that are reported in the full json output are the timestamps that whisper sees after the VAD has processed the audio. This means that whisper only sees possibly filtered audio and the token timestamps are related to the filtered audio, not the original audio. For the segment timestamps we map/align them with original timestamps but this is not currenly done for the token timestamps which is what this commit aims to address. Resolves: ggml-org#3174
12e44a1
to
c5e33f4
Compare
This commit aligns the token timestamps with the VAD timestamps when VAD is enabled.
The motivation of this is that currently the token timestamps that are reported in the full json output are the timestamps that whisper sees after the VAD has processed the audio. This means that whisper only sees possibly filtered audio and the token timestamps are related to the filtered audio, not the original audio. For the segment timestamps we map/align them with original timestamps but this is not currenly done for the token timestamps which is what this commit aims to address.
Resolves: #3174
Example of token level timestamps prior to this PR:
And with this PR: