Skip to content

whisper-cli : align token timestamps with VAD ts #3218

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

danbev
Copy link
Collaborator

@danbev danbev commented Jun 2, 2025

This commit aligns the token timestamps with the VAD timestamps when VAD is enabled.

The motivation of this is that currently the token timestamps that are reported in the full json output are the timestamps that whisper sees after the VAD has processed the audio. This means that whisper only sees possibly filtered audio and the token timestamps are related to the filtered audio, not the original audio. For the segment timestamps we map/align them with original timestamps but this is not currenly done for the token timestamps which is what this commit aims to address.

Resolves: #3174


Example of token level timestamps prior to this PR:

$ ./build/bin/whisper-cli -m models/ggml-medium.en.bin -f samples/gb1.ogg --vad -vm models/for-tests-silero-v5.1.2-ggml.bin -ojf -of gb1
...
[00:00:00.990 --> 00:00:07.800]   My fellow Americans, this day has brought terrible news and great sadness to our country.
[00:00:07.800 --> 00:00:15.860]   At 9 o'clock this morning, Mission Control in Houston lost contact with our space shuttle
...

  "transcription": [                                                            
    {                                                                           
      "timestamps": {                                                           
        "from": "00:00:00,990",                                                 
        "to": "00:00:07,800"                                                    
      },                                                                        
      "offsets": {                                                              
        "from": 990,                                                            
        "to": 7800                                                              
      },                                                                        
      "text": " My fellow Americans, this day has brought terrible news and great sadness to our country.",
      "tokens": [                                                               
        {                                                                       
          "text": "[_BEG_]",                                                    
          "timestamps": {                                                       
            "from": "00:00:00,000",                                             
            "to": "00:00:00,000"                                                
          },                                                                    
          "offsets": {                                                          
            "from": 0,                                                          
            "to": 0                                                             
          },                                                                    
          "id": 50363,                                                          
          "p": 0.994401,                                                        
          "t_dtw": -1                                                           
        },                                                                      
        {                                                                       
          "text": " My",                                                        
          "timestamps": {                                                       
            "from": "00:00:00,020",                                             
            "to": "00:00:00,100"                                                
          },                                                                    
          "offsets": {                                                          
            "from": 20,                                                         
            "to": 100                                                           
          },                                                                    
          "id": 2011,                                                           
          "p": 0.883255,                                                        
          "t_dtw": -1                                                           
        },                                                                      
        {                                                                       
          "text": " fellow",                                                    
          "timestamps": {                                                       
            "from": "00:00:00,170",                                             
            "to": "00:00:00,610"                                                
          },                                                                    
          "offsets": {                                                          
            "from": 170,                                                        
            "to": 610                                                           
          },                                                                    
          "id": 5891,                                                           
          "p": 0.989602,                                                        
          "t_dtw": -1                                                           
        },                      
        ....

And with this PR:

[00:00:00.990 --> 00:00:07.800]   My fellow Americans, this day has brought terrible news and great sadness to our country.
[00:00:07.800 --> 00:00:15.860]   At 9 o'clock this morning, Mission Control in Houston lost contact with our space shuttle
[00:00:15.860 --> 00:00:18.510]   Columbia.
...
  "transcription": [                                                            
    {                                                                           
      "timestamps": {                                                           
        "from": "00:00:00,990",                                                 
        "to": "00:00:07,800"                                                    
      },                                                                        
      "offsets": {                                                              
        "from": 990,                                                            
        "to": 7800                                                              
      },                                                                        
      "text": " My fellow Americans, this day has brought terrible news and great sadness to our country.",
      "tokens": [                                                               
        {                                                                       
          "text": "[_BEG_]",                                                    
          "timestamps": {                                                       
            "from": "00:00:00,990",                                             
            "to": "00:00:00,990"                                                
          },                                                                    
          "offsets": {                                                          
            "from": 990,                                                        
            "to": 990                                                           
          },                                                                    
          "id": 50363,                                                          
          "p": 0.994401,                                                        
          "t_dtw": -1                                                           
        },                                                                      
        {                                                                       
          "text": " My",                                                        
          "timestamps": {                                                       
            "from": "00:00:01,000",                                             
            "to": "00:00:01,080"                                                
          },                                                                    
          "offsets": {                                                          
            "from": 1000,                                                       
            "to": 1080                                                          
          },                                                                    
          "id": 2011,                                                           
          "p": 0.883255,                                                        
          "t_dtw": -1                                                           
        },                                                                      
        {                                                                       
          "text": " fellow",                                                    
          "timestamps": {                                                       
            "from": "00:00:01,140",                                             
            "to": "00:00:01,540"                                                
          },                                                                    
          "offsets": {                                                          
            "from": 1140,                                                       
            "to": 1540                                                          
          },                                                                    
          "id": 5891,                                                           
          "p": 0.989602,                                                        
          "t_dtw": -1                                                           
        },                  

@danbev danbev force-pushed the vad-token-timestamp-alignment branch from b23c671 to 75db936 Compare June 2, 2025 14:47
@danbev danbev marked this pull request as ready for review June 3, 2025 04:28
@danbev danbev marked this pull request as draft June 16, 2025 05:29
@chriswang-
Copy link

Has this issue been resolved? It seems it hasn't been merged into the main branch, or has it already been fixed in the branch (vad-token-timestamp-alignment) that I can use it ?

@danbev
Copy link
Collaborator Author

danbev commented Jun 16, 2025

Has this issue been resolved?

No, it has not been resolved yet. I changed it to a draft (which might have sent a notification) as I noticed the token level timestamps are still not correct and I need to revisit this.

@danbev danbev force-pushed the vad-token-timestamp-alignment branch from 75db936 to 12e44a1 Compare June 16, 2025 11:42
@danbev danbev marked this pull request as ready for review June 16, 2025 11:42
@danbev
Copy link
Collaborator Author

danbev commented Jun 16, 2025

@chriswang- It would be great if you could try this out with the audio sample in your original issue report.

@chriswang-
Copy link

@danbev Sorry The issue is not commited by me, But I can try to verify it .

@danbev
Copy link
Collaborator Author

danbev commented Jun 16, 2025

@chriswang- Ah my bad, I should have checked to be sure and not just assumed.

@chriswang-
Copy link

subtitle-master-with-vad.json
subtitle-master-without-vad.json
subtitle-PR.json

@danbev
I have uploaded three files: one with the Master branch result that includes the VAD feature, one without the VAD feature, and the third using your PR for transcription. After a brief comparison, it seems the issue has been resolved. However, I really forgot all the testing context and related information from when I first discovered the bug. I only noticed the bug and confirmed that the same issue exists on GitHub.

This commit aligns the token timestamps with the VAD timestamps when VAD
is enabled.

The motivation of this is that currently the token timestamps that are
reported in the full json output are the timestamps that whisper sees
after the VAD has processed the audio. This means that whisper only sees
possibly filtered audio and the token timestamps are related to the
filtered audio, not the original audio. For the segment timestamps we
map/align them with original timestamps but this is not currenly done
for the token timestamps which is what this commit aims to address.

Resolves: ggml-org#3174
@danbev danbev force-pushed the vad-token-timestamp-alignment branch from 12e44a1 to c5e33f4 Compare June 24, 2025 11:17
accessiblepixel added a commit to accessiblepixel/whisper.cpp that referenced this pull request Jul 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

bug: Whisper VAD - Token Timestamp Issue
2 participants