whisper-cli : align token timestamps with VAD ts #3218

danbev · 2025-06-02T09:17:39Z

This commit aligns the token timestamps with the VAD timestamps when VAD is enabled.

The motivation of this is that currently the token timestamps that are reported in the full json output are the timestamps that whisper sees after the VAD has processed the audio. This means that whisper only sees possibly filtered audio and the token timestamps are related to the filtered audio, not the original audio. For the segment timestamps we map/align them with original timestamps but this is not currenly done for the token timestamps which is what this commit aims to address.

Resolves: #3174

Example of token level timestamps prior to this PR:

$ ./build/bin/whisper-cli -m models/ggml-medium.en.bin -f samples/gb1.ogg --vad -vm models/for-tests-silero-v5.1.2-ggml.bin -ojf -of gb1
...
[00:00:00.990 --> 00:00:07.800]   My fellow Americans, this day has brought terrible news and great sadness to our country.
[00:00:07.800 --> 00:00:15.860]   At 9 o'clock this morning, Mission Control in Houston lost contact with our space shuttle
...

  "transcription": [                                                            
    {                                                                           
      "timestamps": {                                                           
        "from": "00:00:00,990",                                                 
        "to": "00:00:07,800"                                                    
      },                                                                        
      "offsets": {                                                              
        "from": 990,                                                            
        "to": 7800                                                              
      },                                                                        
      "text": " My fellow Americans, this day has brought terrible news and great sadness to our country.",
      "tokens": [                                                               
        {                                                                       
          "text": "[_BEG_]",                                                    
          "timestamps": {                                                       
            "from": "00:00:00,000",                                             
            "to": "00:00:00,000"                                                
          },                                                                    
          "offsets": {                                                          
            "from": 0,                                                          
            "to": 0                                                             
          },                                                                    
          "id": 50363,                                                          
          "p": 0.994401,                                                        
          "t_dtw": -1                                                           
        },                                                                      
        {                                                                       
          "text": " My",                                                        
          "timestamps": {                                                       
            "from": "00:00:00,020",                                             
            "to": "00:00:00,100"                                                
          },                                                                    
          "offsets": {                                                          
            "from": 20,                                                         
            "to": 100                                                           
          },                                                                    
          "id": 2011,                                                           
          "p": 0.883255,                                                        
          "t_dtw": -1                                                           
        },                                                                      
        {                                                                       
          "text": " fellow",                                                    
          "timestamps": {                                                       
            "from": "00:00:00,170",                                             
            "to": "00:00:00,610"                                                
          },                                                                    
          "offsets": {                                                          
            "from": 170,                                                        
            "to": 610                                                           
          },                                                                    
          "id": 5891,                                                           
          "p": 0.989602,                                                        
          "t_dtw": -1                                                           
        },                      
        ....

And with this PR:

[00:00:00.990 --> 00:00:07.800]   My fellow Americans, this day has brought terrible news and great sadness to our country.
[00:00:07.800 --> 00:00:15.860]   At 9 o'clock this morning, Mission Control in Houston lost contact with our space shuttle
[00:00:15.860 --> 00:00:18.510]   Columbia.
...
  "transcription": [                                                            
    {                                                                           
      "timestamps": {                                                           
        "from": "00:00:00,990",                                                 
        "to": "00:00:07,800"                                                    
      },                                                                        
      "offsets": {                                                              
        "from": 990,                                                            
        "to": 7800                                                              
      },                                                                        
      "text": " My fellow Americans, this day has brought terrible news and great sadness to our country.",
      "tokens": [                                                               
        {                                                                       
          "text": "[_BEG_]",                                                    
          "timestamps": {                                                       
            "from": "00:00:00,990",                                             
            "to": "00:00:00,990"                                                
          },                                                                    
          "offsets": {                                                          
            "from": 990,                                                        
            "to": 990                                                           
          },                                                                    
          "id": 50363,                                                          
          "p": 0.994401,                                                        
          "t_dtw": -1                                                           
        },                                                                      
        {                                                                       
          "text": " My",                                                        
          "timestamps": {                                                       
            "from": "00:00:01,000",                                             
            "to": "00:00:01,080"                                                
          },                                                                    
          "offsets": {                                                          
            "from": 1000,                                                       
            "to": 1080                                                          
          },                                                                    
          "id": 2011,                                                           
          "p": 0.883255,                                                        
          "t_dtw": -1                                                           
        },                                                                      
        {                                                                       
          "text": " fellow",                                                    
          "timestamps": {                                                       
            "from": "00:00:01,140",                                             
            "to": "00:00:01,540"                                                
          },                                                                    
          "offsets": {                                                          
            "from": 1140,                                                       
            "to": 1540                                                          
          },                                                                    
          "id": 5891,                                                           
          "p": 0.989602,                                                        
          "t_dtw": -1                                                           
        },

chriswang- · 2025-06-16T06:47:13Z

Has this issue been resolved? It seems it hasn't been merged into the main branch, or has it already been fixed in the branch (vad-token-timestamp-alignment) that I can use it ?

danbev · 2025-06-16T07:00:44Z

Has this issue been resolved?

No, it has not been resolved yet. I changed it to a draft (which might have sent a notification) as I noticed the token level timestamps are still not correct and I need to revisit this.

danbev · 2025-06-16T11:44:04Z

@chriswang- It would be great if you could try this out with the audio sample in your original issue report.

chriswang- · 2025-06-16T11:58:13Z

@danbev Sorry The issue is not commited by me, But I can try to verify it .

danbev · 2025-06-16T12:17:51Z

@chriswang- Ah my bad, I should have checked to be sure and not just assumed.

chriswang- · 2025-06-16T13:28:14Z

subtitle-master-with-vad.json
subtitle-master-without-vad.json
subtitle-PR.json

@danbev
I have uploaded three files: one with the Master branch result that includes the VAD feature, one without the VAD feature, and the third using your PR for transcription. After a brief comparison, it seems the issue has been resolved. However, I really forgot all the testing context and related information from when I first discovered the bug. I only noticed the bug and confirmed that the same issue exists on GitHub.

This commit aligns the token timestamps with the VAD timestamps when VAD is enabled. The motivation of this is that currently the token timestamps that are reported in the full json output are the timestamps that whisper sees after the VAD has processed the audio. This means that whisper only sees possibly filtered audio and the token timestamps are related to the filtered audio, not the original audio. For the segment timestamps we map/align them with original timestamps but this is not currenly done for the token timestamps which is what this commit aims to address. Resolves: ggml-org#3174

danbev force-pushed the vad-token-timestamp-alignment branch from b23c671 to 75db936 Compare June 2, 2025 14:47

danbev marked this pull request as ready for review June 3, 2025 04:28

danbev marked this pull request as draft June 16, 2025 05:29

danbev force-pushed the vad-token-timestamp-alignment branch from 75db936 to 12e44a1 Compare June 16, 2025 11:42

danbev marked this pull request as ready for review June 16, 2025 11:42

danbev force-pushed the vad-token-timestamp-alignment branch from 12e44a1 to c5e33f4 Compare June 24, 2025 11:17

accessiblepixel added a commit to accessiblepixel/whisper.cpp that referenced this pull request Jul 5, 2025

Add vad corrections as per ggml-org#3218 to my own branch

50739c6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

whisper-cli : align token timestamps with VAD ts #3218

whisper-cli : align token timestamps with VAD ts #3218

danbev commented Jun 2, 2025 •

edited

Loading

Uh oh!

chriswang- commented Jun 16, 2025

Uh oh!

danbev commented Jun 16, 2025

Uh oh!

danbev commented Jun 16, 2025

Uh oh!

chriswang- commented Jun 16, 2025

Uh oh!

danbev commented Jun 16, 2025

Uh oh!

chriswang- commented Jun 16, 2025

Uh oh!

Uh oh!

whisper-cli : align token timestamps with VAD ts #3218

Are you sure you want to change the base?

whisper-cli : align token timestamps with VAD ts #3218

Conversation

danbev commented Jun 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chriswang- commented Jun 16, 2025

Uh oh!

danbev commented Jun 16, 2025

Uh oh!

danbev commented Jun 16, 2025

Uh oh!

chriswang- commented Jun 16, 2025

Uh oh!

danbev commented Jun 16, 2025

Uh oh!

chriswang- commented Jun 16, 2025

Uh oh!

Uh oh!

danbev commented Jun 2, 2025 •

edited

Loading