VAD is relatively slow #364

AlexandderGorodetski · 2023-07-20T16:23:27Z

Hello guys,

I am using VAD of faster whisper using following commands. I found that on TedLium benchmark transcribing VAD takes 8% of time and 92% takes transcribing. I would prefer to decrease time of VAD so that it will not take more than 1%. Is it somehow possible to optimize VAD procedure in terms of real time?? Maybe it is possible to run VAD on several CPU's? BTW, I see that VAD is running on CPU, is it possible to run it somehow on GPU?

# VAD
audio_buffer = decode_audio(audio_filename,
                                sampling_rate=whisper_sampling_rate)

# Get the speech chunks in the given audio buffer, and create a reduced audio buffer that contains only speech.    
speech_chunks = get_speech_timestamps(audio_buffer)
vad_audio_buffer = collect_chunks(audio_buffer, speech_chunks)

# Transribe the reduced audio buffer.
init_segments, _ = whisper_model.transcribe(vad_audio_buffer, language=language_code, beam_size=beam_size)

# Restore the true time-stamps for the segments.
segments = restore_speech_timestamps(init_segments, speech_chunks, whisper_sampling_rate)

hoonlight · 2023-07-20T18:11:51Z

Lowering the window_size_samples value may help.
In faster-whisper, the default is 1024, and you can choose between 512, 1024, and 1536.

snakers4/silero-vad#322 (comment)

guillaumekln · 2023-07-21T08:07:37Z

The VAD model is also run on a single CPU core:

https://github.com/guillaumekln/faster-whisper/blob/e786e26f75f49b7d638412f3bf2b2b75a9c3c9e8/faster_whisper/vad.py#L254-L255

Can you try changing these values and see how they impact the performance?

phineas-pta · 2023-07-21T09:19:06Z

u can make vad run on gpu

install dependencies

pip uninstall onnxruntime
pip install onnxruntime-gpu

edit code

in vad.py line 253-262 replace with

        opts = onnxruntime.SessionOptions()
        opts.log_severity_level = 4
        opts.graph_optimization_level = onnxruntime.GraphOptimizationLevel.ORT_ENABLE_BASIC
        # https://github.com/microsoft/onnxruntime/issues/11548#issuecomment-1158314424

        self.session = onnxruntime.InferenceSession(
            path,
            providers=["CUDAExecutionProvider"],
            sess_options=opts,
        )

Purfview · 2023-07-21T19:35:50Z

Lowering the window_size_samples value may help.

I get faster speed with higher value, is lower faster for you?

 512: VAD speed  58 audio seconds/s - removed 01:37.831 of audio
1024: VAD speed 107 audio seconds/s - removed 01:36.495 of audio
1536: VAD speed 134 audio seconds/s - removed 01:45.383 of audio

Not sure about precision too. 1024 included insignificantly more of non-voice areas vs 1536, but 1536 excluded one voice line in music/song area.

Can you try changing these values and see how they impact the performance?

No impact for me.

u can make vad run on gpu

Could you benchmark VAD. CPU vs GPU?

phineas-pta · 2023-07-21T21:56:18Z

u have any benchmark code & data ?

Purfview · 2023-07-21T22:34:27Z

No.

hoonlight · 2023-07-22T03:16:29Z

I get faster speed with higher value, is lower faster for you?

After seeing your results, I tested it too, and it took longer for lower values of window_size_samples.

512: 23.8 seconds - 296 speech chunks
1024: 12.7 seconds - 288 speech chunks
1536: 10.9 seconds - 298 speech chunks

Not sure about precision too. 1024 included insignificantly more of non-voice areas vs 1536, but 1536 excluded one voice line in music/song area.

I'm not sure about the precision, I'll check it later.

benchmark code:

import time
from typing import NamedTuple

from faster_whisper import vad, audio


class VadOptions(NamedTuple):
    threshold: float = 0.5
    min_speech_duration_ms: int = 250
    max_speech_duration_s: float = float("inf")
    min_silence_duration_ms: int = 2000
    window_size_samples: int = 1024
    speech_pad_ms: int = 400

decoded_audio = audio.decode_audio("test.mp4")

start = time.time()
speech_chunks_512 = vad.get_speech_timestamps(
    decoded_audio, vad_options=VadOptions(window_size_samples=512)
)
end = time.time()
duration_512 = end - start

start = time.time()
speech_chunks_1024 = vad.get_speech_timestamps(
    decoded_audio, vad_options=VadOptions(window_size_samples=1024)
)
end = time.time()
duration_1024 = end - start

start = time.time()
speech_chunks_1536 = vad.get_speech_timestamps(
    decoded_audio, vad_options=VadOptions(window_size_samples=1536)
)
end = time.time()
duration_1536 = end - start


print(f"512: {duration_512}", len(speech_chunks_512))
print(f"1024: {duration_1024}", len(speech_chunks_1024))
print(f"1536: {duration_1536}", len(speech_chunks_1536))

Purfview · 2023-07-22T11:14:59Z

Did tests on various samples to see "1536" effects on transcriptions.
I see less fallbacks, much better timestamps in some cases, very positive effects on Demucs'ed files.

I made it default in r139.2.

iorilu · 2023-08-28T14:11:54Z

Did tests on various samples to see "1536" effects on transcriptions. I see less fallbacks, much better timestamps in some cases, very positive effects on Demucs'ed files.

I made it default in r139.2.

does your application use demucs now ?

how to use demucs to preprocess audio ?

Purfview · 2023-08-28T14:20:58Z

does your application use demucs now ?

No. And I won't include it as it's using PyTorch, that's gigabytes of additional files...
EDIT: Or maybe I could if pyinstaller can do hybrid onefile/onedir compiles, then I could make optional separate download for torch...

how to use demucs to preprocess audio ?

Read and ask there: https://github.com/facebookresearch/demucs

iorilu · 2023-08-28T14:41:23Z

does your application use demucs now ?

No. And I won't include it as it's using PyTorch, that's gigabytes of additional files... EDIT: Or maybe I could if pyinstaller can do hybrid onefile/onedir compiles, then I could make optional separate download for torch...

how to use demucs to preprocess audio ?

Read and ask there: https://github.com/facebookresearch/demucs

I just checked demucs, it can run on cpu , you can make it default run on cpu

Purfview · 2023-08-28T14:56:36Z

Still, cpu only torch would increase current 70Mb .exe ~6 times...
And when Demucs has positive effects on accuracy it can have negative effects too, like missing punctuations and wrong separation of sentences on demucs'ed files.

Currently I'm not interested in bundling it in.

ozancaglayan · 2023-09-04T13:33:23Z

A couple of personal experience related comments here:

VAD model may be quite slow compared to ASR when processing relatively short audio files. Main reason of this is that its an RNN based model.
Last time I have tried it on GPU, there were no substantial speed-ups compared to CPU.
intra_op_num_threads effect on CPU inference is limited. I get slightly better runtime with 4 threads compared to 1 but > 4 is basically useless in my case/CPU. It's not even 2x speed-up when you have 4 threads set.
Larger window_size_samples is the easiest way of improving the speed as it has less windows to process & forward-pass through the model.

guillaumekln · 2023-09-08T13:32:04Z

I think it's not very useful to measure the % of time used by the VAD. You should instead compare the total execution time with and without VAD.

The VAD can remove non-speech sections which would trigger the slow temperature fallback in Whisper. In this case, the total execution time is reduced even though the VAD took X% of this time.

AvivSham · 2023-10-30T12:43:23Z

Hi all,
We also see a degradation in performance when using the vad_filter=True flag. Same as others we also tried to play with the number of threads used without improvement. Is there any progress with enabling GPU support for the VAD model? Maybe you can add a different VAD model which is equally robust, but more lightweight than the current?

Thanks @guillaumekln!

Purfview · 2023-10-30T12:52:52Z

Maybe you can add a different VAD model which is equally robust, but more lightweight than the current?

But it's already lightweight and superfast.

Is there any progress with enabling GPU support for the VAD model?

People reported that there is no significant performance increase when running it on GPU.

AvivSham · 2023-10-30T13:42:10Z

Hi @Purfview,
Thank you for your fast response.
When running the following code it seems like the overhead of adding VAD is not negligible.

import time

from faster_whisper import WhisperModel

files_list = [
    "/home/ec2-user/datasets/vad_debug/no_speech_1.wav",
    "/home/ec2-user/datasets/vad_debug/no_speech_2.wav",
    "/home/ec2-user/datasets/vad_debug/no_speech_3.wav",
    "/home/ec2-user/datasets/vad_debug/no_speech_4.wav",
]

model_size = "large-v2"

model = WhisperModel(model_size, device="cuda", compute_type="float16")

for f in files_list:
    t_i = time.time()
    segments, _ = model.transcribe(f, beam_size=5, language="fr")
    t_i = time.time() - t_i
    time.sleep(20)
    t_j = time.time()
    segments_vad, _ = model.transcribe(
        f,
        beam_size=5,
        vad_filter=True,
        vad_parameters=dict(min_silence_duration_ms=2000),
        language="fr",
    )
    t_j = time.time() - t_j
    print(t_j / t_i)

These are the prints of the above script:

File 1:
0.5270593472686265

File 2:
1.0318930571300973

File 3:
1.0178552937839627

File 4:
2.4939251070712145

when reducing min_silence_duration_ms to 200:

File 1:
0.5422778267655759

File 2:
1.0773890526952445

 File 3:
1.083032817349901

File 4:
2.499190581616007

Note that the first 3 files are ~1 Sec long and the 4th is ~38 Sec long.

Any suggestions on how to make it faster for long files?
@guillaumekln

Purfview · 2023-10-30T14:08:51Z

the overhead of adding VAD is not negligible

Obviously, why anyone would expect it to be negligible?

AvivSham · 2023-10-30T14:26:46Z

@Purfview let me clarify.

Of course there will be overhead but not such that more than doubles the runtime for ~38 Sec long file.
In addition to (1) - Whisper large-v2 has ~1.5B parameters while silero VAD has roughly 100K parameters.

Given the two points above how can we make it run faster? and if there is such a difference in the parameters count why does it add such overhead to the runtime?

@guillaumekln

Purfview · 2023-10-30T16:10:58Z

From the benchmarks posted in this thread you can see that VAD runs 134 audio seconds/s, and that's on the ancient CPU.

You can use window_size_samples=1536 to make VAD faster.

...doubles the runtime for ~38 Sec long file.

But you don't measure the whole runtime in your code example.
Btw, print(t_j / t_i) doesn't make sense, this -> print(t_j - t_i) will give meaningful measurement for VAD performance.

In addition to (1) - Whisper large-v2 has...

You don't measure there large-v2's performance.

AvivSham · 2023-10-30T16:43:27Z

we want to measure the performance in percentage, therefore t_j / t_i is calculated.

You don't measure there large-v2's performance.

what do you mean? can you please suggest how to measure it correctly?

Purfview · 2023-10-30T17:45:21Z

we want to measure the performance in percentage, therefore t_j / t_i is calculated.

Now it shows something like a car's speed in percentage relative to a speed of coolant's flow. ;)

what do you mean? can you please suggest how to measure it correctly?

There you was told how to do it -> #271

AvivSham · 2023-10-31T08:52:17Z

I forgot about that ;).
Final question - is it possible to make the transcribe call faster besides providing the language? Did you benchmark the performance w.r.t CPU threads?
If running on GPU is insignificant I think we can close this issue.

Purfview · 2023-10-31T14:41:24Z

Did you benchmark the performance w.r.t CPU threads?

I didn't noticed any impact when adjusting options related to threads.

saddy001 · 2024-09-24T14:30:52Z

The default VAD takes 55s for a 2 hour audio file with speech on my system before the actual transcription begins.

phineas-pta mentioned this issue Sep 27, 2023

Run with onnxruntime-gpu not working for faster_whisper #493

Closed

thomasmol mentioned this issue Sep 30, 2023

Change onnxruntime requirement to gpu version and update VAD to run on gpu #499

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VAD is relatively slow #364

VAD is relatively slow #364

AlexandderGorodetski commented Jul 20, 2023 •

edited by guillaumekln

Loading

hoonlight commented Jul 20, 2023 •

edited

Loading

guillaumekln commented Jul 21, 2023

phineas-pta commented Jul 21, 2023

Purfview commented Jul 21, 2023

phineas-pta commented Jul 21, 2023

Purfview commented Jul 21, 2023

hoonlight commented Jul 22, 2023 •

edited

Loading

Purfview commented Jul 22, 2023

iorilu commented Aug 28, 2023 •

edited

Loading

Purfview commented Aug 28, 2023 •

edited

Loading

iorilu commented Aug 28, 2023

Purfview commented Aug 28, 2023 •

edited

Loading

ozancaglayan commented Sep 4, 2023 •

edited

Loading

guillaumekln commented Sep 8, 2023 •

edited

Loading

AvivSham commented Oct 30, 2023

Purfview commented Oct 30, 2023

AvivSham commented Oct 30, 2023

Purfview commented Oct 30, 2023

AvivSham commented Oct 30, 2023 •

edited

Loading

Purfview commented Oct 30, 2023 •

edited

Loading

AvivSham commented Oct 30, 2023 •

edited

Loading

Purfview commented Oct 30, 2023

AvivSham commented Oct 31, 2023

Purfview commented Oct 31, 2023

saddy001 commented Sep 24, 2024

VAD is relatively slow #364

VAD is relatively slow #364

Comments

AlexandderGorodetski commented Jul 20, 2023 • edited by guillaumekln Loading

hoonlight commented Jul 20, 2023 • edited Loading

guillaumekln commented Jul 21, 2023

phineas-pta commented Jul 21, 2023

Purfview commented Jul 21, 2023

phineas-pta commented Jul 21, 2023

Purfview commented Jul 21, 2023

hoonlight commented Jul 22, 2023 • edited Loading

Purfview commented Jul 22, 2023

iorilu commented Aug 28, 2023 • edited Loading

Purfview commented Aug 28, 2023 • edited Loading

iorilu commented Aug 28, 2023

Purfview commented Aug 28, 2023 • edited Loading

ozancaglayan commented Sep 4, 2023 • edited Loading

guillaumekln commented Sep 8, 2023 • edited Loading

AvivSham commented Oct 30, 2023

Purfview commented Oct 30, 2023

AvivSham commented Oct 30, 2023

Purfview commented Oct 30, 2023

AvivSham commented Oct 30, 2023 • edited Loading

Purfview commented Oct 30, 2023 • edited Loading

AvivSham commented Oct 30, 2023 • edited Loading

Purfview commented Oct 30, 2023

AvivSham commented Oct 31, 2023

Purfview commented Oct 31, 2023

saddy001 commented Sep 24, 2024

AlexandderGorodetski commented Jul 20, 2023 •

edited by guillaumekln

Loading

hoonlight commented Jul 20, 2023 •

edited

Loading

hoonlight commented Jul 22, 2023 •

edited

Loading

iorilu commented Aug 28, 2023 •

edited

Loading

Purfview commented Aug 28, 2023 •

edited

Loading

Purfview commented Aug 28, 2023 •

edited

Loading

ozancaglayan commented Sep 4, 2023 •

edited

Loading

guillaumekln commented Sep 8, 2023 •

edited

Loading

AvivSham commented Oct 30, 2023 •

edited

Loading

Purfview commented Oct 30, 2023 •

edited

Loading

AvivSham commented Oct 30, 2023 •

edited

Loading