Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VAD is relatively slow #364

Open
AlexandderGorodetski opened this issue Jul 20, 2023 · 25 comments
Open

VAD is relatively slow #364

AlexandderGorodetski opened this issue Jul 20, 2023 · 25 comments

Comments

@AlexandderGorodetski
Copy link

AlexandderGorodetski commented Jul 20, 2023

Hello guys,

I am using VAD of faster whisper using following commands. I found that on TedLium benchmark transcribing VAD takes 8% of time and 92% takes transcribing. I would prefer to decrease time of VAD so that it will not take more than 1%. Is it somehow possible to optimize VAD procedure in terms of real time?? Maybe it is possible to run VAD on several CPU's? BTW, I see that VAD is running on CPU, is it possible to run it somehow on GPU?

# VAD
audio_buffer = decode_audio(audio_filename,
                                sampling_rate=whisper_sampling_rate)

# Get the speech chunks in the given audio buffer, and create a reduced audio buffer that contains only speech.    
speech_chunks = get_speech_timestamps(audio_buffer)
vad_audio_buffer = collect_chunks(audio_buffer, speech_chunks)

# Transribe the reduced audio buffer.
init_segments, _ = whisper_model.transcribe(vad_audio_buffer, language=language_code, beam_size=beam_size)

# Restore the true time-stamps for the segments.
segments = restore_speech_timestamps(init_segments, speech_chunks, whisper_sampling_rate)
@hoonlight
Copy link
Contributor

hoonlight commented Jul 20, 2023

Lowering the window_size_samples value may help.
In faster-whisper, the default is 1024, and you can choose between 512, 1024, and 1536.

snakers4/silero-vad#322 (comment)

@guillaumekln
Copy link
Contributor

The VAD model is also run on a single CPU core:

https://github.com/guillaumekln/faster-whisper/blob/e786e26f75f49b7d638412f3bf2b2b75a9c3c9e8/faster_whisper/vad.py#L254-L255

Can you try changing these values and see how they impact the performance?

@phineas-pta
Copy link

u can make vad run on gpu

  1. install dependencies
pip uninstall onnxruntime
pip install onnxruntime-gpu
  1. edit code

in vad.py line 253-262 replace with

        opts = onnxruntime.SessionOptions()
        opts.log_severity_level = 4
        opts.graph_optimization_level = onnxruntime.GraphOptimizationLevel.ORT_ENABLE_BASIC
        # https://github.com/microsoft/onnxruntime/issues/11548#issuecomment-1158314424

        self.session = onnxruntime.InferenceSession(
            path,
            providers=["CUDAExecutionProvider"],
            sess_options=opts,
        )

@Purfview
Copy link
Contributor

Lowering the window_size_samples value may help.

I get faster speed with higher value, is lower faster for you?

 512: VAD speed  58 audio seconds/s - removed 01:37.831 of audio
1024: VAD speed 107 audio seconds/s - removed 01:36.495 of audio
1536: VAD speed 134 audio seconds/s - removed 01:45.383 of audio

Not sure about precision too. 1024 included insignificantly more of non-voice areas vs 1536, but 1536 excluded one voice line in music/song area.

Can you try changing these values and see how they impact the performance?

No impact for me.

u can make vad run on gpu

Could you benchmark VAD. CPU vs GPU?

@phineas-pta
Copy link

u have any benchmark code & data ?

@Purfview
Copy link
Contributor

No.

@hoonlight
Copy link
Contributor

hoonlight commented Jul 22, 2023

I get faster speed with higher value, is lower faster for you?

After seeing your results, I tested it too, and it took longer for lower values of window_size_samples.

512: 23.8 seconds - 296 speech chunks
1024: 12.7 seconds - 288 speech chunks
1536: 10.9 seconds - 298 speech chunks

Not sure about precision too. 1024 included insignificantly more of non-voice areas vs 1536, but 1536 excluded one voice line in music/song area.

I'm not sure about the precision, I'll check it later.

benchmark code:
import time
from typing import NamedTuple

from faster_whisper import vad, audio


class VadOptions(NamedTuple):
    threshold: float = 0.5
    min_speech_duration_ms: int = 250
    max_speech_duration_s: float = float("inf")
    min_silence_duration_ms: int = 2000
    window_size_samples: int = 1024
    speech_pad_ms: int = 400

decoded_audio = audio.decode_audio("test.mp4")

start = time.time()
speech_chunks_512 = vad.get_speech_timestamps(
    decoded_audio, vad_options=VadOptions(window_size_samples=512)
)
end = time.time()
duration_512 = end - start

start = time.time()
speech_chunks_1024 = vad.get_speech_timestamps(
    decoded_audio, vad_options=VadOptions(window_size_samples=1024)
)
end = time.time()
duration_1024 = end - start

start = time.time()
speech_chunks_1536 = vad.get_speech_timestamps(
    decoded_audio, vad_options=VadOptions(window_size_samples=1536)
)
end = time.time()
duration_1536 = end - start


print(f"512: {duration_512}", len(speech_chunks_512))
print(f"1024: {duration_1024}", len(speech_chunks_1024))
print(f"1536: {duration_1536}", len(speech_chunks_1536))

@Purfview
Copy link
Contributor

Did tests on various samples to see "1536" effects on transcriptions.
I see less fallbacks, much better timestamps in some cases, very positive effects on Demucs'ed files.

I made it default in r139.2.

@iorilu
Copy link

iorilu commented Aug 28, 2023

Did tests on various samples to see "1536" effects on transcriptions. I see less fallbacks, much better timestamps in some cases, very positive effects on Demucs'ed files.

I made it default in r139.2.

does your application use demucs now ?

how to use demucs to preprocess audio ?

@Purfview
Copy link
Contributor

Purfview commented Aug 28, 2023

does your application use demucs now ?

No. And I won't include it as it's using PyTorch, that's gigabytes of additional files...
EDIT: Or maybe I could if pyinstaller can do hybrid onefile/onedir compiles, then I could make optional separate download for torch...

how to use demucs to preprocess audio ?

Read and ask there: https://github.com/facebookresearch/demucs

@iorilu
Copy link

iorilu commented Aug 28, 2023

does your application use demucs now ?

No. And I won't include it as it's using PyTorch, that's gigabytes of additional files... EDIT: Or maybe I could if pyinstaller can do hybrid onefile/onedir compiles, then I could make optional separate download for torch...

how to use demucs to preprocess audio ?

Read and ask there: https://github.com/facebookresearch/demucs

I just checked demucs, it can run on cpu , you can make it default run on cpu

@Purfview
Copy link
Contributor

Purfview commented Aug 28, 2023

Still, cpu only torch would increase current 70Mb .exe ~6 times...
And when Demucs has positive effects on accuracy it can have negative effects too, like missing punctuations and wrong separation of sentences on demucs'ed files.

Currently I'm not interested in bundling it in.

@ozancaglayan
Copy link
Contributor

ozancaglayan commented Sep 4, 2023

A couple of personal experience related comments here:

  • VAD model may be quite slow compared to ASR when processing relatively short audio files. Main reason of this is that its an RNN based model.
  • Last time I have tried it on GPU, there were no substantial speed-ups compared to CPU.
  • intra_op_num_threads effect on CPU inference is limited. I get slightly better runtime with 4 threads compared to 1 but > 4 is basically useless in my case/CPU. It's not even 2x speed-up when you have 4 threads set.
  • Larger window_size_samples is the easiest way of improving the speed as it has less windows to process & forward-pass through the model.

@guillaumekln
Copy link
Contributor

guillaumekln commented Sep 8, 2023

I think it's not very useful to measure the % of time used by the VAD. You should instead compare the total execution time with and without VAD.

The VAD can remove non-speech sections which would trigger the slow temperature fallback in Whisper. In this case, the total execution time is reduced even though the VAD took X% of this time.

@AvivSham
Copy link

Hi all,
We also see a degradation in performance when using the vad_filter=True flag. Same as others we also tried to play with the number of threads used without improvement. Is there any progress with enabling GPU support for the VAD model? Maybe you can add a different VAD model which is equally robust, but more lightweight than the current?

Thanks @guillaumekln!

@Purfview
Copy link
Contributor

Maybe you can add a different VAD model which is equally robust, but more lightweight than the current?

But it's already lightweight and superfast.

Is there any progress with enabling GPU support for the VAD model?

People reported that there is no significant performance increase when running it on GPU.

@AvivSham
Copy link

Hi @Purfview,
Thank you for your fast response.
When running the following code it seems like the overhead of adding VAD is not negligible.

import time

from faster_whisper import WhisperModel

files_list = [
    "/home/ec2-user/datasets/vad_debug/no_speech_1.wav",
    "/home/ec2-user/datasets/vad_debug/no_speech_2.wav",
    "/home/ec2-user/datasets/vad_debug/no_speech_3.wav",
    "/home/ec2-user/datasets/vad_debug/no_speech_4.wav",
]

model_size = "large-v2"

model = WhisperModel(model_size, device="cuda", compute_type="float16")

for f in files_list:
    t_i = time.time()
    segments, _ = model.transcribe(f, beam_size=5, language="fr")
    t_i = time.time() - t_i
    time.sleep(20)
    t_j = time.time()
    segments_vad, _ = model.transcribe(
        f,
        beam_size=5,
        vad_filter=True,
        vad_parameters=dict(min_silence_duration_ms=2000),
        language="fr",
    )
    t_j = time.time() - t_j
    print(t_j / t_i)

These are the prints of the above script:

File 1:
0.5270593472686265

File 2:
1.0318930571300973

File 3:
1.0178552937839627

File 4:
2.4939251070712145

when reducing min_silence_duration_ms to 200:

File 1:
0.5422778267655759

File 2:
1.0773890526952445

 File 3:
1.083032817349901

File 4:
2.499190581616007

Note that the first 3 files are ~1 Sec long and the 4th is ~38 Sec long.

Any suggestions on how to make it faster for long files?
@guillaumekln

@Purfview
Copy link
Contributor

the overhead of adding VAD is not negligible

Obviously, why anyone would expect it to be negligible?

@AvivSham
Copy link

AvivSham commented Oct 30, 2023

@Purfview let me clarify.

  1. Of course there will be overhead but not such that more than doubles the runtime for ~38 Sec long file.
  2. In addition to (1) - Whisper large-v2 has ~1.5B parameters while silero VAD has roughly 100K parameters.

Given the two points above how can we make it run faster? and if there is such a difference in the parameters count why does it add such overhead to the runtime?

@guillaumekln

@Purfview
Copy link
Contributor

Purfview commented Oct 30, 2023

From the benchmarks posted in this thread you can see that VAD runs 134 audio seconds/s, and that's on the ancient CPU.

You can use window_size_samples=1536 to make VAD faster.

...doubles the runtime for ~38 Sec long file.

But you don't measure the whole runtime in your code example.
Btw, print(t_j / t_i) doesn't make sense, this -> print(t_j - t_i) will give meaningful measurement for VAD performance.

In addition to (1) - Whisper large-v2 has...

You don't measure there large-v2's performance.

@AvivSham
Copy link

AvivSham commented Oct 30, 2023

we want to measure the performance in percentage, therefore t_j / t_i is calculated.

You don't measure there large-v2's performance.

what do you mean? can you please suggest how to measure it correctly?

@Purfview
Copy link
Contributor

we want to measure the performance in percentage, therefore t_j / t_i is calculated.

Now it shows something like a car's speed in percentage relative to a speed of coolant's flow. ;)

what do you mean? can you please suggest how to measure it correctly?

There you was told how to do it -> #271

@AvivSham
Copy link

I forgot about that ;).
Final question - is it possible to make the transcribe call faster besides providing the language? Did you benchmark the performance w.r.t CPU threads?
If running on GPU is insignificant I think we can close this issue.

@Purfview
Copy link
Contributor

Did you benchmark the performance w.r.t CPU threads?

I didn't noticed any impact when adjusting options related to threads.

@saddy001
Copy link

The default VAD takes 55s for a 2 hour audio file with speech on my system before the actual transcription begins.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants