Skip to content

Benchmarking different VAD models on AVA-Speech dataset

License

Notifications You must be signed in to change notification settings

Anwarvic/VAD_Benchmark

Repository files navigation

VAD Benchmark

Benchmarking different VAD models on AVA-Speech dataset.

image

Dataset

You can download & pre-process audio from Google's AVA-Speech dataset using the following bash script:

$ cd dataset
$ bash download_ava_speech.sh

Also, you can check the published paper for more information.

Models

The following are the list of the available VAD models:

Inside the ./vads directory, you will find a script for each VAD model.

The following is an example of how to use a VAD model (e.g WebRTC) to trim silence in a given audio file (e.g samples/example_48k.wav) and write out the trimmed audio file:

from vads import WebRTC
from utils import load_audio, save_audio

# load audio file
audio, sr = load_audio("samples/example_48k.wav")

# Initialize WebRTC vad with default values
vad = WebRTC()
trimmed_audio, sr = vad.trim_silence(audio, sr)

# save trimmed audio into current directory
save_audio(trimmed_audio, sr, "webrtc_example_48k.wav")

Benchmakring

To benchmark any group of vad models against AVA-Speech, all you have to do is to run the evaluate.py script like so:

python evaluate.py \
    --dataset-path /private/home/anwarvic/VAD_Benchmark/dataset/ava_speech \
    --vad-models Silero WebRTC \
    --window-sizes-ms 48 64 96 \
    --agg-thresholds 0.3 0.6 0.9 \
    --speech-labels CLEAN_SPEECH SPEECH_WITH_MUSIC SPEECH_WITH_NOISE

This command will benchmark two VAD models (Silero & WebRTC) on AVA-Speech dataset with three different window sizes: [48, 64, 96]ms and three different aggressiveness threshold (the higher the value is, the less sensitive the VAD gets): [0.3, 0.6, 0.9]

To know all available arguments that you can use, run the following command:

python evaluate.py --help

usage: evaluate.py [-h] --dataset-path DATASET_PATH [--speech-labels [SPEECH_LABELS ...]] --vad-models [{WebRTC,Silero,SpeechBrain} ...] --window-sizes-ms [WINDOW_SIZES_MS ...]
                   --agg-thresholds [AGG_THRESHOLDS ...] [--out-path OUT_PATH] [--num-workers NUM_WORKERS]

options:
  -h, --help            show this help message and exit
  --dataset-path DATASET_PATH
                        Relative/Absolute path where AVA-Speech audio files are located.
  --speech-labels [SPEECH_LABELS ...]
                        List (space separated) of the true labels (case-sensitive) that we are considering as 'speech'.
  --vad-models [{WebRTC,Silero,SpeechBrain} ...]
                        List of vad models to be used.
  --window-sizes-ms [WINDOW_SIZES_MS ...]
                        List of window-sizes (in milliseconds) to be used.
  --agg-thresholds [AGG_THRESHOLDS ...]
                        List of aggressiveness thresholds to be used. The higher the value is, the less sensitive the model gets.
  --out-path OUT_PATH   Relative/Absolute path where the out labels will be located.
  --num-workers NUM_WORKERS
                        Number of workers working in parallel.

This will result in a P/R curve that looks like the one shown below:

TODO

  • create viz function for VAD.
  • create test cases.

About

Benchmarking different VAD models on AVA-Speech dataset

Resources

License

Stars

Watchers

Forks