Voxseg is a Python package for voice activity detection (VAD), for speech/non-speech audio segmentation. It provides a full VAD pipeline, including a pretrained VAD model, and it is based on work presented here.
Use of this VAD may be cited as follows:
@inproceedings{cnnbilstm_vad,
title = {A hybrid {CNN-BiLSTM} voice activity detector},
author = {Wilkinson, N. and Niesler, T.},
booktitle = {Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)},
year = {2021},
address = {Toronto, Canada},
}
To install this package, clone the repository from GitHub to a directory of your choice and install using pip:
git clone https://github.com/NickWilkinson37/voxseg.git
pip install ./voxseg
In future, installation directly from the package manager will be supported.
To test the installation run:
cd voxseg
python -m unittest
The test will run the full VAD pipeline on two example audio files. This pipeline includes feature extraction, VAD and evaluation. The output should include the following*:
- A progress bar monitoring the feature extraction process, followed by a DataFrame containing normalized-features and metadata.
- A DataFrame containing model generated endpoints, indicating the starts and ends of discovered speech utterances.
- A confusion matrix of speech vs non-speech, with the following values: TPR 0.935, FPR 0.137, FNR 0.065, FPR 0.863
*The order in which these outputs appear may vary.
Before using the VAD, a number of files need to be created to specify the audio that one wishes to process. These files are the same as those used by the Kaldi toolkit. Extensive documentation on the data preparation process for Kaldi may be found here. Only the files required by the Voxseg toolkit are described here.
-
wav.scp
- this file provides the paths to the audio files one wishes to process, and assigns them a unique recording-id. It is structured as follows:<recording-id> <extended-filename>
. Each entry should appear on a new line, for example:rec_000 wavs/some_raw_audio.wav rec_001 wavs/some_more_raw_audio.wav rec_002 wavs/yet_more_raw_audio.wav
Note that the
<extended-filename>
may be an absolute path or relative path, except when using Docker or Singularity, where paths relative to the mount point must be used. -
segments
- this file is optional and specifies segments within the audio file to be processed by the VAD (useful if one only wants to run the VAD on a subset of the full audio files). If this file is not present the full audio files will be processed. This file is structured as follows:<utterance-id> <recording-id> <segment-begin> <segment-end>
, where<segment-begin>
and<segment-end>
are in seconds. Each entry should appear on a new line, for example:rec_000_part_1 rec_000 20.5 142.6 rec_000_part_2 rec_000 362.1 421.0 rec_001_part_1 rec_001 45.9 89.4 rec_001_part_2 rec_001 97.7 130.0 rec_001_part_3 rec_001 186.9 241.8 rec_002_full rec_002 0.0 350.0
These two files should be placed in the same directory, usually named data
, however you may give it any name. This is the directory that is provided as input to voxseg’s feature extraction.
The package may be used in a number of ways:
- The full VAD can be run with a single script.
- Smaller scripts may be called to run different parts of the pipeline separately, for example feature extraction, then VAD. Useful if one is tuning the parameters of the VAD, and would like to avoid recomputing the features for every experiment.
- As a module within python, useful if one would like to integrate parts of the system into one's own python code.
This package may be used through a basic command-line interface. To run the full VAD pipeline with default settings, navigate to the voxseg directory and call:
# data_directory is Kaldi-style data directory and output_directory is destination for segments file
python3 voxseg/main.py data_directory output_directory
To explore the available flags for changing settings navigate to the voxseg directory and call:
python3 voxseg/main.py -h
The most commonly used flags are:
- -s sets the speech vs non-speech decision threshold (accepts float between 0 and 1, default is 0.5)
- -f: adds median filtering to smooth the output (accepts odd integer for kernal size, default is 1)
- -e: allows a reference directory to be given, against which the VAD output is scored (accepts path to Kaldi-style directory containing ground truth segments file)
To run the smaller, individual scripts, navigate to the voxseg directory and call:
# reads Kaldi-style data directory and extracts features to .h5 file in output directory
python3 voxseg/extract_feats.py data_directory output_directory
# runs VAD and saves output segments file in ouput directory
python3 voxseg/run_cnnlstm.py -m model_path features_directory output_directory
# reads Kaldi-style data directory used as VAD input, the VAD output directory and a directory
# contining a ground truth segments file reference.
python3 voxseg/evaluate.py vad_input_directory vad_out_directory ground_truth_directory
To import the module an use it within custom Python scripts/modules:
import voxseg
from tensorflow.keras import models
# feature extraction
data = extract_feats.prep_data('path/to/input/data') # prepares audio from Kaldi-style data directory
feats = extract_feats.extract(data) # extracts log-mel filterbank spectrogram features
normalized_feats = extract_feats.normalize(norm_feats) # normalizes the features
#model execution
model = models.load_model('path/to/model.h5') # loads a pretrained VAD model
predicted_labels = run_cnnlstm.predict_labels(model, normalized_feats) # runs the VAD model on features
utils.save(predicted_labels, 'path/for/output/labels.h5') # saves predicted labels to .h5 file
A basic training script is provided in the file train.py
in the root directory of the project.
To use this script the following files are required in a Kaldi style data directory:
-
wav.scp
- this file provides the paths to the audio files one wishes to use for training, and assigns them a unique recording-id. It is structured as follows:<recording-id> <extended-filename>
. Each entry should appear on a new line, for example:rec_000 wavs/some_raw_audio.wav rec_001 wavs/some_more_raw_audio.wav
Note that the
<extended-filename>
may be an absolute path or relative path, except when using Docker or Singularity, where paths relative to the mount point must be used. -
segments
- this file specifies the start and end points of each labelled segment within the audio file. Note, this is different to the way this file is used when provided for decoding. This file is structured as follows:<utterance-id> <recording-id> <segment-begin> <segment-end>
, where<segment-begin>
and<segment-end>
are in seconds. Each entry should appear on a new line, for example:rec_000_00 rec_000 0.0 4.3 rec_000_01 rec_000 4.3 7.2 rec_000_02 rec_000 7.2 14.8 rec_000_03 rec_000 14.8 19.5 rec_001_00 rec_001 0.0 8.5 rec_001_01 rec_001 8.5 12.2 rec_001_02 rec_001 12.2 16.1 rec_001_03 rec_001 16.1 18.9 rec_001_04 rec_001 18.9 22.0
-
utt2spk
- this file specifies the label attached to each segment defined within thesegments
file. This file is structured as follows:<utterance-id> <label>
. Each entry should appear on a new line, for example:rec_000_00 speech rec_000_01 non_speech rec_000_02 speech rec_000_03 non_speech rec_001_00 non_speech rec_001_01 speech rec_001_02 non_speech rec_001_03 speech rec_001_04 non_speech
Note, that the model may be trained with 2 classes ('speech', 'non_speech')
as shown in the above example, or with the 4 classes from AVA-Speech dataset ('clean_speech', 'no_speech', 'speech_with_music', 'speech_with_noise')
, as is the case for the default model used by the toolkit.
To use the training script with a specific validation set run:
# use -v to specify a Kaldi style data directory to be used as validation set
python3 train.py -v val_dir train_dir model_name out_dir
To use the training script with a percetage of the training data as a validation set run:
# use -s to specify a percetage of the training data to be used as a validation set
python3 train.py -s 0.1 train_dir model_name out_dir
The training script may also be used without any flags, however this is not recommended, as it makes it difficult to tell whether the model is starting to overfit. When a validation set is provided the model with the best validation accuracy is saved. When no validation set is provided the model is saved after the final training epoch.