JaVAD: Just Another Voice Activity Detector

JaVAD is a state-of-the-art Voice Activity Detection package, lightweight and fast, built on PyTorch with minimal dependencies. Core functionality (without audio loading) requires only NumPy and PyTorch, with no registration, tokens, or installation of large unnecessary packages. While it is built using sliding windows over mel spectrograms, it supports streaming. You can also export results to RTTM, CSV, or TextGrid.

There are three models:

tiny: 0.64s window, optimal for quickest voice detection
balanced: 1.92s window, fastest while small
precise: 3.84s window with extra DirectionalAlignment layer and 2 additional transformer layers for best accuracy.

All models use mono audio with a sample rate of 16,000 Hz.

Comparison

For evaluation, Google's AVA Speech dataset was used, or, to be accurate, only clips that are still available (74 in total). Since AVA Speech has only 15 minutes from each clip labeled, 18.5 hours of audio in total were used.
For evaluation purposes, JaVAD was trained on a custom, manually labelled dataset using a separate, different collection of YouTube clips.

Model	Precision	Recall	F1 Score	AUROC	Time, GPU Nvidia 3090	Time, CPU Ryzen 3900XT
Nvidia NEMO	0.7676	0.9526	0.8502	0.9201	26.24s	56.94s
WebRTC (via py-webrtc)	0.6099	0.9454	0.7415	-¹	-²	59.85s
Google Speechbrain	0.8213	0.8534	0.8370	0.8961	1371.00s	1981.40s
Pyannote	0.9173	0.8463	0.8804	0.9495	75.49s	823.19s
Silero	0.9678	0.6503	0.9050	0.9169	830.27s³	695.58s
JaVAD tiny⁴*	0.9263	0.8846	0.8961	0.9550	22.32s	476.93s
JaVAD balanced*	0.9284	0.8938	0.9108	0.9642	16.38s	220.00s
JaVAD precise*	0.9359	0.8980	0.9166	0.9696	18.58s	236.61s

¹ WebRTC does not return logits ² WebRTC via py-webrtc can be run only on CPU
³ Silero JIT model is slower on GPU, and ONNX model cannot be run on GPU.
⁴ Tiny model is the slowest here due to the smaller window size of 0.64s. It is best applicable for immediate speech detection in the streaming pipeline.
*For information about training dataset see text above the table

Installation

Requirements

Python 3.8+
PyTorch 2.0.0+
NumPy 1.20.0+
Optional: soundfile for loading audio and simplified processing

Install via pip

If you already have audio loaded and just want to process it with minimum dependencies:

pip install javad

If you want to load audio:

pip install javad[extras]  # with audio loading 
# if you're using zsh, add quotes
pip install 'javad[extras]'

Documentation

https://javad.readthedocs.io/en/latest/

Usage

Basic Usage (if installed with [extras]), single file/CPU:

from javad.extras import get_speech_intervals

intervals = get_speech_intervals("path/to/audio.wav")
print(intervals)

Usage via Processor class, single file/CUDA[if available]:

import torch
from javad import Processor
from javad.extras import load_audio

# Load audio file
audio = load_audio("path/to/audio.wav")

# Initialize Processor with default 'balanced' model
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
processor = Processor(device=device)
print(processor)

# Process audio
# Get logits
logits = processor.logits(audio).cpu().numpy() 
print(logits)
# Get boolean predictions based on threshold
predictions = processor.predict(audio).cpu().numpy() 
print(predictions)
# Get speech intervals
intervals = processor.intervals(audio) 
print(intervals)

You can increase accuracy by specifying the step size for the sliding window. The smaller the step, the longer it takes to compute and average predictions, resulting in a more accurate outcome.

Stream Processing, stream/CUDA[if available]:

import torch
from javad.stream import Pipeline
from javad.extras import load_audio

# Initialize pipeline
pipeline = Pipeline()  # by default, Pipeline uses 'balanced' model
pipeline.to(torch.device("cuda" if torch.cuda.is_available() else "cpu"))
print(pipeline)

# Load audio file
audio = load_audio("path/to/audio.wav")

# Process audio in chunks
chunk_size = int(pipeline.config.sample_rate * 0.5)  # 0.5-second chunks
for i in range(0, len(audio), chunk_size):
    audio_chunk = audio[i : i + chunk_size]
    predictions = pipeline.intervals(audio_chunk)
    print(predictions)

There are two modes for streaming: instant and gradual. The instant mode returns results only for the current chunk pushed into the pipeline, while the gradual mode updates and averages predictions while the chunk is within the audio buffer. For example, with a chunk size of 0.25s and the balanced model's window size of 1.92s, it will provide 8 updates for that chunk.

TL;DR: Use instant mode for the fastest response, or gradual mode for the most accurate results in stream mode.

Instant detection

import torch
from javad.stream import Pipeline
from javad.extras import load_audio

# Initialize pipeline
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
pipeline = Pipeline(device=device) # by default, Pipeline uses 'balanced' model

# Generate chunk of audio
audio = load_audio("path/to/audio.wav")
chunk_size = int(pipeline.config.sample_rate * 0.5)  # 0.5-second chunks
audio_chunk = audio[:chunk_size]

# Process and detect speech once per stream
bool_prediction = pipeline.detect(audio_chunk)
print(bool_prediction)

# Reset Pipeline for new stream
pipeline.reset()

Training/Fine-tuning

To re-train/fine-tune the model, Trainer class is complete solution. You will also need to prepare your dataset (see below).

import torch
import logging
from javad.extras import Trainer

if __name__ == "__main__":
    logging.basicConfig(level=logging.INFO)
    trainer = Trainer(
        run_name="balanced_test",
        dataset_settings={
            "audio_root": "path-to-dir-with-audio",
            "spectrograms_root": "path-to-dir-to-save-generated-spectrograms-aka-cache",
            "index_root": "path-to-dir-to-save-index-files-of-the-dataset",
            "metadata_json_path": "path-to-metadata-json-file",
            "max_memory_cache": 16000, # allow to use up to 16Gb of RAM to retain spectrograms 
        },
        use_mixed_precision=True,
        use_scheduler=True,
        window_min_content_ratio=0.5,
        window_offset_sec=0.5,
        device=torch.device("cuda:0"),
        learning_rate=1e-4,
        num_workers=2,
        total_epochs=20,
        augmentations={
            "mix_chance": 0.0,
            "erase_chance": 0.0,
            "zero_chance": 0.00,
        },
    )
    trainer.train()

Example of metadata.json

{
    "farsi/01.flac": {
        "length": 14393728,
        "intervals": {
            "speech": [
                [22347, 165856],
                [178426, 247214],
            ]
        },
        "metadata": {
        }
    }
}

It's a dictionary of data for all audio files in dataset `{relative_to_audio_root_path: data}`, where data is a dict with following fields: 

{
    'length': len of audio file in samples,
    'intervals': { dictionary of different intervals, just "speech" in out case
        "speech": list of list of speech intervals in samples
    },
    'metadata': {
        dict of extra data for SpectrogramDataset class to extract data from, like gender or language. Not applicable in our case.
    }
}

License

This project is licensed under the MIT License.

Citation

If you use this package in your research, please cite it as follows:

@misc{JaVAD, author = {Sergey Skrebnev}, title = {JaVAD: Just Another Voice Activity Detector}, year = {2024}, publisher = {GitHub}, journal = {GitHub repository}, howpublished = {\url{https://github.com/skrbnv/javad}}, }

Alternatively, you can use the following BibTeX entry:

@software{JaVAD, author = {Sergey Skrebnev}, title = {JaVAD: Just Another Voice Activity Detector}, year = {2024}, url = {https://github.com/skrbnv/javad}, }

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
docs/source		docs/source
src/javad		src/javad
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
ROCs.png		ROCs.png
pyproject.toml		pyproject.toml
readthedocs.yaml		readthedocs.yaml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

JaVAD: Just Another Voice Activity Detector

Comparison

Installation

Requirements

Install via pip

Documentation

Usage

Basic Usage (if installed with [extras]), single file/CPU:

Usage via Processor class, single file/CUDA[if available]:

Stream Processing, stream/CUDA[if available]:

Instant detection

Training/Fine-tuning

Example of metadata.json

License

Citation

About

Releases

Packages

Languages

License

skrbnv/javad

Folders and files

Latest commit

History

Repository files navigation

JaVAD: Just Another Voice Activity Detector

Comparison

Installation

Requirements

Install via pip

Documentation

Usage

Basic Usage (if installed with [extras]), single file/CPU:

Usage via Processor class, single file/CUDA[if available]:

Stream Processing, stream/CUDA[if available]:

Instant detection

Training/Fine-tuning

Example of metadata.json

License

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages