Skip to content
wake word engine benchmark framework
Python C++
Branch: master
Clone or download
kenarsa updated based on porcupine v1.7 (#10)
* updated the benchmark

* updated porcupine submodule
Latest commit e817da1 Feb 13, 2020

Files

Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
audio added new audio data May 11, 2019
doc/img updated based on porcupine v1.7 (#10) Feb 13, 2020
engines updated based on porcupine v1.7 (#10) Feb 13, 2020
runtime updated based on porcupine v1.7 (#10) Feb 13, 2020
.gitignore updated framework for multi-keyword testing May 13, 2019
.gitmodules Initial commit Apr 25, 2018
LICENSE Initial commit Apr 25, 2018
README.md updated based on porcupine v1.7 (#10) Feb 13, 2020
benchmark.py updated framework for multi-keyword testing May 13, 2019
dataset.py updated for v1.6 Apr 26, 2019
engine.py updated based on porcupine v1.7 (#10) Feb 13, 2020
mixer.py fix #6 May 24, 2019
plot.py updated based on porcupine v1.7 (#10) Feb 13, 2020
requirements.txt updated for v1.6 Apr 26, 2019

README.md

Wake Word Benchmark

License

Made in Vancouver, Canada by Picovoice

The purpose of this benchmarking framework is to provide a scientific comparison between different wake word detection engines in terms of accuracy and runtime metrics. While working on Porcupine we noted that there is a need for such a tool to empower customers to make data-driven decisions.

Data

LibriSpeech (test_clean portion) is used as background dataset. It can be downloaded from OpenSLR.

Furthermore, more than 300 recordings of six keywords (alexa, computer, jarvis, smart mirror, snowboy, and view glass) from more than 50 distinct speakers are used. The recordings are crowd-sourced. The recordings are stored within the repository here.

In order to simulate real-world situations, the data is mixed with noise (at 10 dB SNR). For this purpose, we use DEMAND dataset which has noise recording in 18 different environments (e.g. kitchen, office, traffic, etc.). It can be downloaded from Kaggle.

Wake Word Engines

Three wake-word engines are used. PocketSphinx which can be installed using PyPI. Porcupine and Snowboy which are included as submodules in this repository. The Snowboy engine has a audio frontend component which is not normally a part of wake word engines and is considered a separate part of audio processing chain. The other two engines have not such component in them. We enabled this component in Snowboy engine for this benchmark as this is the optimal way of running it.

Metric

We measure the accuracy of the wake word engines using false alarm per hour and miss detection rates. The false alarm per hour is measured as a number of false positives in an hour. Miss detection is measured as the percentage of wake word utterances an engine rejects incorrectly. Using these definitions we compare the engines for a given false alarm rate and the engine with a smaller miss detection rate has a better performance.

The measured runtime metric is real time factor. Real time factor is computed by dividing the processing time to the length of input audio. It can be thought of as average CPU usage. The engine with a lower real time factor is more computationally efficient (faster).

Usage

Prerequisites

The benchmark has been developed on Ubuntu 18.04 with Python 3.6. Clone the repository using

git clone --recurse-submodules git@github.com:Picovoice/wakeword-benchmark.git

Make sure the Python packages in the requirements.txt are properly installed for your Python version as Python bindings are used for running the engines. The repositories for Porcupine and Snowboy are cloned in engines. Follow the instructions on their repositories to be able to run their Python demo before proceeding to the next step.

Running the Accuracy Benchmark

Usage information can be retrieved via

python benchmark.py -h

The benchmark can be run using the following command from the root of the repository

python benchmark.py --librispeech_dataset_path ${LIBRISPEECH_DATASET_PATH} --demand_dataset_path ${DEMAND_DATASET_PATH} --keyword ${KEYWORD}

Running the Runtime Benchmark

Refer to runtime documentation.

Results

Accuracy

Below is the result of running the benchmark framework averaged on six different keywords. The plot below shows the miss rate of different engines at 1 false alarm per 10 hours. The lower the miss rate the more accurate the engine is.

Runtime

Below is the runtime measurements on a Raspberry Pi 3. For Snowboy the runtime highly-depends on the keyword. Therefore we measured the CPU usage for each keyword and used the average.

You can’t perform that action at this time.