Python C C++
Switch branches/tags
Nothing to show
Clone or download
Failed to load latest commit information.
doc/img new results Jul 20, 2018
runtime updated rtf results according to the reference definition provided Aug 12, 2018
.gitignore Initial commit Apr 25, 2018
LICENSE Initial commit Apr 25, 2018 updated rtf results according to the reference definition provided Aug 12, 2018 added runtime measurements and refactor Jul 17, 2018 new results Jul 20, 2018 fixed a bug in mixer Jul 18, 2018
requirements.txt Initial commit Apr 25, 2018 added runtime measurements and refactor Jul 17, 2018

Wakeword-benchmark   Tweet


Made in Vancouver, Canada by Picovoice

The primary purpose of this benchmark framework is to provide a scientific comparison between different wake-word detection engines in terms of accuracy and runtime metrics. Currently, the framework is configured for Alexa as the test wake-word. But it can be configured for any other wake-words as described here.

Why did we make this?

The benchmark framework provides a definitive answer to the question "which engine provides the best performance for a given wake-word?". While working on Porcupine we noted that there is a need for such tool to empower customers to make data-driven decisions. The framework

  • uses hundreds of crowd-sourced utterances of wake-word.
  • uses tens of hours of data made publicly available by Common Voice as background model (i.e. what is not the wake-work and needs to be ignored).
  • allows simulating real-world conditions by adding noise to clean speech.
  • runs wake-word engines at different detection thresholds (aka sensitivities) which results in different miss detection and false alarm rates. Finally, it creates an ROC curve for each engine.
  • measures real time factor, CPU usage, and memory usage for each engine on Raspberry Pi 3.


Common Voice is used as background dataset, i.e., dataset without utterances of the wake-word. It can be downloaded from here. Only recordings with at least two up-votes and no down-votes are used (this reduces the size of the dataset to ~125 hours).

Furthermore, 369 recording of word Alexa from 89 distinct speakers are used. The recordings are crowd-sourced using an Android mobile application. The recordings can be downloaded here.

In order to closely simulate real-world situations, the data is mixed with noise. For this purpose, we use DEMAND dataset which has noise recording in 18 different environments (e.g. kitchen, office, traffic, etc.). It can be downloaded from here.

Wake-word engines

Three wake-word engines are used in this benchmark. PocketSphinx which can be installed using PyPI. Porcupine and Snowboy which are included as submodules in this repository.


We measure the accuracy of the wake-word engines using false alarm per hour and miss detection rates. The false alarm per hour is measured as a number of false positives in an hour. Miss detection is measured as the percentage of wake-word utterances an engine rejects incorrectly. Using these definitions we compare the engines for a given false alarm, and therefore the engine with a smaller miss detection rate has a better performance.

Two runtime metrics are measured, real time factor and memory usage. Real time factor is computed by dividing the processing time to the length of input audio. It can be thought of as average CPU usage. The engine with a lower real time factor is more computationally efficient (faster).



The benchmark has been developed on Ubuntu 16.04 with Python 3.5. It should be possible to run it on a Mac machine or different distributions of Linux but has not been tested. Clone the repository using

git clone --recurse-submodules

Install SoX for Ubuntu using

sudo apt-get install sox

Make sure the Python packages in the requirements.txt are properly installed for your Python version. Then install the mp3 handler for SoX

sudo apt-get install libsox-fmt-mp3

Python bindings are used for running Porcupine and Snowboy. The repositories for these are cloned in engines. Make sure to follow the instructions on their repositories to be able to run their Python demo before proceeding to the next step.

For memory profiling valgrind is used. It can be installed using

sudo apt-get install valgrind

Running the accuracy benchmark

Usage information can be retrieved via

python -h

The benchmark can be run using the following command from the root directory of the repository

python --common_voice_directory <root directory of Common Voice dataset> --alexa_directory <root directory of Alexa dataset> \
--demand_directory <root directory of Demand dataset>

This runs the benchmark for a clean environment and creates the ROC curves for different engines. When --output_directory <output directory to save the results> is passed to command line the framework saves the results in CSV format. This is going to take a while (it takes 48 hours on a quad-core Intel machine).

To run the benchmark in the noisy environment pass the --add_noise. Noise is mixed into the audio samples with the SNR of 10dB which simulates environments with moderate noise.

Running the runtime benchmark

Please refer to runtime documentation.



Below is the result (ROC curve) of running the benchmark framework for clean and noisy environments. As expected, for a given false alarm rate the miss rate increases across different engines when noise is added to data.

A more illustrative way of comparing the results is to compare the miss rates given a fixed false alarm per hour value. The engine with smallest miss rate is performing the best. This is shown below for clean speech scenario

Also below is the result in presence of noise


Below are the runtime measurements (on Raspberry Pi 3). Two metrics are measured (1) real time factor and (2) memory usage. For ease of interpretation we also added average CPU usage.

Engine Real Time Factor Average CPU Usage Memory Usage
PocketSphinx 0.32 31.75% 15.58 MB
Snowboy 0.19 18.94% 2.43 MB
Porcupine 0.07 7.39% 1.38 MB
Porcupine Tiny 0.03 3.42% 0.24 MB


How can I reproduce the results?

The results presented above are completely reproducible given that exactly same datasets and engines are used.

Datasets are taken from

The engines used in the benchmark are:

  • PocketSphinx 0.1.3 from PyPI.
  • Snowboy is cloned from its repository on commit.
  • Porcupine is cloned from its repository on commit.

How can I use the framework for my wake-word?

The framework is currently configured for Alexa as the wake-word. It can be configured for a different wake-word by following steps below

  1. Collect recordings of your wake-word with the sample rate of 16000 in WAV format.
  2. Implement Dataset interface for the new collection of wake-word recordings.
  3. Instantiate the newly-created dataset instead of AlexaDataset.
  4. Instantiate CommonVoiceDataset with exclude_words constructor parameter set to the new test word.
  5. Assure all wake-word engines have a model for the newly-selected wake-word. Refer to the and add your models to the engines.