Speaker Diarization Benchmark

Made in Vancouver, Canada by Picovoice

This repo is a minimalist and extensible framework for benchmarking different speaker diarization engines.

Data

VoxConverse

VoxConverse is a well-known dataset in the speaker diarization field, showcasing speakers conversing in multiple languages. In this benchmark, we utilize cloud-based Speech-to-Text engines equipped with speaker diarization capabilities. Hence, for benchmarking purposes, we specifically employ the English subset of the dataset's test section.

Setup

Clone the VoxConverse repository. This repository contains only the labels in the form of .rttm files.
Download the test set from the links provided in the README.md file of the cloned repository and extract the downloaded files.

Metrics

Diarization Error Rate (DER)

The Diarization Error Rate (DER) is the most common metric for evaluating speaker diarization systems. DER is calculated by summing the time duration of three distinct errors: speaker confusion, false alarms, and missed detections. This total duration is then divided by the overall time span.

Jaccard Error Rate (JER)

The Jaccard Error Rate (JER) is a newly developed metric for evaluating speaker diarization, specifically designed for DIHARD II. It is based on the Jaccard similarity index, which measures the similarity between two sets of segments. In short, JER assigns equal weight to each speaker's contribution, regardless of their speech duration. For a more in-depth understanding, refer to the second DIHARD's paper.

Total Memory Usage

This metric provides insight into the memory consumption of the diarization engine during its processing of audio files. It presents the total memory utilized, measured in gigabytes (GB).

Core-Hour

The Core-Hour metric is used to evaluate the computational efficiency of the diarization engine, indicating the number of hours required to process one hour of audio on a single CPU core.

Note

Total Memory Usage and Core-Hour metrics are not applicable to cloud-based engines.

Engines

Usage

This benchmark has been developed and tested on Ubuntu 20.04 using Python 3.8.

Set up your dataset as described in the Data section.
Install the requirements:

pip3 install -r requirements.txt

In the commands that follow, replace ${DATASET} with a supported dataset, ${DATA_FOLDER} with the path to the dataset folder, and ${LABEL_FOLDER} with the path to the label folder. For further details, refer to the Data. Replace ${TYPE} with ACCURACY, CPU, or MEMORY for accuracy, CPU benchmark, and memory benchmark, respectively.

python3 benchmark.py \
--type ${TYPE} \
--dataset ${DATASET} \
--data-folder ${DATA_FOLDER} \
--label-folder ${LABEL_FOLDER} \
--engine ${ENGINE} \
...

For the memory benchmark, you should also run mem_monitor.py in a separate terminal window. This script will monitor the memory usage of the diarization engine.

python3 mem_monitor.py --engine ${ENGINE}

when the benchmark is complete, press Ctrl + C to stop the memory monitor.

Additionally, specify the desired engine using the --engine flag. For instructions on each engine and the required flags, consult the section below.

Amazon Transcribe Instructions

Create an S3 bucket. Then, substitute ${AWS_PROFILE} with your AWS profile name and ${AWS_S3_BUCKET_NAME} with the created S3 bucket name.

python3 benchmark.py \
--dataset ${DATASET} \
--data-folder ${DATA_FOLDER} \
--label-folder ${LABEL_FOLDER} \
--engine AWS_TRANSCRIBE \
--aws-profile ${AWS_PROFILE} \
--aws-s3-bucket-name ${AWS_S3_BUCKET_NAME}

Azure Speech-to-Text Instructions

A client library for the Speech to Text REST API should be generated, as outlined in the documentation.

Then, create an Azure storage account and container, and replace ${AZURE_STORAGE_ACCOUNT_NAME} with your Azure storage account name, ${AZURE_STORAGE_ACCOUNT_KEY} with your Azure storage account key, and ${AZURE_STORAGE_CONTAINER_NAME} with your Azure storage container name.

Finally, replace ${AZURE_SUBSCRIPTION_KEY} with your Azure subscription key and ${AZURE_REGION} with your Azure region.

python3 benchmark.py \
--dataset ${DATASET} \
--data-folder ${DATA_FOLDER} \
--label-folder ${LABEL_FOLDER} \
--engine AZURE_SPEECH_TO_TEXT \
--azure-storage-account-name ${AZURE_STORAGE_ACCOUNT_NAME} \
--azure-storage-account-key ${AZURE_STORAGE_ACCOUNT_KEY} \
--azure-storage-container-name ${AZURE_STORAGE_CONTAINER_NAME} \
--azure-subscription-key ${AZURE_SUBSCRIPTION_KEY} \
--azure-region ${AZURE_REGION}

Google Speech-to-Text Instructions

Create a Google cloud storage bucket. Then, replace ${GCP_CREDENTIALS} with the path to your GCP credentials file (.json) and ${GCP_BUCKET_NAME} with your GCP bucket name.

python3 benchmark.py \
--dataset ${DATASET} \
--data-folder ${DATA_FOLDER} \
--label-folder ${LABEL_FOLDER} \
--engine GOOGLE_SPEECH_TO_TEXT \
--gcp-credentials ${GCP_CREDENTIALS} \
--gcp-bucket-name ${GCP_BUCKET_NAME} \

To utilize the enhanced model, replace the GOOGLE_SPEECH_TO_TEXT engine with GOOGLE_SPEECH_TO_TEXT_ENHANCED.

Picovoice Falcon Instructions

Replace ${PICOVOICE_ACCESS_KEY} with AccessKey obtained from Picovoice Console.

python3 benchmark.py \
--dataset ${DATASET} \
--data-folder ${DATA_FOLDER} \
--label-folder ${LABEL_FOLDER} \
--engine PICOVOICE_FALCON \
--picovoice-access-key ${PICOVOICE_ACCESS_KEY}

pyannote.audio Instructions

Obtain your authentication token to download pretrained models by visiting their Hugging Face page. Then replace ${PYANNOTE_AUTH_TOKEN} with the authentication token.

python3 benchmark.py \
--dataset ${DATASET} \
--data-folder ${DATA_FOLDER} \
--label-folder ${LABEL_FOLDER} \
--engine PYANNOTE \
--pyannote-auth-token ${PYANNOTE_AUTH_TOKEN}

Results

Measurement is carried on an Ubuntu 20.04 machine with AMD CPU (AMD Ryzen 7 5700X (16) @ 3.400G), 64 GB of RAM, and NVMe storage.

Diarization Error Rate (DER)

Engine	VoxConverse (English)
Amazon	11.1%
Azure	15.7%
Google	50.2%
Google - Enhanced	24.0%
Picovoice Falcon	10.3%
pyannote.audio	9.0%

Jaccard Error Rate (JER)

Engine	VoxConverse (English)
Amazon	29.8%
Azure	30.1%
Google	83.4%
Google - Enhanced	57.6%
Picovoice Falcon	19.9%
pyannote.audio	27.4%

Total Memory Usage

To obtain these results, we ran the benchmark across the entire VoxConverse dataset and recorded the maximum memory usage during that period. As conversations involve varying lengths and numbers of speakers, this method provides us with a reliable estimation of the memory usage of each engine.

Engine	Memory Usage (GB)
pyannote.audio	1.5
Picovoice Falcon	0.1

Core-Hour

Engine	Core-Hour
pyannote.audio	442
Picovoice Falcon	4

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commits
.github/workflows		.github/workflows
.spell-check		.spell-check
results		results
.gitignore		.gitignore
README.md		README.md
benchmark.py		benchmark.py
dataset.py		dataset.py
engine.py		engine.py
mem_monitor.py		mem_monitor.py
plot_results.py		plot_results.py
requirements.txt		requirements.txt
util.py		util.py

Picovoice/speaker-diarization-benchmark

Folders and files

Latest commit

History

Repository files navigation

Speaker Diarization Benchmark

Table of Contents

Data

Setup

Metrics

Diarization Error Rate (DER)

Jaccard Error Rate (JER)

Total Memory Usage

Core-Hour

Engines

Usage

Amazon Transcribe Instructions

Azure Speech-to-Text Instructions

Google Speech-to-Text Instructions

Picovoice Falcon Instructions

pyannote.audio Instructions

Results

Diarization Error Rate (DER)

Jaccard Error Rate (JER)

Total Memory Usage

Core-Hour

About

Resources

Stars

Watchers

Forks

Languages