Skip to content

bespoken/nlp-benchmark

Repository files navigation

unit-test codecov standard

Bespoken Benchmarking Project

This is Bespoken's open-source benchmarking project.

This provides a general mechanism for testing and evaluating NLP platforms.

We have conducted two tests so far:

Process

We interact with the voice assistants using the Bespoken Device Service - which allow us to interact exactly as a real person would with an actual device. Read more here.

For running the tests and collecting the results, we leverage our batch testing framework:
https://gitlab.com/bespoken/batch-tester

Benchmark Results

Results are meant to published on a bi-monthly basis. The table below summarizes our tests and results to-date:

Date Test Type Data Set Platforms Results
7/26/2020 General Knowledge ComQA Alexa, Google Assistant, Siri Link
11/20/2020 Speech Recognition DefinedCrowd Amazon Connect, Google Dialogflow, Twilio Voice Link

The published results are viewable here:
https://benchmark.bespoken.io

Methodology

General Knowledge

We classify answers as correct or not by the presence of the answer from the dataset.

In the case where the dataset has multiple answers, if anyone is present we include it. Read more here

Speech Recognition Accuracy

We take datasets from DefinedCrowd and run them through the various platforms using our Virtual Devices for IVR:

Read more here

Contact

We appreciate all feedback. Open an issue to suggest additional datasets as well as improvements to our methodology.

Contact us at contact@bespoken.io.