Speakerbox v1.0.0

Speakerbox is a library for few-shot fine-tuning of a Transformer for speaker identification. This initial release has all the functionality needed to quickly generate a training set and fine-tune a model for use in downstream analysis tasks.

Given a set of recordings of multi-speaker recordings:

example/
├── 0.wav
├── 1.wav
├── 2.wav
├── 3.wav
├── 4.wav
└── 5.wav

Where each recording has some or all of a set of speakers, for example:

0.wav -- contains speakers: A, B, C, D, E
1.wav -- contains speakers: B, D, E
2.wav -- contains speakers: A, B, C
3.wav -- contains speakers: A, B, C, D, E
4.wav -- contains speakers: A, C, D
5.wav -- contains speakers: A, B, C, D, E

You want to train a model to classify portions of audio as one of the N known speakers
in future recordings not included in your original training set.

f(audio) -> [(start_time, end_time, speaker), (start_time, end_time, speaker), ...]

i.e. f(audio) -> [(2.4, 10.5, "A"), (10.8, 14.1, "D"), (14.8, 22.7, "B"), ...]

The speakerbox library contains methods for both generating datasets for annotation
and for utilizing multiple audio annotation schemes to train such a model.

The following table shows model performance results as the dataset size increases:

dataset_size	mean_accuracy	mean_precision	mean_recall	mean_training_duration_seconds
15-minutes	0.874 ± 0.029	0.881 ± 0.037	0.874 ± 0.029	101 ± 1
30-minutes	0.929 ± 0.006	0.94 ± 0.007	0.929 ± 0.006	186 ± 3
60-minutes	0.937 ± 0.02	0.94 ± 0.017	0.937 ± 0.02	453 ± 7

Please see our documentation for more details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speakerbox v1.0.0

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

Speakerbox v1.0.0

Uh oh!