Skip to content

Speakerbox v1.0.0

Choose a tag to compare

@evamaxfield evamaxfield released this 25 Oct 02:42
· 29 commits to main since this release

Speakerbox v1.0.0

Speakerbox is a library for few-shot fine-tuning of a Transformer for speaker identification. This initial release has all the functionality needed to quickly generate a training set and fine-tune a model for use in downstream analysis tasks.

Given a set of recordings of multi-speaker recordings:

example/
├── 0.wav
├── 1.wav
├── 2.wav
├── 3.wav
├── 4.wav
└── 5.wav

Where each recording has some or all of a set of speakers, for example:

  • 0.wav -- contains speakers: A, B, C, D, E
  • 1.wav -- contains speakers: B, D, E
  • 2.wav -- contains speakers: A, B, C
  • 3.wav -- contains speakers: A, B, C, D, E
  • 4.wav -- contains speakers: A, C, D
  • 5.wav -- contains speakers: A, B, C, D, E

You want to train a model to classify portions of audio as one of the N known speakers
in future recordings not included in your original training set.

f(audio) -> [(start_time, end_time, speaker), (start_time, end_time, speaker), ...]

i.e. f(audio) -> [(2.4, 10.5, "A"), (10.8, 14.1, "D"), (14.8, 22.7, "B"), ...]

The speakerbox library contains methods for both generating datasets for annotation
and for utilizing multiple audio annotation schemes to train such a model.

Speakerbox example workflow

The following table shows model performance results as the dataset size increases:

dataset_size mean_accuracy mean_precision mean_recall mean_training_duration_seconds
15-minutes 0.874 ± 0.029 0.881 ± 0.037 0.874 ± 0.029 101 ± 1
30-minutes 0.929 ± 0.006 0.94 ± 0.007 0.929 ± 0.006 186 ± 3
60-minutes 0.937 ± 0.02 0.94 ± 0.017 0.937 ± 0.02 453 ± 7

Please see our documentation for more details.