Question about ClusteringDiarizer model #2514

francescodaq · 2021-07-20T09:25:08Z

francescodaq
Jul 20, 2021

Hello,

I'm trying the tutorial

tutorials/speaker_recognition/Speaker_Diarization_Inference.ipynb

It works well on recognizing different voices with the example audio file an4_diarize_test.wav, but it does not always returns good results in recognizing different speakers with other audio files.

Checking the configuration file provided in the tutorial speaker_diarization.yaml the only two parameters to tune speaker_verification using the out-of-box model speakerdiarization_speakernet in ClusteringDiarizer model seem to be:

    window_length_in_sec: 1.5 # window length in sec for speaker embedding extraction
    shift_length_in_sec: 0.75 # shift length in sec for speaker embedding extraction

How changing these parameters affect the speaker verification process? What are they used for?

In cases where speakers identification was not accurate, often a single speech from the same speaker is identified as made up of two or more speakers. In those cases increasing those two parameter seems to improve model performances.

Are there some best practices to determine the correct value for these params?

Thank you
Francesco

nithinraok · 2021-07-20T17:05:09Z

nithinraok
Jul 20, 2021
Maintainer

How changing these parameters affect the speaker verification process? What are they used for?

These are speaker embedding extractor parameters. The window length is the length of the speech signal send to the speaker embedding extractor for forward pass, and shift length is hop length to move and choose the next window. 1.5 sec for window and 0.75 sec for shift should work for most of the cases, smaller windows mean very refined embeddings (can help in not picking contiguous speaker speech segments). I would recommend you to try with speakerverification_speakernet model as well, it's best suited in normal non-telephonic conversations.

Are there some best practices to determine the correct value for these params?

There are no best practices in determining those parameters. Speaker embedding extractor works best if you give a bigger speech sample length but at the same time if you pick a higher sample length there is a chance that window would be picking the next contiguous speaker segments as well, so it's a trade off.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about ClusteringDiarizer model #2514

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Question about ClusteringDiarizer model #2514

francescodaq Jul 20, 2021

Replies: 1 comment

nithinraok Jul 20, 2021 Maintainer

francescodaq
Jul 20, 2021

nithinraok
Jul 20, 2021
Maintainer