Question about ClusteringDiarizer model #2514
Replies: 1 comment
-
These are speaker embedding extractor parameters. The window length is the length of the speech signal send to the speaker embedding extractor for forward pass, and shift length is hop length to move and choose the next window. 1.5 sec for window and 0.75 sec for shift should work for most of the cases, smaller windows mean very refined embeddings (can help in not picking contiguous speaker speech segments). I would recommend you to try with
There are no best practices in determining those parameters. Speaker embedding extractor works best if you give a bigger speech sample length but at the same time if you pick a higher sample length there is a chance that window would be picking the next contiguous speaker segments as well, so it's a trade off. |
Beta Was this translation helpful? Give feedback.
-
Hello,
I'm trying the tutorial
tutorials/speaker_recognition/Speaker_Diarization_Inference.ipynb
It works well on recognizing different voices with the example audio file
an4_diarize_test.wav
, but it does not always returns good results in recognizing different speakers with other audio files.Checking the configuration file provided in the tutorial
speaker_diarization.yaml
the only two parameters to tune speaker_verification using the out-of-box modelspeakerdiarization_speakernet
inClusteringDiarizer
model seem to be:How changing these parameters affect the speaker verification process? What are they used for?
In cases where speakers identification was not accurate, often a single speech from the same speaker is identified as made up of two or more speakers. In those cases increasing those two parameter seems to improve model performances.
Are there some best practices to determine the correct value for these params?
Thank you
Francesco
Beta Was this translation helpful? Give feedback.
All reactions