New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding long-form audio speaker diarization (clustering) class and functions #7737
Conversation
Signed-off-by: Taejin Park <tango4j@gmail.com>
Signed-off-by: Taejin Park <tango4j@gmail.com>
Signed-off-by: Taejin Park <tango4j@gmail.com>
Signed-off-by: Taejin Park <tango4j@gmail.com>
for more information, see https://pre-commit.ci
Signed-off-by: Taejin Park <tango4j@gmail.com>
Signed-off-by: Taejin Park <tango4j@gmail.com>
for more information, see https://pre-commit.ci
Signed-off-by: Taejin Park <tango4j@gmail.com>
Signed-off-by: Taejin Park <tango4j@gmail.com>
Signed-off-by: Taejin Park <tango4j@gmail.com>
Signed-off-by: Taejin Park <tango4j@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please see if we can make General case of number of embeddings < unit_window_len i.e., n_windows=1
to be a special case for LongFormClustering. Core algorithm of getting affinity matrix and performing specrtalclustering is re-written, is it possble to reuse it by making modifications in forward_infer
of SpeakerClustering
.
examples/speaker_tasks/diarization/conf/inference/diar_infer_general.yaml
Outdated
Show resolved
Hide resolved
Y (LongTensor): | ||
Speaker labels for the segments in the given input embeddings. | ||
""" | ||
embeddings_in_scales = param_dict['embeddings'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this function be directly called by the user? If so, do we need to add some value checking statements to for param_dict
such that proper error messages will be thrown if invalid or missing values are found?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added dimension check in split_input_data
function in offline_clustering.py
.
original labels for all `org_len` input embeddings: `Y_unpack`. | ||
|
||
Args: | ||
embeddings_in_scales (Tensor): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on the docstring, should the type for embeddings_in_scales
and timestamp_in_scales
be List[torch.Tensor]
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is passed as concatenated torch Tensor so this form is correct.
If Type hint is wrong, torch jit compiler breaks at this line.
def forward_infer( | ||
self, | ||
embeddings_in_scales: torch.Tensor, | ||
timestamps_in_scales: torch.Tensor, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it List[torch.Tensor]
for these two args?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is passed as concatenated torch Tensor so this form is correct.
If Type hint is wrong, torch jit compiler breaks at this line.
for more information, see https://pre-commit.ci
Signed-off-by: Taejin Park <tango4j@gmail.com>
Signed-off-by: Taejin Park <tango4j@gmail.com>
jenkins |
jenkins |
Before checking on real recordings of 4+ hours meetings (that I cannot share) I've multiplied a short video with 5 speakers (which nemo had great results on it before..), on that multiplied video the longformspeakerclustering estimated only 2 speakers.. I don't know why. rttm: (links may expire in 7 days..) |
Hi. Try changing the following parameters in
If you increase I have to inform you that building a system that counts the number of speakers in a very long audio recording is extremely challenging task since we do not have data points to tune it properly. |
@tango4j |
Very nice to see this PR being merged. I wanted to test it but I am unable to figure out how (or if) it is possible to use this with the I am experiencing OOM problems with that approach and was hoing this might offer a solution. |
Dito on the above comment. We are actually using the ClusteringDiarizer directly and getting cuda memory issues (on an AWS g5 instance). Code is something like: sd_model = ClusteringDiarizer(cfg=config).to(device) I tried cloning / pip installing directly from the main branch but it does not seem to be calling this new class. So the question is how can we use this class / functionality on long audio files > 8 hours? |
Actually you can ignore my previous comment. It was my mistake, I had a pre-existing installation of nemo and the version no does not seem to have been updated on the main branch. So running the install did nothing... I recreated the venv, installed from source and now I have the latest code running. For others the solution (for now until its released) is to:
Have run the diarizer on an 8 hour audio file and it ran in < 10 mins with < 10gb of GPU memory. So great work :) |
@lasmith |
How much CPU RAM did this use? |
…ctions (NVIDIA#7737) * Adding long-form audio clustering for diarization Signed-off-by: Taejin Park <tango4j@gmail.com> * Adding unit test changes Signed-off-by: Taejin Park <tango4j@gmail.com> * Added tests for torch jit script Signed-off-by: Taejin Park <tango4j@gmail.com> * Added variable value checking line Signed-off-by: Taejin Park <tango4j@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Added needed params to all yamls Signed-off-by: Taejin Park <tango4j@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Merged latest main and updated speaker utils Signed-off-by: Taejin Park <tango4j@gmail.com> * Fixed code formatting error in speaker_utils.py Signed-off-by: Taejin Park <tango4j@gmail.com> * Some minor fixes for doc-strings Signed-off-by: Taejin Park <tango4j@gmail.com> * Removed unnecessary comments Signed-off-by: Taejin Park <tango4j@gmail.com> * Refelcted comments and made changes Signed-off-by: Taejin Park <tango4j@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Minor changes on typos and comments Signed-off-by: Taejin Park <tango4j@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fixes for code QL Signed-off-by: Taejin Park <tango4j@gmail.com> * Fixed docstring errors Signed-off-by: Taejin Park <tango4j@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Reflected the second batch of comments Signed-off-by: Taejin Park <tango4j@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Updating all yamls for inference Signed-off-by: Taejin Park <tango4j@gmail.com> * Added None-checker to forward to prevent type errors Signed-off-by: Taejin Park <tango4j@gmail.com> --------- Signed-off-by: Taejin Park <tango4j@gmail.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: He Huang (Steve) <105218074+stevehuang52@users.noreply.github.com> Co-authored-by: Nithin Rao <nithinrao.koluguri@gmail.com> Signed-off-by: Piotr Żelasko <petezor@gmail.com>
@lasmith Would you mind sharing how you were able to adopt the config or the code you posted to use the LongFormSpeakerClustering? I keep reading trying to read the code and get lost in all the classes and configs. Or is the long form version somehow default now? |
@mfalt |
Yes its the default now. The instructions above are no longer needed, as the code has now been merged into master and released. So all you need to do is install the latest version e.g.
To use the long form audio you just need to "chunk_cluster_count" and "embeddings_per_chunk" to your config. See here for an example. To answer the question on accuracy, in general accuracy is identical for shorter audio (not found any issues). We had an issue raised for longer audio. We have audio where we don't know the no. of speakers aprior, and had set the "max_num_speakers" to 8. On longer audio that have > 8 speakers this resulted in all the audio being assigned to the same speaker (so only 1 speaker detected). Increasing the limit, resolved this. Not sure if this is true on shorter audio with lots of speakers also, as it may not be specific to long form. |
Hello @lasmith |
The setting I changed was the "max_num_speakers", which I increased from 8 to 20. The other settings you mention, seem to only affect the memory usage. However the values we have used were:
Whilst these seem to alleviate the issue, it does not get rid of it completely. Not had much time to investigate this further as fortunately we don't get many long form audio files and it seems to only affect files above X hours, which have large numbers of speakers. So I am not sure the exact cause at the moment but my suspicion is its something to do with the number of speakers and how its detecting these. FYI For testing we are using UK parliament recordings of Hansard (see here) as these are the closest public domain data to our legal use case. |
What does this PR do ?
This PR adds long-form audio speaker diarization feature by adding LongFormSpeakerClustering class.
Two parameters are added to yaml file for long form audio clustering.
This is forward function for long-form speaker clustering.
Please refer to
SpeakerClustering
class for the original argument information.Please read the following text to understand the long-form clustering
In the function
long_forward_infer
inLongFormSpeakerClustering
class:unit_window_len
. It is usually order of 10K.unit_window_len=10000
is default value.sub_cluster_n
fine-grained clusters. Defaultsub_cluster_n=50
.Y_aggr
.unpack_labels
function is then employed to map the aggregated labelsY_aggr
back to the original labels for allorg_len
input embeddings:Y_unpack
.unit_window_len/sub_cluster_n
. For instance, by default, if the original clustering hits the memory limit at the 1-hour mark, the long-form clustering could handle up to 20 hours without exhausting the memory.unpack_labels
.forward_unit_infer
clustering execution function andunpack_labels
functions are added to thetest_diar_utils.py
.(1)
forward_unit_infer
function performs a simple 1-scale clustering without time-stamp processing.(2)
LongFormSpeakerClustering
hasforward_infer
function that chooses one of these two functions:--
long_forward_infer
if the input number of segments are larger thanself.unit_window_len
.--
short_forward_infer
if the input number of segments are smaller or equal toself.unit_window_len
.Collection: ASR - Speaker Tasks - Speaker Diarization
Changelog
Updated files:
LongFormSpeakerClustering
Usage
Before your PR is "Ready for review"
Pre checks:
PR Type:
If you haven't finished some of the above items you can still open "Draft" PR.
Who can review?
Anyone in the ASR