Adding long-form audio speaker diarization (clustering) class and functions #7737

tango4j · 2023-10-17T06:09:13Z

What does this PR do ?

This PR adds long-form audio speaker diarization feature by adding LongFormSpeakerClustering class.
Two parameters are added to yaml file for long form audio clustering.

This is forward function for long-form speaker clustering.
Please refer to SpeakerClustering class for the original argument information.

Please read the following text to understand the long-form clustering
In the function long_forward_infer in LongFormSpeakerClustering class:

Input embeddings are divided into smaller windows of size unit_window_len. It is usually order of 10K. unit_window_len=10000 is default value.
Each window undergoes overclustering, resulting in sub_cluster_n fine-grained clusters. Default sub_cluster_n=50.
These fine-grained clusters are merged to form the aggregated clustering labels Y_aggr.
The unpack_labels function is then employed to map the aggregated labels Y_aggr back to the original labels for all org_len input embeddings: Y_unpack.

Theoretically, the maximum length of audio can be extended by a factor of unit_window_len/sub_cluster_n. For instance, by default, if the original clustering hits the memory limit at the 1-hour mark, the long-form clustering could handle up to 20 hours without exhausting the memory.
The added long-form audio clustering re-uses (repurposes) the online clustering functions and offline clustering functions except unpack_labels.
The main forward_unit_infer clustering execution function and unpack_labels functions are added to the test_diar_utils.py.
(1) forward_unit_infer function performs a simple 1-scale clustering without time-stamp processing.
(2) LongFormSpeakerClustering has forward_infer function that chooses one of these two functions:
-- long_forward_infer if the input number of segments are larger than self.unit_window_len.
-- short_forward_infer if the input number of segments are smaller or equal to self.unit_window_len.

Collection: ASR - Speaker Tasks - Speaker Diarization

Changelog

Updated files:

All three diarization inference yaml files
speaker_utils.py
test_diar_utils.py
longform_clustering.py - Major change of adding LongFormSpeakerClustering
offline_clustering.py

Usage

You can potentially add a usage example below

python offline_diar_with_asr_infer.py \
    diarizer.manifest_filepath=<path to manifest file> \
    diarizer.out_dir='demo_asr_output' \
    diarizer.speaker_embeddings.model_path=<pretrained modelname or path to .nemo> \
    diarizer.asr.model_path=<pretrained modelname or path to .nemo> \
    diarizer.asr.parameters.asr_based_vad=True \
    diarizer.speaker_embeddings.parameters.save_embeddings=False

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

New Feature
Bugfix
Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the ASR

Signed-off-by: Taejin Park <tango4j@gmail.com>

…_clus

Signed-off-by: Taejin Park <tango4j@gmail.com>

for more information, see https://pre-commit.ci

nemo/collections/asr/parts/utils/online_clustering.py

Signed-off-by: Taejin Park <tango4j@gmail.com>

for more information, see https://pre-commit.ci

Signed-off-by: Taejin Park <tango4j@gmail.com>

nemo/collections/asr/parts/utils/speaker_utils.py

nemo/collections/asr/parts/utils/online_clustering.py

Signed-off-by: Taejin Park <tango4j@gmail.com>

nithinraok

Please see if we can make General case of number of embeddings < unit_window_len i.e., n_windows=1 to be a special case for LongFormClustering. Core algorithm of getting affinity matrix and performing specrtalclustering is re-written, is it possble to reuse it by making modifications in forward_infer of SpeakerClustering.

nemo/collections/asr/parts/utils/offline_clustering.py

nemo/collections/asr/parts/utils/online_clustering.py

examples/speaker_tasks/diarization/conf/inference/diar_infer_general.yaml

stevehuang52 · 2023-10-23T13:23:55Z

nemo/collections/asr/parts/utils/online_clustering.py

+            Y (LongTensor):
+                Speaker labels for the segments in the given input embeddings.
+        """
+        embeddings_in_scales = param_dict['embeddings']


Will this function be directly called by the user? If so, do we need to add some value checking statements to for param_dict such that proper error messages will be thrown if invalid or missing values are found?

Added dimension check in split_input_data function in offline_clustering.py.

stevehuang52 · 2023-10-23T15:56:10Z

nemo/collections/asr/parts/utils/online_clustering.py

+        original labels for all `org_len` input embeddings: `Y_unpack`.
+
+        Args:
+            embeddings_in_scales (Tensor):


Based on the docstring, should the type for embeddings_in_scales and timestamp_in_scales be List[torch.Tensor]?

It is passed as concatenated torch Tensor so this form is correct.
If Type hint is wrong, torch jit compiler breaks at this line.

stevehuang52 · 2023-10-23T16:55:36Z

nemo/collections/asr/parts/utils/online_clustering.py

+    def forward_infer(
+        self,
+        embeddings_in_scales: torch.Tensor,
+        timestamps_in_scales: torch.Tensor,


is it List[torch.Tensor] for these two args?

It is passed as concatenated torch Tensor so this form is correct.
If Type hint is wrong, torch jit compiler breaks at this line.

for more information, see https://pre-commit.ci

Signed-off-by: Taejin Park <tango4j@gmail.com>

…_clus

Signed-off-by: Taejin Park <tango4j@gmail.com>

nithinraok · 2023-11-06T22:25:48Z

jenkins

nithinraok · 2023-11-06T22:26:06Z

jenkins

tttalshaul · 2023-11-12T13:26:14Z

Before checking on real recordings of 4+ hours meetings (that I cannot share) I've multiplied a short video with 5 speakers (which nemo had great results on it before..), on that multiplied video the longformspeakerclustering estimated only 2 speakers.. I don't know why.
Video:
https://www.sendbig.com/view-files/?Id=93e8d76b-470a-1c62-4b8e-48acb0328f59-4dMJ

rttm:
https://www.sendbig.com/view-files/?Id=384cdc42-903d-6aba-2996-c4e388eb0bc7

(links may expire in 7 days..)

tango4j · 2023-11-13T18:04:29Z

@tttalshaul

Hi.
Since there are no high quality annotated dataset contains 2~5 hours of multi-talker recordings, we have tuned the parameters on a set of simulated dataset (audio mixture).

Try changing the following parameters in diar_infer_<domain>.yaml:

chunk_cluster_count: 50 # Number of forced clusters (overclustering) per unit chunk in long-form audio clustering.
embeddings_per_chunk: 10000 # Number of embeddings in each chunk for long-form audio clustering. Adjust based on GPU memory capacity. (default: 10000, approximately 40 mins of audio)

If you increase embeddings_per_chunk, the threshold for applying long-form clustering goes up.
If the algorithm is under-counting, try increasing chunk_cluster_count and this could change the outcome.

I have to inform you that building a system that counts the number of speakers in a very long audio recording is extremely challenging task since we do not have data points to tune it properly.
At this point, the best way to handle this problem is tweaking the above two parameters to find the parameters that deliver the estimated number of speakers.

tttalshaul · 2023-11-16T19:46:43Z

@tango4j
Thank you. Can you use maybe movies to tune it? Many movies are long than 2-3 hours, and have subtitles, which is different from speakers but will help to tag it..
And If you'll have enough data, can you train a neural network to diarize different speakers (not to label their names) without clustering?

mfalt · 2023-12-06T15:10:33Z

Very nice to see this PR being merged. I wanted to test it but I am unable to figure out how (or if) it is possible to use this with the NeuralDiarizer class (with diar_msdd_telephonic) as shown here.

I am experiencing OOM problems with that approach and was hoing this might offer a solution.

lasmith · 2023-12-13T16:36:25Z

Dito on the above comment. We are actually using the ClusteringDiarizer directly and getting cuda memory issues (on an AWS g5 instance). Code is something like:

sd_model = ClusteringDiarizer(cfg=config).to(device)
sd_model.diarize()

I tried cloning / pip installing directly from the main branch but it does not seem to be calling this new class. So the question is how can we use this class / functionality on long audio files > 8 hours?

lasmith · 2023-12-13T17:04:16Z

Actually you can ignore my previous comment. It was my mistake, I had a pre-existing installation of nemo and the version no does not seem to have been updated on the main branch. So running the install did nothing... I recreated the venv, installed from source and now I have the latest code running.

For others the solution (for now until its released) is to:

uninstall nemo
Install from source e.g. pip install git+https://github.com/NVIDIA/NeMo.git@main#egg=nemo_toolkit[all]

Have run the diarizer on an 8 hour audio file and it ran in < 10 mins with < 10gb of GPU memory. So great work :)

tttalshaul · 2023-12-13T17:33:37Z

@lasmith
How many speakers and how were the results?

hanifaudah · 2023-12-18T05:12:13Z

Actually you can ignore my previous comment. It was my mistake, I had a pre-existing installation of nemo and the version no does not seem to have been updated on the main branch. So running the install did nothing... I recreated the venv, installed from source and now I have the latest code running.

For others the solution (for now until its released) is to:

uninstall nemo

Install from source e.g. pip install git+https://github.com/NVIDIA/NeMo.git@main#egg=nemo_toolkit[all]

Have run the diarizer on an 8 hour audio file and it ran in < 10 mins with < 10gb of GPU memory. So great work :)

How much CPU RAM did this use?

…ctions (NVIDIA#7737) * Adding long-form audio clustering for diarization Signed-off-by: Taejin Park <tango4j@gmail.com> * Adding unit test changes Signed-off-by: Taejin Park <tango4j@gmail.com> * Added tests for torch jit script Signed-off-by: Taejin Park <tango4j@gmail.com> * Added variable value checking line Signed-off-by: Taejin Park <tango4j@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Added needed params to all yamls Signed-off-by: Taejin Park <tango4j@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Merged latest main and updated speaker utils Signed-off-by: Taejin Park <tango4j@gmail.com> * Fixed code formatting error in speaker_utils.py Signed-off-by: Taejin Park <tango4j@gmail.com> * Some minor fixes for doc-strings Signed-off-by: Taejin Park <tango4j@gmail.com> * Removed unnecessary comments Signed-off-by: Taejin Park <tango4j@gmail.com> * Refelcted comments and made changes Signed-off-by: Taejin Park <tango4j@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Minor changes on typos and comments Signed-off-by: Taejin Park <tango4j@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fixes for code QL Signed-off-by: Taejin Park <tango4j@gmail.com> * Fixed docstring errors Signed-off-by: Taejin Park <tango4j@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Reflected the second batch of comments Signed-off-by: Taejin Park <tango4j@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Updating all yamls for inference Signed-off-by: Taejin Park <tango4j@gmail.com> * Added None-checker to forward to prevent type errors Signed-off-by: Taejin Park <tango4j@gmail.com> --------- Signed-off-by: Taejin Park <tango4j@gmail.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: He Huang (Steve) <105218074+stevehuang52@users.noreply.github.com> Co-authored-by: Nithin Rao <nithinrao.koluguri@gmail.com> Signed-off-by: Piotr Żelasko <petezor@gmail.com>

mfalt · 2024-03-13T14:14:48Z

@lasmith Would you mind sharing how you were able to adopt the config or the code you posted to use the LongFormSpeakerClustering? I keep reading trying to read the code and get lost in all the classes and configs. Or is the long form version somehow default now?

tango4j · 2024-03-14T00:46:19Z

@mfalt
Hi. I am the main contributor of the long-form audio speaker diarization.
It is using long form audio class but if it is shorter than the pre-determined threshold, it uses the regular clustering, but if it goes above threshold, it employs long-form audio clustering mechanism which is basically divide and conquer way of clustering.

lasmith · 2024-03-14T12:47:04Z

@lasmith Would you mind sharing how you were able to adopt the config or the code you posted to use the LongFormSpeakerClustering? I keep reading trying to read the code and get lost in all the classes and configs. Or is the long form version somehow default now?

Yes its the default now. The instructions above are no longer needed, as the code has now been merged into master and released. So all you need to do is install the latest version e.g.

pip install nemo-toolkit[all]==1.23.0

To use the long form audio you just need to "chunk_cluster_count" and "embeddings_per_chunk" to your config. See here for an example.

To answer the question on accuracy, in general accuracy is identical for shorter audio (not found any issues). We had an issue raised for longer audio. We have audio where we don't know the no. of speakers aprior, and had set the "max_num_speakers" to 8. On longer audio that have > 8 speakers this resulted in all the audio being assigned to the same speaker (so only 1 speaker detected). Increasing the limit, resolved this. Not sure if this is true on shorter audio with lots of speakers also, as it may not be specific to long form.

trojanof · 2024-04-26T06:09:10Z

To answer the question on accuracy, in general accuracy is identical for shorter audio (not found any issues). We had an issue raised for longer audio. We have audio where we don't know the no. of speakers aprior, and had set the "max_num_speakers" to 8. On longer audio that have > 8 speakers this resulted in all the audio being assigned to the same speaker (so only 1 speaker detected). Increasing the limit, resolved this. Not sure if this is true on shorter audio with lots of speakers also, as it may not be specific to long form.

Hello @lasmith
I faced similar problem having all the audio (1 hour 5 min) assigned to one speaker. I didn't get from your comment what limits exactly were increased to obtain good result with proper diarization. Could you please describe in more details what were the values of "embeddings_per_chunk" and maybe "chunk_cluster_count"?

lasmith · 2024-04-29T09:05:27Z

Hello @lasmith I faced similar problem having all the audio (1 hour 5 min) assigned to one speaker. I didn't get from your comment what limits exactly were increased to obtain good result with proper diarization. Could you please describe in more details what were the values of "embeddings_per_chunk" and maybe "chunk_cluster_count"?

The setting I changed was the "max_num_speakers", which I increased from 8 to 20. The other settings you mention, seem to only affect the memory usage. However the values we have used were:

chunk_cluster_count: 75,
embeddings_per_chunk: 20000

Whilst these seem to alleviate the issue, it does not get rid of it completely. Not had much time to investigate this further as fortunately we don't get many long form audio files and it seems to only affect files above X hours, which have large numbers of speakers. So I am not sure the exact cause at the moment but my suspicion is its something to do with the number of speakers and how its detecting these.

FYI For testing we are using UK parliament recordings of Hansard (see here) as these are the closest public domain data to our legal use case.

tango4j added 6 commits October 13, 2023 17:35

Adding long-form audio clustering for diarization

3923de4

Signed-off-by: Taejin Park <tango4j@gmail.com>

Adding unit test changes

9829949

Signed-off-by: Taejin Park <tango4j@gmail.com>

Merge branch 'NVIDIA:main' into long_clus

f9c6141

Added tests for torch jit script

26c61c4

Signed-off-by: Taejin Park <tango4j@gmail.com>

Merge branch 'long_clus' of https://github.com/tango4j/NeMo into long…

51904c6

…_clus

Added variable value checking line

67883f5

Signed-off-by: Taejin Park <tango4j@gmail.com>

github-actions bot added ASR Speaker Tasks labels Oct 17, 2023

[pre-commit.ci] auto fixes from pre-commit.com hooks

15ab8cd

for more information, see https://pre-commit.ci

github-advanced-security bot found potential problems Oct 17, 2023

View reviewed changes

nemo/collections/asr/parts/utils/online_clustering.py Fixed Show fixed Hide fixed

nemo/collections/asr/parts/utils/online_clustering.py Fixed Show fixed Hide fixed

nemo/collections/asr/parts/utils/online_clustering.py Fixed Show fixed Hide fixed

Added needed params to all yamls

8f537f8

Signed-off-by: Taejin Park <tango4j@gmail.com>

tango4j mentioned this pull request Oct 17, 2023

Speaker Diarization Task: Clustering stage crashes(Cuda out of memory) for long audio files using Gpu #7691

Closed

tango4j and others added 4 commits October 17, 2023 20:01

Consolidated long-form and short-form clustering methods

dcaf06a

Signed-off-by: Taejin Park <tango4j@gmail.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

c739609

for more information, see https://pre-commit.ci

Merge remote-tracking branch 'origin' into long_clus

7ac2ecc

Merged latest main and updated speaker utils

f62184a

Signed-off-by: Taejin Park <tango4j@gmail.com>

github-advanced-security bot found potential problems Oct 18, 2023

View reviewed changes

nemo/collections/asr/parts/utils/speaker_utils.py Fixed Show fixed Hide fixed

nemo/collections/asr/parts/utils/online_clustering.py Fixed Show fixed Hide fixed

Fixed code formatting error in speaker_utils.py

e7ce447

Signed-off-by: Taejin Park <tango4j@gmail.com>

tango4j requested review from nithinraok, KunalDhawan and stevehuang52 October 18, 2023 16:56

Some minor fixes for doc-strings

f8ac688

Signed-off-by: Taejin Park <tango4j@gmail.com>

tango4j marked this pull request as ready for review October 18, 2023 19:50

tango4j and others added 2 commits October 18, 2023 13:49

Removed unnecessary comments

31f57d2

Signed-off-by: Taejin Park <tango4j@gmail.com>

Merge branch 'main' into long_clus

e810223

nithinraok requested changes Oct 20, 2023

View reviewed changes

Merge branch 'main' into long_clus

3a5b4f2

stevehuang52 reviewed Oct 23, 2023

View reviewed changes

pre-commit-ci bot and others added 5 commits November 1, 2023 00:28

[pre-commit.ci] auto fixes from pre-commit.com hooks

c90bce8

for more information, see https://pre-commit.ci

Updating all yamls for inference

e41acee

Signed-off-by: Taejin Park <tango4j@gmail.com>

Merge branch 'long_clus' of https://github.com/tango4j/NeMo into long…

bf7fe44

…_clus

Added None-checker to forward to prevent type errors

cd299ec

Signed-off-by: Taejin Park <tango4j@gmail.com>

Merge branch 'main' into long_clus

2db9779

nithinraok approved these changes Nov 1, 2023

View reviewed changes

tango4j removed the request for review from KunalDhawan November 2, 2023 00:45

Merge branch 'main' into long_clus

22dee0a

Merge branch 'main' into long_clus

1416307

tango4j merged commit df9f0d1 into NVIDIA:main Nov 7, 2023
11 checks passed

ALEXuH mentioned this pull request Nov 8, 2023

adding long-form audio speaker diarization MahmoudAshraf97/whisper-diarization#125

Merged

This was referenced Nov 20, 2023

Long-form audio speaker diarization OOM in clustering CLD2Owners/cld2#69

Closed

Long-form audio speaker diarization OOM in clustering #7912

Closed

tango4j deleted the long_clus branch December 6, 2023 21:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding long-form audio speaker diarization (clustering) class and functions #7737

Adding long-form audio speaker diarization (clustering) class and functions #7737

tango4j commented Oct 17, 2023 •

edited

nithinraok left a comment

stevehuang52 Oct 23, 2023

tango4j Oct 26, 2023

stevehuang52 Oct 23, 2023

tango4j Oct 25, 2023

stevehuang52 Oct 23, 2023

tango4j Oct 25, 2023

nithinraok commented Nov 6, 2023

nithinraok commented Nov 6, 2023

tttalshaul commented Nov 12, 2023

tango4j commented Nov 13, 2023 •

edited

tttalshaul commented Nov 16, 2023

mfalt commented Dec 6, 2023

lasmith commented Dec 13, 2023

lasmith commented Dec 13, 2023

tttalshaul commented Dec 13, 2023

hanifaudah commented Dec 18, 2023

mfalt commented Mar 13, 2024

tango4j commented Mar 14, 2024

lasmith commented Mar 14, 2024

trojanof commented Apr 26, 2024 •

edited

lasmith commented Apr 29, 2024

Adding long-form audio speaker diarization (clustering) class and functions #7737

Adding long-form audio speaker diarization (clustering) class and functions #7737

Conversation

tango4j commented Oct 17, 2023 • edited

What does this PR do ?

Changelog

Usage

Before your PR is "Ready for review"

Who can review?

nithinraok left a comment

Choose a reason for hiding this comment

stevehuang52 Oct 23, 2023

Choose a reason for hiding this comment

tango4j Oct 26, 2023

Choose a reason for hiding this comment

stevehuang52 Oct 23, 2023

Choose a reason for hiding this comment

tango4j Oct 25, 2023

Choose a reason for hiding this comment

stevehuang52 Oct 23, 2023

Choose a reason for hiding this comment

tango4j Oct 25, 2023

Choose a reason for hiding this comment

nithinraok commented Nov 6, 2023

nithinraok commented Nov 6, 2023

tttalshaul commented Nov 12, 2023

tango4j commented Nov 13, 2023 • edited

tttalshaul commented Nov 16, 2023

mfalt commented Dec 6, 2023

lasmith commented Dec 13, 2023

lasmith commented Dec 13, 2023

tttalshaul commented Dec 13, 2023

hanifaudah commented Dec 18, 2023

mfalt commented Mar 13, 2024

tango4j commented Mar 14, 2024

lasmith commented Mar 14, 2024

trojanof commented Apr 26, 2024 • edited

lasmith commented Apr 29, 2024

tango4j commented Oct 17, 2023 •

edited

tango4j commented Nov 13, 2023 •

edited

trojanof commented Apr 26, 2024 •

edited