Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding long-form audio speaker diarization (clustering) class and functions #7737

Merged
merged 40 commits into from Nov 7, 2023

Conversation

tango4j
Copy link
Collaborator

@tango4j tango4j commented Oct 17, 2023

What does this PR do ?

This PR adds long-form audio speaker diarization feature by adding LongFormSpeakerClustering class.
Two parameters are added to yaml file for long form audio clustering.

This is forward function for long-form speaker clustering.
Please refer to SpeakerClustering class for the original argument information.

Please read the following text to understand the long-form clustering
In the function long_forward_infer in LongFormSpeakerClustering class:

  1. Input embeddings are divided into smaller windows of size unit_window_len. It is usually order of 10K. unit_window_len=10000 is default value.
  2. Each window undergoes overclustering, resulting in sub_cluster_n fine-grained clusters. Default sub_cluster_n=50.
  3. These fine-grained clusters are merged to form the aggregated clustering labels Y_aggr.
  4. The unpack_labels function is then employed to map the aggregated labels Y_aggr back to the original labels for all org_len input embeddings: Y_unpack.
  • Theoretically, the maximum length of audio can be extended by a factor of unit_window_len/sub_cluster_n. For instance, by default, if the original clustering hits the memory limit at the 1-hour mark, the long-form clustering could handle up to 20 hours without exhausting the memory.
  • The added long-form audio clustering re-uses (repurposes) the online clustering functions and offline clustering functions except unpack_labels.
  • The main forward_unit_infer clustering execution function and unpack_labels functions are added to the test_diar_utils.py.
    (1) forward_unit_infer function performs a simple 1-scale clustering without time-stamp processing.
    (2) LongFormSpeakerClustering has forward_infer function that chooses one of these two functions:
    -- long_forward_infer if the input number of segments are larger than self.unit_window_len.
    -- short_forward_infer if the input number of segments are smaller or equal to self.unit_window_len.

Collection: ASR - Speaker Tasks - Speaker Diarization

Changelog

Updated files:

  • All three diarization inference yaml files
  • speaker_utils.py
  • test_diar_utils.py
  • longform_clustering.py - Major change of adding LongFormSpeakerClustering
  • offline_clustering.py

Usage

  • You can potentially add a usage example below
python offline_diar_with_asr_infer.py \
    diarizer.manifest_filepath=<path to manifest file> \
    diarizer.out_dir='demo_asr_output' \
    diarizer.speaker_embeddings.model_path=<pretrained modelname or path to .nemo> \
    diarizer.asr.model_path=<pretrained modelname or path to .nemo> \
    diarizer.asr.parameters.asr_based_vad=True \
    diarizer.speaker_embeddings.parameters.save_embeddings=False

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

  • New Feature
  • Bugfix
  • Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the ASR

Signed-off-by: Taejin Park <tango4j@gmail.com>
Signed-off-by: Taejin Park <tango4j@gmail.com>
Signed-off-by: Taejin Park <tango4j@gmail.com>
Signed-off-by: Taejin Park <tango4j@gmail.com>
Signed-off-by: Taejin Park <tango4j@gmail.com>
Signed-off-by: Taejin Park <tango4j@gmail.com>
Signed-off-by: Taejin Park <tango4j@gmail.com>
@tango4j tango4j marked this pull request as ready for review October 18, 2023 19:50
tango4j and others added 2 commits October 18, 2023 13:49
Signed-off-by: Taejin Park <tango4j@gmail.com>
Copy link
Collaborator

@nithinraok nithinraok left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see if we can make General case of number of embeddings < unit_window_len i.e., n_windows=1 to be a special case for LongFormClustering. Core algorithm of getting affinity matrix and performing specrtalclustering is re-written, is it possble to reuse it by making modifications in forward_infer of SpeakerClustering.

nemo/collections/asr/parts/utils/offline_clustering.py Outdated Show resolved Hide resolved
nemo/collections/asr/parts/utils/online_clustering.py Outdated Show resolved Hide resolved
nemo/collections/asr/parts/utils/online_clustering.py Outdated Show resolved Hide resolved
nemo/collections/asr/parts/utils/online_clustering.py Outdated Show resolved Hide resolved
nemo/collections/asr/parts/utils/online_clustering.py Outdated Show resolved Hide resolved
nemo/collections/asr/parts/utils/online_clustering.py Outdated Show resolved Hide resolved
Y (LongTensor):
Speaker labels for the segments in the given input embeddings.
"""
embeddings_in_scales = param_dict['embeddings']
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this function be directly called by the user? If so, do we need to add some value checking statements to for param_dict such that proper error messages will be thrown if invalid or missing values are found?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added dimension check in split_input_data function in offline_clustering.py.

original labels for all `org_len` input embeddings: `Y_unpack`.

Args:
embeddings_in_scales (Tensor):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the docstring, should the type for embeddings_in_scales and timestamp_in_scales be List[torch.Tensor]?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is passed as concatenated torch Tensor so this form is correct.
If Type hint is wrong, torch jit compiler breaks at this line.

def forward_infer(
self,
embeddings_in_scales: torch.Tensor,
timestamps_in_scales: torch.Tensor,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it List[torch.Tensor] for these two args?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is passed as concatenated torch Tensor so this form is correct.
If Type hint is wrong, torch jit compiler breaks at this line.

@tango4j tango4j removed the request for review from KunalDhawan November 2, 2023 00:45
@nithinraok
Copy link
Collaborator

jenkins

@nithinraok
Copy link
Collaborator

jenkins

@tango4j tango4j merged commit df9f0d1 into NVIDIA:main Nov 7, 2023
11 checks passed
@tttalshaul
Copy link

Before checking on real recordings of 4+ hours meetings (that I cannot share) I've multiplied a short video with 5 speakers (which nemo had great results on it before..), on that multiplied video the longformspeakerclustering estimated only 2 speakers.. I don't know why.
Video:
https://www.sendbig.com/view-files/?Id=93e8d76b-470a-1c62-4b8e-48acb0328f59-4dMJ

rttm:
https://www.sendbig.com/view-files/?Id=384cdc42-903d-6aba-2996-c4e388eb0bc7

(links may expire in 7 days..)

@tango4j
Copy link
Collaborator Author

tango4j commented Nov 13, 2023

@tttalshaul

Hi.
Since there are no high quality annotated dataset contains 2~5 hours of multi-talker recordings, we have tuned the parameters on a set of simulated dataset (audio mixture).

Try changing the following parameters in diar_infer_<domain>.yaml:

chunk_cluster_count: 50 # Number of forced clusters (overclustering) per unit chunk in long-form audio clustering.
embeddings_per_chunk: 10000 # Number of embeddings in each chunk for long-form audio clustering. Adjust based on GPU memory capacity. (default: 10000, approximately 40 mins of audio) 

If you increase embeddings_per_chunk, the threshold for applying long-form clustering goes up.
If the algorithm is under-counting, try increasing chunk_cluster_count and this could change the outcome.

I have to inform you that building a system that counts the number of speakers in a very long audio recording is extremely challenging task since we do not have data points to tune it properly.
At this point, the best way to handle this problem is tweaking the above two parameters to find the parameters that deliver the estimated number of speakers.

@tttalshaul
Copy link

@tango4j
Thank you. Can you use maybe movies to tune it? Many movies are long than 2-3 hours, and have subtitles, which is different from speakers but will help to tag it..
And If you'll have enough data, can you train a neural network to diarize different speakers (not to label their names) without clustering?

@mfalt
Copy link

mfalt commented Dec 6, 2023

Very nice to see this PR being merged. I wanted to test it but I am unable to figure out how (or if) it is possible to use this with the NeuralDiarizer class (with diar_msdd_telephonic) as shown here.

I am experiencing OOM problems with that approach and was hoing this might offer a solution.

@tango4j tango4j deleted the long_clus branch December 6, 2023 21:26
@lasmith
Copy link

lasmith commented Dec 13, 2023

Dito on the above comment. We are actually using the ClusteringDiarizer directly and getting cuda memory issues (on an AWS g5 instance). Code is something like:

sd_model = ClusteringDiarizer(cfg=config).to(device)
sd_model.diarize()

I tried cloning / pip installing directly from the main branch but it does not seem to be calling this new class. So the question is how can we use this class / functionality on long audio files > 8 hours?

@lasmith
Copy link

lasmith commented Dec 13, 2023

Actually you can ignore my previous comment. It was my mistake, I had a pre-existing installation of nemo and the version no does not seem to have been updated on the main branch. So running the install did nothing... I recreated the venv, installed from source and now I have the latest code running.

For others the solution (for now until its released) is to:

  1. uninstall nemo
  2. Install from source e.g. pip install git+https://github.com/NVIDIA/NeMo.git@main#egg=nemo_toolkit[all]

Have run the diarizer on an 8 hour audio file and it ran in < 10 mins with < 10gb of GPU memory. So great work :)

@tttalshaul
Copy link

@lasmith
How many speakers and how were the results?

@hanifaudah
Copy link

Actually you can ignore my previous comment. It was my mistake, I had a pre-existing installation of nemo and the version no does not seem to have been updated on the main branch. So running the install did nothing... I recreated the venv, installed from source and now I have the latest code running.

For others the solution (for now until its released) is to:

  1. uninstall nemo
  2. Install from source e.g. pip install git+https://github.com/NVIDIA/NeMo.git@main#egg=nemo_toolkit[all]

Have run the diarizer on an 8 hour audio file and it ran in < 10 mins with < 10gb of GPU memory. So great work :)

How much CPU RAM did this use?

pzelasko pushed a commit to pzelasko/NeMo that referenced this pull request Jan 3, 2024
…ctions (NVIDIA#7737)

* Adding long-form audio clustering for diarization

Signed-off-by: Taejin Park <tango4j@gmail.com>

* Adding unit test changes

Signed-off-by: Taejin Park <tango4j@gmail.com>

* Added tests for torch jit script

Signed-off-by: Taejin Park <tango4j@gmail.com>

* Added variable value checking line

Signed-off-by: Taejin Park <tango4j@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Added needed params to all yamls

Signed-off-by: Taejin Park <tango4j@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Merged latest main and updated speaker utils

Signed-off-by: Taejin Park <tango4j@gmail.com>

* Fixed code formatting error in speaker_utils.py

Signed-off-by: Taejin Park <tango4j@gmail.com>

* Some minor fixes for doc-strings

Signed-off-by: Taejin Park <tango4j@gmail.com>

* Removed unnecessary comments

Signed-off-by: Taejin Park <tango4j@gmail.com>

* Refelcted comments and made changes

Signed-off-by: Taejin Park <tango4j@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Minor changes on typos and comments

Signed-off-by: Taejin Park <tango4j@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fixes for code QL

Signed-off-by: Taejin Park <tango4j@gmail.com>

* Fixed docstring errors

Signed-off-by: Taejin Park <tango4j@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Reflected the second batch of comments

Signed-off-by: Taejin Park <tango4j@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Updating all yamls for inference

Signed-off-by: Taejin Park <tango4j@gmail.com>

* Added None-checker to forward to prevent type errors

Signed-off-by: Taejin Park <tango4j@gmail.com>

---------

Signed-off-by: Taejin Park <tango4j@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: He Huang (Steve) <105218074+stevehuang52@users.noreply.github.com>
Co-authored-by: Nithin Rao <nithinrao.koluguri@gmail.com>
Signed-off-by: Piotr Żelasko <petezor@gmail.com>
@mfalt
Copy link

mfalt commented Mar 13, 2024

@lasmith Would you mind sharing how you were able to adopt the config or the code you posted to use the LongFormSpeakerClustering? I keep reading trying to read the code and get lost in all the classes and configs. Or is the long form version somehow default now?

@tango4j
Copy link
Collaborator Author

tango4j commented Mar 14, 2024

@mfalt
Hi. I am the main contributor of the long-form audio speaker diarization.
It is using long form audio class but if it is shorter than the pre-determined threshold, it uses the regular clustering, but if it goes above threshold, it employs long-form audio clustering mechanism which is basically divide and conquer way of clustering.

@lasmith
Copy link

lasmith commented Mar 14, 2024

@lasmith Would you mind sharing how you were able to adopt the config or the code you posted to use the LongFormSpeakerClustering? I keep reading trying to read the code and get lost in all the classes and configs. Or is the long form version somehow default now?

Yes its the default now. The instructions above are no longer needed, as the code has now been merged into master and released. So all you need to do is install the latest version e.g.

pip install nemo-toolkit[all]==1.23.0

To use the long form audio you just need to "chunk_cluster_count" and "embeddings_per_chunk" to your config. See here for an example.

To answer the question on accuracy, in general accuracy is identical for shorter audio (not found any issues). We had an issue raised for longer audio. We have audio where we don't know the no. of speakers aprior, and had set the "max_num_speakers" to 8. On longer audio that have > 8 speakers this resulted in all the audio being assigned to the same speaker (so only 1 speaker detected). Increasing the limit, resolved this. Not sure if this is true on shorter audio with lots of speakers also, as it may not be specific to long form.

@trojanof
Copy link

trojanof commented Apr 26, 2024

To answer the question on accuracy, in general accuracy is identical for shorter audio (not found any issues). We had an issue raised for longer audio. We have audio where we don't know the no. of speakers aprior, and had set the "max_num_speakers" to 8. On longer audio that have > 8 speakers this resulted in all the audio being assigned to the same speaker (so only 1 speaker detected). Increasing the limit, resolved this. Not sure if this is true on shorter audio with lots of speakers also, as it may not be specific to long form.

Hello @lasmith
I faced similar problem having all the audio (1 hour 5 min) assigned to one speaker. I didn't get from your comment what limits exactly were increased to obtain good result with proper diarization. Could you please describe in more details what were the values of "embeddings_per_chunk" and maybe "chunk_cluster_count"?

@lasmith
Copy link

lasmith commented Apr 29, 2024

Hello @lasmith I faced similar problem having all the audio (1 hour 5 min) assigned to one speaker. I didn't get from your comment what limits exactly were increased to obtain good result with proper diarization. Could you please describe in more details what were the values of "embeddings_per_chunk" and maybe "chunk_cluster_count"?

The setting I changed was the "max_num_speakers", which I increased from 8 to 20. The other settings you mention, seem to only affect the memory usage. However the values we have used were:

  • chunk_cluster_count: 75,
  • embeddings_per_chunk: 20000

Whilst these seem to alleviate the issue, it does not get rid of it completely. Not had much time to investigate this further as fortunately we don't get many long form audio files and it seems to only affect files above X hours, which have large numbers of speakers. So I am not sure the exact cause at the moment but my suspicion is its something to do with the number of speakers and how its detecting these.

FYI For testing we are using UK parliament recordings of Hansard (see here) as these are the closest public domain data to our legal use case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants