Update diarization data loader to train meeting data #4567

tango4j · 2022-07-20T03:52:04Z

Signed-off-by: Taejin Park tango4j@gmail.com

What does this PR do ?

This PR adds a new feature to audio_to_diar_label.py.
So far, we have only trained with 2 speaker audio where one of the speakers always exists.
This PR makes dataloader to take 2 speaker subset from more-than-2-speaker audio clips with corresponding groundtruth.
Thus, there are cases where grround-truth-cluster label is "-1" where speakers are not included in either of two targeted speakers.
ex) Two speaker ground-truth label ([0,0,1,1,-1,-1,-1,1,1,1,0]

Also removed unnecessary variables.

Collection: [Note which collection this PR will affect]

ASR

Changelog

audio_to_diar_label.py

-- Removed unused variables in the functions.
-- base_clus_label now can have -1 label.
-- Few cosmetic changes

collections.py

A few changes include:

-- Added get_rttm_speaker_index() function since sess_spk_dict could have more than 2 speakers.

speaker_utils.py

-- Added get_rttm_speaker_index that takes speakers from RTTM files.

Usage

The usage will be provided when model and module codes are published.

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

New Feature
Bugfix
Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

Related to # (issue)

Signed-off-by: Taejin Park <tango4j@gmail.com>

lgtm-com · 2022-07-20T04:02:23Z

This pull request introduces 1 alert when merging 08dfec2 into fea3775 - view on LGTM.com

new alerts:

1 for Unused local variable

Signed-off-by: Taejin Park <tango4j@gmail.com>

…into update_diar_to_label

Signed-off-by: Taejin Park <tango4j@gmail.com>

lgtm-com · 2022-07-20T04:37:00Z

This pull request introduces 1 alert when merging 932ece6 into fea3775 - view on LGTM.com

new alerts:

1 for Unused local variable

Signed-off-by: Taejin Park <tango4j@gmail.com>

nithinraok · 2022-07-25T16:49:07Z

nemo/collections/asr/data/audio_to_diar_label.py

@@ -61,7 +61,7 @@ def get_scale_mapping_list(uniq_timestamps):
    return scale_mapping_argmat


-def extract_seg_info_from_rttm(uniq_id, rttm_lines, emb_dict=None, target_spks=None):


you replaced emb_dict with mapping_dict but I see emb_dict variable still at line 84.

Also please add doc strings for these variables

This is a mistake and I didn't catch it cause i didn't test seq_eval_mode. removed emb_dict. Adding docstrings.

nithinraok · 2022-07-25T16:50:57Z

nemo/collections/asr/data/audio_to_diar_label.py

@@ -274,7 +259,7 @@ def assign_labels_to_longer_segs(self, uniq_id, base_scale_clus_label):
        per_scale_clus_label = torch.tensor(per_scale_clus_label)
        return per_scale_clus_label, uniq_scale_mapping

-    def get_diar_target_labels(self, uniq_id, fr_level_target):
+    def get_diar_target_labels(self, uniq_id, sample, fr_level_target):


add doc strings for sample

possible to initialize with default values for these variables? comes in handy for debugging issues in the future

This needs discussion. I can't fully understand what variable should be initilaized with default values. These arguments should not be assigned with any default values. Adding docstring for sample.

nithinraok · 2022-07-25T16:54:59Z

nemo/collections/asr/parts/utils/speaker_utils.py

@@ -359,6 +359,19 @@ def rttm_to_labels(rttm_filename):
    return labels


+def get_rttm_speaker_index(rttm_labels):
+    """
+    Generate speaker mapping between integer index to RTTM speaker label names.


add doc strings, input and return

Adding doctstrings for this function

Signed-off-by: Taejin Park <tango4j@gmail.com>

nithinraok

LGTM, thanks.
Needs to test after we merge MSDD model classes

* Update audio_to_diar_label to train meeting data Signed-off-by: Taejin Park <tango4j@gmail.com> * Style fix with --scope=nemo Signed-off-by: Taejin Park <tango4j@gmail.com> * Style fix problem, re-run style fix Signed-off-by: Taejin Park <tango4j@gmail.com> * Removed remaining commented lines Signed-off-by: Taejin Park <tango4j@gmail.com> * Remove an unused variable Signed-off-by: Taejin Park <tango4j@gmail.com> * Reflected comments Signed-off-by: Taejin Park <tango4j@gmail.com> * style fixed Signed-off-by: Taejin Park <tango4j@gmail.com> * style fix for no reason Signed-off-by: Taejin Park <tango4j@gmail.com> Co-authored-by: Nithin Rao <nithinrao.koluguri@gmail.com> Signed-off-by: David Mosallanezhad <dmosallanezh@nvidia.com>

* Update audio_to_diar_label to train meeting data Signed-off-by: Taejin Park <tango4j@gmail.com> * Style fix with --scope=nemo Signed-off-by: Taejin Park <tango4j@gmail.com> * Style fix problem, re-run style fix Signed-off-by: Taejin Park <tango4j@gmail.com> * Removed remaining commented lines Signed-off-by: Taejin Park <tango4j@gmail.com> * Remove an unused variable Signed-off-by: Taejin Park <tango4j@gmail.com> * Reflected comments Signed-off-by: Taejin Park <tango4j@gmail.com> * style fixed Signed-off-by: Taejin Park <tango4j@gmail.com> * style fix for no reason Signed-off-by: Taejin Park <tango4j@gmail.com> Co-authored-by: Nithin Rao <nithinrao.koluguri@gmail.com> Signed-off-by: Anas Abou Allaban <aabouallaban@pm.me>

* Update audio_to_diar_label to train meeting data Signed-off-by: Taejin Park <tango4j@gmail.com> * Style fix with --scope=nemo Signed-off-by: Taejin Park <tango4j@gmail.com> * Style fix problem, re-run style fix Signed-off-by: Taejin Park <tango4j@gmail.com> * Removed remaining commented lines Signed-off-by: Taejin Park <tango4j@gmail.com> * Remove an unused variable Signed-off-by: Taejin Park <tango4j@gmail.com> * Reflected comments Signed-off-by: Taejin Park <tango4j@gmail.com> * style fixed Signed-off-by: Taejin Park <tango4j@gmail.com> * style fix for no reason Signed-off-by: Taejin Park <tango4j@gmail.com> Co-authored-by: Nithin Rao <nithinrao.koluguri@gmail.com> Signed-off-by: Hainan Xu <hainanx@nvidia.com>

Update audio_to_diar_label to train meeting data

08dfec2

Signed-off-by: Taejin Park <tango4j@gmail.com>

tango4j added 4 commits July 19, 2022 21:13

Style fix with --scope=nemo

028defb

Signed-off-by: Taejin Park <tango4j@gmail.com>

Merge branch 'update_diar_to_label' of https://github.com/NVIDIA/NeMo …

f70c8ac

…into update_diar_to_label

Style fix problem, re-run style fix

d0758a1

Signed-off-by: Taejin Park <tango4j@gmail.com>

Removed remaining commented lines

932ece6

Signed-off-by: Taejin Park <tango4j@gmail.com>

tango4j added 3 commits July 19, 2022 23:58

Remove an unused variable

050381f

Signed-off-by: Taejin Park <tango4j@gmail.com>

Merge branch 'main' into update_diar_to_label

8ac6618

Merge branch 'main' into update_diar_to_label

9c00a0a

tango4j requested a review from nithinraok July 22, 2022 21:58

tango4j marked this pull request as ready for review July 22, 2022 22:01

Merge branch 'main' into update_diar_to_label

333cfbe

nithinraok requested changes Jul 25, 2022

View reviewed changes

tango4j and others added 5 commits July 26, 2022 15:23

Reflected comments

81cd88c

Signed-off-by: Taejin Park <tango4j@gmail.com>

style fixed

2d761a8

Signed-off-by: Taejin Park <tango4j@gmail.com>

style fix for no reason

20cbf6a

Signed-off-by: Taejin Park <tango4j@gmail.com>

Merge branch 'main' into update_diar_to_label

d0500ee

Merge branch 'main' into update_diar_to_label

89e55b2

nithinraok approved these changes Jul 28, 2022

View reviewed changes

tango4j merged commit 16c96ba into main Jul 28, 2022

tango4j deleted the update_diar_to_label branch August 2, 2022 01:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update diarization data loader to train meeting data #4567

Update diarization data loader to train meeting data #4567

tango4j commented Jul 20, 2022 •

edited

Loading

lgtm-com bot commented Jul 20, 2022

lgtm-com bot commented Jul 20, 2022

nithinraok Jul 25, 2022

nithinraok Jul 25, 2022

tango4j Jul 26, 2022

nithinraok Jul 25, 2022

nithinraok Jul 25, 2022

tango4j Jul 26, 2022 •

edited

Loading

nithinraok Jul 25, 2022

tango4j Jul 26, 2022

nithinraok left a comment

		@@ -61,7 +61,7 @@ def get_scale_mapping_list(uniq_timestamps):
		return scale_mapping_argmat


		def extract_seg_info_from_rttm(uniq_id, rttm_lines, emb_dict=None, target_spks=None):

Update diarization data loader to train meeting data #4567

Update diarization data loader to train meeting data #4567

Conversation

tango4j commented Jul 20, 2022 • edited Loading

What does this PR do ?

Changelog

Usage

Before your PR is "Ready for review"

Who can review?

Additional Information

lgtm-com bot commented Jul 20, 2022

lgtm-com bot commented Jul 20, 2022

nithinraok Jul 25, 2022

Choose a reason for hiding this comment

nithinraok Jul 25, 2022

Choose a reason for hiding this comment

tango4j Jul 26, 2022

Choose a reason for hiding this comment

nithinraok Jul 25, 2022

Choose a reason for hiding this comment

nithinraok Jul 25, 2022

Choose a reason for hiding this comment

tango4j Jul 26, 2022 • edited Loading

Choose a reason for hiding this comment

nithinraok Jul 25, 2022

Choose a reason for hiding this comment

tango4j Jul 26, 2022

Choose a reason for hiding this comment

nithinraok left a comment

Choose a reason for hiding this comment

tango4j commented Jul 20, 2022 •

edited

Loading

tango4j Jul 26, 2022 •

edited

Loading