Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Standalone diarization+ASR evaluation script #5439

Merged
merged 50 commits into from
Nov 21, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
50 commits
Select commit Hold shift + click to select a range
19cbe6f
first commit on eval_diar_with_asr.py
tango4j Nov 16, 2022
d85efa2
Add a standalone diarization-ASR evaluation transcript
tango4j Nov 16, 2022
c1e7cf4
Fixed examples in docstrings
tango4j Nov 16, 2022
aea2e6c
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Nov 16, 2022
c48bf07
Fixed staticmethod error
tango4j Nov 16, 2022
cdc3235
merged main
tango4j Nov 16, 2022
b9feb6b
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Nov 16, 2022
af94a2e
Merge branch 'main' into mulspk_asr_eval_script
tango4j Nov 17, 2022
902c088
Merge branch 'main' into mulspk_asr_eval_script
tango4j Nov 17, 2022
11a7188
Merge branch 'main' into mulspk_asr_eval_script
tango4j Nov 17, 2022
979ec60
Added description on eval modes
tango4j Nov 17, 2022
2731413
Merge branch 'mulspk_asr_eval_script' of https://github.com/NVIDIA/Ne…
tango4j Nov 17, 2022
9a48743
Merge branch 'main' into mulspk_asr_eval_script
tango4j Nov 17, 2022
e449287
adding diar_infer_general.yaml
tango4j Nov 17, 2022
a4cbd7a
Merge branch 'mulspk_asr_eval_script' of https://github.com/NVIDIA/Ne…
tango4j Nov 17, 2022
e35629d
fix msdd_model in general yaml file
tango4j Nov 17, 2022
db31725
fixed errors in yaml file
tango4j Nov 17, 2022
50f8417
combine into 1 commit
tango4j Nov 17, 2022
61b4809
Added description on eval modes
tango4j Nov 17, 2022
250468c
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Nov 16, 2022
9e2acca
Add MoE support for T5 model (w/o expert parallel) (#5409)
aklife97 Nov 15, 2022
0abcbaf
Fix args (#5410) (#5416)
github-actions[bot] Nov 15, 2022
54a87d2
Fix for concat map dataset (#5133)
1-800-BAD-CODE Nov 15, 2022
dd3f3aa
Add temporary fix for CUDA issue in Dockerfile (#5421) (#5422)
github-actions[bot] Nov 15, 2022
a2145ba
Fix GPT generation when using sentencepiece tokenizer (#5413) (#5428)
github-actions[bot] Nov 16, 2022
4d7b8cf
Support for finetuning and finetuning inference with .ckpt files & ba…
MaximumEntropy Nov 16, 2022
cb6f339
Revert "Add temporary fix for CUDA issue in Dockerfile (#5421)" (#543…
github-actions[bot] Nov 16, 2022
0f56bdb
[ITN] fix year date graph, cardinals extension for hundreds (#5435)
ekmb Nov 16, 2022
6a97bc9
update doc in terms of get_label for lang id model (#5366)
fayejf Nov 16, 2022
cec4715
Revert workaround for T5 that sets number of workers to 0 & sync_batc…
github-actions[bot] Nov 16, 2022
785426c
Fixed bug in notebook (#5382) (#5394)
github-actions[bot] Nov 16, 2022
6b018da
Fixing bug in Megatron BERT when loss mask is all zeros (#5424)
shanmugamr1992 Nov 16, 2022
51250e9
Use updated API for overlapping grad sync with pipeline parallelism (…
timmoon10 Nov 16, 2022
7041f09
support to disable sequence length + 1 input tokens for each sample i…
anmolgupt Nov 17, 2022
9de694a
[TTS] Create script for processing TTS training audio (#5262)
rlangman Nov 17, 2022
b75fc1a
[TTS] remove useless logic for set_tokenizer. (#5430)
XuesongYang Nov 17, 2022
40e2fdf
Fix setting up of `ReduceLROnPlateau` learning rate scheduler (#5444)
PeganovAnton Nov 17, 2022
841ed52
Create codeql.yml (#5445)
titu1994 Nov 17, 2022
db6c136
Fix for getting tokenizer in character-based ASR models when using ta…
jonghwanhyeon Nov 17, 2022
a615aff
Combine 5 commits
tango4j Nov 17, 2022
8f60d87
resolved conflict
tango4j Nov 17, 2022
8f37ecb
Merge branch 'main' into mulspk_asr_eval_script
tango4j Nov 17, 2022
1402c97
moved eval_der function and fixed tqdm options
tango4j Nov 18, 2022
5390636
Merge branch 'mulspk_asr_eval_script' of https://github.com/NVIDIA/Ne…
tango4j Nov 18, 2022
f00250a
Merge branch 'main' into mulspk_asr_eval_script
tango4j Nov 18, 2022
92f4159
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Nov 18, 2022
1fa70e8
Changed minor error in docstrings
tango4j Nov 18, 2022
e2d519d
Merge branch 'mulspk_asr_eval_script' of https://github.com/NVIDIA/Ne…
tango4j Nov 18, 2022
037b61c
removed score_labels and changed leave=True
tango4j Nov 19, 2022
2b838af
Merge branch 'main' into mulspk_asr_eval_script
tango4j Nov 19, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -64,9 +64,32 @@ def main(cfg):
# If RTTM is provided and DER evaluation
if diar_score is not None:
metric, mapping_dict, _ = diar_score
der_results = asr_diar_offline.gather_eval_results(metric, mapping_dict, trans_info_dict)
wer_results = asr_diar_offline.evaluate(trans_info_dict)
asr_diar_offline.print_errors(der_results, wer_results)

# Get session-level diarization error rate and speaker counting error
der_results = OfflineDiarWithASR.gather_eval_results(
diar_score=diar_score,
audio_rttm_map_dict=asr_diar_offline.AUDIO_RTTM_MAP,
trans_info_dict=trans_info_dict,
root_path=asr_diar_offline.root_path,
)

# Calculate WER and cpWER if reference CTM files exist
wer_results = OfflineDiarWithASR.evaluate(
hyp_trans_info_dict=trans_info_dict,
audio_file_list=asr_diar_offline.audio_file_list,
ref_ctm_file_list=asr_diar_offline.ctm_file_list,
)

# Print average DER, WER and cpWER
OfflineDiarWithASR.print_errors(der_results=der_results, wer_results=wer_results)

# Save detailed session-level evaluation results in `root_path`.
OfflineDiarWithASR.write_session_level_result_in_csv(
der_results=der_results,
wer_results=wer_results,
root_path=asr_diar_offline.root_path,
csv_columns=asr_diar_offline.csv_columns,
)


if __name__ == '__main__':
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,90 @@
# This YAML file is created for all types of offline speaker diarization inference tasks in `<NeMo git root>/example/speaker_tasks/diarization` folder.
# The inference parameters for VAD, speaker embedding extractor, clustering module, MSDD module, ASR decoder are all included in this YAML file.
# All the keys under `diarizer` key (`vad`, `speaker_embeddings`, `clustering`, `msdd_model`, `asr`) can be selectively used for its own purpose and also can be ignored if the module is not used.
# The configurations in this YAML file is optimized to show balanced performances on various types of domain. VAD is optimized on multilingual ASR datasets and diarizer is optimized on DIHARD3 development set.
# An example line in an input manifest file (`.json` format):
# {"audio_filepath": "/path/to/audio_file", "offset": 0, "duration": null, "label": "infer", "text": "-", "num_speakers": null, "rttm_filepath": "/path/to/rttm/file", "uem_filepath": "/path/to/uem/file"}
name: &name "ClusterDiarizer"

num_workers: 1
sample_rate: 16000
batch_size: 64

diarizer:
manifest_filepath: ???
out_dir: ???
oracle_vad: False # If True, uses RTTM files provided in the manifest file to get speech activity (VAD) timestamps
collar: 0.25 # Collar value for scoring
ignore_overlap: True # Consider or ignore overlap segments while scoring

vad:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part looks good to me.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for checking

model_path: vad_multilingual_marblenet # .nemo local model path or pretrained VAD model name
external_vad_manifest: null # This option is provided to use external vad and provide its speech activity labels for speaker embeddings extraction. Only one of model_path or external_vad_manifest should be set

parameters: # Tuned by detection error rate (false alarm + miss) on multilingual ASR evaluation datasets
window_length_in_sec: 0.63 # Window length in sec for VAD context input
shift_length_in_sec: 0.08 # Shift length in sec for generate frame level VAD prediction
smoothing: False # False or type of smoothing method (eg: median)
overlap: 0.5 # Overlap ratio for overlapped mean/median smoothing filter
onset: 0.5 # Onset threshold for detecting the beginning and end of a speech
offset: 0.3 # Offset threshold for detecting the end of a speech
pad_onset: 0.2 # Adding durations before each speech segment
pad_offset: 0.2 # Adding durations after each speech segment
min_duration_on: 0.5 # Threshold for small non_speech deletion
min_duration_off: 0.5 # Threshold for short speech segment deletion
filter_speech_first: True

speaker_embeddings:
model_path: titanet_large # .nemo local model path or pretrained model name (titanet_large, ecapa_tdnn or speakerverification_speakernet)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it titanet_large or titanet_small ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If titanet_small is published, we should change this right away

parameters:
window_length_in_sec: [1.9,1.2,0.5] # Window length(s) in sec (floating-point number). either a number or a list. ex) 1.5 or [1.5,1.0,0.5]
shift_length_in_sec: [0.95,0.6,0.25] # Shift length(s) in sec (floating-point number). either a number or a list. ex) 0.75 or [0.75,0.5,0.25]
multiscale_weights: [1,1,1] # Weight for each scale. should be null (for single scale) or a list matched with window/shift scale count. ex) [0.33,0.33,0.33]
save_embeddings: True # If True, save speaker embeddings in pickle format. This should be True if clustering result is used for other models, such as `msdd_model`.

clustering:
parameters:
oracle_num_speakers: False # If True, use num of speakers value provided in manifest file.
max_num_speakers: 8 # Max number of speakers for each recording. If an oracle number of speakers is passed, this value is ignored.
enhanced_count_thres: 80 # If the number of segments is lower than this number, enhanced speaker counting is activated.
max_rp_threshold: 0.25 # Determines the range of p-value search: 0 < p <= max_rp_threshold.
sparse_search_volume: 10 # The higher the number, the more values will be examined with more time.
maj_vote_spk_count: False # If True, take a majority vote on multiple p-values to estimate the number of speakers.

msdd_model:
model_path: null # .nemo local model path or pretrained model name for multiscale diarization decoder (MSDD)
parameters:
use_speaker_model_from_ckpt: True # If True, use speaker embedding model in checkpoint. If False, the provided speaker embedding model in config will be used.
infer_batch_size: 25 # Batch size for MSDD inference.
sigmoid_threshold: [0.7] # Sigmoid threshold for generating binarized speaker labels. The smaller the more generous on detecting overlaps.
seq_eval_mode: False # If True, use oracle number of speaker and evaluate F1 score for the given speaker sequences. Default is False.
split_infer: True # If True, break the input audio clip to short sequences and calculate cluster average embeddings for inference.
diar_window_length: 50 # The length of split short sequence when split_infer is True.
overlap_infer_spk_limit: 5 # If the estimated number of speakers are larger than this number, overlap speech is not estimated.

asr:
model_path: null # Provide NGC cloud ASR model name. stt_en_conformer_ctc_* models are recommended for diarization purposes.
parameters:
asr_based_vad: False # if True, speech segmentation for diarization is based on word-timestamps from ASR inference.
asr_based_vad_threshold: 1.0 # Threshold (in sec) that caps the gap between two words when generating VAD timestamps using ASR based VAD.
asr_batch_size: null # Batch size can be dependent on each ASR model. Default batch sizes are applied if set to null.
decoder_delay_in_sec: null # Native decoder delay. null is recommended to use the default values for each ASR model.
word_ts_anchor_offset: null # Offset to set a reference point from the start of the word. Recommended range of values is [-0.05 0.2].
word_ts_anchor_pos: "start" # Select which part of the word timestamp we want to use. The options are: 'start', 'end', 'mid'.
fix_word_ts_with_VAD: False # Fix the word timestamp using VAD output. You must provide a VAD model to use this feature.
colored_text: False # If True, use colored text to distinguish speakers in the output transcript.
print_time: True # If True, the start and end time of each speaker turn is printed in the output transcript.
break_lines: False # If True, the output transcript breaks the line to fix the line width (default is 90 chars)

ctc_decoder_parameters: # Optional beam search decoder (pyctcdecode)
pretrained_language_model: null # KenLM model file: .arpa model file or .bin binary file.
beam_width: 32
alpha: 0.5
beta: 2.5

realigning_lm_parameters: # Experimental feature
arpa_language_model: null # Provide a KenLM language model in .arpa format.
min_number_of_words: 3 # Min number of words for the left context.
max_number_of_words: 10 # Max number of words for the right context.
logprob_diff_threshold: 1.2 # The threshold for the difference between two log probability values from two hypotheses.

48 changes: 48 additions & 0 deletions nemo/collections/asr/metrics/der.py
Original file line number Diff line number Diff line change
Expand Up @@ -108,6 +108,54 @@ def score_labels(
return None


def evaluate_der(audio_rttm_map_dict, all_reference, all_hypothesis, diar_eval_mode='all'):
"""
Evaluate with a selected diarization evaluation scheme

AUDIO_RTTM_MAP (dict):
Dictionary containing information provided from manifestpath
all_reference (list[uniq_name,annotation]):
reference annotations for score calculation
all_hypothesis (list[uniq_name,annotation]):
hypothesis annotations for score calculation
diar_eval_mode (str):
Diarization evaluation modes

diar_eval_mode == "full":
DIHARD challenge style evaluation, the most strict way of evaluating diarization
(collar, ignore_overlap) = (0.0, False)
diar_eval_mode == "fair":
Evaluation setup used in VoxSRC challenge
(collar, ignore_overlap) = (0.25, False)
diar_eval_mode == "forgiving":
Traditional evaluation setup
(collar, ignore_overlap) = (0.25, True)
diar_eval_mode == "all":
Compute all three modes (default)
"""
eval_settings = []
if diar_eval_mode == "full":
eval_settings = [(0.0, False)]
elif diar_eval_mode == "fair":
eval_settings = [(0.25, False)]
elif diar_eval_mode == "forgiving":
eval_settings = [(0.25, True)]
elif diar_eval_mode == "all":
eval_settings = [(0.0, False), (0.25, False), (0.25, True)]
else:
raise ValueError("`diar_eval_mode` variable contains an unsupported value")

for collar, ignore_overlap in eval_settings:
diar_score = score_labels(
AUDIO_RTTM_MAP=audio_rttm_map_dict,
all_reference=all_reference,
all_hypothesis=all_hypothesis,
collar=collar,
ignore_overlap=ignore_overlap,
)
return diar_score


def calculate_session_cpWER_bruteforce(spk_hypothesis: List[str], spk_reference: List[str]) -> Tuple[float, str, str]:
"""
Calculate cpWER with actual permutations in brute-force way when LSA algorithm cannot deliver the correct result.
Expand Down
4 changes: 2 additions & 2 deletions nemo/collections/asr/models/clustering_diarizer.py
Original file line number Diff line number Diff line change
Expand Up @@ -213,7 +213,7 @@ def _run_vad(self, manifest_file):
data.append(get_uniqname_from_filepath(file))

status = get_vad_stream_status(data)
for i, test_batch in enumerate(tqdm(self._vad_model.test_dataloader(), desc='vad', leave=False)):
for i, test_batch in enumerate(tqdm(self._vad_model.test_dataloader(), desc='vad', leave=True)):
test_batch = [x.to(self._device) for x in test_batch]
with autocast():
log_probs = self._vad_model(input_signal=test_batch[0], input_signal_length=test_batch[1])
Expand Down Expand Up @@ -342,7 +342,7 @@ def _extract_embeddings(self, manifest_file: str, scale_idx: int, num_scales: in

all_embs = torch.empty([0])
for test_batch in tqdm(
self._speaker_model.test_dataloader(), desc=f'[{scale_idx}/{num_scales}] extract embeddings', leave=False
self._speaker_model.test_dataloader(), desc=f'[{scale_idx+1}/{num_scales}] extract embeddings', leave=True
):
test_batch = [x.to(self._device) for x in test_batch]
audio_signal, audio_signal_len, labels, slices = test_batch
Expand Down
Loading