Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding long-form audio speaker diarization (clustering) class and functions #7737

Merged
merged 40 commits into from
Nov 7, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
40 commits
Select commit Hold shift + click to select a range
3923de4
Adding long-form audio clustering for diarization
tango4j Oct 14, 2023
9829949
Adding unit test changes
tango4j Oct 16, 2023
f9c6141
Merge branch 'NVIDIA:main' into long_clus
tango4j Oct 16, 2023
26c61c4
Added tests for torch jit script
tango4j Oct 16, 2023
51904c6
Merge branch 'long_clus' of https://github.com/tango4j/NeMo into long…
tango4j Oct 16, 2023
67883f5
Added variable value checking line
tango4j Oct 17, 2023
15ab8cd
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Oct 17, 2023
8f537f8
Added needed params to all yamls
tango4j Oct 17, 2023
dcaf06a
Consolidated long-form and short-form clustering methods
tango4j Oct 18, 2023
c739609
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Oct 18, 2023
7ac2ecc
Merge remote-tracking branch 'origin' into long_clus
tango4j Oct 18, 2023
f62184a
Merged latest main and updated speaker utils
tango4j Oct 18, 2023
e7ce447
Fixed code formatting error in speaker_utils.py
tango4j Oct 18, 2023
f8ac688
Some minor fixes for doc-strings
tango4j Oct 18, 2023
31f57d2
Removed unnecessary comments
tango4j Oct 18, 2023
e810223
Merge branch 'main' into long_clus
stevehuang52 Oct 20, 2023
3a5b4f2
Merge branch 'main' into long_clus
tango4j Oct 20, 2023
e60c16a
Refelcted comments and made changes
tango4j Oct 26, 2023
319d2d9
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Oct 26, 2023
381f6c1
Merge branch 'main' into long_clus
tango4j Oct 26, 2023
57bec0e
Minor changes on typos and comments
tango4j Oct 26, 2023
72869a6
Minor changes on typos and comments
tango4j Oct 26, 2023
4871767
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Oct 26, 2023
6bbc0a8
Merge branch 'main' into long_clus
tango4j Oct 26, 2023
b1124bb
Fixes for code QL
tango4j Oct 26, 2023
976121b
Merge branch 'main' into long_clus
tango4j Oct 26, 2023
20eb34a
Fixed docstring errors
tango4j Oct 26, 2023
afa7434
Merge branch 'long_clus' of https://github.com/tango4j/NeMo into long…
tango4j Oct 26, 2023
14774b0
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Oct 26, 2023
5e5c896
Merge branch 'main' into long_clus
tango4j Oct 27, 2023
fe756d4
Merge branch 'main' into long_clus
tango4j Oct 30, 2023
4657c07
Reflected the second batch of comments
tango4j Nov 1, 2023
696c559
Merge branch 'long_clus' of https://github.com/tango4j/NeMo into long…
tango4j Nov 1, 2023
c90bce8
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Nov 1, 2023
e41acee
Updating all yamls for inference
tango4j Nov 1, 2023
bf7fe44
Merge branch 'long_clus' of https://github.com/tango4j/NeMo into long…
tango4j Nov 1, 2023
cd299ec
Added None-checker to forward to prevent type errors
tango4j Nov 1, 2023
2db9779
Merge branch 'main' into long_clus
tango4j Nov 1, 2023
22dee0a
Merge branch 'main' into long_clus
tango4j Nov 2, 2023
1416307
Merge branch 'main' into long_clus
nithinraok Nov 6, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,9 @@ diarizer:
max_rp_threshold: 0.25 # Determines the range of p-value search: 0 < p <= max_rp_threshold.
sparse_search_volume: 10 # The higher the number, the more values will be examined with more time.
maj_vote_spk_count: False # If True, take a majority vote on multiple p-values to estimate the number of speakers.

chunk_cluster_count: 50 # Number of forced clusters (overclustering) per unit chunk in long-form audio clustering.
embeddings_per_chunk: 10000 # Number of embeddings in each chunk for long-form audio clustering. Adjust based on GPU memory capacity. (default: 10000, approximately 40 mins of audio)

msdd_model:
model_path: null # .nemo local model path or pretrained model name for multiscale diarization decoder (MSDD)
parameters:
Expand Down Expand Up @@ -88,5 +90,4 @@ diarizer:
arpa_language_model: null # Provide a KenLM language model in .arpa format.
min_number_of_words: 3 # Min number of words for the left context.
max_number_of_words: 10 # Max number of words for the right context.
logprob_diff_threshold: 1.2 # The threshold for the difference between two log probability values from two hypotheses.

logprob_diff_threshold: 1.2 # The threshold for the difference between two log probability values from two hypotheses.
Original file line number Diff line number Diff line change
Expand Up @@ -52,6 +52,8 @@ diarizer:
max_rp_threshold: 0.25 # Determines the range of p-value search: 0 < p <= max_rp_threshold.
sparse_search_volume: 30 # The higher the number, the more values will be examined with more time.
maj_vote_spk_count: False # If True, take a majority vote on multiple p-values to estimate the number of speakers.
chunk_cluster_count: 50 # Number of forced clusters (overclustering) per unit chunk in long-form audio clustering.
embeddings_per_chunk: 10000 # Number of embeddings in each chunk for long-form audio clustering. Adjust based on GPU memory capacity. (default: 10000, approximately 40 mins of audio)

msdd_model:
model_path: null # .nemo local model path or pretrained model name for multiscale diarization decoder (MSDD)
Expand Down Expand Up @@ -88,5 +90,4 @@ diarizer:
arpa_language_model: null # Provide a KenLM language model in .arpa format.
min_number_of_words: 3 # Min number of words for the left context.
max_number_of_words: 10 # Max number of words for the right context.
logprob_diff_threshold: 1.2 # The threshold for the difference between two log probability values from two hypotheses.

logprob_diff_threshold: 1.2 # The threshold for the difference between two log probability values from two hypotheses.
Original file line number Diff line number Diff line change
Expand Up @@ -44,14 +44,16 @@ diarizer:
multiscale_weights: [1,1,1,1,1] # Weight for each scale. should be null (for single scale) or a list matched with window/shift scale count. ex) [0.33,0.33,0.33]
save_embeddings: True # If True, save speaker embeddings in pickle format. This should be True if clustering result is used for other models, such as `msdd_model`.

clustering:
clustering:
parameters:
oracle_num_speakers: False # If True, use num of speakers value provided in manifest file.
max_num_speakers: 8 # Max number of speakers for each recording. If an oracle number of speakers is passed, this value is ignored.
enhanced_count_thres: 80 # If the number of segments is lower than this number, enhanced speaker counting is activated.
max_rp_threshold: 0.25 # Determines the range of p-value search: 0 < p <= max_rp_threshold.
sparse_search_volume: 30 # The higher the number, the more values will be examined with more time.
maj_vote_spk_count: False # If True, take a majority vote on multiple p-values to estimate the number of speakers.
chunk_cluster_count: 50 # Number of forced clusters (overclustering) per unit chunk in long-form audio clustering.
embeddings_per_chunk: 10000 # Number of embeddings in each chunk for long-form audio clustering. Adjust based on GPU memory capacity. (default: 10000, approximately 40 mins of audio)

msdd_model:
model_path: diar_msdd_telephonic # .nemo local model path or pretrained model name for multiscale diarization decoder (MSDD)
Expand Down Expand Up @@ -88,5 +90,4 @@ diarizer:
arpa_language_model: null # Provide a KenLM language model in .arpa format.
min_number_of_words: 3 # Min number of words for the left context.
max_number_of_words: 10 # Max number of words for the right context.
logprob_diff_threshold: 1.2 # The threshold for the difference between two log probability values from two hypotheses.

logprob_diff_threshold: 1.2 # The threshold for the difference between two log probability values from two hypotheses.