# Streaming multispeaker ASR and diarization tutorial

In [None]:
"""
You can run either this notebook locally (if you have all the dependencies and a GPU) or on Google Colab.

Instructions for setting up Colab are as follows:
1. Open a new Python 3 notebook.
2. Import this notebook from GitHub (File -> Upload Notebook -> "GITHUB" tab -> copy/paste GitHub URL)
3. Connect to an instance with a GPU (Runtime -> Change runtime type -> select "GPU" for hardware accelerator)
4. Run this cell to set up dependencies.
"""
# If you're using Google Colab and not running locally, run this cell.

## Install dependencies
!pip install wget
!apt-get install sox libsndfile1 ffmpeg
!pip install text-unidecode

# ## Install NeMo
BRANCH = 'main'
!python -m pip install git+https://github.com/NVIDIA/NeMo.git@$BRANCH#egg=nemo_toolkit[asr]

## Install TorchAudio
!pip install torchaudio -f https://download.pytorch.org/whl/torch_stable.html

In [2]:
import os
import sys
print(sys.path)
sys.path.insert(0,f'/home/taejinp/projects/streaming_mulspk_asr/NeMo/')
import nemo
print("Nemo PATH:", nemo.__path__)
BRANCH = 'streaming_mulspk_asr'

['/home/taejinp/projects/streaming_mulspk_asr/NeMo/tutorials/speaker_tasks', '/home/taejinp/projects/streaming_mulspk_asr/NeMo/tutorials/speaker_tasks', '/usr/local/lib/python3.6/dist-packages/torchtext_mod', '/usr/local/lib/python3.6/dist-packages/torchtext_edit', '/usr/local/lib/python3.6/dist-packages/torchtext_edit/data', '/home/taejinp/anaconda3/lib/python39.zip', '/home/taejinp/anaconda3/lib/python3.9', '/home/taejinp/anaconda3/lib/python3.9/lib-dynload', '', '/home/taejinp/.local/lib/python3.9/site-packages', '/home/taejinp/anaconda3/lib/python3.9/site-packages']
Nemo PATH: ['/home/taejinp/projects/streaming_mulspk_asr/NeMo/nemo']


Set your NeMo path

In [3]:
import sys
import socket
if socket.gethostname() == "aiapps-06052021":
    sys.path.insert(0,'/home/taejinp/projects/streaming_mulspk_asr/NeMo')
else:
    sys.path.insert(0,'/your/path/to/NeMo/')
    
import nemo
print("Using Nemo PATH:", nemo.__path__[0])

# !pip install gradio==2.9.0

Using Nemo PATH: /home/taejinp/projects/streaming_mulspk_asr/NeMo/nemo


In [4]:
# Introduction to Online Speaker Diarization
"""
As covered in Speaker diarization inference tutorial, speaker diarization is the task of segmenting audio recordings by speaker labels and answers the question "Who Speaks When?".

While offline speaker diarization has access to the entire audio file and return the speaker labels all at once, online speaker diarization is a streaming task that processes audio in small chunks. 
Since we only have access to a small chunk of audio at a time, the online speaker diarization system needs to maintain a memory buffer to store the history of the speakers in the past. At the sametime, the system needs to be able to detect new speakers that are not in the memory buffer.

This tutorial will cover the followings:

- How to run online speaker diarization with NeMo
- How online speaker clustering and memory buffer works together
"""

'\nAs covered in Speaker diarization inference tutorial, speaker diarization is the task of segmenting audio recordings by speaker labels and answers the question "Who Speaks When?".\n\nWhile offline speaker diarization has access to the entire audio file and return the speaker labels all at once, online speaker diarization is a streaming task that processes audio in small chunks. \nSince we only have access to a small chunk of audio at a time, the online speaker diarization system needs to maintain a memory buffer to store the history of the speakers in the past. At the sametime, the system needs to be able to detect new speakers that are not in the memory buffer.\n\nThis tutorial will cover the followings:\n\n- How to run online speaker diarization with NeMo\n- How online speaker clustering and memory buffer works together\n'

In [5]:
from nemo.collections.asr.parts.utils.speaker_utils import audio_rttm_map
from nemo.core.config import hydra_runner
import gradio as gr
from scipy.io import wavfile
import numpy as np
import hydra
import os
import torch
from nemo.collections.asr.models import OnlineClusteringDiarizer
# from nemo.collections.asr.parts.utils.diarization_utils import ASR_DIAR_ONLINE
from nemo.collections.asr.parts.utils.diarization_utils import OnlineDiarWithASR


Read yaml file for online diarization. You have to specifty the following items:
    
- input manifest file (If  simulation)
- VAD model path
- Speaker embedding extractor model path
- Diarization Decoder model path (Coming soon)
- Punctuation model path (automatically download from NGC)
- Language model path (Coming soon)

Download nemo models and specify the path to config struct.

In [6]:
import omegaconf

YAML_FILE="/home/taejinp/projects/streaming_mulspk_asr/NeMo/examples/speaker_tasks/diarization/conf/inference/online_diar_infer_general.yaml"
cfg = omegaconf.OmegaConf.load(YAML_FILE)
import socket

cfg.diarizer.out_dir = "./streaming_diar_output"

os.makedirs(cfg.diarizer.out_dir, exist_ok=True)
cfg.diarizer.asr.parameters.colored_text = False
print(f"socket.gethostname() {socket.gethostname()}")
if socket.gethostname() == "aiapps-06052021":
    # cfg.diarizer.manifest_filepath = "/home/taejinp/projects/data/diar_manifest_input/ch109.json"
    cfg.diarizer.manifest_filepath = "/home/taejinp/projects/data/diar_manifest_input/online_diar_demo_01.json"
    cfg.diarizer.vad.model_path = "/home/taejinp/gdrive/model/VAD_models/mVAD_lin_marblenet-3x2x64-4N-256bs-50e-0.01lr-0.001wd.nemo"
    cfg.diarizer.speaker_embeddings.model_path = "/home/taejinp/Downloads/titanet_target_fixed/titanet-l.nemo"
    cfg.diarizer.asr.model_path = "/home/taejinp/gdrive/model/ASR_models/Conformer-CTC-BPE_large_Riva_ASR_set_3.0_ep60.nemo"
    cfg.diarizer.asr.parameters.punctuation_model_path = "punctuation_en_distilbert"
else:
    # Please download the following models and run the code. 

    # Download CH109 dataset at: https://drive.google.com/drive/folders/1ksq10H-NZbKRfMjEP_WWyBF_G0iAJt6b?usp=sharing
    cfg.diarizer.manifest_filepath = "/your/path/to/ch109.json"

    # Download streaming VAD model at: https://drive.google.com/file/d/1ab42CaYeTkuJSMsMsMLbSS9m5e1isJzx/view?usp=sharing
    cfg.diarizer.vad.model_path = "/your/path/to/mVAD_lin_marblenet-3x2x64-4N-256bs-50e-0.01lr-0.001wd.nemo"

    # Download titanet-m model at: https://drive.google.com/file/d/1xAgjm0udKogPrlQF6cdHLobEKHLY9azA/view?usp=sharing
    cfg.diarizer.speaker_embeddings.model_path = "/your/path/to/titanet-m.nemo"

    # Download Conformer-CTC ASR model at: https://drive.google.com/file/d/1Xg075IbiwL0szI4_a8gYmCPaG1UsgR6E/view?usp=sharing
    cfg.diarizer.asr.model_path = "/your/path/to/Conformer-CTC-BPE_large_Riva_ASR_set_3.0_ep60.nemo"

    cfg.diarizer.asr.parameters.punctuation_model_path = "punctuation_en_distilbert"

socket.gethostname() aiapps-06052021


Initialize ASR_DIAR_ONLINE and OnlineDiarizer Class.

In [7]:
# %%html
# <style>
# .output_wrapper, .output {
#     height:auto !important;
#     max-height:500px; 
# }
# .output_scroll {
#     box-shadow:none !important;
#     webkit-box-shadow:none !important;
# }
# </style>

In [8]:
from nemo.collections.asr.models import OnlineClusteringDiarizer
import os

params = {}
params['use_cuda'] = True
AUDIO_RTTM_MAP = audio_rttm_map(cfg.diarizer.manifest_filepath)

diar = OnlineClusteringDiarizer(cfg)
from nemo.collections.asr.parts.utils.diarization_utils import OnlineDiarWithASR, write_txt

cfg.diarizer.simulation_uniq_id='citadel_ken'
cfg.diarizer.out_dir = '/home/taejinp/projects/run_time/streaming_diar_output_univ'
cfg.diarizer.asr.parameters.streaming_simulation=True
cfg.diarizer.asr.parameters.enforce_real_time=True 
cfg.diarizer.asr.parameters.colored_text=False
 
fn = os.path.join(cfg.diarizer.out_dir, "print_script.sh")
# os.remove(fn) if os.path.exists(fn) else None

diar.uniq_id = cfg.diarizer.simulation_uniq_id 
diar.single_audio_file_path = AUDIO_RTTM_MAP[diar.uniq_id]['audio_filepath']
diar.rttm_file_path = AUDIO_RTTM_MAP[diar.uniq_id]['rttm_filepath']
# diar.rttm_file_path = None # DER calculation slows down online diarization speed
diar._init_segment_variables()


online_diar_asr = OnlineDiarWithASR(cfg=cfg)
diar = online_diar_asr.diar

diar.device = online_diar_asr.device
online_diar_asr.reset()

# cfg.diarizer.asr.parameters.streaming_simulation=True
# cfg.diarizer.asr.parameters.streaming_simulation=False

simulation = True
# simulation = False # Run Gradio server with your microphone.

[NeMo I 2023-10-20 15:18:54 speaker_utils:93] Number of files to diarize: 4


[NeMo W 2023-10-20 15:18:54 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: /manifests/ami_train_0.63.json,/manifests/freesound_background_train.json,/manifests/freesound_laughter_train.json,/manifests/fisher_2004_background.json,/manifests/fisher_2004_speech_sampled.json,/manifests/google_train_manifest.json,/manifests/icsi_all_0.63.json,/manifests/musan_freesound_train.json,/manifests/musan_music_train.json,/manifests/musan_soundbible_train.json,/manifests/mandarin_train_sample.json,/manifests/german_train_sample.json,/manifests/spanish_train_sample.json,/manifests/french_train_sample.json,/manifests/russian_train_sample.json
    sample_rate: 16000
    labels:
    - background
    - speech
    batch_size: 256
    shuffle: true
    is_tarred: false
    tarred_audio_filepaths: null
    tarred_shard_strategy: sca

[NeMo I 2023-10-20 15:18:54 features:289] PADDING: 16
[NeMo I 2023-10-20 15:18:55 save_restore_connector:249] Model EncDecClassificationModel was successfully restored from /home/taejinp/gdrive/model/VAD_models/mVAD_lin_marblenet-3x2x64-4N-256bs-50e-0.01lr-0.001wd.nemo.
[NeMo I 2023-10-20 15:18:55 clustering_diarizer:120] VAD model loaded locally from /home/taejinp/gdrive/model/VAD_models/mVAD_lin_marblenet-3x2x64-4N-256bs-50e-0.01lr-0.001wd.nemo


[NeMo W 2023-10-20 15:18:55 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: /manifests/combined_fisher_swbd_voxceleb12_librispeech/train.json
    sample_rate: 16000
    labels: null
    batch_size: 64
    shuffle: true
    is_tarred: false
    tarred_audio_filepaths: null
    tarred_shard_strategy: scatter
    augmentor:
      noise:
        manifest_path: /manifests/noise/rir_noise_manifest.json
        prob: 0.5
        min_snr_db: 0
        max_snr_db: 15
      speed:
        prob: 0.5
        sr: 16000
        resample_type: kaiser_fast
        min_speed_rate: 0.95
        max_speed_rate: 1.05
    num_workers: 15
    pin_memory: true
    
[NeMo W 2023-10-20 15:18:55 modelPT:168] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method 

[NeMo I 2023-10-20 15:18:55 features:289] PADDING: 16
[NeMo I 2023-10-20 15:18:56 save_restore_connector:249] Model EncDecSpeakerLabelModel was successfully restored from /home/taejinp/Downloads/titanet_target_fixed/titanet-l.nemo.
[NeMo I 2023-10-20 15:18:56 clustering_diarizer:145] Speaker Model restored locally from /home/taejinp/Downloads/titanet_target_fixed/titanet-l.nemo
[NeMo I 2023-10-20 15:18:56 speaker_utils:93] Number of files to diarize: 4
[NeMo I 2023-10-20 15:18:56 speaker_utils:93] Number of files to diarize: 4


[NeMo W 2023-10-20 15:18:56 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: /manifests/ami_train_0.63.json,/manifests/freesound_background_train.json,/manifests/freesound_laughter_train.json,/manifests/fisher_2004_background.json,/manifests/fisher_2004_speech_sampled.json,/manifests/google_train_manifest.json,/manifests/icsi_all_0.63.json,/manifests/musan_freesound_train.json,/manifests/musan_music_train.json,/manifests/musan_soundbible_train.json,/manifests/mandarin_train_sample.json,/manifests/german_train_sample.json,/manifests/spanish_train_sample.json,/manifests/french_train_sample.json,/manifests/russian_train_sample.json
    sample_rate: 16000
    labels:
    - background
    - speech
    batch_size: 256
    shuffle: true
    is_tarred: false
    tarred_audio_filepaths: null
    tarred_shard_strategy: sca

[NeMo I 2023-10-20 15:18:56 features:289] PADDING: 16
[NeMo I 2023-10-20 15:18:56 save_restore_connector:249] Model EncDecClassificationModel was successfully restored from /home/taejinp/gdrive/model/VAD_models/mVAD_lin_marblenet-3x2x64-4N-256bs-50e-0.01lr-0.001wd.nemo.
[NeMo I 2023-10-20 15:18:56 clustering_diarizer:120] VAD model loaded locally from /home/taejinp/gdrive/model/VAD_models/mVAD_lin_marblenet-3x2x64-4N-256bs-50e-0.01lr-0.001wd.nemo


[NeMo W 2023-10-20 15:18:56 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: /manifests/combined_fisher_swbd_voxceleb12_librispeech/train.json
    sample_rate: 16000
    labels: null
    batch_size: 64
    shuffle: true
    is_tarred: false
    tarred_audio_filepaths: null
    tarred_shard_strategy: scatter
    augmentor:
      noise:
        manifest_path: /manifests/noise/rir_noise_manifest.json
        prob: 0.5
        min_snr_db: 0
        max_snr_db: 15
      speed:
        prob: 0.5
        sr: 16000
        resample_type: kaiser_fast
        min_speed_rate: 0.95
        max_speed_rate: 1.05
    num_workers: 15
    pin_memory: true
    
[NeMo W 2023-10-20 15:18:56 modelPT:168] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method 

[NeMo I 2023-10-20 15:18:56 features:289] PADDING: 16
[NeMo I 2023-10-20 15:18:57 save_restore_connector:249] Model EncDecSpeakerLabelModel was successfully restored from /home/taejinp/Downloads/titanet_target_fixed/titanet-l.nemo.
[NeMo I 2023-10-20 15:18:57 clustering_diarizer:145] Speaker Model restored locally from /home/taejinp/Downloads/titanet_target_fixed/titanet-l.nemo
[NeMo I 2023-10-20 15:18:57 speaker_utils:93] Number of files to diarize: 4


[NeMo W 2023-10-20 15:18:57 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: /manifests/ami_train_0.63.json,/manifests/freesound_background_train.json,/manifests/freesound_laughter_train.json,/manifests/fisher_2004_background.json,/manifests/fisher_2004_speech_sampled.json,/manifests/google_train_manifest.json,/manifests/icsi_all_0.63.json,/manifests/musan_freesound_train.json,/manifests/musan_music_train.json,/manifests/musan_soundbible_train.json,/manifests/mandarin_train_sample.json,/manifests/german_train_sample.json,/manifests/spanish_train_sample.json,/manifests/french_train_sample.json,/manifests/russian_train_sample.json
    sample_rate: 16000
    labels:
    - background
    - speech
    batch_size: 256
    shuffle: true
    is_tarred: false
    tarred_audio_filepaths: null
    tarred_shard_strategy: sca

[NeMo I 2023-10-20 15:18:57 features:289] PADDING: 16
[NeMo I 2023-10-20 15:18:57 save_restore_connector:249] Model EncDecClassificationModel was successfully restored from /home/taejinp/gdrive/model/VAD_models/mVAD_lin_marblenet-3x2x64-4N-256bs-50e-0.01lr-0.001wd.nemo.
[NeMo I 2023-10-20 15:18:59 mixins:170] Tokenizer SentencePieceTokenizer initialized with 1024 tokens


[NeMo W 2023-10-20 15:18:59 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: /data/asr_datasets_prebuilt/RIVA_ASR_SET_3.0_tarred/tarred_audio_manifest.json
    sample_rate: 16000
    batch_size: 16
    shuffle: true
    is_tarred: true
    tarred_audio_filepaths: /data/asr_datasets_prebuilt/RIVA_ASR_SET_3.0_tarred/audio__OP_0..4095_CL_.tar
    use_start_end_token: false
    trim_silence: false
    max_duration: 20.0
    min_duration: 0.1
    shuffle_n: 1024
    num_workers: 16
    pin_memory: true
    
[NeMo W 2023-10-20 15:18:59 modelPT:168] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s). 
    Validation config : 
    manifest_filepath: /data/Jasper_NEM

[NeMo I 2023-10-20 15:18:59 features:289] PADDING: 0
[NeMo I 2023-10-20 15:19:01 save_restore_connector:249] Model EncDecCTCModelBPE was successfully restored from /home/taejinp/gdrive/model/ASR_models/Conformer-CTC-BPE_large_Riva_ASR_set_3.0_ep60.nemo.


[NeMo W 2023-10-20 15:19:01 decoder_timestamps_utils:66] `ctc_decode` was set to True. Note that this is ignored.


[NeMo I 2023-10-20 15:19:01 features:289] PADDING: 0
[NeMo I 2023-10-20 15:19:01 features:289] PADDING: 0
[NeMo I 2023-10-20 15:19:01 features:289] PADDING: 0


Let's run simulated audio stream to check if streaming system is working properly. After you initiate the following function and while the function is running, you can check the transcription is being generated in realtime.  The path is ./streaming_diar_output/print_script.sh, and this can be viewed using "streaming_diarization_viewer.ipynb"


In [8]:
import ipywidgets
import time
box_layout = ipywidgets.Layout(height="500px", width="90%")
widget = ipywidgets.Textarea(value='', disabled=True, layout=box_layout)
display(widget)  # display widget

Textarea(value='', disabled=True, layout=Layout(height='500px', width='90%'))

In [None]:
diar.uniq_id = cfg.diarizer.simulation_uniq_id
online_diar_asr.get_audio_rttm_map(diar.uniq_id)
diar.single_audio_file_path = diar.AUDIO_RTTM_MAP[diar.uniq_id]['audio_filepath']
online_diar_asr.rttm_file_path = diar.AUDIO_RTTM_MAP[diar.uniq_id]['rttm_filepath']

diar._init_segment_variables()
diar.device = online_diar_asr.device
write_txt(f"{diar._out_dir}/print_script.sh", "")

samplerate, sdata = wavfile.read(diar.single_audio_file_path)
if  diar.AUDIO_RTTM_MAP[diar.uniq_id]['offset'] and diar.AUDIO_RTTM_MAP[diar.uniq_id]['duration']:
    
    offset = samplerate*diar.AUDIO_RTTM_MAP[diar.uniq_id]['offset']
    duration = samplerate*diar.AUDIO_RTTM_MAP[diar.uniq_id]['duration']
    stt, end = int(offset), int(offset + duration)
    sdata = sdata[stt:end]

for index in range(int(np.floor(sdata.shape[0]/online_diar_asr.n_frame_len))):
    shift = online_diar_asr.CHUNK_SIZE
    sample_audio = sdata[shift*index:shift*(index+1)]
    online_diar_asr.buffer_counter = index
    online_diar_asr.streaming_step(sample_audio)
    
    widget.value += f" update {index}"
    fp = open(f'{diar._out_dir}/print_script.sh','r').read()
    widget.value = fp
     




Now, go to streaming_diarization_viewer.ipynb and check the realtime output.

In [9]:
cfg.diarizer.asr.parameters.streaming_simulation=False
cfg.diarizer.asr.parameters.enforce_real_time=False
online_diar_asr = OnlineDiarWithASR(cfg=cfg)
diar = online_diar_asr.diar
write_txt(f"{diar._out_dir}/print_script.sh", "")

diar.uniq_id = cfg.diarizer.simulation_uniq_id 
diar.single_audio_file_path = AUDIO_RTTM_MAP[diar.uniq_id]['audio_filepath']
diar.rttm_file_path = AUDIO_RTTM_MAP[diar.uniq_id]['rttm_filepath']
# diar.rttm_file_path = None # DER calculation slows down online diarization speed
diar._init_segment_variables()



[NeMo I 2023-10-20 15:19:01 speaker_utils:93] Number of files to diarize: 4


[NeMo W 2023-10-20 15:19:02 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: /manifests/ami_train_0.63.json,/manifests/freesound_background_train.json,/manifests/freesound_laughter_train.json,/manifests/fisher_2004_background.json,/manifests/fisher_2004_speech_sampled.json,/manifests/google_train_manifest.json,/manifests/icsi_all_0.63.json,/manifests/musan_freesound_train.json,/manifests/musan_music_train.json,/manifests/musan_soundbible_train.json,/manifests/mandarin_train_sample.json,/manifests/german_train_sample.json,/manifests/spanish_train_sample.json,/manifests/french_train_sample.json,/manifests/russian_train_sample.json
    sample_rate: 16000
    labels:
    - background
    - speech
    batch_size: 256
    shuffle: true
    is_tarred: false
    tarred_audio_filepaths: null
    tarred_shard_strategy: sca

[NeMo I 2023-10-20 15:19:02 features:289] PADDING: 16
[NeMo I 2023-10-20 15:19:02 save_restore_connector:249] Model EncDecClassificationModel was successfully restored from /home/taejinp/gdrive/model/VAD_models/mVAD_lin_marblenet-3x2x64-4N-256bs-50e-0.01lr-0.001wd.nemo.
[NeMo I 2023-10-20 15:19:02 clustering_diarizer:120] VAD model loaded locally from /home/taejinp/gdrive/model/VAD_models/mVAD_lin_marblenet-3x2x64-4N-256bs-50e-0.01lr-0.001wd.nemo


[NeMo W 2023-10-20 15:19:02 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: /manifests/combined_fisher_swbd_voxceleb12_librispeech/train.json
    sample_rate: 16000
    labels: null
    batch_size: 64
    shuffle: true
    is_tarred: false
    tarred_audio_filepaths: null
    tarred_shard_strategy: scatter
    augmentor:
      noise:
        manifest_path: /manifests/noise/rir_noise_manifest.json
        prob: 0.5
        min_snr_db: 0
        max_snr_db: 15
      speed:
        prob: 0.5
        sr: 16000
        resample_type: kaiser_fast
        min_speed_rate: 0.95
        max_speed_rate: 1.05
    num_workers: 15
    pin_memory: true
    
[NeMo W 2023-10-20 15:19:02 modelPT:168] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method 

[NeMo I 2023-10-20 15:19:02 features:289] PADDING: 16
[NeMo I 2023-10-20 15:19:02 save_restore_connector:249] Model EncDecSpeakerLabelModel was successfully restored from /home/taejinp/Downloads/titanet_target_fixed/titanet-l.nemo.
[NeMo I 2023-10-20 15:19:02 clustering_diarizer:145] Speaker Model restored locally from /home/taejinp/Downloads/titanet_target_fixed/titanet-l.nemo
[NeMo I 2023-10-20 15:19:02 speaker_utils:93] Number of files to diarize: 4


[NeMo W 2023-10-20 15:19:02 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: /manifests/ami_train_0.63.json,/manifests/freesound_background_train.json,/manifests/freesound_laughter_train.json,/manifests/fisher_2004_background.json,/manifests/fisher_2004_speech_sampled.json,/manifests/google_train_manifest.json,/manifests/icsi_all_0.63.json,/manifests/musan_freesound_train.json,/manifests/musan_music_train.json,/manifests/musan_soundbible_train.json,/manifests/mandarin_train_sample.json,/manifests/german_train_sample.json,/manifests/spanish_train_sample.json,/manifests/french_train_sample.json,/manifests/russian_train_sample.json
    sample_rate: 16000
    labels:
    - background
    - speech
    batch_size: 256
    shuffle: true
    is_tarred: false
    tarred_audio_filepaths: null
    tarred_shard_strategy: sca

[NeMo I 2023-10-20 15:19:02 features:289] PADDING: 16
[NeMo I 2023-10-20 15:19:02 save_restore_connector:249] Model EncDecClassificationModel was successfully restored from /home/taejinp/gdrive/model/VAD_models/mVAD_lin_marblenet-3x2x64-4N-256bs-50e-0.01lr-0.001wd.nemo.
[NeMo I 2023-10-20 15:19:05 mixins:170] Tokenizer SentencePieceTokenizer initialized with 1024 tokens


[NeMo W 2023-10-20 15:19:05 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: /data/asr_datasets_prebuilt/RIVA_ASR_SET_3.0_tarred/tarred_audio_manifest.json
    sample_rate: 16000
    batch_size: 16
    shuffle: true
    is_tarred: true
    tarred_audio_filepaths: /data/asr_datasets_prebuilt/RIVA_ASR_SET_3.0_tarred/audio__OP_0..4095_CL_.tar
    use_start_end_token: false
    trim_silence: false
    max_duration: 20.0
    min_duration: 0.1
    shuffle_n: 1024
    num_workers: 16
    pin_memory: true
    
[NeMo W 2023-10-20 15:19:05 modelPT:168] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s). 
    Validation config : 
    manifest_filepath: /data/Jasper_NEM

[NeMo I 2023-10-20 15:19:05 features:289] PADDING: 0
[NeMo I 2023-10-20 15:19:06 save_restore_connector:249] Model EncDecCTCModelBPE was successfully restored from /home/taejinp/gdrive/model/ASR_models/Conformer-CTC-BPE_large_Riva_ASR_set_3.0_ep60.nemo.


[NeMo W 2023-10-20 15:19:06 decoder_timestamps_utils:66] `ctc_decode` was set to True. Note that this is ignored.


[NeMo I 2023-10-20 15:19:06 features:289] PADDING: 0
[NeMo I 2023-10-20 15:19:06 features:289] PADDING: 0
[NeMo I 2023-10-20 15:19:07 features:289] PADDING: 0


In [10]:
import ipywidgets
import time
box_layout = ipywidgets.Layout(height="500px", width="90%")
widget = ipywidgets.Textarea(value='', disabled=True, layout=box_layout)
display(widget)  # display widget

Textarea(value='', disabled=True, layout=Layout(height='500px', width='90%'))

In [12]:
isTorch = torch.cuda.is_available()
iface = gr.Interface(
    fn=online_diar_asr.audio_queue_launcher,
    inputs=[
        gr.Audio(source="microphone", type="numpy", streaming=True), 
        "state",
    ],
    outputs=[
        "textbox",
        "state",
    ],
    layout="horizontal",
    theme="huggingface",
    title=f"NeMo Streaming Conformer CTC Large - English, CUDA:{isTorch}",
    description="Demo for English speech recognition using Conformer Transducers",
    allow_flagging='never',
    live=True,
)
iface.launch(share=True)

for index in range(100000000):
    widget.value += f" update {index}"
    fp = open(f'{diar._out_dir}/print_script.sh','r').read()
    widget.value = fp
    time.sleep(0.01)


    
    


IMPORTANT: You are using gradio version 3.4.0, however version 3.14.0 is available, please upgrade.
--------
Running on local URL:  http://127.0.0.1:7871
[NeMo I 2023-10-20 15:20:24 diarization_utils:1381] 1930.98ms 'run_VAD_decoder_step'
[NeMo I 2023-10-20 15:20:24 diarization_utils:1381] 32.14ms 'run_ASR_decoder_step'
[NeMo I 2023-10-20 15:20:24 online_diarizer:56] 0.05ms 'diarize_step'
[NeMo I 2023-10-20 15:20:24 diarization_utils:899] Creating results for Session: citadel_ken n_spk: 1 
OUTPUT DIR: /home/taejinp/projects/run_time/streaming_diar_output_univ/print_script.sh
[NeMo I 2023-10-20 15:20:24 diarization_utils:2020] Total ASR and Diarization ETA: 2.806 comp ETA 2.806
[NeMo I 2023-10-20 15:20:25 diarization_utils:1381] 139.11ms 'run_VAD_decoder_step'
[NeMo I 2023-10-20 15:20:25 diarization_utils:1381] 32.19ms 'run_ASR_decoder_step'
[NeMo I 2023-10-20 15:20:25 online_diarizer:56] 0.05ms 'diarize_step'
[NeMo I 2023-10-20 15:20:25 diarization_utils:899] Creating results for Sessi

[NeMo I 2023-10-20 15:20:31 online_diarizer:56] 7.67ms '_run_embedding_extractor'
[NeMo I 2023-10-20 15:20:31 online_diarizer:56] 8.37ms '_extract_online_embeddings'
[NeMo I 2023-10-20 15:20:31 online_diarizer:56] 17.88ms '_perform_online_clustering'
[NeMo I 2023-10-20 15:20:31 online_diarizer:56] 53.74ms 'diarize_step'
[NeMo I 2023-10-20 15:20:31 diarization_utils:899] Creating results for Session: citadel_ken n_spk: 1 
OUTPUT DIR: /home/taejinp/projects/run_time/streaming_diar_output_univ/print_script.sh
[NeMo I 2023-10-20 15:20:31 diarization_utils:2020] Total ASR and Diarization ETA: 0.122 comp ETA 0.122
[NeMo I 2023-10-20 15:20:33 diarization_utils:1381] 14.31ms 'run_VAD_decoder_step'
[NeMo I 2023-10-20 15:20:33 diarization_utils:1381] 33.42ms 'run_ASR_decoder_step'
[NeMo I 2023-10-20 15:20:33 online_diarizer:56] 8.90ms '_run_embedding_extractor'
[NeMo I 2023-10-20 15:20:33 online_diarizer:56] 9.94ms '_extract_online_embeddings'
[NeMo I 2023-10-20 15:20:33 online_diarizer:56] 7.71

OUTPUT DIR: /home/taejinp/projects/run_time/streaming_diar_output_univ/print_script.sh
[NeMo I 2023-10-20 15:20:39 diarization_utils:2020] Total ASR and Diarization ETA: 0.121 comp ETA 0.121
[NeMo I 2023-10-20 15:20:39 diarization_utils:1381] 7.58ms 'run_VAD_decoder_step'
[NeMo I 2023-10-20 15:20:40 diarization_utils:1381] 28.48ms 'run_ASR_decoder_step'
[NeMo I 2023-10-20 15:20:40 online_diarizer:56] 8.53ms '_run_embedding_extractor'
[NeMo I 2023-10-20 15:20:40 online_diarizer:56] 9.71ms '_extract_online_embeddings'
[NeMo I 2023-10-20 15:20:40 online_diarizer:56] 7.26ms '_run_embedding_extractor'
[NeMo I 2023-10-20 15:20:40 online_diarizer:56] 8.26ms '_extract_online_embeddings'
[NeMo I 2023-10-20 15:20:40 online_diarizer:56] 18.17ms '_run_embedding_extractor'
[NeMo I 2023-10-20 15:20:40 online_diarizer:56] 19.12ms '_extract_online_embeddings'
[NeMo I 2023-10-20 15:20:40 online_diarizer:56] 16.83ms '_perform_online_clustering'
[NeMo I 2023-10-20 15:20:40 online_diarizer:56] 61.38ms 'di

[NeMo I 2023-10-20 15:20:47 online_diarizer:56] 9.49ms '_extract_online_embeddings'
[NeMo I 2023-10-20 15:20:47 online_diarizer:56] 7.34ms '_run_embedding_extractor'
[NeMo I 2023-10-20 15:20:47 online_diarizer:56] 8.39ms '_extract_online_embeddings'
[NeMo I 2023-10-20 15:20:47 online_diarizer:56] 7.25ms '_run_embedding_extractor'
[NeMo I 2023-10-20 15:20:47 online_diarizer:56] 8.23ms '_extract_online_embeddings'
[NeMo I 2023-10-20 15:20:47 online_diarizer:56] 61.90ms '_perform_online_clustering'
[NeMo I 2023-10-20 15:20:47 online_diarizer:56] 95.76ms 'diarize_step'
[NeMo I 2023-10-20 15:20:47 diarization_utils:899] Creating results for Session: citadel_ken n_spk: 2 
OUTPUT DIR: /home/taejinp/projects/run_time/streaming_diar_output_univ/print_script.sh
[NeMo I 2023-10-20 15:20:47 diarization_utils:2020] Total ASR and Diarization ETA: 0.156 comp ETA 0.156
[NeMo I 2023-10-20 15:20:48 diarization_utils:1381] 14.44ms 'run_VAD_decoder_step'
[NeMo I 2023-10-20 15:20:48 diarization_utils:1381]

[NeMo I 2023-10-20 15:20:54 online_diarizer:56] 18.78ms '_perform_online_clustering'
[NeMo I 2023-10-20 15:20:54 online_diarizer:56] 76.93ms 'diarize_step'
[NeMo I 2023-10-20 15:20:54 diarization_utils:899] Creating results for Session: citadel_ken n_spk: 3 
OUTPUT DIR: /home/taejinp/projects/run_time/streaming_diar_output_univ/print_script.sh
[NeMo I 2023-10-20 15:20:54 diarization_utils:2020] Total ASR and Diarization ETA: 0.144 comp ETA 0.144
[NeMo I 2023-10-20 15:20:55 diarization_utils:1381] 14.35ms 'run_VAD_decoder_step'
[NeMo I 2023-10-20 15:20:55 diarization_utils:1381] 39.31ms 'run_ASR_decoder_step'
[NeMo I 2023-10-20 15:20:55 online_diarizer:56] 17.93ms '_run_embedding_extractor'
[NeMo I 2023-10-20 15:20:55 online_diarizer:56] 19.11ms '_extract_online_embeddings'
[NeMo I 2023-10-20 15:20:55 online_diarizer:56] 12.29ms '_run_embedding_extractor'
[NeMo I 2023-10-20 15:20:55 online_diarizer:56] 13.29ms '_extract_online_embeddings'
[NeMo I 2023-10-20 15:20:55 online_diarizer:56] 

[NeMo I 2023-10-20 15:21:02 diarization_utils:1381] 10.23ms 'run_VAD_decoder_step'
[NeMo I 2023-10-20 15:21:03 diarization_utils:1381] 36.30ms 'run_ASR_decoder_step'
[NeMo I 2023-10-20 15:21:03 online_diarizer:56] 8.69ms '_run_embedding_extractor'
[NeMo I 2023-10-20 15:21:03 online_diarizer:56] 9.63ms '_extract_online_embeddings'
[NeMo I 2023-10-20 15:21:03 online_diarizer:56] 7.99ms '_run_embedding_extractor'
[NeMo I 2023-10-20 15:21:03 online_diarizer:56] 8.84ms '_extract_online_embeddings'
[NeMo I 2023-10-20 15:21:03 online_diarizer:56] 17.24ms '_run_embedding_extractor'
[NeMo I 2023-10-20 15:21:03 online_diarizer:56] 18.07ms '_extract_online_embeddings'
[NeMo I 2023-10-20 15:21:03 online_diarizer:56] 53.65ms '_perform_online_clustering'
[NeMo I 2023-10-20 15:21:03 online_diarizer:56] 99.40ms 'diarize_step'
[NeMo I 2023-10-20 15:21:03 diarization_utils:899] Creating results for Session: citadel_ken n_spk: 3 
OUTPUT DIR: /home/taejinp/projects/run_time/streaming_diar_output_univ/prin

[NeMo I 2023-10-20 15:21:10 online_diarizer:56] 8.65ms '_extract_online_embeddings'
[NeMo I 2023-10-20 15:21:10 online_diarizer:56] 19.59ms '_run_embedding_extractor'
[NeMo I 2023-10-20 15:21:10 online_diarizer:56] 20.73ms '_extract_online_embeddings'
[NeMo I 2023-10-20 15:21:10 online_diarizer:56] 16.33ms '_perform_online_clustering'
[NeMo I 2023-10-20 15:21:10 online_diarizer:56] 63.71ms 'diarize_step'
[NeMo I 2023-10-20 15:21:10 diarization_utils:899] Creating results for Session: citadel_ken n_spk: 3 
OUTPUT DIR: /home/taejinp/projects/run_time/streaming_diar_output_univ/print_script.sh
[NeMo I 2023-10-20 15:21:10 diarization_utils:2020] Total ASR and Diarization ETA: 0.126 comp ETA 0.126
[NeMo I 2023-10-20 15:21:11 diarization_utils:1381] 16.12ms 'run_VAD_decoder_step'
[NeMo I 2023-10-20 15:21:11 diarization_utils:1381] 36.88ms 'run_ASR_decoder_step'
[NeMo I 2023-10-20 15:21:11 online_diarizer:56] 8.74ms '_run_embedding_extractor'
[NeMo I 2023-10-20 15:21:11 online_diarizer:56] 9.

[NeMo I 2023-10-20 15:21:17 diarization_utils:899] Creating results for Session: citadel_ken n_spk: 3 
OUTPUT DIR: /home/taejinp/projects/run_time/streaming_diar_output_univ/print_script.sh
[NeMo I 2023-10-20 15:21:17 diarization_utils:2020] Total ASR and Diarization ETA: 0.139 comp ETA 0.139
[NeMo I 2023-10-20 15:21:18 diarization_utils:1381] 15.60ms 'run_VAD_decoder_step'
[NeMo I 2023-10-20 15:21:18 diarization_utils:1381] 50.96ms 'run_ASR_decoder_step'
[NeMo I 2023-10-20 15:21:18 online_diarizer:56] 18.35ms '_run_embedding_extractor'
[NeMo I 2023-10-20 15:21:18 online_diarizer:56] 19.73ms '_extract_online_embeddings'
[NeMo I 2023-10-20 15:21:18 online_diarizer:56] 12.96ms '_run_embedding_extractor'
[NeMo I 2023-10-20 15:21:18 online_diarizer:56] 14.39ms '_extract_online_embeddings'
[NeMo I 2023-10-20 15:21:18 online_diarizer:56] 18.28ms '_run_embedding_extractor'
[NeMo I 2023-10-20 15:21:18 online_diarizer:56] 19.57ms '_extract_online_embeddings'
[NeMo I 2023-10-20 15:21:18 online_d

[NeMo I 2023-10-20 15:21:26 online_diarizer:56] 8.99ms '_run_embedding_extractor'
[NeMo I 2023-10-20 15:21:26 online_diarizer:56] 10.05ms '_extract_online_embeddings'
[NeMo I 2023-10-20 15:21:26 online_diarizer:56] 7.61ms '_run_embedding_extractor'
[NeMo I 2023-10-20 15:21:26 online_diarizer:56] 8.60ms '_extract_online_embeddings'
[NeMo I 2023-10-20 15:21:26 online_diarizer:56] 18.40ms '_run_embedding_extractor'
[NeMo I 2023-10-20 15:21:26 online_diarizer:56] 19.35ms '_extract_online_embeddings'
[NeMo I 2023-10-20 15:21:26 online_diarizer:56] 21.68ms '_perform_online_clustering'
[NeMo I 2023-10-20 15:21:26 online_diarizer:56] 67.47ms 'diarize_step'
[NeMo I 2023-10-20 15:21:26 diarization_utils:899] Creating results for Session: citadel_ken n_spk: 3 
OUTPUT DIR: /home/taejinp/projects/run_time/streaming_diar_output_univ/print_script.sh
[NeMo I 2023-10-20 15:21:26 diarization_utils:2020] Total ASR and Diarization ETA: 0.126 comp ETA 0.126
[NeMo I 2023-10-20 15:21:27 diarization_utils:138

[NeMo I 2023-10-20 15:21:33 online_diarizer:56] 10.54ms '_extract_online_embeddings'
[NeMo I 2023-10-20 15:21:33 online_diarizer:56] 19.36ms '_perform_online_clustering'
[NeMo I 2023-10-20 15:21:33 online_diarizer:56] 63.92ms 'diarize_step'
[NeMo I 2023-10-20 15:21:33 diarization_utils:899] Creating results for Session: citadel_ken n_spk: 3 
OUTPUT DIR: /home/taejinp/projects/run_time/streaming_diar_output_univ/print_script.sh
[NeMo I 2023-10-20 15:21:33 diarization_utils:2020] Total ASR and Diarization ETA: 0.125 comp ETA 0.126
[NeMo I 2023-10-20 15:21:33 diarization_utils:1381] 7.45ms 'run_VAD_decoder_step'
[NeMo I 2023-10-20 15:21:34 diarization_utils:1381] 36.43ms 'run_ASR_decoder_step'
[NeMo I 2023-10-20 15:21:34 online_diarizer:56] 8.53ms '_run_embedding_extractor'
[NeMo I 2023-10-20 15:21:34 online_diarizer:56] 9.57ms '_extract_online_embeddings'
[NeMo I 2023-10-20 15:21:34 online_diarizer:56] 14.29ms '_run_embedding_extractor'
[NeMo I 2023-10-20 15:21:34 online_diarizer:56] 15.

AssertionError: 