# Streaming multispeaker ASR and diarization tutorial

In [3]:
"""
You can run either this notebook locally (if you have all the dependencies and a GPU) or on Google Colab.

Instructions for setting up Colab are as follows:
1. Open a new Python 3 notebook.
2. Import this notebook from GitHub (File -> Upload Notebook -> "GITHUB" tab -> copy/paste GitHub URL)
3. Connect to an instance with a GPU (Runtime -> Change runtime type -> select "GPU" for hardware accelerator)
4. Run this cell to set up dependencies.
"""
# If you're using Google Colab and not running locally, run this cell.

if False: 
    ## Install dependencies
    !pip install wget
    !apt-get install sox libsndfile1 ffmpeg
    !pip install text-unidecode

    # ## Install NeMo
    BRANCH = 'main'
    !python -m pip install git+https://github.com/NVIDIA/NeMo.git@$BRANCH#egg=nemo_toolkit[asr]

    ## Install TorchAudio
    !pip install torchaudio -f https://download.pytorch.org/whl/torch_stable.html

UsageError: Line magic function `%%script` not found.


In [4]:
import os
import sys
print(sys.path)
sys.path.insert(0,f'/home/taejinp/projects/online_diar/NeMo/')
import nemo
print("Nemo PATH:", nemo.__path__)
BRANCH = 'streaming_mulspk_asr'

['/home/taejinp/projects/online_diar/NeMo/tutorials/speaker_tasks', '/home/taejinp/projects/online_diar/NeMo/tutorials/speaker_tasks', '/usr/local/lib/python3.6/dist-packages/torchtext_mod', '/usr/local/lib/python3.6/dist-packages/torchtext_edit', '/usr/local/lib/python3.6/dist-packages/torchtext_edit/data', '/home/taejinp/anaconda3/envs/e2py310/lib/python310.zip', '/home/taejinp/anaconda3/envs/e2py310/lib/python3.10', '/home/taejinp/anaconda3/envs/e2py310/lib/python3.10/lib-dynload', '', '/home/taejinp/anaconda3/envs/e2py310/lib/python3.10/site-packages']
Nemo PATH: ['/home/taejinp/projects/online_diar/NeMo/nemo']


# Online Speaker Diarization

Speaker diarization is the process of determining "who spoke when" in a given audio clip. Depending on the method of processing, speaker diarization can be categorized into two types:

- **Offline Speaker Diarization**: This method assumes access to the entire audio clip. The transcription, indicating which speaker spoke at which time, is provided after processing the audio from start to end.

- **Online Speaker Diarization**: In this approach, the system only gradually gains access to short segments of the audio, typically a few seconds long. The transcription is generated and displayed in real-time as the segmented audio is being processed.


In [3]:
import sys
import socket
if socket.gethostname() == "aiapps-06052021":
    sys.path.insert(0,'/home/taejinp/projects/streaming_mulspk_asr/NeMo')
else:
    sys.path.insert(0,'/your/path/to/NeMo/')
    
import nemo
print("Using Nemo PATH:", nemo.__path__[0])

# !pip install gradio==2.9.0

Using Nemo PATH: /home/taejinp/projects/online_diar/NeMo/nemo


In [4]:
# Introduction to Online Speaker Diarization
"""
As covered in Speaker diarization inference tutorial, speaker diarization is the task of segmenting audio recordings by speaker labels and answers the question "Who Speaks When?".

While offline speaker diarization has access to the entire audio file and return the speaker labels all at once, online speaker diarization is a streaming task that processes audio in small chunks. 
Since we only have access to a small chunk of audio at a time, the online speaker diarization system needs to maintain a memory buffer to store the history of the speakers in the past. At the sametime, the system needs to be able to detect new speakers that are not in the memory buffer.

This tutorial will cover the followings:

- How to run online speaker diarization with NeMo
- How online speaker clustering and memory buffer works together
"""

'\nAs covered in Speaker diarization inference tutorial, speaker diarization is the task of segmenting audio recordings by speaker labels and answers the question "Who Speaks When?".\n\nWhile offline speaker diarization has access to the entire audio file and return the speaker labels all at once, online speaker diarization is a streaming task that processes audio in small chunks. \nSince we only have access to a small chunk of audio at a time, the online speaker diarization system needs to maintain a memory buffer to store the history of the speakers in the past. At the sametime, the system needs to be able to detect new speakers that are not in the memory buffer.\n\nThis tutorial will cover the followings:\n\n- How to run online speaker diarization with NeMo\n- How online speaker clustering and memory buffer works together\n'

In [5]:
from nemo.collections.asr.parts.utils.speaker_utils import audio_rttm_map
from nemo.core.config import hydra_runner
import gradio as gr
from scipy.io import wavfile
import numpy as np
import hydra
import os
import torch
from nemo.collections.asr.models import OnlineClusteringDiarizer
# from nemo.collections.asr.parts.utils.diarization_utils import ASR_DIAR_ONLINE
from nemo.collections.asr.parts.utils.diarization_utils import OnlineDiarWithASR


Read yaml file for online diarization. You have to specifty the following items:
    
- input manifest file (If  simulation)
- VAD model path
- Speaker embedding extractor model path
- Diarization Decoder model path (Coming soon)
- Punctuation model path (automatically download from NGC)
- Language model path (Coming soon)

Download nemo models and specify the path to config struct.

In [6]:
import omegaconf

YAML_FILE="/home/taejinp/projects/streaming_mulspk_asr/NeMo/examples/speaker_tasks/diarization/conf/inference/online_diar_infer_general.yaml"
cfg = omegaconf.OmegaConf.load(YAML_FILE)
import socket

cfg.diarizer.out_dir = "./streaming_diar_output"

os.makedirs(cfg.diarizer.out_dir, exist_ok=True)
cfg.diarizer.asr.parameters.colored_text = False
print(f"socket.gethostname() {socket.gethostname()}")
if socket.gethostname() == "aiapps-06052021":
    # cfg.diarizer.manifest_filepath = "/home/taejinp/projects/data/diar_manifest_input/ch109.json"
    cfg.diarizer.manifest_filepath = "/home/taejinp/projects/data/diar_manifest_input/online_diar_demo_01.json"
    cfg.diarizer.vad.model_path = "/home/taejinp/gdrive/model/VAD_models/mVAD_lin_marblenet-3x2x64-4N-256bs-50e-0.01lr-0.001wd.nemo"
    cfg.diarizer.speaker_embeddings.model_path = "/home/taejinp/Downloads/titanet_target_fixed/titanet-l.nemo"
    cfg.diarizer.asr.model_path = "/home/taejinp/gdrive/model/ASR_models/Conformer-CTC-BPE_large_Riva_ASR_set_3.0_ep60.nemo"
    cfg.diarizer.asr.parameters.punctuation_model_path = "punctuation_en_distilbert"
else:
    # Please download the following models and run the code. 

    # Download CH109 dataset at: https://drive.google.com/drive/folders/1ksq10H-NZbKRfMjEP_WWyBF_G0iAJt6b?usp=sharing
    cfg.diarizer.manifest_filepath = "/your/path/to/ch109.json"

    # Download streaming VAD model at: https://drive.google.com/file/d/1ab42CaYeTkuJSMsMsMLbSS9m5e1isJzx/view?usp=sharing
    cfg.diarizer.vad.model_path = "/your/path/to/mVAD_lin_marblenet-3x2x64-4N-256bs-50e-0.01lr-0.001wd.nemo"

    # Download titanet-m model at: https://drive.google.com/file/d/1xAgjm0udKogPrlQF6cdHLobEKHLY9azA/view?usp=sharing
    cfg.diarizer.speaker_embeddings.model_path = "/your/path/to/titanet-m.nemo"

    # Download Conformer-CTC ASR model at: https://drive.google.com/file/d/1Xg075IbiwL0szI4_a8gYmCPaG1UsgR6E/view?usp=sharing
    cfg.diarizer.asr.model_path = "/your/path/to/Conformer-CTC-BPE_large_Riva_ASR_set_3.0_ep60.nemo"

    cfg.diarizer.asr.parameters.punctuation_model_path = "punctuation_en_distilbert"

socket.gethostname() aiapps-06052021


Initialize ASR_DIAR_ONLINE and OnlineDiarizer Class.

In [7]:
# %%html
# <style>
# .output_wrapper, .output {
#     height:auto !important;
#     max-height:400px; 
# }
# .output_scroll {
#     box-shadow:none !important;
#     webkit-box-shadow:none !important;
# }
# </style>

In [8]:
from nemo.collections.asr.models import OnlineClusteringDiarizer
import os

params = {}
params['use_cuda'] = True
AUDIO_RTTM_MAP = audio_rttm_map(cfg.diarizer.manifest_filepath)

diar = OnlineClusteringDiarizer(cfg)
from nemo.collections.asr.parts.utils.diarization_utils import OnlineDiarWithASR, write_txt

cfg.diarizer.simulation_uniq_id='citadel_ken'
cfg.diarizer.out_dir = '/home/taejinp/projects/run_time/streaming_diar_output_univ'
cfg.diarizer.asr.parameters.streaming_simulation=True
cfg.diarizer.asr.parameters.enforce_real_time=True 
cfg.diarizer.asr.parameters.colored_text=False
 
fn = os.path.join(cfg.diarizer.out_dir, "print_script.sh")
# os.remove(fn) if os.path.exists(fn) else None

diar.uniq_id = cfg.diarizer.simulation_uniq_id 
diar.single_audio_file_path = AUDIO_RTTM_MAP[diar.uniq_id]['audio_filepath']
diar.rttm_file_path = AUDIO_RTTM_MAP[diar.uniq_id]['rttm_filepath']
# diar.rttm_file_path = None # DER calculation slows down online diarization speed
diar._init_segment_variables()


online_diar_asr = OnlineDiarWithASR(cfg=cfg)
diar = online_diar_asr.diar

diar.device = online_diar_asr.device
online_diar_asr.reset()

# cfg.diarizer.asr.parameters.streaming_simulation=True
cfg.diarizer.asr.parameters.streaming_simulation=False

if not cfg.diarizer.asr.parameters.streaming_simulation:
    cfg.diarizer.asr.parameters.enforce_real_time=False

[NeMo I 2023-10-26 17:26:06 speaker_utils:93] Number of files to diarize: 4


[NeMo W 2023-10-26 17:26:07 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: /manifests/ami_train_0.63.json,/manifests/freesound_background_train.json,/manifests/freesound_laughter_train.json,/manifests/fisher_2004_background.json,/manifests/fisher_2004_speech_sampled.json,/manifests/google_train_manifest.json,/manifests/icsi_all_0.63.json,/manifests/musan_freesound_train.json,/manifests/musan_music_train.json,/manifests/musan_soundbible_train.json,/manifests/mandarin_train_sample.json,/manifests/german_train_sample.json,/manifests/spanish_train_sample.json,/manifests/french_train_sample.json,/manifests/russian_train_sample.json
    sample_rate: 16000
    labels:
    - background
    - speech
    batch_size: 256
    shuffle: true
    is_tarred: false
    tarred_audio_filepaths: null
    tarred_shard_strategy: sca

[NeMo I 2023-10-26 17:26:07 features:289] PADDING: 16
[NeMo I 2023-10-26 17:26:08 save_restore_connector:249] Model EncDecClassificationModel was successfully restored from /home/taejinp/gdrive/model/VAD_models/mVAD_lin_marblenet-3x2x64-4N-256bs-50e-0.01lr-0.001wd.nemo.
[NeMo I 2023-10-26 17:26:08 clustering_diarizer:120] VAD model loaded locally from /home/taejinp/gdrive/model/VAD_models/mVAD_lin_marblenet-3x2x64-4N-256bs-50e-0.01lr-0.001wd.nemo


[NeMo W 2023-10-26 17:26:08 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: /manifests/combined_fisher_swbd_voxceleb12_librispeech/train.json
    sample_rate: 16000
    labels: null
    batch_size: 64
    shuffle: true
    is_tarred: false
    tarred_audio_filepaths: null
    tarred_shard_strategy: scatter
    augmentor:
      noise:
        manifest_path: /manifests/noise/rir_noise_manifest.json
        prob: 0.5
        min_snr_db: 0
        max_snr_db: 15
      speed:
        prob: 0.5
        sr: 16000
        resample_type: kaiser_fast
        min_speed_rate: 0.95
        max_speed_rate: 1.05
    num_workers: 15
    pin_memory: true
    
[NeMo W 2023-10-26 17:26:08 modelPT:168] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method 

[NeMo I 2023-10-26 17:26:08 features:289] PADDING: 16
[NeMo I 2023-10-26 17:26:08 save_restore_connector:249] Model EncDecSpeakerLabelModel was successfully restored from /home/taejinp/Downloads/titanet_target_fixed/titanet-l.nemo.
[NeMo I 2023-10-26 17:26:08 clustering_diarizer:145] Speaker Model restored locally from /home/taejinp/Downloads/titanet_target_fixed/titanet-l.nemo
[NeMo I 2023-10-26 17:26:08 speaker_utils:93] Number of files to diarize: 4
[NeMo I 2023-10-26 17:26:08 speaker_utils:93] Number of files to diarize: 4


[NeMo W 2023-10-26 17:26:08 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: /manifests/ami_train_0.63.json,/manifests/freesound_background_train.json,/manifests/freesound_laughter_train.json,/manifests/fisher_2004_background.json,/manifests/fisher_2004_speech_sampled.json,/manifests/google_train_manifest.json,/manifests/icsi_all_0.63.json,/manifests/musan_freesound_train.json,/manifests/musan_music_train.json,/manifests/musan_soundbible_train.json,/manifests/mandarin_train_sample.json,/manifests/german_train_sample.json,/manifests/spanish_train_sample.json,/manifests/french_train_sample.json,/manifests/russian_train_sample.json
    sample_rate: 16000
    labels:
    - background
    - speech
    batch_size: 256
    shuffle: true
    is_tarred: false
    tarred_audio_filepaths: null
    tarred_shard_strategy: sca

[NeMo I 2023-10-26 17:26:08 features:289] PADDING: 16
[NeMo I 2023-10-26 17:26:09 save_restore_connector:249] Model EncDecClassificationModel was successfully restored from /home/taejinp/gdrive/model/VAD_models/mVAD_lin_marblenet-3x2x64-4N-256bs-50e-0.01lr-0.001wd.nemo.
[NeMo I 2023-10-26 17:26:09 clustering_diarizer:120] VAD model loaded locally from /home/taejinp/gdrive/model/VAD_models/mVAD_lin_marblenet-3x2x64-4N-256bs-50e-0.01lr-0.001wd.nemo


[NeMo W 2023-10-26 17:26:09 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: /manifests/combined_fisher_swbd_voxceleb12_librispeech/train.json
    sample_rate: 16000
    labels: null
    batch_size: 64
    shuffle: true
    is_tarred: false
    tarred_audio_filepaths: null
    tarred_shard_strategy: scatter
    augmentor:
      noise:
        manifest_path: /manifests/noise/rir_noise_manifest.json
        prob: 0.5
        min_snr_db: 0
        max_snr_db: 15
      speed:
        prob: 0.5
        sr: 16000
        resample_type: kaiser_fast
        min_speed_rate: 0.95
        max_speed_rate: 1.05
    num_workers: 15
    pin_memory: true
    
[NeMo W 2023-10-26 17:26:09 modelPT:168] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method 

[NeMo I 2023-10-26 17:26:09 features:289] PADDING: 16
[NeMo I 2023-10-26 17:26:09 save_restore_connector:249] Model EncDecSpeakerLabelModel was successfully restored from /home/taejinp/Downloads/titanet_target_fixed/titanet-l.nemo.
[NeMo I 2023-10-26 17:26:09 clustering_diarizer:145] Speaker Model restored locally from /home/taejinp/Downloads/titanet_target_fixed/titanet-l.nemo
[NeMo I 2023-10-26 17:26:09 speaker_utils:93] Number of files to diarize: 4


[NeMo W 2023-10-26 17:26:09 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: /manifests/ami_train_0.63.json,/manifests/freesound_background_train.json,/manifests/freesound_laughter_train.json,/manifests/fisher_2004_background.json,/manifests/fisher_2004_speech_sampled.json,/manifests/google_train_manifest.json,/manifests/icsi_all_0.63.json,/manifests/musan_freesound_train.json,/manifests/musan_music_train.json,/manifests/musan_soundbible_train.json,/manifests/mandarin_train_sample.json,/manifests/german_train_sample.json,/manifests/spanish_train_sample.json,/manifests/french_train_sample.json,/manifests/russian_train_sample.json
    sample_rate: 16000
    labels:
    - background
    - speech
    batch_size: 256
    shuffle: true
    is_tarred: false
    tarred_audio_filepaths: null
    tarred_shard_strategy: sca

[NeMo I 2023-10-26 17:26:09 features:289] PADDING: 16
[NeMo I 2023-10-26 17:26:09 save_restore_connector:249] Model EncDecClassificationModel was successfully restored from /home/taejinp/gdrive/model/VAD_models/mVAD_lin_marblenet-3x2x64-4N-256bs-50e-0.01lr-0.001wd.nemo.
[NeMo I 2023-10-26 17:26:12 mixins:170] Tokenizer SentencePieceTokenizer initialized with 1024 tokens


[NeMo W 2023-10-26 17:26:12 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: /data/asr_datasets_prebuilt/RIVA_ASR_SET_3.0_tarred/tarred_audio_manifest.json
    sample_rate: 16000
    batch_size: 16
    shuffle: true
    is_tarred: true
    tarred_audio_filepaths: /data/asr_datasets_prebuilt/RIVA_ASR_SET_3.0_tarred/audio__OP_0..4095_CL_.tar
    use_start_end_token: false
    trim_silence: false
    max_duration: 20.0
    min_duration: 0.1
    shuffle_n: 1024
    num_workers: 16
    pin_memory: true
    
[NeMo W 2023-10-26 17:26:12 modelPT:168] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s). 
    Validation config : 
    manifest_filepath: /data/Jasper_NEM

[NeMo I 2023-10-26 17:26:12 features:289] PADDING: 0
[NeMo I 2023-10-26 17:26:13 save_restore_connector:249] Model EncDecCTCModelBPE was successfully restored from /home/taejinp/gdrive/model/ASR_models/Conformer-CTC-BPE_large_Riva_ASR_set_3.0_ep60.nemo.


[NeMo W 2023-10-26 17:26:13 decoder_timestamps_utils:66] `ctc_decode` was set to True. Note that this is ignored.


[NeMo I 2023-10-26 17:26:13 features:289] PADDING: 0
[NeMo I 2023-10-26 17:26:14 features:289] PADDING: 0
[NeMo I 2023-10-26 17:26:14 features:289] PADDING: 0


Let's run simulated audio stream to check if streaming system is working properly. After you initiate the following function and while the function is running, you can check the transcription is being generated in realtime.  The path is ./streaming_diar_output/print_script.sh, and this can be viewed using "streaming_diarization_viewer.ipynb"


In [9]:
import ipywidgets
import time
box_layout = ipywidgets.Layout(height="500px", width="90%")
widget = ipywidgets.Textarea(value='', disabled=True, layout=box_layout)
display(widget)  # display widget


Textarea(value='', disabled=True, layout=Layout(height='500px', width='90%'))

In [10]:
diar.uniq_id = cfg.diarizer.simulation_uniq_id
online_diar_asr.get_audio_rttm_map(diar.uniq_id)
diar.single_audio_file_path = diar.AUDIO_RTTM_MAP[diar.uniq_id]['audio_filepath']
online_diar_asr.rttm_file_path = diar.AUDIO_RTTM_MAP[diar.uniq_id]['rttm_filepath']

diar._init_segment_variables()
diar.device = online_diar_asr.device
write_txt(f"{diar._out_dir}/print_script.sh", "")

if cfg.diarizer.asr.parameters.streaming_simulation == True:
    samplerate, sdata = wavfile.read(diar.single_audio_file_path)
    if  diar.AUDIO_RTTM_MAP[diar.uniq_id]['offset'] and diar.AUDIO_RTTM_MAP[diar.uniq_id]['duration']:
        
        offset = samplerate*diar.AUDIO_RTTM_MAP[diar.uniq_id]['offset']
        duration = samplerate*diar.AUDIO_RTTM_MAP[diar.uniq_id]['duration']
        stt, end = int(offset), int(offset + duration)
        sdata = sdata[stt:end]

    for index in range(int(np.floor(sdata.shape[0]/online_diar_asr.n_frame_len))):
        shift = online_diar_asr.CHUNK_SIZE
        sample_audio = sdata[shift*index:shift*(index+1)]
        online_diar_asr.buffer_counter = index
        online_diar_asr.streaming_step(sample_audio)
        
        widget.value += f" update {index}"
        fp = open(f'{diar._out_dir}/print_script.sh','r').read()
        widget.value = fp
else:
    isTorch = torch.cuda.is_available()
    iface = gr.Interface(
    fn=online_diar_asr.audio_queue_launcher,
    inputs=[
        gr.Audio(source="microphone", type="numpy", streaming=True), 
        "state",
    ],
    outputs=[
        "textbox",
        "state",
    ],
    layout="horizontal",
    theme="huggingface",
    title=f"NeMo Streaming Conformer CTC Large - English, CUDA:{isTorch}",
    description="Demo for English speech recognition using Conformer Transducers",
    allow_flagging='never',
    live=True,
    )
    iface.launch(share=True)



    
    


IMPORTANT: You are using gradio version 3.4.0, however version 3.14.0 is available, please upgrade.
--------
Running on local URL:  http://127.0.0.1:7860
[NeMo I 2023-10-26 17:26:35 diarization_utils:1381] 2051.99ms 'run_VAD_decoder_step'
[NeMo I 2023-10-26 17:26:35 diarization_utils:1381] 37.81ms 'run_ASR_decoder_step'
[NeMo I 2023-10-26 17:26:35 online_diarizer:56] 0.07ms 'diarize_step'
[NeMo I 2023-10-26 17:26:35 diarization_utils:899] Creating results for Session: citadel_ken n_spk: 1 
OUTPUT DIR: /home/taejinp/projects/run_time/streaming_diar_output_univ/print_script.sh
[NeMo I 2023-10-26 17:26:35 diarization_utils:2020] Total ASR and Diarization ETA: 2.974 comp ETA 2.975
[NeMo I 2023-10-26 17:26:36 diarization_utils:1381] 142.35ms 'run_VAD_decoder_step'
[NeMo I 2023-10-26 17:26:36 diarization_utils:1381] 32.81ms 'run_ASR_decoder_step'
[NeMo I 2023-10-26 17:26:36 online_diarizer:56] 0.06ms 'diarize_step'
[NeMo I 2023-10-26 17:26:36 diarization_utils:899] Creating results for Sessi

[NeMo I 2023-10-26 17:26:43 online_diarizer:56] 7.97ms '_run_embedding_extractor'
[NeMo I 2023-10-26 17:26:43 online_diarizer:56] 8.82ms '_extract_online_embeddings'
[NeMo I 2023-10-26 17:26:43 online_diarizer:56] 16.50ms '_perform_online_clustering'
[NeMo I 2023-10-26 17:26:43 online_diarizer:56] 52.07ms 'diarize_step'
[NeMo I 2023-10-26 17:26:43 diarization_utils:899] Creating results for Session: citadel_ken n_spk: 1 
OUTPUT DIR: /home/taejinp/projects/run_time/streaming_diar_output_univ/print_script.sh
[NeMo I 2023-10-26 17:26:43 diarization_utils:2020] Total ASR and Diarization ETA: 0.116 comp ETA 0.117
[NeMo I 2023-10-26 17:26:44 diarization_utils:1381] 16.75ms 'run_VAD_decoder_step'
[NeMo I 2023-10-26 17:26:44 diarization_utils:1381] 41.78ms 'run_ASR_decoder_step'
[NeMo I 2023-10-26 17:26:44 online_diarizer:56] 9.24ms '_run_embedding_extractor'
[NeMo I 2023-10-26 17:26:44 online_diarizer:56] 10.38ms '_extract_online_embeddings'
[NeMo I 2023-10-26 17:26:44 online_diarizer:56] 8.4

OUTPUT DIR: /home/taejinp/projects/run_time/streaming_diar_output_univ/print_script.sh
[NeMo I 2023-10-26 17:26:50 diarization_utils:2020] Total ASR and Diarization ETA: 0.131 comp ETA 0.131
[NeMo I 2023-10-26 17:26:51 diarization_utils:1381] 7.63ms 'run_VAD_decoder_step'
[NeMo I 2023-10-26 17:26:51 diarization_utils:1381] 32.27ms 'run_ASR_decoder_step'
[NeMo I 2023-10-26 17:26:51 online_diarizer:56] 18.43ms '_run_embedding_extractor'
[NeMo I 2023-10-26 17:26:51 online_diarizer:56] 20.10ms '_extract_online_embeddings'
[NeMo I 2023-10-26 17:26:51 online_diarizer:56] 17.52ms '_run_embedding_extractor'
[NeMo I 2023-10-26 17:26:51 online_diarizer:56] 18.73ms '_extract_online_embeddings'
[NeMo I 2023-10-26 17:26:51 online_diarizer:56] 11.73ms '_run_embedding_extractor'
[NeMo I 2023-10-26 17:26:51 online_diarizer:56] 12.80ms '_extract_online_embeddings'
[NeMo I 2023-10-26 17:26:51 online_diarizer:56] 5.70ms '_perform_online_clustering'
[NeMo I 2023-10-26 17:26:51 online_diarizer:56] 64.57ms 

[NeMo I 2023-10-26 17:26:59 online_diarizer:56] 9.25ms '_extract_online_embeddings'
[NeMo I 2023-10-26 17:26:59 online_diarizer:56] 7.75ms '_run_embedding_extractor'
[NeMo I 2023-10-26 17:26:59 online_diarizer:56] 8.35ms '_extract_online_embeddings'
[NeMo I 2023-10-26 17:26:59 online_diarizer:56] 7.85ms '_run_embedding_extractor'
[NeMo I 2023-10-26 17:26:59 online_diarizer:56] 8.54ms '_extract_online_embeddings'
[NeMo I 2023-10-26 17:26:59 online_diarizer:56] 48.45ms '_perform_online_clustering'
[NeMo I 2023-10-26 17:26:59 online_diarizer:56] 82.20ms 'diarize_step'
[NeMo I 2023-10-26 17:26:59 diarization_utils:899] Creating results for Session: citadel_ken n_spk: 1 
OUTPUT DIR: /home/taejinp/projects/run_time/streaming_diar_output_univ/print_script.sh
[NeMo I 2023-10-26 17:26:59 diarization_utils:2020] Total ASR and Diarization ETA: 0.136 comp ETA 0.136
[NeMo I 2023-10-26 17:27:07 diarization_utils:1381] 17.30ms 'run_VAD_decoder_step'
[NeMo I 2023-10-26 17:27:07 diarization_utils:1381]

[NeMo I 2023-10-26 17:27:14 online_diarizer:56] 14.82ms '_perform_online_clustering'
[NeMo I 2023-10-26 17:27:14 online_diarizer:56] 63.61ms 'diarize_step'
[NeMo I 2023-10-26 17:27:14 diarization_utils:899] Creating results for Session: citadel_ken n_spk: 2 
OUTPUT DIR: /home/taejinp/projects/run_time/streaming_diar_output_univ/print_script.sh
[NeMo I 2023-10-26 17:27:14 diarization_utils:2020] Total ASR and Diarization ETA: 0.122 comp ETA 0.123
[NeMo I 2023-10-26 17:27:14 diarization_utils:1381] 16.43ms 'run_VAD_decoder_step'
[NeMo I 2023-10-26 17:27:14 diarization_utils:1381] 34.54ms 'run_ASR_decoder_step'
[NeMo I 2023-10-26 17:27:14 online_diarizer:56] 12.84ms '_run_embedding_extractor'
[NeMo I 2023-10-26 17:27:14 online_diarizer:56] 14.02ms '_extract_online_embeddings'
[NeMo I 2023-10-26 17:27:14 online_diarizer:56] 15.70ms '_run_embedding_extractor'
[NeMo I 2023-10-26 17:27:14 online_diarizer:56] 16.76ms '_extract_online_embeddings'
[NeMo I 2023-10-26 17:27:14 online_diarizer:56] 

*** Failed to connect to ec2.gradio.app:22: [Errno 110] Connection timed out


AssertionError: 

Now, go to streaming_diarization_viewer.ipynb and check the realtime output.

In [None]:
while True:
    fp = open(f'{diar._out_dir}/print_script.sh','r').read()
    widget.value = fp
    time.sleep(0.1)

In [None]:
# cfg.diarizer.asr.parameters.streaming_simulation=False
# cfg.diarizer.asr.parameters.enforce_real_time=False
# online_diar_asr = OnlineDiarWithASR(cfg=cfg)
# diar = online_diar_asr.diar
# write_txt(f"{diar._out_dir}/print_script.sh", "")

# diar.uniq_id = cfg.diarizer.simulation_uniq_id 
# diar.single_audio_file_path = AUDIO_RTTM_MAP[diar.uniq_id]['audio_filepath']
# diar.rttm_file_path = AUDIO_RTTM_MAP[diar.uniq_id]['rttm_filepath']
# # diar.rttm_file_path = None # DER calculation slows down online diarization speed
# diar._init_segment_variables()



In [None]:
# import ipywidgets
# import time
# box_layout = ipywidgets.Layout(height="500px", width="90%")
# widget_gradio = ipywidgets.Textarea(value='', disabled=True, layout=box_layout)
# display(widget_gradio)  # display widget
# write_txt(f"{diar._out_dir}/print_script.sh", "")

In [None]:
# isTorch = torch.cuda.is_available()
# iface = gr.Interface(
#     fn=online_diar_asr.audio_queue_launcher,
#     inputs=[
#         gr.Audio(source="microphone", type="numpy", streaming=True), 
#         "state",
#     ],
#     outputs=[
#         "textbox",
#         "state",
#     ],
#     layout="horizontal",
#     theme="huggingface",
#     title=f"NeMo Streaming Conformer CTC Large - English, CUDA:{isTorch}",
#     description="Demo for English speech recognition using Conformer Transducers",
#     allow_flagging='never',
#     live=True,
# )
# iface.launch(share=True)

# for index in range(100000000):
#     widget_gradio.value += f" update {index}"
#     fp = open(f'{diar._out_dir}/print_script.sh','r').read()
#     widget_gradio.value = fp
#     time.sleep(0.01)
