# Speech-to-Text | Get transcription with speakers (OpenAI Whisper + NeMo Speaker Diarization)

- Author: [Pierre GUILLOU]()
- Full Credit: this notebook is just an update of the notebook [Whisper_Transcription_%2B_NeMo_Diarization.ipynb](https://github.com/MahmoudAshraf97/whisper-diarization/blob/main/Whisper_Transcription_%2B_NeMo_Diarization.ipynb) of [Mahmoud Ashraf](https://github.com/MahmoudAshraf97) (all texts were kept)
- Date: 07/12/2023
- Blog post: [Speech-to-Text | Get transcription WITH SPEAKERS from large audio file in any language (OpenAI Whisper + NeMo Speaker Diarization)](https://medium.com/@pierre_guillou/speech-to-text-get-transcription-with-speakers-from-large-audio-file-in-any-language-openai-8da2312f1617)

## Overview

[source: from [Mahmoud Ashraf](https://github.com/MahmoudAshraf97)] This notebook combines Whisper ASR capabilities with Voice Activity Detection (VAD) and Speaker Embedding to identify the speaker for each sentence in the transcription generated by Whisper. First, the vocals are extracted from the audio to increase the speaker embedding accuracy, then the transcription is generated using Whisper, then the timestamps are corrected and aligned using WhisperX to help minimize diarization error due to time shift. The audio is then passed into MarbleNet for VAD and segmentation to exclude silences, TitaNet is then used to extract speaker embeddings to identify the speaker for each segment, the result is then associated with the timestamps generated by WhisperX to detect the speaker for each word based on timestamps and then realigned using punctuation models to compensate for minor time shifts.

## WARNING

- This notebook runs on (free) Google Colab.
- It was tested on GPU T4.

## Setup

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# check there is at least a T4 GPU
!nvidia-smi

Mon Aug 26 02:17:09 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   61C    P8              11W /  70W |      0MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

### Installing Dependencies

After installing the libraries, it is necessary to restart the runtime (session) in Google Colab.

In [None]:
%%capture
!pip install git+https://github.com/m-bain/whisperX.git@a5dca2cc65b1a37f32a347e574b2c56af3a7434a
!pip install --no-build-isolation nemo_toolkit[asr]==1.21.0
!pip install git+https://github.com/facebookresearch/demucs#egg=demucs
!pip install deepmultilingualpunctuation
!pip install wget pydub
# !pip install --force-reinstall torch torchaudio torchvision
# !pip uninstall -y nvidia-cudnn-cu12
!pip install numba==0.58.0
!pip install unidecode

In [None]:
!pip install whisperx

Collecting whisperx
  Downloading whisperx-3.1.5-py3-none-any.whl.metadata (13 kB)
Collecting faster-whisper==1.0.1 (from whisperx)
  Downloading faster_whisper-1.0.1-py3-none-any.whl.metadata (14 kB)
Collecting pyannote.audio==3.1.1 (from whisperx)
  Downloading pyannote.audio-3.1.1-py2.py3-none-any.whl.metadata (9.3 kB)
Collecting av==11.* (from faster-whisper==1.0.1->whisperx)
  Downloading av-11.0.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.5 kB)
Collecting ctranslate2<5,>=4.0 (from faster-whisper==1.0.1->whisperx)
  Downloading ctranslate2-4.3.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (10 kB)
Collecting onnxruntime<2,>=1.14 (from faster-whisper==1.0.1->whisperx)
  Downloading onnxruntime-1.19.0-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (4.3 kB)
Collecting asteroid-filterbanks>=0.4 (from pyannote.audio==3.1.1->whisperx)
  Downloading asteroid_filterbanks-0.4.0-py3-none-any.whl.metadata (3.3 kB)
Coll

**RESTART the runtime now!**

### Import libraries

In [None]:
import wget

In [None]:
import os
# import wget
from omegaconf import OmegaConf
import json
import shutil
from faster_whisper import WhisperModel
import whisperx
import torch
from pydub import AudioSegment
from nemo.collections.asr.models.msdd_models import NeuralDiarizer
from deepmultilingualpunctuation import PunctuationModel
import re
import logging
import nltk
from whisperx.alignment import DEFAULT_ALIGN_MODELS_HF, DEFAULT_ALIGN_MODELS_TORCH
from whisperx.utils import LANGUAGES, TO_LANGUAGE_CODE

import unidecode
from unidecode import unidecode
import pathlib
from pathlib import Path

[NeMo W 2024-08-26 02:26:15 transformer_bpe_models:59] Could not import NeMo NLP collection which is required for speech translation model.


### Helper Functions

In [None]:
punct_model_langs = [
    "en",
    "fr",
    "de",
    "es",
    "it",
    "nl",
    "pt",
    "bg",
    "pl",
    "cs",
    "sk",
    "sl",
]
wav2vec2_langs = list(DEFAULT_ALIGN_MODELS_TORCH.keys()) + list(
    DEFAULT_ALIGN_MODELS_HF.keys()
)

whisper_langs = sorted(LANGUAGES.keys()) + sorted(
    [k.title() for k in TO_LANGUAGE_CODE.keys()]
)


def create_config(output_dir, DOMAIN_TYPE = "telephonic"):
    # DOMAIN_TYPE: can be meeting, telephonic, or general based on domain type of the audio file
    CONFIG_FILE_NAME = f"diar_infer_{DOMAIN_TYPE}.yaml"
    CONFIG_URL = f"https://raw.githubusercontent.com/NVIDIA/NeMo/main/examples/speaker_tasks/diarization/conf/inference/{CONFIG_FILE_NAME}"
    MODEL_CONFIG = os.path.join(output_dir, CONFIG_FILE_NAME)
    if not os.path.exists(MODEL_CONFIG):
        MODEL_CONFIG = wget.download(CONFIG_URL, output_dir)

    config = OmegaConf.load(MODEL_CONFIG)

    data_dir = os.path.join(output_dir, "data")
    os.makedirs(data_dir, exist_ok=True)

    meta = {
        "audio_filepath": os.path.join(output_dir, "mono_file.wav"),
        "offset": 0,
        "duration": None,
        "label": "infer",
        "text": "-",
        "rttm_filepath": None,
        "uem_filepath": None,
    }
    with open(os.path.join(data_dir, "input_manifest.json"), "w") as fp:
        json.dump(meta, fp)
        fp.write("\n")

    pretrained_vad = "vad_multilingual_marblenet"
    pretrained_speaker_model = "titanet_large"
    config.num_workers = 0  # Workaround for multiprocessing hanging with ipython issue
    config.diarizer.manifest_filepath = os.path.join(data_dir, "input_manifest.json")
    config.diarizer.out_dir = (
        output_dir  # Directory to store intermediate files and prediction outputs
    )

    config.diarizer.speaker_embeddings.model_path = pretrained_speaker_model
    config.diarizer.oracle_vad = (
        False  # compute VAD provided with model_path to vad config
    )
    config.diarizer.clustering.parameters.oracle_num_speakers = False

    # Here, we use our in-house pretrained NeMo VAD model
    config.diarizer.vad.model_path = pretrained_vad
    config.diarizer.vad.parameters.onset = 0.8
    config.diarizer.vad.parameters.offset = 0.6
    config.diarizer.vad.parameters.pad_offset = -0.05
    config.diarizer.msdd_model.model_path = (
        "diar_msdd_telephonic"  # Telephonic speaker diarization model
    )

    return config


def get_word_ts_anchor(s, e, option="start"):
    if option == "end":
        return e
    elif option == "mid":
        return (s + e) / 2
    return s


def get_words_speaker_mapping(wrd_ts, spk_ts, word_anchor_option="start"):
    s, e, sp = spk_ts[0]
    wrd_pos, turn_idx = 0, 0
    wrd_spk_mapping = []
    for wrd_dict in wrd_ts:
        ws, we, wrd = (
            int(wrd_dict["start"] * 1000),
            int(wrd_dict["end"] * 1000),
            wrd_dict["word"],
        )
        wrd_pos = get_word_ts_anchor(ws, we, word_anchor_option)
        while wrd_pos > float(e):
            turn_idx += 1
            turn_idx = min(turn_idx, len(spk_ts) - 1)
            s, e, sp = spk_ts[turn_idx]
            if turn_idx == len(spk_ts) - 1:
                e = get_word_ts_anchor(ws, we, option="end")
        wrd_spk_mapping.append(
            {"word": wrd, "start_time": ws, "end_time": we, "speaker": sp}
        )
    return wrd_spk_mapping


sentence_ending_punctuations = ".?!"


def get_first_word_idx_of_sentence(word_idx, word_list, speaker_list, max_words):
    is_word_sentence_end = (
        lambda x: x >= 0 and word_list[x][-1] in sentence_ending_punctuations
    )
    left_idx = word_idx
    while (
        left_idx > 0
        and word_idx - left_idx < max_words
        and speaker_list[left_idx - 1] == speaker_list[left_idx]
        and not is_word_sentence_end(left_idx - 1)
    ):
        left_idx -= 1

    return left_idx if left_idx == 0 or is_word_sentence_end(left_idx - 1) else -1


def get_last_word_idx_of_sentence(word_idx, word_list, max_words):
    is_word_sentence_end = (
        lambda x: x >= 0 and word_list[x][-1] in sentence_ending_punctuations
    )
    right_idx = word_idx
    while (
        right_idx < len(word_list)
        and right_idx - word_idx < max_words
        and not is_word_sentence_end(right_idx)
    ):
        right_idx += 1

    return (
        right_idx
        if right_idx == len(word_list) - 1 or is_word_sentence_end(right_idx)
        else -1
    )


def get_realigned_ws_mapping_with_punctuation(
    word_speaker_mapping, max_words_in_sentence=50
):
    is_word_sentence_end = (
        lambda x: x >= 0
        and word_speaker_mapping[x]["word"][-1] in sentence_ending_punctuations
    )
    wsp_len = len(word_speaker_mapping)

    words_list, speaker_list = [], []
    for k, line_dict in enumerate(word_speaker_mapping):
        word, speaker = line_dict["word"], line_dict["speaker"]
        words_list.append(word)
        speaker_list.append(speaker)

    k = 0
    while k < len(word_speaker_mapping):
        line_dict = word_speaker_mapping[k]
        if (
            k < wsp_len - 1
            and speaker_list[k] != speaker_list[k + 1]
            and not is_word_sentence_end(k)
        ):
            left_idx = get_first_word_idx_of_sentence(
                k, words_list, speaker_list, max_words_in_sentence
            )
            right_idx = (
                get_last_word_idx_of_sentence(
                    k, words_list, max_words_in_sentence - k + left_idx - 1
                )
                if left_idx > -1
                else -1
            )
            if min(left_idx, right_idx) == -1:
                k += 1
                continue

            spk_labels = speaker_list[left_idx : right_idx + 1]
            mod_speaker = max(set(spk_labels), key=spk_labels.count)
            if spk_labels.count(mod_speaker) < len(spk_labels) // 2:
                k += 1
                continue

            speaker_list[left_idx : right_idx + 1] = [mod_speaker] * (
                right_idx - left_idx + 1
            )
            k = right_idx

        k += 1

    k, realigned_list = 0, []
    while k < len(word_speaker_mapping):
        line_dict = word_speaker_mapping[k].copy()
        line_dict["speaker"] = speaker_list[k]
        realigned_list.append(line_dict)
        k += 1

    return realigned_list


def get_sentences_speaker_mapping(word_speaker_mapping, spk_ts):
    sentence_checker = nltk.tokenize.PunktSentenceTokenizer().text_contains_sentbreak
    s, e, spk = spk_ts[0]
    prev_spk = spk

    snts = []
    snt = {"speaker": f"Speaker {spk}", "start_time": s, "end_time": e, "text": ""}

    for wrd_dict in word_speaker_mapping:
        wrd, spk = wrd_dict["word"], wrd_dict["speaker"]
        s, e = wrd_dict["start_time"], wrd_dict["end_time"]
        if spk != prev_spk or sentence_checker(snt["text"] + " " + wrd):
            snts.append(snt)
            snt = {
                "speaker": f"Speaker {spk}",
                "start_time": s,
                "end_time": e,
                "text": "",
            }
        else:
            snt["end_time"] = e
        snt["text"] += wrd + " "
        prev_spk = spk

    snts.append(snt)
    return snts


def get_speaker_aware_transcript(sentences_speaker_mapping, f):
    previous_speaker = sentences_speaker_mapping[0]["speaker"]
    f.write(f"{previous_speaker}: ")

    for sentence_dict in sentences_speaker_mapping:
        speaker = sentence_dict["speaker"]
        sentence = sentence_dict["text"]

        # If this speaker doesn't match the previous one, start a new paragraph
        if speaker != previous_speaker:
            f.write(f"\n\n{speaker}: ")
            previous_speaker = speaker

        # No matter what, write the current sentence
        f.write(sentence + " ")


def format_timestamp(
    milliseconds: float, always_include_hours: bool = False, decimal_marker: str = "."
):
    assert milliseconds >= 0, "non-negative timestamp expected"

    hours = milliseconds // 3_600_000
    milliseconds -= hours * 3_600_000

    minutes = milliseconds // 60_000
    milliseconds -= minutes * 60_000

    seconds = milliseconds // 1_000
    milliseconds -= seconds * 1_000

    hours_marker = f"{hours:02d}:" if always_include_hours or hours > 0 else ""
    return (
        f"{hours_marker}{minutes:02d}:{seconds:02d}{decimal_marker}{milliseconds:03d}"
    )


def write_srt(transcript, file):
    """
    Write a transcript to a file in SRT format.

    """
    for i, segment in enumerate(transcript, start=1):
        # write srt lines
        print(
            f"{i}\n"
            f"{format_timestamp(segment['start_time'], always_include_hours=True, decimal_marker=',')} --> "
            f"{format_timestamp(segment['end_time'], always_include_hours=True, decimal_marker=',')}\n"
            f"{segment['speaker']}: {segment['text'].strip().replace('-->', '->')}\n",
            file=file,
            flush=True,
        )


def find_numeral_symbol_tokens(tokenizer):
    numeral_symbol_tokens = [
        -1,
    ]
    for token, token_id in tokenizer.get_vocab().items():
        has_numeral_symbol = any(c in "0123456789%$£" for c in token)
        if has_numeral_symbol:
            numeral_symbol_tokens.append(token_id)
    return numeral_symbol_tokens


def _get_next_start_timestamp(word_timestamps, current_word_index):
    # if current word is the last word
    if current_word_index == len(word_timestamps) - 1:
        return word_timestamps[current_word_index]["start"]

    next_word_index = current_word_index + 1
    while current_word_index < len(word_timestamps) - 1:
        if word_timestamps[next_word_index].get("start") is None:
            # if next word doesn't have a start timestamp
            # merge it with the current word and delete it
            word_timestamps[current_word_index]["word"] += (
                " " + word_timestamps[next_word_index]["word"]
            )

            word_timestamps[next_word_index]["word"] = None
            next_word_index += 1

        else:
            return word_timestamps[next_word_index]["start"]


def filter_missing_timestamps(word_timestamps):
    # handle the first and last word
    if word_timestamps[0].get("start") is None:
        word_timestamps[0]["start"] = 0
        word_timestamps[0]["end"] = _get_next_start_timestamp(word_timestamps, 0)

    result = [
        word_timestamps[0],
    ]

    for i, ws in enumerate(word_timestamps[1:], start=1):
        # if ws doesn't have a start and end
        # use the previous end as start and next start as end
        if ws.get("start") is None and ws.get("word") is not None:
            ws["start"] = word_timestamps[i - 1]["end"]
            ws["end"] = _get_next_start_timestamp(word_timestamps, i)

        if ws["word"] is not None:
            result.append(ws)
    return result


def cleanup(path: str):
    """path could either be relative or absolute."""
    # check if file or directory exists
    if os.path.isfile(path) or os.path.islink(path):
        # remove file
        os.remove(path)
    elif os.path.isdir(path):
        # remove directory and all its content
        shutil.rmtree(path)
    else:
        raise ValueError("Path {} is not a file or dir.".format(path))


def process_language_arg(language: str, model_name: str):
    """
    Process the language argument to make sure it's valid and convert language names to language codes.
    """
    if language is not None:
        language = language.lower()
    if language not in LANGUAGES:
        if language in TO_LANGUAGE_CODE:
            language = TO_LANGUAGE_CODE[language]
        else:
            raise ValueError(f"Unsupported language: {language}")

    if model_name.endswith(".en") and language != "en":
        if language is not None:
            logging.warning(
                f"{model_name} is an English-only model but received '{language}'; using English instead."
            )
        language = "en"
    return language


def transcribe(
    audio_file: str,
    language: str,
    model_name: str,
    compute_dtype: str,
    suppress_numerals: bool,
    device: str,
):
    from faster_whisper import WhisperModel
    from helpers import find_numeral_symbol_tokens, wav2vec2_langs

    # Faster Whisper non-batched
    # Run on GPU with FP16
    whisper_model = WhisperModel(model_name, device=device, compute_type=compute_dtype)

    # or run on GPU with INT8
    # model = WhisperModel(model_size, device="cuda", compute_type="int8_float16")
    # or run on CPU with INT8
    # model = WhisperModel(model_size, device="cpu", compute_type="int8")

    if suppress_numerals:
        numeral_symbol_tokens = find_numeral_symbol_tokens(whisper_model.hf_tokenizer)
    else:
        numeral_symbol_tokens = None

    if language is not None and language in wav2vec2_langs:
        word_timestamps = False
    else:
        word_timestamps = True

    segments, info = whisper_model.transcribe(
        audio_file,
        language=language,
        beam_size=5,
        word_timestamps=word_timestamps,  # TODO: disable this if the language is supported by wav2vec2
        suppress_tokens=numeral_symbol_tokens,
        vad_filter=True,
    )
    whisper_results = []
    for segment in segments:
        whisper_results.append(segment._asdict())
    # clear gpu vram
    del whisper_model
    torch.cuda.empty_cache()
    return whisper_results, language


def transcribe_batched(
    audio_file: str,
    language: str,
    batch_size: int,
    model_name: str,
    compute_dtype: str,
    suppress_numerals: bool,
    device: str,
):
    import whisperx

    # Faster Whisper batched
    whisper_model = whisperx.load_model(
        model_name,
        device,
        compute_type=compute_dtype,
        asr_options={"suppress_numerals": suppress_numerals},
    )
    audio = whisperx.load_audio(audio_file)
    result = whisper_model.transcribe(audio, language=language, batch_size=batch_size)
    del whisper_model
    torch.cuda.empty_cache()
    return result["segments"], result["language"]

In [None]:
# no space, punctuation, accent in lower string
def cleanString(string):
    cleanString = unidecode(string)
    # cleanString = re.sub('\W+','_', cleanString)
    cleanString = re.sub(r'[^\w\s]','',cleanString)
    cleanString = cleanString.replace(" ", "_")
    return cleanString.lower()

# rename audio filename to get name without accent, no space, in lower case
def rename_file(filepath):
    suffix = Path(filepath).suffix
    if str(Path(filepath).parent) != ".":
        new_filepath = str(Path(filepath).parent) + cleanString(filepath.replace(suffix, "")) + suffix
    else:
        new_filepath = cleanString(filepath.replace(suffix, "")) + suffix
    os.rename(filepath, new_filepath)
    return new_filepath

### Setup options

In [None]:
# Name of the audio file
audio_path = "00bcae999c09402cb74a20549a188040_20230429t15_03_utc.wav"
# audio_path = "audio1.wav"
# audio_path = "audio2.wav"

# rename audio filename if necessary to get string without accent, space, in lower case
audio_path = rename_file(audio_path)

FileNotFoundError: [Errno 2] No such file or directory: '00bcae999c09402cb74a20549a188040_20230429t15_03_utc.wav' -> '00bcae999c09402cb74a20549a188040_20230429t15_03_utc.wav'

In [None]:
# (Option) Whether to enable music removal from speech, helps increase diarization quality but uses alot of ram
enable_stemming = False

# (choose from 'tiny.en', 'tiny', 'base.en', 'base', 'small.en', 'small', 'medium.en', 'medium', 'large-v1', 'large-v2', 'large-v3', 'large')
# whisper_model_name = "large-v2"
whisper_model_name = "large-v3"

# replaces numerical digits with their pronounciation, increases diarization accuracy
suppress_numerals = True

batch_size = 8

language = None  # autodetect language

device = "cuda" if torch.cuda.is_available() else "cpu"

In [None]:
# check device
device


'cuda'

# Processing

## (Option) Separating music from speech using Demucs

---

By isolating the vocals from the rest of the audio, it becomes easier to identify and track individual speakers based on the spectral and temporal characteristics of their speech signals. Source separation is just one of many techniques that can be used as a preprocessing step to help improve the accuracy and reliability of the overall diarization process.

In [None]:
%%time
if enable_stemming:
    # Isolate vocals from the rest of the audio

    return_code = os.system(
        f'python3 -m demucs.separate -n htdemucs --two-stems=vocals "{audio_path}" -o "temp_outputs"'
    )

    if return_code != 0:
        logging.warning("Source splitting failed, using original audio file.")
        vocal_target = audio_path
    else:
        vocal_target = os.path.join(
            "temp_outputs",
            "htdemucs",
            os.path.splitext(os.path.basename(audio_path))[0],
            "vocals.wav",
        )
else:
    vocal_target = audio_path

CPU times: user 5 µs, sys: 1 µs, total: 6 µs
Wall time: 8.82 µs


## Transcriping audio using Whisper and realligning timestamps using Wav2Vec2
---
This code uses two different open-source models to transcribe speech and perform forced alignment on the resulting transcription.

The first model is called OpenAI Whisper, which is a speech recognition model that can transcribe speech with high accuracy. The code loads the whisper model and uses it to transcribe the vocal_target file.

The output of the transcription process is a set of text segments with corresponding timestamps indicating when each segment was spoken.


In [None]:
%%time
if device == "cuda": compute_type = "float16"
# or run on GPU with INT8
# compute_type = "int8_float16"
# or run on CPU with INT8
else: compute_type = "int8"

if batch_size != 0:
    whisper_results, language = transcribe_batched(
        vocal_target,
        language,
        batch_size,
        whisper_model_name,
        compute_type,
        suppress_numerals,
        device,
    )
else:
    whisper_results, language = transcribe(
        vocal_target,
        language,
        whisper_model_name,
        compute_type,
        suppress_numerals,
        device,
    )

model.bin:   0%|          | 0.00/3.09G [00:00<?, ?B/s]

config.json:   0%|          | 0.00/2.39k [00:00<?, ?B/s]

preprocessor_config.json:   0%|          | 0.00/340 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.48M [00:00<?, ?B/s]

vocabulary.json:   0%|          | 0.00/1.07M [00:00<?, ?B/s]

No language specified, language will be first be detected for each audio file (increases inference time).


100%|█████████████████████████████████████| 16.9M/16.9M [00:01<00:00, 10.3MiB/s]
INFO:pytorch_lightning.utilities.migration.utils:Lightning automatically upgraded your loaded checkpoint from v1.5.4 to v2.0.7. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint --file ../root/.cache/torch/whisperx-vad-segmentation.bin`


Model was trained with pyannote.audio 0.0.1, yours is 3.1.1. Bad things might happen unless you revert pyannote.audio to 0.x.
Model was trained with torch 1.10.0+cu102, yours is 2.0.1+cu117. Bad things might happen unless you revert torch to 1.x.
Detected language: es (0.99) in first 30s of audio...
Suppressing numeral and symbol tokens: [3, 4, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 502, 568, 628, 805, 945, 1017, 1025, 1266, 1294, 1360, 1386, 1525, 1614, 1649, 1722, 1848, 1958, 2009, 2119, 2217, 2272, 2319, 2331, 2443, 2625, 2803, 2975, 3165, 3279, 3282, 3356, 3405, 3446, 3499, 3552, 3705, 4022, 4060, 4289, 4303, 4436, 4550, 4688, 4702, 4762, 4808, 5080, 5211, 5254, 5285, 5348, 5853, 5867, 5923, 6071, 6074, 6096, 6375, 6494, 6549, 6591, 6641, 6673, 6856, 6866, 6879, 6905, 6976, 7143, 7201, 7271, 7490, 7526, 7546, 7551, 7560, 7562, 7629, 7634, 7668, 7771, 7773, 7911, 7998, 8132, 8227, 8423, 8451, 8465, 8494, 8652, 8794, 8858, 8923, 9012, 9125, 9356, 9413, 9562, 9657, 9714, 9754, 10076,

## Aligning the transcription with the original audio using Wav2Vec2
---
The second model used is called wav2vec2, which is a large-scale neural network that is designed to learn representations of speech that are useful for a variety of speech processing tasks, including speech recognition and alignment.

The code loads the wav2vec2 alignment model and uses it to align the transcription segments with the original audio signal contained in the vocal_target file. This process involves finding the exact timestamps in the audio signal where each segment was spoken and aligning the text accordingly.

By combining the outputs of the two models, the code produces a fully aligned transcription of the speech contained in the vocal_target file. This aligned transcription can be useful for a variety of speech processing tasks, such as speaker diarization, sentiment analysis, and language identification.

If there's no Wav2Vec2 model available for your language, word timestamps generated by whisper will be used instead.

In [None]:
%%time
if language in wav2vec2_langs:
    device = "cuda"
    alignment_model, metadata = whisperx.load_align_model(
        language_code=language, device=device
    )
    result_aligned = whisperx.align(
        whisper_results, alignment_model, metadata, vocal_target, device
    )
    word_timestamps = filter_missing_timestamps(result_aligned["word_segments"])

    # clear gpu vram
    del alignment_model
    torch.cuda.empty_cache()
else:
    assert batch_size == 0, (  # TODO: add a better check for word timestamps existence
        f"Unsupported language: {language}, use --batch_size to 0"
        " to generate word timestamps using whisper directly and fix this error."
    )
    word_timestamps = []
    for segment in whisper_results:
        for word in segment["words"]:
            word_timestamps.append({"word": word[2], "start": word[0], "end": word[1]})

Downloading: "https://download.pytorch.org/torchaudio/models/wav2vec2_voxpopuli_base_10k_asr_es.pt" to /root/.cache/torch/hub/checkpoints/wav2vec2_voxpopuli_base_10k_asr_es.pt
100%|██████████| 360M/360M [00:02<00:00, 143MB/s]


CPU times: user 8.28 s, sys: 1.81 s, total: 10.1 s
Wall time: 11.9 s


## Convert audio to mono for NeMo combatibility

In [None]:
%%time
sound = AudioSegment.from_file(vocal_target).set_channels(1)
ROOT = os.getcwd()
temp_path = os.path.join(ROOT, "temp_outputs")
os.makedirs(temp_path, exist_ok=True)
sound.export(os.path.join(temp_path, "mono_file.wav"), format="wav")

CPU times: user 33.8 ms, sys: 15.6 ms, total: 49.4 ms
Wall time: 96.7 ms


<_io.BufferedRandom name='/content/temp_outputs/mono_file.wav'>

## Speaker Diarization using NeMo MSDD Model
---
This code uses a model called Nvidia NeMo MSDD (Multi-scale Diarization Decoder) to perform speaker diarization on an audio signal. Speaker diarization is the process of separating an audio signal into different segments based on who is speaking at any given time.

In [None]:
%%time
# Initialize NeMo MSDD diarization model
# DOMAIN_TYPE: can be meeting, telephonic, or general based on domain type of the audio file
msdd_model = NeuralDiarizer(cfg=create_config(temp_path, DOMAIN_TYPE="telephonic")).to("cuda")
msdd_model.diarize()

del msdd_model
torch.cuda.empty_cache()

[NeMo I 2024-08-17 16:33:38 msdd_models:1092] Loading pretrained diar_msdd_telephonic model from NGC
[NeMo I 2024-08-17 16:33:38 cloud:68] Downloading from: https://api.ngc.nvidia.com/v2/models/nvidia/nemo/diar_msdd_telephonic/versions/1.0.1/files/diar_msdd_telephonic.nemo to /root/.cache/torch/NeMo/NeMo_1.21.0/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo
[NeMo I 2024-08-17 16:33:39 common:913] Instantiating model from pre-trained checkpoint


[NeMo W 2024-08-17 16:33:41 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: true
    
[NeMo W 2024-08-17 16:33:41 modelPT:168] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s). 
    Validation config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: false
    
[NeMo W 2024-08-17 16:33:41 modelPT:174] Please call the ModelPT.setup_test_data() or ModelPT.setup_multiple

[NeMo I 2024-08-17 16:33:41 features:289] PADDING: 16
[NeMo I 2024-08-17 16:33:41 features:289] PADDING: 16
[NeMo I 2024-08-17 16:33:43 save_restore_connector:249] Model EncDecDiarLabelModel was successfully restored from /root/.cache/torch/NeMo/NeMo_1.21.0/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo.
[NeMo I 2024-08-17 16:33:43 features:289] PADDING: 16
[NeMo I 2024-08-17 16:33:44 clustering_diarizer:127] Loading pretrained vad_multilingual_marblenet model from NGC
[NeMo I 2024-08-17 16:33:44 cloud:68] Downloading from: https://api.ngc.nvidia.com/v2/models/nvidia/nemo/vad_multilingual_marblenet/versions/1.10.0/files/vad_multilingual_marblenet.nemo to /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo
[NeMo I 2024-08-17 16:33:44 common:913] Instantiating model from pre-trained checkpoint


[NeMo W 2024-08-17 16:33:45 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: /manifests/ami_train_0.63.json,/manifests/freesound_background_train.json,/manifests/freesound_laughter_train.json,/manifests/fisher_2004_background.json,/manifests/fisher_2004_speech_sampled.json,/manifests/google_train_manifest.json,/manifests/icsi_all_0.63.json,/manifests/musan_freesound_train.json,/manifests/musan_music_train.json,/manifests/musan_soundbible_train.json,/manifests/mandarin_train_sample.json,/manifests/german_train_sample.json,/manifests/spanish_train_sample.json,/manifests/french_train_sample.json,/manifests/russian_train_sample.json
    sample_rate: 16000
    labels:
    - background
    - speech
    batch_size: 256
    shuffle: true
    is_tarred: false
    tarred_audio_filepaths: null
    tarred_shard_strategy: sca

[NeMo I 2024-08-17 16:33:45 features:289] PADDING: 16
[NeMo I 2024-08-17 16:33:45 save_restore_connector:249] Model EncDecClassificationModel was successfully restored from /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-08-17 16:33:45 msdd_models:864] Multiscale Weights: [1, 1, 1, 1, 1]
[NeMo I 2024-08-17 16:33:45 msdd_models:865] Clustering Parameters: {
        "oracle_num_speakers": false,
        "max_num_speakers": 8,
        "enhanced_count_thres": 80,
        "max_rp_threshold": 0.25,
        "sparse_search_volume": 30,
        "maj_vote_spk_count": false,
        "chunk_cluster_count": 50,
        "embeddings_per_chunk": 10000
    }
[NeMo I 2024-08-17 16:33:45 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-08-17 16:33:45 clustering_diarizer:309] Split long audio file to avoid CUDA memory issue


splitting manifest: 100%|██████████| 1/1 [00:17<00:00, 17.68s/it]

[NeMo I 2024-08-17 16:34:03 classification_models:273] Perform streaming frame-level VAD
[NeMo I 2024-08-17 16:34:03 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-17 16:34:03 collections:302] Dataset loaded with 8 items, total duration of  0.10 hours.
[NeMo I 2024-08-17 16:34:03 collections:304] # 8 files loaded accounting to # 1 labels



vad: 100%|██████████| 8/8 [00:02<00:00,  3.43it/s]

[NeMo I 2024-08-17 16:34:05 clustering_diarizer:250] Generating predictions with overlapping input segments



                                                               

[NeMo I 2024-08-17 16:34:09 clustering_diarizer:262] Converting frame level prediction to speech/no-speech segment in start and end times format.


creating speech segments: 100%|██████████| 1/1 [00:00<00:00,  2.26it/s]

[NeMo I 2024-08-17 16:34:10 clustering_diarizer:287] Subsegmentation for embedding extraction: scale0, /content/temp_outputs/speaker_outputs/subsegments_scale0.json
[NeMo I 2024-08-17 16:34:10 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-17 16:34:10 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-17 16:34:10 collections:302] Dataset loaded with 349 items, total duration of  0.14 hours.
[NeMo I 2024-08-17 16:34:10 collections:304] # 349 files loaded accounting to # 1 labels



[1/5] extract embeddings: 100%|██████████| 6/6 [00:01<00:00,  5.62it/s]

[NeMo I 2024-08-17 16:34:11 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-17 16:34:11 clustering_diarizer:287] Subsegmentation for embedding extraction: scale1, /content/temp_outputs/speaker_outputs/subsegments_scale1.json
[NeMo I 2024-08-17 16:34:11 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-17 16:34:11 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-17 16:34:11 collections:302] Dataset loaded with 421 items, total duration of  0.14 hours.
[NeMo I 2024-08-17 16:34:11 collections:304] # 421 files loaded accounting to # 1 labels



[2/5] extract embeddings: 100%|██████████| 7/7 [00:01<00:00,  5.94it/s]

[NeMo I 2024-08-17 16:34:12 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-17 16:34:12 clustering_diarizer:287] Subsegmentation for embedding extraction: scale2, /content/temp_outputs/speaker_outputs/subsegments_scale2.json
[NeMo I 2024-08-17 16:34:12 clustering_diarizer:343] Extracting embeddings for Diarization





[NeMo I 2024-08-17 16:34:12 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-17 16:34:12 collections:302] Dataset loaded with 529 items, total duration of  0.14 hours.
[NeMo I 2024-08-17 16:34:12 collections:304] # 529 files loaded accounting to # 1 labels


[3/5] extract embeddings: 100%|██████████| 9/9 [00:01<00:00,  6.20it/s]

[NeMo I 2024-08-17 16:34:13 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-17 16:34:13 clustering_diarizer:287] Subsegmentation for embedding extraction: scale3, /content/temp_outputs/speaker_outputs/subsegments_scale3.json





[NeMo I 2024-08-17 16:34:14 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-17 16:34:14 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-17 16:34:14 collections:302] Dataset loaded with 715 items, total duration of  0.15 hours.
[NeMo I 2024-08-17 16:34:14 collections:304] # 715 files loaded accounting to # 1 labels


[4/5] extract embeddings: 100%|██████████| 12/12 [00:01<00:00,  6.11it/s]

[NeMo I 2024-08-17 16:34:16 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-17 16:34:16 clustering_diarizer:287] Subsegmentation for embedding extraction: scale4, /content/temp_outputs/speaker_outputs/subsegments_scale4.json





[NeMo I 2024-08-17 16:34:16 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-17 16:34:16 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-17 16:34:16 collections:302] Dataset loaded with 1087 items, total duration of  0.15 hours.
[NeMo I 2024-08-17 16:34:16 collections:304] # 1087 files loaded accounting to # 1 labels


[5/5] extract embeddings: 100%|██████████| 17/17 [00:02<00:00,  8.11it/s]


[NeMo I 2024-08-17 16:34:18 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings


clustering: 100%|██████████| 1/1 [00:01<00:00,  1.49s/it]

[NeMo I 2024-08-17 16:34:19 clustering_diarizer:464] Outputs are saved in /content/temp_outputs directory



[NeMo W 2024-08-17 16:34:19 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-17 16:34:19 msdd_models:960] Loading embedding pickle file of scale:0 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale0_embeddings.pkl
[NeMo I 2024-08-17 16:34:19 msdd_models:960] Loading embedding pickle file of scale:1 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale1_embeddings.pkl
[NeMo I 2024-08-17 16:34:19 msdd_models:960] Loading embedding pickle file of scale:2 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale2_embeddings.pkl
[NeMo I 2024-08-17 16:34:19 msdd_models:960] Loading embedding pickle file of scale:3 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale3_embeddings.pkl
[NeMo I 2024-08-17 16:34:19 msdd_models:960] Loading embedding pickle file of scale:4 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale4_embeddings.pkl
[NeMo I 2024-08-17 16:34:19 msdd_models:938] Loading cluster label file from /content/temp_outputs/speaker_outputs/subsegments_scale4_cluste

100%|██████████| 1/1 [00:00<00:00, 10.85it/s]

[NeMo I 2024-08-17 16:34:20 msdd_models:1403]      [Threshold: 0.7000] [use_clus_as_main=False] [diar_window=50]
[NeMo I 2024-08-17 16:34:20 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-08-17 16:34:20 speaker_utils:93] Number of files to diarize: 1



[NeMo W 2024-08-17 16:34:20 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-17 16:34:20 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-08-17 16:34:20 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-17 16:34:20 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-08-17 16:34:20 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-17 16:34:20 msdd_models:1431]   
    
CPU times: user 36.5 s, sys: 1.48 s, total: 38 s
Wall time: 42.1 s


## Mapping Spekers to Sentences According to Timestamps

In [None]:
%%time
# Reading timestamps <> Speaker Labels mapping

speaker_ts = []
with open(os.path.join(temp_path, "pred_rttms", "mono_file.rttm"), "r") as f:
    lines = f.readlines()
    for line in lines:
        line_list = line.split(" ")
        s = int(float(line_list[5]) * 1000)
        e = s + int(float(line_list[8]) * 1000)
        speaker_ts.append([s, e, int(line_list[11].split("_")[-1])])

wsm = get_words_speaker_mapping(word_timestamps, speaker_ts, "start")

CPU times: user 3.91 ms, sys: 934 µs, total: 4.84 ms
Wall time: 4.71 ms


In [None]:
print(wsm)

[{'word': 'Buen', 'start_time': 1450, 'end_time': 1590, 'speaker': 0}, {'word': 'día,', 'start_time': 1630, 'end_time': 1730, 'speaker': 0}, {'word': 'le', 'start_time': 1790, 'end_time': 1890, 'speaker': 0}, {'word': 'habla', 'start_time': 1970, 'end_time': 2230, 'speaker': 0}, {'word': 'Vanessa.', 'start_time': 5051, 'end_time': 5191, 'speaker': 1}, {'word': '¿En', 'start_time': 5211, 'end_time': 5291, 'speaker': 1}, {'word': 'qué', 'start_time': 5351, 'end_time': 5471, 'speaker': 1}, {'word': 'le', 'start_time': 5511, 'end_time': 5571, 'speaker': 1}, {'word': 'puedo', 'start_time': 5631, 'end_time': 5851, 'speaker': 1}, {'word': 'ayudar?', 'start_time': 5871, 'end_time': 6191, 'speaker': 1}, {'word': 'Hola,', 'start_time': 6211, 'end_time': 6792, 'speaker': 1}, {'word': 'buenos', 'start_time': 7052, 'end_time': 7232, 'speaker': 1}, {'word': 'días.', 'start_time': 7252, 'end_time': 7552, 'speaker': 1}, {'word': 'Habla', 'start_time': 7572, 'end_time': 7852, 'speaker': 1}, {'word': 'J

## Realligning Speech segments using Punctuation
---

This code provides a method for disambiguating speaker labels in cases where a sentence is split between two different speakers. It uses punctuation markings to determine the dominant speaker for each sentence in the transcription.

```
Speaker A: It's got to come from somewhere else. Yeah, that one's also fun because you know the lows are
Speaker B: going to suck, right? So it's actually it hits you on both sides.
```

For example, if a sentence is split between two speakers, the code takes the mode of speaker labels for each word in the sentence, and uses that speaker label for the whole sentence. This can help to improve the accuracy of speaker diarization, especially in cases where the Whisper model may not take fine utterances like "hmm" and "yeah" into account, but the Diarization Model (Nemo) may include them, leading to inconsistent results.

The code also handles cases where one speaker is giving a monologue while other speakers are making occasional comments in the background. It ignores the comments and assigns the entire monologue to the speaker who is speaking the majority of the time. This provides a robust and reliable method for realigning speech segments to their respective speakers based on punctuation in the transcription.

In [None]:
%%time
if language in punct_model_langs:
    # restoring punctuation in the transcript to help realign the sentences
    punct_model = PunctuationModel(model="kredor/punctuate-all")

    words_list = list(map(lambda x: x["word"], wsm))

    labled_words = punct_model.predict(words_list)

    ending_puncts = ".?!"
    model_puncts = ".,;:!?"

    # We don't want to punctuate U.S.A. with a period. Right?
    is_acronym = lambda x: re.fullmatch(r"\b(?:[a-zA-Z]\.){2,}", x)

    for word_dict, labeled_tuple in zip(wsm, labled_words):
        word = word_dict["word"]
        if (
            word
            and labeled_tuple[1] in ending_puncts
            and (word[-1] not in model_puncts or is_acronym(word))
        ):
            word += labeled_tuple[1]
            if word.endswith(".."):
                word = word.rstrip(".")
            word_dict["word"] = word

else:
    logging.warning(
        f"Punctuation restoration is not available for {language} language. Using the original punctuation."
    )

wsm = get_realigned_ws_mapping_with_punctuation(wsm)
ssm = get_sentences_speaker_mapping(wsm, speaker_ts)

config.json:   0%|          | 0.00/914 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.11G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/447 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.1M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

CPU times: user 7.41 s, sys: 4.8 s, total: 12.2 s
Wall time: 20.1 s


In [None]:
print(ssm)

[{'speaker': 'Speaker 0', 'start_time': 0, 'end_time': 5191, 'text': 'Buen día, le habla Vanessa. '}, {'speaker': 'Speaker 1', 'start_time': 5211, 'end_time': 6191, 'text': '¿En qué le puedo ayudar? '}, {'speaker': 'Speaker 1', 'start_time': 6211, 'end_time': 7552, 'text': 'Hola, buenos días. '}, {'speaker': 'Speaker 1', 'start_time': 7572, 'end_time': 8512, 'text': 'Habla Joan Camilo. '}, {'speaker': 'Speaker 1', 'start_time': 9293, 'end_time': 10613, 'text': 'Perdona la bulla. '}, {'speaker': 'Speaker 1', 'start_time': 11934, 'end_time': 16236, 'text': 'Lo que pasa es que a mí me interesaría ingresar a estudiar enfermería. '}, {'speaker': 'Speaker 0', 'start_time': 18256, 'end_time': 20057, 'text': 'Dime, ya terminaste. '}, {'speaker': 'Speaker 0', 'start_time': 20277, 'end_time': 21978, 'text': 'Dime. '}, {'speaker': 'Speaker 1', 'start_time': 21998, 'end_time': 22258, 'text': 'Perdón. '}, {'speaker': 'Speaker 1', 'start_time': 22691, 'end_time': 37058, 'text': 'Este, me gustaría em

## Cleanup and downloading the results

In [None]:
%%time
path_textfile_with_speakers = f"{os.path.splitext(audio_path)[0]}.txt"
path_srtfile_with_speakers = f"{os.path.splitext(audio_path)[0]}.srt"

with open(path_textfile_with_speakers, "w", encoding="utf-8-sig") as f:
    get_speaker_aware_transcript(ssm, f)

with open(path_srtfile_with_speakers, "w", encoding="utf-8-sig") as srt:
    write_srt(ssm, srt)

cleanup(temp_path)

CPU times: user 3.6 ms, sys: 2.66 ms, total: 6.26 ms
Wall time: 9.82 ms


In [None]:
# cleanup text file with speakers
with open(path_textfile_with_speakers, "r") as f:
    lines = f.readlines()
    lines = [re.sub(' +', ' ', line.strip("\ufeff").strip()) for line in lines if line != "\n"]

with open(path_textfile_with_speakers, "w", encoding="utf-8-sig") as f:
    for i,line in enumerate(lines):
        if i < len(lines) - 1: f.write(f"{line}\n\n")
        else: f.write(f"{line}")

### Descargar Archivo

In [None]:
# download files
from google.colab import files
files.download(path_textfile_with_speakers)
files.download(path_srtfile_with_speakers)

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

### SETUP BUNCH

## ENLISTA ARCHIVOS

In [None]:
import os
import glob

# # Listar todos los archivos en la carpeta raíz
# files = os.listdir('/')
# print(files)

# Listar todos los archivos en una carpeta específica, por ejemplo, '/content'
files = os.listdir('/content')
# print(files)

wav_files = [file for file in files if file.endswith('.wav')]
print(wav_files)

# # Usar glob para listar todos los archivos con una extensión específica en una carpeta
# files = glob.glob('/content/*.csv')  # Cambia la extensión según tus necesidades
# print(files)


['a32ef11a-6884-41e7-928b-cf63a386514b_20230506T15_14_UTC.wav', 'ec8aa41a-dfd3-4a1c-b2ea-edf54a679155_20230422T16_33_UTC.wav', '1841240f-f1ab-4df6-96c7-6987b97d4068_20230411T16_35_UTC.wav', '8524c81b-ba37-4acc-a3b5-d1303fab469d_20230603T15_45_UTC.wav', '2423c4d4-01d1-424a-b384-3e5ddda79e7a_20230610T16_50_UTC.wav', 'e0913f30-a778-47f8-9585-f9219bb14dd5_20230610T16_29_UTC.wav', 'ae031a33-d308-40e9-aab5-fd43bf33b425_20230513T15_47_UTC.wav', '1c602ea2-8bef-4b22-9990-ef7ae3bdede8_20230415T14_53_UTC.wav', '5439e31b-4b86-4077-b1e1-4e1fd161a670_20230513T13_55_UTC.wav', '77dc4488-e978-4884-82aa-79d5b5197056_20230411T13_49_UTC.wav', 'f7e4b691-c321-4c6f-b91f-83888f96589f_20230422T14_39_UTC.wav', 'd15a221c-5dce-4a34-ac2d-8bda6ed71ebd_20230610T13_57_UTC.wav', 'ece40026-e149-425b-8c37-f0eb1f33deb4_20230513T15_41_UTC.wav', '2dea44d0-a836-446f-adf6-677088ca215c_20230411T14_37_UTC.wav', '00bcae99-9c09-402c-b74a-20549a188040_20230429T15_03_UTC.wav', 'dee55130-9aee-45f7-981f-694edfbcba81_20230506T15_21_U

In [None]:
# def rename_file(filepath):
filepath = "00bcae99-9c09-402c-b74a-20549a188040_20230429T15_03_UTC.wav"

filepath = filepath.replace('-','')
suffix = Path(filepath).suffix

# if str(Path(filepath).parent) != ".":
#     new_filepath = str(Path(filepath).parent) + cleanString(filepath.replace(suffix, "")) + suffix
# else:
#     new_filepath = cleanString(filepath.replace(suffix, "")) + suffix
# os.rename(filepath, new_filepath)
# return new_filepath

# print(rename_file("00bcae99-9c09-402c-b74a-20549a188040_20230429T15_03_UTC.wav"))

.


In [None]:
def isolate_string(audio_path):
  if enable_stemming:
      # Isolate vocals from the rest of the audio

      return_code = os.system(
          f'python3 -m demucs.separate -n htdemucs --two-stems=vocals "{audio_path}" -o "temp_outputs"'
      )

      if return_code != 0:
          logging.warning("Source splitting failed, using original audio file.")
          vocal_target = audio_path
      else:
          vocal_target = os.path.join(
              "temp_outputs",
              "htdemucs",
              os.path.splitext(os.path.basename(audio_path))[0],
              "vocals.wav",
          )
  else:
      vocal_target = audio_path
      return vocal_target

### EJECUCION MUESTRA

In [None]:
device = "cuda"
vocal_target = audio_path
language = None
whisper_model_name = "large-v3"
# replaces numerical digits with their pronounciation, increases diarization accuracy
suppress_numerals = True
batch_size = 8

if device == "cuda": compute_type = "float16"
# or run on GPU with INT8
# compute_type = "int8_float16"
# or run on CPU with INT8
else: compute_type = "int8"


whisper_results, language = transcribe_batched(
    vocal_target,
    language,
    batch_size,
    whisper_model_name,
    compute_type,
    suppress_numerals,
    device,
)

model.bin:   0%|          | 0.00/3.09G [00:00<?, ?B/s]

config.json:   0%|          | 0.00/2.39k [00:00<?, ?B/s]

preprocessor_config.json:   0%|          | 0.00/340 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.48M [00:00<?, ?B/s]

vocabulary.json:   0%|          | 0.00/1.07M [00:00<?, ?B/s]

No language specified, language will be first be detected for each audio file (increases inference time).


100%|█████████████████████████████████████| 16.9M/16.9M [00:02<00:00, 8.07MiB/s]
INFO:pytorch_lightning.utilities.migration.utils:Lightning automatically upgraded your loaded checkpoint from v1.5.4 to v2.0.7. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint --file ../root/.cache/torch/whisperx-vad-segmentation.bin`


Model was trained with pyannote.audio 0.0.1, yours is 3.1.1. Bad things might happen unless you revert pyannote.audio to 0.x.
Model was trained with torch 1.10.0+cu102, yours is 2.0.1+cu117. Bad things might happen unless you revert torch to 1.x.
Detected language: es (0.99) in first 30s of audio...
Suppressing numeral and symbol tokens: [3, 4, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 502, 568, 628, 805, 945, 1017, 1025, 1266, 1294, 1360, 1386, 1525, 1614, 1649, 1722, 1848, 1958, 2009, 2119, 2217, 2272, 2319, 2331, 2443, 2625, 2803, 2975, 3165, 3279, 3282, 3356, 3405, 3446, 3499, 3552, 3705, 4022, 4060, 4289, 4303, 4436, 4550, 4688, 4702, 4762, 4808, 5080, 5211, 5254, 5285, 5348, 5853, 5867, 5923, 6071, 6074, 6096, 6375, 6494, 6549, 6591, 6641, 6673, 6856, 6866, 6879, 6905, 6976, 7143, 7201, 7271, 7490, 7526, 7546, 7551, 7560, 7562, 7629, 7634, 7668, 7771, 7773, 7911, 7998, 8132, 8227, 8423, 8451, 8465, 8494, 8652, 8794, 8858, 8923, 9012, 9125, 9356, 9413, 9562, 9657, 9714, 9754, 10076,

### RESULTADO WHISPER

In [None]:
print(whisper_results)

[{'text': ' Aló. Buenos días, me comunico con Rafael Dávalo. ¿Y con él? Le habla Angie Valente, de la Universidad Javeriana de Cali. ¿Cómo se encuentra? Muy bien, gracias a Dios, usted. Muy bien, gracias a Dios. Nos estamos comunicando, señor Rafael, por su interés en el seminario en Business Analysis con Power BI. Ajá, sí, señora. Bueno, usted se encuentra en la ciudad de Cali.', 'start': 0.623, 'end': 29.599}, {'text': ' Sí, señora. Este seminario se dictará de manera presencial en el campus universitario. Iniciaría el veintidós de junio y finalizaría el treinta y uno de julio. ¿Qué tenés a la pregunta? ¿Este es el curso de Python? No, señor. Seminario en Business y análisis de Big Data con Power BI. Ah, ok, ok. Sí, sí, ya recordé. Este seminario es gratuito, ¿cierto?', 'start': 30.503, 'end': 58.729}, {'text': ' Se estaba, digamos que se estaba ofertando de esa manera, pero pues con un convenio que tenía el ICT con el Ministerio de las TIC, pero esa convocatoria ya finalizó. Ahorita

## ALINEACION DE AUDIO CON WAV2VEC2

In [None]:
%%time
if language in wav2vec2_langs:
    device = "cuda"
    alignment_model, metadata = whisperx.load_align_model(
        language_code=language, device=device
    )
    result_aligned = whisperx.align(
        whisper_results, alignment_model, metadata, vocal_target, device
    )
    word_timestamps = filter_missing_timestamps(result_aligned["word_segments"])

    # clear gpu vram
    del alignment_model
    torch.cuda.empty_cache()
else:
    assert batch_size == 0, (  # TODO: add a better check for word timestamps existence
        f"Unsupported language: {language}, use --batch_size to 0"
        " to generate word timestamps using whisper directly and fix this error."
    )
    word_timestamps = []
    for segment in whisper_results:
        for word in segment["words"]:
            word_timestamps.append({"word": word[2], "start": word[0], "end": word[1]})

Downloading: "https://download.pytorch.org/torchaudio/models/wav2vec2_voxpopuli_base_10k_asr_es.pt" to /root/.cache/torch/hub/checkpoints/wav2vec2_voxpopuli_base_10k_asr_es.pt
100%|██████████| 360M/360M [00:05<00:00, 66.4MB/s]


CPU times: user 5.36 s, sys: 1.35 s, total: 6.71 s
Wall time: 12.2 s


In [None]:
print(word_timestamps)

[{'word': 'Aló.', 'start': 0.663, 'end': 0.803, 'score': 0.035}, {'word': 'Buenos', 'start': 0.823, 'end': 3.324, 'score': 0.496}, {'word': 'días,', 'start': 3.385, 'end': 3.725, 'score': 0.698}, {'word': 'me', 'start': 3.745, 'end': 3.825, 'score': 0.544}, {'word': 'comunico', 'start': 3.865, 'end': 4.265, 'score': 0.692}, {'word': 'con', 'start': 4.305, 'end': 4.425, 'score': 0.661}, {'word': 'Rafael', 'start': 4.485, 'end': 4.905, 'score': 0.702}, {'word': 'Dávalo.', 'start': 4.965, 'end': 5.326, 'score': 0.774}, {'word': '¿Y', 'start': 6.486, 'end': 6.546, 'score': 0.568}, {'word': 'con', 'start': 6.586, 'end': 6.686, 'score': 0.436}, {'word': 'él?', 'start': 6.766, 'end': 6.906, 'score': 0.495}, {'word': 'Le', 'start': 7.827, 'end': 7.927, 'score': 0.788}, {'word': 'habla', 'start': 7.987, 'end': 8.147, 'score': 0.447}, {'word': 'Angie', 'start': 8.187, 'end': 8.467, 'score': 0.749}, {'word': 'Valente,', 'start': 8.487, 'end': 8.868, 'score': 0.566}, {'word': 'de', 'start': 8.888,

## CONVERSION A MONO

In [None]:
%%time
sound = AudioSegment.from_file(vocal_target).set_channels(1)
ROOT = os.getcwd()
temp_path = os.path.join(ROOT, "temp_outputs")
os.makedirs(temp_path, exist_ok=True)
sound.export(os.path.join(temp_path, "mono_file.wav"), format="wav")

CPU times: user 9.98 ms, sys: 5.04 ms, total: 15 ms
Wall time: 15.2 ms


<_io.BufferedRandom name='/content/temp_outputs/mono_file.wav'>

## DIARIZACION AUDIO

In [None]:
%%time
# Initialize NeMo MSDD diarization model
# DOMAIN_TYPE: can be meeting, telephonic, or general based on domain type of the audio file
msdd_model = NeuralDiarizer(cfg=create_config(temp_path, DOMAIN_TYPE="telephonic")).to("cuda")
msdd_model.diarize()

del msdd_model
torch.cuda.empty_cache()

[NeMo I 2024-08-18 02:42:44 msdd_models:1092] Loading pretrained diar_msdd_telephonic model from NGC
[NeMo I 2024-08-18 02:42:44 cloud:68] Downloading from: https://api.ngc.nvidia.com/v2/models/nvidia/nemo/diar_msdd_telephonic/versions/1.0.1/files/diar_msdd_telephonic.nemo to /root/.cache/torch/NeMo/NeMo_1.21.0/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo
[NeMo I 2024-08-18 02:42:46 common:913] Instantiating model from pre-trained checkpoint


[NeMo W 2024-08-18 02:42:47 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: true
    
[NeMo W 2024-08-18 02:42:47 modelPT:168] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s). 
    Validation config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: false
    
[NeMo W 2024-08-18 02:42:47 modelPT:174] Please call the ModelPT.setup_test_data() or ModelPT.setup_multiple

[NeMo I 2024-08-18 02:42:47 features:289] PADDING: 16
[NeMo I 2024-08-18 02:42:47 features:289] PADDING: 16
[NeMo I 2024-08-18 02:42:48 save_restore_connector:249] Model EncDecDiarLabelModel was successfully restored from /root/.cache/torch/NeMo/NeMo_1.21.0/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo.
[NeMo I 2024-08-18 02:42:48 features:289] PADDING: 16
[NeMo I 2024-08-18 02:42:49 clustering_diarizer:127] Loading pretrained vad_multilingual_marblenet model from NGC
[NeMo I 2024-08-18 02:42:49 cloud:68] Downloading from: https://api.ngc.nvidia.com/v2/models/nvidia/nemo/vad_multilingual_marblenet/versions/1.10.0/files/vad_multilingual_marblenet.nemo to /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo
[NeMo I 2024-08-18 02:42:49 common:913] Instantiating model from pre-trained checkpoint


[NeMo W 2024-08-18 02:42:49 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: /manifests/ami_train_0.63.json,/manifests/freesound_background_train.json,/manifests/freesound_laughter_train.json,/manifests/fisher_2004_background.json,/manifests/fisher_2004_speech_sampled.json,/manifests/google_train_manifest.json,/manifests/icsi_all_0.63.json,/manifests/musan_freesound_train.json,/manifests/musan_music_train.json,/manifests/musan_soundbible_train.json,/manifests/mandarin_train_sample.json,/manifests/german_train_sample.json,/manifests/spanish_train_sample.json,/manifests/french_train_sample.json,/manifests/russian_train_sample.json
    sample_rate: 16000
    labels:
    - background
    - speech
    batch_size: 256
    shuffle: true
    is_tarred: false
    tarred_audio_filepaths: null
    tarred_shard_strategy: sca

[NeMo I 2024-08-18 02:42:49 features:289] PADDING: 16
[NeMo I 2024-08-18 02:42:49 save_restore_connector:249] Model EncDecClassificationModel was successfully restored from /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-08-18 02:42:50 msdd_models:864] Multiscale Weights: [1, 1, 1, 1, 1]
[NeMo I 2024-08-18 02:42:50 msdd_models:865] Clustering Parameters: {
        "oracle_num_speakers": false,
        "max_num_speakers": 8,
        "enhanced_count_thres": 80,
        "max_rp_threshold": 0.25,
        "sparse_search_volume": 30,
        "maj_vote_spk_count": false,
        "chunk_cluster_count": 50,
        "embeddings_per_chunk": 10000
    }
[NeMo I 2024-08-18 02:42:50 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-08-18 02:42:50 clustering_diarizer:309] Split long audio file to avoid CUDA memory issue


splitting manifest: 100%|██████████| 1/1 [00:14<00:00, 14.39s/it]

[NeMo I 2024-08-18 02:43:04 classification_models:273] Perform streaming frame-level VAD
[NeMo I 2024-08-18 02:43:04 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-18 02:43:04 collections:302] Dataset loaded with 4 items, total duration of  0.05 hours.
[NeMo I 2024-08-18 02:43:04 collections:304] # 4 files loaded accounting to # 1 labels



vad: 100%|██████████| 4/4 [00:00<00:00,  4.24it/s]

[NeMo I 2024-08-18 02:43:05 clustering_diarizer:250] Generating predictions with overlapping input segments



                                                               

[NeMo I 2024-08-18 02:43:06 clustering_diarizer:262] Converting frame level prediction to speech/no-speech segment in start and end times format.


creating speech segments: 100%|██████████| 1/1 [00:00<00:00,  3.47it/s]

[NeMo I 2024-08-18 02:43:07 clustering_diarizer:287] Subsegmentation for embedding extraction: scale0, /content/temp_outputs/speaker_outputs/subsegments_scale0.json
[NeMo I 2024-08-18 02:43:07 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-18 02:43:07 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-18 02:43:07 collections:302] Dataset loaded with 145 items, total duration of  0.05 hours.
[NeMo I 2024-08-18 02:43:07 collections:304] # 145 files loaded accounting to # 1 labels



[1/5] extract embeddings: 100%|██████████| 3/3 [00:00<00:00,  5.56it/s]

[NeMo I 2024-08-18 02:43:07 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-18 02:43:07 clustering_diarizer:287] Subsegmentation for embedding extraction: scale1, /content/temp_outputs/speaker_outputs/subsegments_scale1.json
[NeMo I 2024-08-18 02:43:07 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-18 02:43:07 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-18 02:43:07 collections:302] Dataset loaded with 174 items, total duration of  0.06 hours.
[NeMo I 2024-08-18 02:43:07 collections:304] # 174 files loaded accounting to # 1 labels



[2/5] extract embeddings: 100%|██████████| 3/3 [00:00<00:00,  5.89it/s]

[NeMo I 2024-08-18 02:43:08 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-18 02:43:08 clustering_diarizer:287] Subsegmentation for embedding extraction: scale2, /content/temp_outputs/speaker_outputs/subsegments_scale2.json
[NeMo I 2024-08-18 02:43:08 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-18 02:43:08 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-18 02:43:08 collections:302] Dataset loaded with 222 items, total duration of  0.06 hours.
[NeMo I 2024-08-18 02:43:08 collections:304] # 222 files loaded accounting to # 1 labels



[3/5] extract embeddings: 100%|██████████| 4/4 [00:00<00:00,  6.64it/s]

[NeMo I 2024-08-18 02:43:09 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-18 02:43:09 clustering_diarizer:287] Subsegmentation for embedding extraction: scale3, /content/temp_outputs/speaker_outputs/subsegments_scale3.json
[NeMo I 2024-08-18 02:43:09 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-18 02:43:09 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-18 02:43:09 collections:302] Dataset loaded with 297 items, total duration of  0.06 hours.
[NeMo I 2024-08-18 02:43:09 collections:304] # 297 files loaded accounting to # 1 labels



[4/5] extract embeddings: 100%|██████████| 5/5 [00:00<00:00,  7.43it/s]

[NeMo I 2024-08-18 02:43:09 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-18 02:43:09 clustering_diarizer:287] Subsegmentation for embedding extraction: scale4, /content/temp_outputs/speaker_outputs/subsegments_scale4.json
[NeMo I 2024-08-18 02:43:09 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-18 02:43:09 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-18 02:43:09 collections:302] Dataset loaded with 456 items, total duration of  0.06 hours.
[NeMo I 2024-08-18 02:43:09 collections:304] # 456 files loaded accounting to # 1 labels



[5/5] extract embeddings: 100%|██████████| 8/8 [00:01<00:00,  4.63it/s]

[NeMo I 2024-08-18 02:43:11 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings



clustering: 100%|██████████| 1/1 [00:01<00:00,  1.53s/it]

[NeMo I 2024-08-18 02:43:13 clustering_diarizer:464] Outputs are saved in /content/temp_outputs directory



[NeMo W 2024-08-18 02:43:13 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-18 02:43:13 msdd_models:960] Loading embedding pickle file of scale:0 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale0_embeddings.pkl
[NeMo I 2024-08-18 02:43:13 msdd_models:960] Loading embedding pickle file of scale:1 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale1_embeddings.pkl
[NeMo I 2024-08-18 02:43:13 msdd_models:960] Loading embedding pickle file of scale:2 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale2_embeddings.pkl
[NeMo I 2024-08-18 02:43:13 msdd_models:960] Loading embedding pickle file of scale:3 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale3_embeddings.pkl
[NeMo I 2024-08-18 02:43:13 msdd_models:960] Loading embedding pickle file of scale:4 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale4_embeddings.pkl
[NeMo I 2024-08-18 02:43:13 msdd_models:938] Loading cluster label file from /content/temp_outputs/speaker_outputs/subsegments_scale4_cluste

100%|██████████| 1/1 [00:00<00:00, 22.31it/s]

[NeMo I 2024-08-18 02:43:13 msdd_models:1403]      [Threshold: 0.7000] [use_clus_as_main=False] [diar_window=50]
[NeMo I 2024-08-18 02:43:13 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-08-18 02:43:13 speaker_utils:93] Number of files to diarize: 1



[NeMo W 2024-08-18 02:43:13 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-18 02:43:13 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-08-18 02:43:13 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-18 02:43:13 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-08-18 02:43:13 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-18 02:43:13 msdd_models:1431]   
    
CPU times: user 22.8 s, sys: 930 ms, total: 23.8 s
Wall time: 28.9 s


## MAPEO HABLANTES POR TIEMPO

In [None]:
%%time
# Reading timestamps <> Speaker Labels mapping

speaker_ts = []
with open(os.path.join(temp_path, "pred_rttms", "mono_file.rttm"), "r") as f:
    lines = f.readlines()
    for line in lines:
        line_list = line.split(" ")
        s = int(float(line_list[5]) * 1000)
        e = s + int(float(line_list[8]) * 1000)
        speaker_ts.append([s, e, int(line_list[11].split("_")[-1])])

wsm = get_words_speaker_mapping(word_timestamps, speaker_ts, "start")

CPU times: user 2.47 ms, sys: 0 ns, total: 2.47 ms
Wall time: 2.29 ms


In [None]:
print(wsm)

[{'word': 'Aló.', 'start_time': 663, 'end_time': 803, 'speaker': 0}, {'word': 'Buenos', 'start_time': 823, 'end_time': 3324, 'speaker': 0}, {'word': 'días,', 'start_time': 3385, 'end_time': 3725, 'speaker': 1}, {'word': 'me', 'start_time': 3745, 'end_time': 3825, 'speaker': 1}, {'word': 'comunico', 'start_time': 3865, 'end_time': 4265, 'speaker': 1}, {'word': 'con', 'start_time': 4305, 'end_time': 4425, 'speaker': 1}, {'word': 'Rafael', 'start_time': 4485, 'end_time': 4905, 'speaker': 1}, {'word': 'Dávalo.', 'start_time': 4965, 'end_time': 5326, 'speaker': 1}, {'word': '¿Y', 'start_time': 6486, 'end_time': 6546, 'speaker': 0}, {'word': 'con', 'start_time': 6586, 'end_time': 6686, 'speaker': 0}, {'word': 'él?', 'start_time': 6766, 'end_time': 6906, 'speaker': 0}, {'word': 'Le', 'start_time': 7827, 'end_time': 7927, 'speaker': 1}, {'word': 'habla', 'start_time': 7987, 'end_time': 8147, 'speaker': 1}, {'word': 'Angie', 'start_time': 8186, 'end_time': 8467, 'speaker': 1}, {'word': 'Valente

## ESTRUCTURA HABLANTES

In [None]:
%%time
if language in punct_model_langs:
    # restoring punctuation in the transcript to help realign the sentences
    punct_model = PunctuationModel(model="kredor/punctuate-all")

    words_list = list(map(lambda x: x["word"], wsm))

    labled_words = punct_model.predict(words_list)

    ending_puncts = ".?!"
    model_puncts = ".,;:!?"

    # We don't want to punctuate U.S.A. with a period. Right?
    is_acronym = lambda x: re.fullmatch(r"\b(?:[a-zA-Z]\.){2,}", x)

    for word_dict, labeled_tuple in zip(wsm, labled_words):
        word = word_dict["word"]
        if (
            word
            and labeled_tuple[1] in ending_puncts
            and (word[-1] not in model_puncts or is_acronym(word))
        ):
            word += labeled_tuple[1]
            if word.endswith(".."):
                word = word.rstrip(".")
            word_dict["word"] = word

else:
    logging.warning(
        f"Punctuation restoration is not available for {language} language. Using the original punctuation."
    )

wsm = get_realigned_ws_mapping_with_punctuation(wsm)
ssm = get_sentences_speaker_mapping(wsm, speaker_ts)

config.json:   0%|          | 0.00/914 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.11G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/447 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.1M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

CPU times: user 5.44 s, sys: 3.8 s, total: 9.25 s
Wall time: 32.3 s


In [None]:
print(ssm)

[{'speaker': 'Speaker 0', 'start_time': 0, 'end_time': 803, 'text': 'Aló. '}, {'speaker': 'Speaker 1', 'start_time': 823, 'end_time': 5326, 'text': 'Buenos días, me comunico con Rafael Dávalo. '}, {'speaker': 'Speaker 0', 'start_time': 6486, 'end_time': 6906, 'text': '¿Y con él? '}, {'speaker': 'Speaker 1', 'start_time': 7827, 'end_time': 10508, 'text': 'Le habla Angie Valente, de la Universidad Javeriana de Cali. '}, {'speaker': 'Speaker 1', 'start_time': 10548, 'end_time': 11369, 'text': '¿Cómo se encuentra? '}, {'speaker': 'Speaker 0', 'start_time': 12410, 'end_time': 13670, 'text': 'Muy bien, gracias a Dios, usted. '}, {'speaker': 'Speaker 1', 'start_time': 14911, 'end_time': 16052, 'text': 'Muy bien, gracias a Dios. '}, {'speaker': 'Speaker 1', 'start_time': 16932, 'end_time': 23336, 'text': 'Nos estamos comunicando, señor Rafael, por su interés en el seminario en Business Analysis con Power BI. '}, {'speaker': 'Speaker 0', 'start_time': 24796, 'end_time': 25657, 'text': 'Ajá, sí,

## EXPORTACION DE RESULTADOS

In [None]:
%%time
path_textfile_with_speakers = f"{os.path.splitext(audio_path)[0]}.txt"
path_srtfile_with_speakers = f"{os.path.splitext(audio_path)[0]}.srt"

with open(path_textfile_with_speakers, "w", encoding="utf-8-sig") as f:
    get_speaker_aware_transcript(ssm, f)

with open(path_srtfile_with_speakers, "w", encoding="utf-8-sig") as srt:
    write_srt(ssm, srt)

cleanup(temp_path)

CPU times: user 3.91 ms, sys: 1.03 ms, total: 4.94 ms
Wall time: 10.8 ms


In [None]:
def create_text(audio_path):
    # rename audio filename if necessary to get string without accent, space, in lower case
    audio_path = rename_file(audio_path)
    # vocal_target = isolate_string(audio_path)
    # whisper_results, language = transcribe(vocal_target)
    return audio_path

# DESARROLLO FOLDER

### ESTANDARIZACION DE FILEPATH

In [None]:
# Listar todos los archivos en una carpeta específica, por ejemplo, '/content'
files = os.listdir('/content')
# print(files)

wav_files = [file for file in files if file.endswith('.wav')]
# print(wav_files)

for filepath in wav_files:
  suffix = Path(filepath).suffix

  if str(Path(filepath).parent) != ".":
      new_filepath = str(Path(filepath).parent) + cleanString(filepath.replace(suffix, "")) + suffix
  else:
      new_filepath = cleanString(filepath.replace(suffix, "")) + suffix
  os.rename(filepath, new_filepath)


In [None]:
files = os.listdir('/content')
wav_files = [fl for fl in files if Path(fl).suffix == ".wav"]
print(wav_files)

['b3067024c98b48888d1b93992535ae9a_20230610t14_15_utc.wav', '18b2b93c2953487c97c8b73ce36135fb_20230610t15_48_utc.wav', '033864c87c5f4a4fab44a201563fd968_20230610t15_05_utc.wav', '3030567b3ee042bb8a838d1c76b6b916_20230610t14_01_utc.wav', '7561a59a1cc24d3f9dcb726bda6b777e_20230610t14_18_utc.wav', '2423c4d401d1424ab3843e5ddda79e7a_20230610t16_50_utc.wav', '975df3115269486098f522a3eb856dfd_20230610t15_05_utc.wav', '34958b78223b449782e44c3afd0d0fd7_20230610t14_27_utc.wav', 'd32ed68f007043a4a4eb582c82015c33_20230610t14_59_utc.wav', '3fde7eca0a2f4b188a3fd81fce9a9d09_20230610t14_48_utc.wav', 'ca04630280544cbf8b9bee6854a2f681_20230610t14_13_utc.wav', '5fab44f212e3402d99638a5dc9d97701_20230610t14_26_utc.wav', 'e091c45596054f6ea5fac6b7388e115c_20230610t15_13_utc.wav', 'd9553adbba9c45bead6e1a2bded61ad8_20230610t16_33_utc.wav', '934de5177dd94afd8f8a198b2226c2e5_20230610t15_00_utc.wav', '45adf080f3e24793900699ada4592d63_20230610t15_47_utc.wav', 'eda880c7ca6f4e40835aa372d01627c0_20230610t14_59_utc.wa

In [None]:
len(wav_files)

59

In [None]:
#Limpieza de storage
files = os.listdir('/content')
not_txt = [fl for fl in files if Path(fl).suffix != ".txt"]
elm = [fl for fl in files if Path(fl).suffix in (".txt", ".wav")]
# len(elm)
for nt in elm:
  mono_file_path = os.path.join('/content', nt)
  # print(mono_file_path)
  cleanup(mono_file_path)

### PROCESO TRANSCRIPCION

In [None]:
# wav_files = ['0c1dacc20d21474f9f300b2bc3069d1a_20230401t16_11_utc.wav']

# vocal_target = audio_path
language = None
batch_size = 8
whisper_model_name = "large-v3"
compute_type = "float16"
suppress_numerals = True
device = "cuda"

for vocal_target in wav_files:

  whisper_results, language = transcribe_batched(
      vocal_target,
      language,
      batch_size,
      whisper_model_name,
      compute_type,
      suppress_numerals,
      device,
  )

######################################
#########      WAV2VEC2      #########
######################################

  alignment_model, metadata = whisperx.load_align_model(
      language_code=language, device=device
  )
  result_aligned = whisperx.align(
      whisper_results, alignment_model, metadata, vocal_target, device
  )
  word_timestamps = filter_missing_timestamps(result_aligned["word_segments"])

  # clear gpu vram
  del alignment_model
  torch.cuda.empty_cache()

######################################
######  CONVERSION A MONO   ##########
######################################

  sound = AudioSegment.from_file(vocal_target).set_channels(1)
  ROOT = os.getcwd()
  temp_path = os.path.join(ROOT, "temp_outputs")
  os.makedirs(temp_path, exist_ok=True)
  sound.export(os.path.join(temp_path, "mono_file.wav"), format="wav")

#######################################
########  DIARIZACION AUDIO   #########
#######################################

  msdd_model = NeuralDiarizer(cfg=create_config(temp_path, DOMAIN_TYPE="telephonic")).to("cuda")
  msdd_model.diarize()

  del msdd_model
  torch.cuda.empty_cache()

#######################################
####  MAPEO HABLANTES POR TIEMPO   ####
#######################################


  speaker_ts = []
  with open(os.path.join(temp_path, "pred_rttms", "mono_file.rttm"), "r") as f:
      lines = f.readlines()
      for line in lines:
          line_list = line.split(" ")
          s = int(float(line_list[5]) * 1000)
          e = s + int(float(line_list[8]) * 1000)
          speaker_ts.append([s, e, int(line_list[11].split("_")[-1])])

  wsm = get_words_speaker_mapping(word_timestamps, speaker_ts, "start")

#######################################
#######  ESTRUCTURA HABLANTES   #######
#######################################

  punct_model = PunctuationModel(model="kredor/punctuate-all")

  words_list = list(map(lambda x: x["word"], wsm))

  labled_words = punct_model.predict(words_list)

  ending_puncts = ".?!"
  model_puncts = ".,;:!?"

  # We don't want to punctuate U.S.A. with a period. Right?
  is_acronym = lambda x: re.fullmatch(r"\b(?:[a-zA-Z]\.){2,}", x)

  for word_dict, labeled_tuple in zip(wsm, labled_words):
      word = word_dict["word"]
      if (
          word
          and labeled_tuple[1] in ending_puncts
          and (word[-1] not in model_puncts or is_acronym(word))
      ):
          word += labeled_tuple[1]
          if word.endswith(".."):
              word = word.rstrip(".")
          word_dict["word"] = word

  wsm = get_realigned_ws_mapping_with_punctuation(wsm)
  ssm = get_sentences_speaker_mapping(wsm, speaker_ts)


#######################################
######  EXPORTACION RESULTADOS   ######
#######################################

  path_textfile_with_speakers = f"/content/drive/MyDrive/audios/2023-06-10a/{os.path.splitext(vocal_target)[0]}.txt"  #JSON #Pandas(Speaker 1 - Speaker 2))
  # path_srtfile_with_speakers = f"/content/drive/MyDrive/audios/{os.path.splitext(vocal_target)[0]}.srt"

  with open(path_textfile_with_speakers, "w", encoding="utf-8-sig") as f:
      get_speaker_aware_transcript(ssm, f)

  # with open(path_srtfile_with_speakers, "w", encoding="utf-8-sig") as srt:
  #     write_srt(ssm, srt)

  mono_file_path = os.path.join(temp_path, "mono_file.wav")
  cleanup(mono_file_path)

  print(f'archivo {vocal_target} listo')
# print(whisper_results)
# print(word_timestamps)
# print(wsm)
# print(ssm)

INFO:pytorch_lightning.utilities.migration.utils:Lightning automatically upgraded your loaded checkpoint from v1.5.4 to v2.0.7. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint --file ../root/.cache/torch/whisperx-vad-segmentation.bin`


No language specified, language will be first be detected for each audio file (increases inference time).
Model was trained with pyannote.audio 0.0.1, yours is 3.1.1. Bad things might happen unless you revert pyannote.audio to 0.x.
Model was trained with torch 1.10.0+cu102, yours is 2.0.1+cu117. Bad things might happen unless you revert torch to 1.x.
Detected language: es (0.98) in first 30s of audio...
Suppressing numeral and symbol tokens: [3, 4, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 502, 568, 628, 805, 945, 1017, 1025, 1266, 1294, 1360, 1386, 1525, 1614, 1649, 1722, 1848, 1958, 2009, 2119, 2217, 2272, 2319, 2331, 2443, 2625, 2803, 2975, 3165, 3279, 3282, 3356, 3405, 3446, 3499, 3552, 3705, 4022, 4060, 4289, 4303, 4436, 4550, 4688, 4702, 4762, 4808, 5080, 5211, 5254, 5285, 5348, 5853, 5867, 5923, 6071, 6074, 6096, 6375, 6494, 6549, 6591, 6641, 6673, 6856, 6866, 6879, 6905, 6976, 7143, 7201, 7271, 7490, 7526, 7546, 7551, 7560, 7562, 7629, 7634, 7668, 7771, 7773, 7911, 7998, 8132, 82

[NeMo W 2024-08-26 03:36:11 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: true
    
[NeMo W 2024-08-26 03:36:11 modelPT:168] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s). 
    Validation config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: false
    
[NeMo W 2024-08-26 03:36:11 modelPT:174] Please call the ModelPT.setup_test_data() or ModelPT.setup_multiple

[NeMo I 2024-08-26 03:36:11 features:289] PADDING: 16
[NeMo I 2024-08-26 03:36:12 features:289] PADDING: 16
[NeMo I 2024-08-26 03:36:12 save_restore_connector:249] Model EncDecDiarLabelModel was successfully restored from /root/.cache/torch/NeMo/NeMo_1.21.0/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo.
[NeMo I 2024-08-26 03:36:13 features:289] PADDING: 16
[NeMo I 2024-08-26 03:36:14 clustering_diarizer:127] Loading pretrained vad_multilingual_marblenet model from NGC
[NeMo I 2024-08-26 03:36:14 cloud:58] Found existing object /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-08-26 03:36:14 cloud:64] Re-using file from: /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo
[NeMo I 2024-08-26 03:36:14 common:913] Instantiating model from pre-trained checkpoint


[NeMo W 2024-08-26 03:36:14 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: /manifests/ami_train_0.63.json,/manifests/freesound_background_train.json,/manifests/freesound_laughter_train.json,/manifests/fisher_2004_background.json,/manifests/fisher_2004_speech_sampled.json,/manifests/google_train_manifest.json,/manifests/icsi_all_0.63.json,/manifests/musan_freesound_train.json,/manifests/musan_music_train.json,/manifests/musan_soundbible_train.json,/manifests/mandarin_train_sample.json,/manifests/german_train_sample.json,/manifests/spanish_train_sample.json,/manifests/french_train_sample.json,/manifests/russian_train_sample.json
    sample_rate: 16000
    labels:
    - background
    - speech
    batch_size: 256
    shuffle: true
    is_tarred: false
    tarred_audio_filepaths: null
    tarred_shard_strategy: sca

[NeMo I 2024-08-26 03:36:14 features:289] PADDING: 16
[NeMo I 2024-08-26 03:36:14 save_restore_connector:249] Model EncDecClassificationModel was successfully restored from /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-08-26 03:36:14 msdd_models:864] Multiscale Weights: [1, 1, 1, 1, 1]
[NeMo I 2024-08-26 03:36:14 msdd_models:865] Clustering Parameters: {
        "oracle_num_speakers": false,
        "max_num_speakers": 8,
        "enhanced_count_thres": 80,
        "max_rp_threshold": 0.25,
        "sparse_search_volume": 30,
        "maj_vote_spk_count": false,
        "chunk_cluster_count": 50,
        "embeddings_per_chunk": 10000
    }


[NeMo W 2024-08-26 03:36:14 clustering_diarizer:411] Deleting previous clustering diarizer outputs.


[NeMo I 2024-08-26 03:36:14 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-08-26 03:36:14 clustering_diarizer:309] Split long audio file to avoid CUDA memory issue


splitting manifest: 100%|██████████| 1/1 [00:00<00:00, 30.47it/s]

[NeMo I 2024-08-26 03:36:14 vad_utils:107] The prepared manifest file exists. Overwriting!
[NeMo I 2024-08-26 03:36:14 classification_models:273] Perform streaming frame-level VAD
[NeMo I 2024-08-26 03:36:14 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 03:36:14 collections:302] Dataset loaded with 4 items, total duration of  0.05 hours.
[NeMo I 2024-08-26 03:36:14 collections:304] # 4 files loaded accounting to # 1 labels



vad: 100%|██████████| 4/4 [00:00<00:00,  4.22it/s]

[NeMo I 2024-08-26 03:36:15 clustering_diarizer:250] Generating predictions with overlapping input segments



                                                               

[NeMo I 2024-08-26 03:36:16 clustering_diarizer:262] Converting frame level prediction to speech/no-speech segment in start and end times format.


creating speech segments: 100%|██████████| 1/1 [00:00<00:00,  6.96it/s]

[NeMo I 2024-08-26 03:36:16 clustering_diarizer:287] Subsegmentation for embedding extraction: scale0, /content/temp_outputs/speaker_outputs/subsegments_scale0.json
[NeMo I 2024-08-26 03:36:17 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 03:36:17 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 03:36:17 collections:302] Dataset loaded with 201 items, total duration of  0.08 hours.
[NeMo I 2024-08-26 03:36:17 collections:304] # 201 files loaded accounting to # 1 labels



[1/5] extract embeddings: 100%|██████████| 4/4 [00:00<00:00,  7.42it/s]

[NeMo I 2024-08-26 03:36:17 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings





[NeMo I 2024-08-26 03:36:17 clustering_diarizer:287] Subsegmentation for embedding extraction: scale1, /content/temp_outputs/speaker_outputs/subsegments_scale1.json
[NeMo I 2024-08-26 03:36:17 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 03:36:17 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 03:36:17 collections:302] Dataset loaded with 244 items, total duration of  0.08 hours.
[NeMo I 2024-08-26 03:36:17 collections:304] # 244 files loaded accounting to # 1 labels


[2/5] extract embeddings: 100%|██████████| 4/4 [00:00<00:00,  7.57it/s]

[NeMo I 2024-08-26 03:36:18 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 03:36:18 clustering_diarizer:287] Subsegmentation for embedding extraction: scale2, /content/temp_outputs/speaker_outputs/subsegments_scale2.json
[NeMo I 2024-08-26 03:36:18 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 03:36:18 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 03:36:18 collections:302] Dataset loaded with 306 items, total duration of  0.08 hours.
[NeMo I 2024-08-26 03:36:18 collections:304] # 306 files loaded accounting to # 1 labels



[3/5] extract embeddings: 100%|██████████| 5/5 [00:00<00:00,  8.26it/s]

[NeMo I 2024-08-26 03:36:18 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 03:36:18 clustering_diarizer:287] Subsegmentation for embedding extraction: scale3, /content/temp_outputs/speaker_outputs/subsegments_scale3.json
[NeMo I 2024-08-26 03:36:18 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 03:36:18 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 03:36:18 collections:302] Dataset loaded with 418 items, total duration of  0.09 hours.
[NeMo I 2024-08-26 03:36:18 collections:304] # 418 files loaded accounting to # 1 labels



[4/5] extract embeddings: 100%|██████████| 7/7 [00:00<00:00, 10.12it/s]

[NeMo I 2024-08-26 03:36:19 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 03:36:19 clustering_diarizer:287] Subsegmentation for embedding extraction: scale4, /content/temp_outputs/speaker_outputs/subsegments_scale4.json
[NeMo I 2024-08-26 03:36:19 clustering_diarizer:343] Extracting embeddings for Diarization





[NeMo I 2024-08-26 03:36:19 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 03:36:19 collections:302] Dataset loaded with 630 items, total duration of  0.09 hours.
[NeMo I 2024-08-26 03:36:19 collections:304] # 630 files loaded accounting to # 1 labels


[5/5] extract embeddings: 100%|██████████| 10/10 [00:00<00:00, 10.72it/s]


[NeMo I 2024-08-26 03:36:20 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings


clustering: 100%|██████████| 1/1 [00:00<00:00,  2.34it/s]

[NeMo I 2024-08-26 03:36:20 clustering_diarizer:464] Outputs are saved in /content/temp_outputs directory



[NeMo W 2024-08-26 03:36:20 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 03:36:20 msdd_models:960] Loading embedding pickle file of scale:0 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale0_embeddings.pkl
[NeMo I 2024-08-26 03:36:20 msdd_models:960] Loading embedding pickle file of scale:1 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale1_embeddings.pkl
[NeMo I 2024-08-26 03:36:20 msdd_models:960] Loading embedding pickle file of scale:2 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale2_embeddings.pkl
[NeMo I 2024-08-26 03:36:20 msdd_models:960] Loading embedding pickle file of scale:3 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale3_embeddings.pkl
[NeMo I 2024-08-26 03:36:20 msdd_models:960] Loading embedding pickle file of scale:4 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale4_embeddings.pkl
[NeMo I 2024-08-26 03:36:20 msdd_models:938] Loading cluster label file from /content/temp_outputs/speaker_outputs/subsegments_scale4_cluste

100%|██████████| 1/1 [00:00<00:00, 35.05it/s]

[NeMo I 2024-08-26 03:36:21 msdd_models:1403]      [Threshold: 0.7000] [use_clus_as_main=False] [diar_window=50]
[NeMo I 2024-08-26 03:36:21 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-08-26 03:36:21 speaker_utils:93] Number of files to diarize: 1



[NeMo W 2024-08-26 03:36:21 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 03:36:21 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-08-26 03:36:21 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 03:36:21 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-08-26 03:36:21 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 03:36:21 msdd_models:1431]   
    
archivo b3067024c98b48888d1b93992535ae9a_20230610t14_15_utc.wav listo


INFO:pytorch_lightning.utilities.migration.utils:Lightning automatically upgraded your loaded checkpoint from v1.5.4 to v2.0.7. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint --file ../root/.cache/torch/whisperx-vad-segmentation.bin`


No language specified, language will be first be detected for each audio file (increases inference time).
Model was trained with pyannote.audio 0.0.1, yours is 3.1.1. Bad things might happen unless you revert pyannote.audio to 0.x.
Model was trained with torch 1.10.0+cu102, yours is 2.0.1+cu117. Bad things might happen unless you revert torch to 1.x.
Suppressing numeral and symbol tokens: [3, 4, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 502, 568, 628, 805, 945, 1017, 1025, 1266, 1294, 1360, 1386, 1525, 1614, 1649, 1722, 1848, 1958, 2009, 2119, 2217, 2272, 2319, 2331, 2443, 2625, 2803, 2975, 3165, 3279, 3282, 3356, 3405, 3446, 3499, 3552, 3705, 4022, 4060, 4289, 4303, 4436, 4550, 4688, 4702, 4762, 4808, 5080, 5211, 5254, 5285, 5348, 5853, 5867, 5923, 6071, 6074, 6096, 6375, 6494, 6549, 6591, 6641, 6673, 6856, 6866, 6879, 6905, 6976, 7143, 7201, 7271, 7490, 7526, 7546, 7551, 7560, 7562, 7629, 7634, 7668, 7771, 7773, 7911, 7998, 8132, 8227, 8423, 8451, 8465, 8494, 8652, 8794, 8858, 8923, 90

[NeMo W 2024-08-26 03:37:03 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: true
    
[NeMo W 2024-08-26 03:37:03 modelPT:168] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s). 
    Validation config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: false
    
[NeMo W 2024-08-26 03:37:03 modelPT:174] Please call the ModelPT.setup_test_data() or ModelPT.setup_multiple

[NeMo I 2024-08-26 03:37:03 features:289] PADDING: 16
[NeMo I 2024-08-26 03:37:04 features:289] PADDING: 16
[NeMo I 2024-08-26 03:37:05 save_restore_connector:249] Model EncDecDiarLabelModel was successfully restored from /root/.cache/torch/NeMo/NeMo_1.21.0/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo.
[NeMo I 2024-08-26 03:37:05 features:289] PADDING: 16
[NeMo I 2024-08-26 03:37:06 clustering_diarizer:127] Loading pretrained vad_multilingual_marblenet model from NGC
[NeMo I 2024-08-26 03:37:06 cloud:58] Found existing object /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-08-26 03:37:06 cloud:64] Re-using file from: /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo
[NeMo I 2024-08-26 03:37:06 common:913] Instantiating model from pre-trained checkpoint


[NeMo W 2024-08-26 03:37:06 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: /manifests/ami_train_0.63.json,/manifests/freesound_background_train.json,/manifests/freesound_laughter_train.json,/manifests/fisher_2004_background.json,/manifests/fisher_2004_speech_sampled.json,/manifests/google_train_manifest.json,/manifests/icsi_all_0.63.json,/manifests/musan_freesound_train.json,/manifests/musan_music_train.json,/manifests/musan_soundbible_train.json,/manifests/mandarin_train_sample.json,/manifests/german_train_sample.json,/manifests/spanish_train_sample.json,/manifests/french_train_sample.json,/manifests/russian_train_sample.json
    sample_rate: 16000
    labels:
    - background
    - speech
    batch_size: 256
    shuffle: true
    is_tarred: false
    tarred_audio_filepaths: null
    tarred_shard_strategy: sca

[NeMo I 2024-08-26 03:37:06 features:289] PADDING: 16
[NeMo I 2024-08-26 03:37:06 save_restore_connector:249] Model EncDecClassificationModel was successfully restored from /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-08-26 03:37:06 msdd_models:864] Multiscale Weights: [1, 1, 1, 1, 1]
[NeMo I 2024-08-26 03:37:06 msdd_models:865] Clustering Parameters: {
        "oracle_num_speakers": false,
        "max_num_speakers": 8,
        "enhanced_count_thres": 80,
        "max_rp_threshold": 0.25,
        "sparse_search_volume": 30,
        "maj_vote_spk_count": false,
        "chunk_cluster_count": 50,
        "embeddings_per_chunk": 10000
    }


[NeMo W 2024-08-26 03:37:06 clustering_diarizer:411] Deleting previous clustering diarizer outputs.


[NeMo I 2024-08-26 03:37:06 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-08-26 03:37:06 clustering_diarizer:309] Split long audio file to avoid CUDA memory issue


splitting manifest: 100%|██████████| 1/1 [00:00<00:00, 58.76it/s]

[NeMo I 2024-08-26 03:37:06 vad_utils:107] The prepared manifest file exists. Overwriting!
[NeMo I 2024-08-26 03:37:06 classification_models:273] Perform streaming frame-level VAD
[NeMo I 2024-08-26 03:37:06 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 03:37:06 collections:302] Dataset loaded with 2 items, total duration of  0.01 hours.
[NeMo I 2024-08-26 03:37:06 collections:304] # 2 files loaded accounting to # 1 labels



vad: 100%|██████████| 2/2 [00:00<00:00,  4.86it/s]

[NeMo I 2024-08-26 03:37:07 clustering_diarizer:250] Generating predictions with overlapping input segments



                                                               

[NeMo I 2024-08-26 03:37:07 clustering_diarizer:262] Converting frame level prediction to speech/no-speech segment in start and end times format.


creating speech segments: 100%|██████████| 1/1 [00:00<00:00, 21.95it/s]

[NeMo I 2024-08-26 03:37:07 clustering_diarizer:287] Subsegmentation for embedding extraction: scale0, /content/temp_outputs/speaker_outputs/subsegments_scale0.json
[NeMo I 2024-08-26 03:37:07 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 03:37:07 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 03:37:07 collections:302] Dataset loaded with 42 items, total duration of  0.02 hours.
[NeMo I 2024-08-26 03:37:07 collections:304] # 42 files loaded accounting to # 1 labels



[1/5] extract embeddings: 100%|██████████| 1/1 [00:00<00:00,  7.42it/s]

[NeMo I 2024-08-26 03:37:07 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 03:37:07 clustering_diarizer:287] Subsegmentation for embedding extraction: scale1, /content/temp_outputs/speaker_outputs/subsegments_scale1.json
[NeMo I 2024-08-26 03:37:07 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 03:37:07 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 03:37:07 collections:302] Dataset loaded with 51 items, total duration of  0.02 hours.
[NeMo I 2024-08-26 03:37:07 collections:304] # 51 files loaded accounting to # 1 labels



[2/5] extract embeddings: 100%|██████████| 1/1 [00:00<00:00,  7.50it/s]

[NeMo I 2024-08-26 03:37:07 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 03:37:07 clustering_diarizer:287] Subsegmentation for embedding extraction: scale2, /content/temp_outputs/speaker_outputs/subsegments_scale2.json
[NeMo I 2024-08-26 03:37:07 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 03:37:07 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 03:37:07 collections:302] Dataset loaded with 63 items, total duration of  0.02 hours.
[NeMo I 2024-08-26 03:37:07 collections:304] # 63 files loaded accounting to # 1 labels



[3/5] extract embeddings: 100%|██████████| 1/1 [00:00<00:00,  6.70it/s]

[NeMo I 2024-08-26 03:37:08 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 03:37:08 clustering_diarizer:287] Subsegmentation for embedding extraction: scale3, /content/temp_outputs/speaker_outputs/subsegments_scale3.json
[NeMo I 2024-08-26 03:37:08 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 03:37:08 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 03:37:08 collections:302] Dataset loaded with 88 items, total duration of  0.02 hours.
[NeMo I 2024-08-26 03:37:08 collections:304] # 88 files loaded accounting to # 1 labels



[4/5] extract embeddings: 100%|██████████| 2/2 [00:00<00:00, 12.38it/s]

[NeMo I 2024-08-26 03:37:08 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings





[NeMo I 2024-08-26 03:37:08 clustering_diarizer:287] Subsegmentation for embedding extraction: scale4, /content/temp_outputs/speaker_outputs/subsegments_scale4.json
[NeMo I 2024-08-26 03:37:08 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 03:37:08 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 03:37:08 collections:302] Dataset loaded with 134 items, total duration of  0.02 hours.
[NeMo I 2024-08-26 03:37:08 collections:304] # 134 files loaded accounting to # 1 labels


[5/5] extract embeddings: 100%|██████████| 3/3 [00:00<00:00, 12.51it/s]

[NeMo I 2024-08-26 03:37:08 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings



clustering: 100%|██████████| 1/1 [00:00<00:00,  4.98it/s]

[NeMo I 2024-08-26 03:37:08 clustering_diarizer:464] Outputs are saved in /content/temp_outputs directory



[NeMo W 2024-08-26 03:37:08 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 03:37:08 msdd_models:960] Loading embedding pickle file of scale:0 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale0_embeddings.pkl
[NeMo I 2024-08-26 03:37:08 msdd_models:960] Loading embedding pickle file of scale:1 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale1_embeddings.pkl
[NeMo I 2024-08-26 03:37:08 msdd_models:960] Loading embedding pickle file of scale:2 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale2_embeddings.pkl
[NeMo I 2024-08-26 03:37:08 msdd_models:960] Loading embedding pickle file of scale:3 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale3_embeddings.pkl
[NeMo I 2024-08-26 03:37:08 msdd_models:960] Loading embedding pickle file of scale:4 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale4_embeddings.pkl
[NeMo I 2024-08-26 03:37:08 msdd_models:938] Loading cluster label file from /content/temp_outputs/speaker_outputs/subsegments_scale4_cluste

100%|██████████| 1/1 [00:00<00:00, 53.37it/s]

[NeMo I 2024-08-26 03:37:08 msdd_models:1403]      [Threshold: 0.7000] [use_clus_as_main=False] [diar_window=50]
[NeMo I 2024-08-26 03:37:08 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-08-26 03:37:08 speaker_utils:93] Number of files to diarize: 1



[NeMo W 2024-08-26 03:37:08 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 03:37:08 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-08-26 03:37:08 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 03:37:08 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-08-26 03:37:08 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 03:37:08 msdd_models:1431]   
    
archivo 18b2b93c2953487c97c8b73ce36135fb_20230610t15_48_utc.wav listo


INFO:pytorch_lightning.utilities.migration.utils:Lightning automatically upgraded your loaded checkpoint from v1.5.4 to v2.0.7. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint --file ../root/.cache/torch/whisperx-vad-segmentation.bin`


No language specified, language will be first be detected for each audio file (increases inference time).
Model was trained with pyannote.audio 0.0.1, yours is 3.1.1. Bad things might happen unless you revert pyannote.audio to 0.x.
Model was trained with torch 1.10.0+cu102, yours is 2.0.1+cu117. Bad things might happen unless you revert torch to 1.x.
Suppressing numeral and symbol tokens: [3, 4, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 502, 568, 628, 805, 945, 1017, 1025, 1266, 1294, 1360, 1386, 1525, 1614, 1649, 1722, 1848, 1958, 2009, 2119, 2217, 2272, 2319, 2331, 2443, 2625, 2803, 2975, 3165, 3279, 3282, 3356, 3405, 3446, 3499, 3552, 3705, 4022, 4060, 4289, 4303, 4436, 4550, 4688, 4702, 4762, 4808, 5080, 5211, 5254, 5285, 5348, 5853, 5867, 5923, 6071, 6074, 6096, 6375, 6494, 6549, 6591, 6641, 6673, 6856, 6866, 6879, 6905, 6976, 7143, 7201, 7271, 7490, 7526, 7546, 7551, 7560, 7562, 7629, 7634, 7668, 7771, 7773, 7911, 7998, 8132, 8227, 8423, 8451, 8465, 8494, 8652, 8794, 8858, 8923, 90

[NeMo W 2024-08-26 03:38:11 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: true
    
[NeMo W 2024-08-26 03:38:11 modelPT:168] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s). 
    Validation config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: false
    
[NeMo W 2024-08-26 03:38:11 modelPT:174] Please call the ModelPT.setup_test_data() or ModelPT.setup_multiple

[NeMo I 2024-08-26 03:38:11 features:289] PADDING: 16
[NeMo I 2024-08-26 03:38:11 features:289] PADDING: 16
[NeMo I 2024-08-26 03:38:12 save_restore_connector:249] Model EncDecDiarLabelModel was successfully restored from /root/.cache/torch/NeMo/NeMo_1.21.0/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo.
[NeMo I 2024-08-26 03:38:12 features:289] PADDING: 16
[NeMo I 2024-08-26 03:38:13 clustering_diarizer:127] Loading pretrained vad_multilingual_marblenet model from NGC
[NeMo I 2024-08-26 03:38:13 cloud:58] Found existing object /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-08-26 03:38:13 cloud:64] Re-using file from: /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo
[NeMo I 2024-08-26 03:38:13 common:913] Instantiating model from pre-trained checkpoint


[NeMo W 2024-08-26 03:38:13 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: /manifests/ami_train_0.63.json,/manifests/freesound_background_train.json,/manifests/freesound_laughter_train.json,/manifests/fisher_2004_background.json,/manifests/fisher_2004_speech_sampled.json,/manifests/google_train_manifest.json,/manifests/icsi_all_0.63.json,/manifests/musan_freesound_train.json,/manifests/musan_music_train.json,/manifests/musan_soundbible_train.json,/manifests/mandarin_train_sample.json,/manifests/german_train_sample.json,/manifests/spanish_train_sample.json,/manifests/french_train_sample.json,/manifests/russian_train_sample.json
    sample_rate: 16000
    labels:
    - background
    - speech
    batch_size: 256
    shuffle: true
    is_tarred: false
    tarred_audio_filepaths: null
    tarred_shard_strategy: sca

[NeMo I 2024-08-26 03:38:13 features:289] PADDING: 16
[NeMo I 2024-08-26 03:38:13 save_restore_connector:249] Model EncDecClassificationModel was successfully restored from /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-08-26 03:38:13 msdd_models:864] Multiscale Weights: [1, 1, 1, 1, 1]
[NeMo I 2024-08-26 03:38:13 msdd_models:865] Clustering Parameters: {
        "oracle_num_speakers": false,
        "max_num_speakers": 8,
        "enhanced_count_thres": 80,
        "max_rp_threshold": 0.25,
        "sparse_search_volume": 30,
        "maj_vote_spk_count": false,
        "chunk_cluster_count": 50,
        "embeddings_per_chunk": 10000
    }


[NeMo W 2024-08-26 03:38:13 clustering_diarizer:411] Deleting previous clustering diarizer outputs.


[NeMo I 2024-08-26 03:38:13 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-08-26 03:38:13 clustering_diarizer:309] Split long audio file to avoid CUDA memory issue


splitting manifest: 100%|██████████| 1/1 [00:00<00:00,  8.47it/s]

[NeMo I 2024-08-26 03:38:13 vad_utils:107] The prepared manifest file exists. Overwriting!
[NeMo I 2024-08-26 03:38:13 classification_models:273] Perform streaming frame-level VAD
[NeMo I 2024-08-26 03:38:13 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 03:38:13 collections:302] Dataset loaded with 9 items, total duration of  0.12 hours.
[NeMo I 2024-08-26 03:38:13 collections:304] # 9 files loaded accounting to # 1 labels



vad: 100%|██████████| 9/9 [00:02<00:00,  3.41it/s]

[NeMo I 2024-08-26 03:38:16 clustering_diarizer:250] Generating predictions with overlapping input segments



                                                               

[NeMo I 2024-08-26 03:38:20 clustering_diarizer:262] Converting frame level prediction to speech/no-speech segment in start and end times format.


creating speech segments: 100%|██████████| 1/1 [00:00<00:00,  3.34it/s]

[NeMo I 2024-08-26 03:38:21 clustering_diarizer:287] Subsegmentation for embedding extraction: scale0, /content/temp_outputs/speaker_outputs/subsegments_scale0.json
[NeMo I 2024-08-26 03:38:21 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 03:38:21 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 03:38:21 collections:302] Dataset loaded with 378 items, total duration of  0.14 hours.
[NeMo I 2024-08-26 03:38:21 collections:304] # 378 files loaded accounting to # 1 labels



[1/5] extract embeddings: 100%|██████████| 6/6 [00:00<00:00,  6.31it/s]

[NeMo I 2024-08-26 03:38:22 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 03:38:22 clustering_diarizer:287] Subsegmentation for embedding extraction: scale1, /content/temp_outputs/speaker_outputs/subsegments_scale1.json
[NeMo I 2024-08-26 03:38:22 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 03:38:22 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 03:38:22 collections:302] Dataset loaded with 462 items, total duration of  0.15 hours.
[NeMo I 2024-08-26 03:38:22 collections:304] # 462 files loaded accounting to # 1 labels



[2/5] extract embeddings: 100%|██████████| 8/8 [00:00<00:00,  8.16it/s]


[NeMo I 2024-08-26 03:38:23 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 03:38:23 clustering_diarizer:287] Subsegmentation for embedding extraction: scale2, /content/temp_outputs/speaker_outputs/subsegments_scale2.json
[NeMo I 2024-08-26 03:38:23 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 03:38:23 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 03:38:23 collections:302] Dataset loaded with 573 items, total duration of  0.15 hours.
[NeMo I 2024-08-26 03:38:23 collections:304] # 573 files loaded accounting to # 1 labels


[3/5] extract embeddings: 100%|██████████| 9/9 [00:01<00:00,  8.40it/s]

[NeMo I 2024-08-26 03:38:24 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 03:38:24 clustering_diarizer:287] Subsegmentation for embedding extraction: scale3, /content/temp_outputs/speaker_outputs/subsegments_scale3.json
[NeMo I 2024-08-26 03:38:24 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 03:38:24 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 03:38:24 collections:302] Dataset loaded with 780 items, total duration of  0.16 hours.
[NeMo I 2024-08-26 03:38:24 collections:304] # 780 files loaded accounting to # 1 labels



[4/5] extract embeddings: 100%|██████████| 13/13 [00:01<00:00, 10.32it/s]

[NeMo I 2024-08-26 03:38:25 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 03:38:25 clustering_diarizer:287] Subsegmentation for embedding extraction: scale4, /content/temp_outputs/speaker_outputs/subsegments_scale4.json





[NeMo I 2024-08-26 03:38:25 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 03:38:25 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 03:38:25 collections:302] Dataset loaded with 1192 items, total duration of  0.16 hours.
[NeMo I 2024-08-26 03:38:25 collections:304] # 1192 files loaded accounting to # 1 labels


[5/5] extract embeddings: 100%|██████████| 19/19 [00:01<00:00, 11.29it/s]

[NeMo I 2024-08-26 03:38:27 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings



clustering: 100%|██████████| 1/1 [00:00<00:00,  2.40it/s]

[NeMo I 2024-08-26 03:38:27 clustering_diarizer:464] Outputs are saved in /content/temp_outputs directory



[NeMo W 2024-08-26 03:38:28 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 03:38:28 msdd_models:960] Loading embedding pickle file of scale:0 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale0_embeddings.pkl
[NeMo I 2024-08-26 03:38:28 msdd_models:960] Loading embedding pickle file of scale:1 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale1_embeddings.pkl
[NeMo I 2024-08-26 03:38:28 msdd_models:960] Loading embedding pickle file of scale:2 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale2_embeddings.pkl
[NeMo I 2024-08-26 03:38:28 msdd_models:960] Loading embedding pickle file of scale:3 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale3_embeddings.pkl
[NeMo I 2024-08-26 03:38:28 msdd_models:960] Loading embedding pickle file of scale:4 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale4_embeddings.pkl
[NeMo I 2024-08-26 03:38:28 msdd_models:938] Loading cluster label file from /content/temp_outputs/speaker_outputs/subsegments_scale4_cluste

100%|██████████| 1/1 [00:00<00:00, 11.42it/s]

[NeMo I 2024-08-26 03:38:28 msdd_models:1403]      [Threshold: 0.7000] [use_clus_as_main=False] [diar_window=50]
[NeMo I 2024-08-26 03:38:28 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-08-26 03:38:28 speaker_utils:93] Number of files to diarize: 1



[NeMo W 2024-08-26 03:38:28 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 03:38:28 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-08-26 03:38:28 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 03:38:28 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-08-26 03:38:28 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 03:38:28 msdd_models:1431]   
    
archivo 033864c87c5f4a4fab44a201563fd968_20230610t15_05_utc.wav listo


INFO:pytorch_lightning.utilities.migration.utils:Lightning automatically upgraded your loaded checkpoint from v1.5.4 to v2.0.7. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint --file ../root/.cache/torch/whisperx-vad-segmentation.bin`


No language specified, language will be first be detected for each audio file (increases inference time).
Model was trained with pyannote.audio 0.0.1, yours is 3.1.1. Bad things might happen unless you revert pyannote.audio to 0.x.
Model was trained with torch 1.10.0+cu102, yours is 2.0.1+cu117. Bad things might happen unless you revert torch to 1.x.
Suppressing numeral and symbol tokens: [3, 4, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 502, 568, 628, 805, 945, 1017, 1025, 1266, 1294, 1360, 1386, 1525, 1614, 1649, 1722, 1848, 1958, 2009, 2119, 2217, 2272, 2319, 2331, 2443, 2625, 2803, 2975, 3165, 3279, 3282, 3356, 3405, 3446, 3499, 3552, 3705, 4022, 4060, 4289, 4303, 4436, 4550, 4688, 4702, 4762, 4808, 5080, 5211, 5254, 5285, 5348, 5853, 5867, 5923, 6071, 6074, 6096, 6375, 6494, 6549, 6591, 6641, 6673, 6856, 6866, 6879, 6905, 6976, 7143, 7201, 7271, 7490, 7526, 7546, 7551, 7560, 7562, 7629, 7634, 7668, 7771, 7773, 7911, 7998, 8132, 8227, 8423, 8451, 8465, 8494, 8652, 8794, 8858, 8923, 90

[NeMo W 2024-08-26 03:39:13 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: true
    
[NeMo W 2024-08-26 03:39:13 modelPT:168] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s). 
    Validation config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: false
    
[NeMo W 2024-08-26 03:39:13 modelPT:174] Please call the ModelPT.setup_test_data() or ModelPT.setup_multiple

[NeMo I 2024-08-26 03:39:13 features:289] PADDING: 16
[NeMo I 2024-08-26 03:39:13 features:289] PADDING: 16
[NeMo I 2024-08-26 03:39:14 save_restore_connector:249] Model EncDecDiarLabelModel was successfully restored from /root/.cache/torch/NeMo/NeMo_1.21.0/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo.
[NeMo I 2024-08-26 03:39:14 features:289] PADDING: 16
[NeMo I 2024-08-26 03:39:15 clustering_diarizer:127] Loading pretrained vad_multilingual_marblenet model from NGC
[NeMo I 2024-08-26 03:39:15 cloud:58] Found existing object /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-08-26 03:39:15 cloud:64] Re-using file from: /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo
[NeMo I 2024-08-26 03:39:15 common:913] Instantiating model from pre-trained checkpoint


[NeMo W 2024-08-26 03:39:15 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: /manifests/ami_train_0.63.json,/manifests/freesound_background_train.json,/manifests/freesound_laughter_train.json,/manifests/fisher_2004_background.json,/manifests/fisher_2004_speech_sampled.json,/manifests/google_train_manifest.json,/manifests/icsi_all_0.63.json,/manifests/musan_freesound_train.json,/manifests/musan_music_train.json,/manifests/musan_soundbible_train.json,/manifests/mandarin_train_sample.json,/manifests/german_train_sample.json,/manifests/spanish_train_sample.json,/manifests/french_train_sample.json,/manifests/russian_train_sample.json
    sample_rate: 16000
    labels:
    - background
    - speech
    batch_size: 256
    shuffle: true
    is_tarred: false
    tarred_audio_filepaths: null
    tarred_shard_strategy: sca

[NeMo I 2024-08-26 03:39:15 features:289] PADDING: 16
[NeMo I 2024-08-26 03:39:15 save_restore_connector:249] Model EncDecClassificationModel was successfully restored from /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-08-26 03:39:15 msdd_models:864] Multiscale Weights: [1, 1, 1, 1, 1]
[NeMo I 2024-08-26 03:39:15 msdd_models:865] Clustering Parameters: {
        "oracle_num_speakers": false,
        "max_num_speakers": 8,
        "enhanced_count_thres": 80,
        "max_rp_threshold": 0.25,
        "sparse_search_volume": 30,
        "maj_vote_spk_count": false,
        "chunk_cluster_count": 50,
        "embeddings_per_chunk": 10000
    }


[NeMo W 2024-08-26 03:39:15 clustering_diarizer:411] Deleting previous clustering diarizer outputs.


[NeMo I 2024-08-26 03:39:15 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-08-26 03:39:15 clustering_diarizer:309] Split long audio file to avoid CUDA memory issue


splitting manifest: 100%|██████████| 1/1 [00:00<00:00, 46.24it/s]

[NeMo I 2024-08-26 03:39:15 vad_utils:107] The prepared manifest file exists. Overwriting!
[NeMo I 2024-08-26 03:39:15 classification_models:273] Perform streaming frame-level VAD





[NeMo I 2024-08-26 03:39:15 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 03:39:15 collections:302] Dataset loaded with 2 items, total duration of  0.03 hours.
[NeMo I 2024-08-26 03:39:15 collections:304] # 2 files loaded accounting to # 1 labels


vad: 100%|██████████| 2/2 [00:00<00:00,  3.90it/s]

[NeMo I 2024-08-26 03:39:15 clustering_diarizer:250] Generating predictions with overlapping input segments



                                                               

[NeMo I 2024-08-26 03:39:16 clustering_diarizer:262] Converting frame level prediction to speech/no-speech segment in start and end times format.


creating speech segments: 100%|██████████| 1/1 [00:00<00:00, 14.22it/s]

[NeMo I 2024-08-26 03:39:16 clustering_diarizer:287] Subsegmentation for embedding extraction: scale0, /content/temp_outputs/speaker_outputs/subsegments_scale0.json
[NeMo I 2024-08-26 03:39:16 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 03:39:16 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 03:39:16 collections:302] Dataset loaded with 90 items, total duration of  0.04 hours.
[NeMo I 2024-08-26 03:39:16 collections:304] # 90 files loaded accounting to # 1 labels



[1/5] extract embeddings: 100%|██████████| 2/2 [00:00<00:00,  8.21it/s]

[NeMo I 2024-08-26 03:39:17 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 03:39:17 clustering_diarizer:287] Subsegmentation for embedding extraction: scale1, /content/temp_outputs/speaker_outputs/subsegments_scale1.json
[NeMo I 2024-08-26 03:39:17 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 03:39:17 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 03:39:17 collections:302] Dataset loaded with 114 items, total duration of  0.04 hours.
[NeMo I 2024-08-26 03:39:17 collections:304] # 114 files loaded accounting to # 1 labels



[2/5] extract embeddings: 100%|██████████| 2/2 [00:00<00:00,  8.05it/s]

[NeMo I 2024-08-26 03:39:17 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 03:39:17 clustering_diarizer:287] Subsegmentation for embedding extraction: scale2, /content/temp_outputs/speaker_outputs/subsegments_scale2.json
[NeMo I 2024-08-26 03:39:17 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 03:39:17 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 03:39:17 collections:302] Dataset loaded with 142 items, total duration of  0.04 hours.
[NeMo I 2024-08-26 03:39:17 collections:304] # 142 files loaded accounting to # 1 labels



[3/5] extract embeddings: 100%|██████████| 3/3 [00:00<00:00,  9.95it/s]

[NeMo I 2024-08-26 03:39:17 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 03:39:17 clustering_diarizer:287] Subsegmentation for embedding extraction: scale3, /content/temp_outputs/speaker_outputs/subsegments_scale3.json
[NeMo I 2024-08-26 03:39:17 clustering_diarizer:343] Extracting embeddings for Diarization





[NeMo I 2024-08-26 03:39:17 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 03:39:17 collections:302] Dataset loaded with 190 items, total duration of  0.04 hours.
[NeMo I 2024-08-26 03:39:17 collections:304] # 190 files loaded accounting to # 1 labels


[4/5] extract embeddings: 100%|██████████| 3/3 [00:00<00:00,  9.28it/s]

[NeMo I 2024-08-26 03:39:18 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 03:39:18 clustering_diarizer:287] Subsegmentation for embedding extraction: scale4, /content/temp_outputs/speaker_outputs/subsegments_scale4.json
[NeMo I 2024-08-26 03:39:18 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 03:39:18 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 03:39:18 collections:302] Dataset loaded with 296 items, total duration of  0.04 hours.
[NeMo I 2024-08-26 03:39:18 collections:304] # 296 files loaded accounting to # 1 labels



[5/5] extract embeddings: 100%|██████████| 5/5 [00:00<00:00, 11.79it/s]

[NeMo I 2024-08-26 03:39:18 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings



clustering: 100%|██████████| 1/1 [00:00<00:00,  3.33it/s]

[NeMo I 2024-08-26 03:39:18 clustering_diarizer:464] Outputs are saved in /content/temp_outputs directory



[NeMo W 2024-08-26 03:39:18 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 03:39:18 msdd_models:960] Loading embedding pickle file of scale:0 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale0_embeddings.pkl
[NeMo I 2024-08-26 03:39:18 msdd_models:960] Loading embedding pickle file of scale:1 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale1_embeddings.pkl
[NeMo I 2024-08-26 03:39:18 msdd_models:960] Loading embedding pickle file of scale:2 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale2_embeddings.pkl
[NeMo I 2024-08-26 03:39:18 msdd_models:960] Loading embedding pickle file of scale:3 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale3_embeddings.pkl
[NeMo I 2024-08-26 03:39:18 msdd_models:960] Loading embedding pickle file of scale:4 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale4_embeddings.pkl
[NeMo I 2024-08-26 03:39:18 msdd_models:938] Loading cluster label file from /content/temp_outputs/speaker_outputs/subsegments_scale4_cluste

100%|██████████| 1/1 [00:00<00:00, 49.31it/s]

[NeMo I 2024-08-26 03:39:18 msdd_models:1403]      [Threshold: 0.7000] [use_clus_as_main=False] [diar_window=50]
[NeMo I 2024-08-26 03:39:18 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-08-26 03:39:18 speaker_utils:93] Number of files to diarize: 1



[NeMo W 2024-08-26 03:39:19 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 03:39:19 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-08-26 03:39:19 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 03:39:19 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-08-26 03:39:19 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 03:39:19 msdd_models:1431]   
    
archivo 3030567b3ee042bb8a838d1c76b6b916_20230610t14_01_utc.wav listo


INFO:pytorch_lightning.utilities.migration.utils:Lightning automatically upgraded your loaded checkpoint from v1.5.4 to v2.0.7. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint --file ../root/.cache/torch/whisperx-vad-segmentation.bin`


No language specified, language will be first be detected for each audio file (increases inference time).
Model was trained with pyannote.audio 0.0.1, yours is 3.1.1. Bad things might happen unless you revert pyannote.audio to 0.x.
Model was trained with torch 1.10.0+cu102, yours is 2.0.1+cu117. Bad things might happen unless you revert torch to 1.x.
Suppressing numeral and symbol tokens: [3, 4, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 502, 568, 628, 805, 945, 1017, 1025, 1266, 1294, 1360, 1386, 1525, 1614, 1649, 1722, 1848, 1958, 2009, 2119, 2217, 2272, 2319, 2331, 2443, 2625, 2803, 2975, 3165, 3279, 3282, 3356, 3405, 3446, 3499, 3552, 3705, 4022, 4060, 4289, 4303, 4436, 4550, 4688, 4702, 4762, 4808, 5080, 5211, 5254, 5285, 5348, 5853, 5867, 5923, 6071, 6074, 6096, 6375, 6494, 6549, 6591, 6641, 6673, 6856, 6866, 6879, 6905, 6976, 7143, 7201, 7271, 7490, 7526, 7546, 7551, 7560, 7562, 7629, 7634, 7668, 7771, 7773, 7911, 7998, 8132, 8227, 8423, 8451, 8465, 8494, 8652, 8794, 8858, 8923, 90

[NeMo W 2024-08-26 03:40:00 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: true
    
[NeMo W 2024-08-26 03:40:00 modelPT:168] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s). 
    Validation config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: false
    
[NeMo W 2024-08-26 03:40:00 modelPT:174] Please call the ModelPT.setup_test_data() or ModelPT.setup_multiple

[NeMo I 2024-08-26 03:40:00 features:289] PADDING: 16
[NeMo I 2024-08-26 03:40:01 features:289] PADDING: 16
[NeMo I 2024-08-26 03:40:01 save_restore_connector:249] Model EncDecDiarLabelModel was successfully restored from /root/.cache/torch/NeMo/NeMo_1.21.0/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo.
[NeMo I 2024-08-26 03:40:01 features:289] PADDING: 16
[NeMo I 2024-08-26 03:40:02 clustering_diarizer:127] Loading pretrained vad_multilingual_marblenet model from NGC
[NeMo I 2024-08-26 03:40:02 cloud:58] Found existing object /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-08-26 03:40:02 cloud:64] Re-using file from: /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo
[NeMo I 2024-08-26 03:40:02 common:913] Instantiating model from pre-trained checkpoint


[NeMo W 2024-08-26 03:40:02 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: /manifests/ami_train_0.63.json,/manifests/freesound_background_train.json,/manifests/freesound_laughter_train.json,/manifests/fisher_2004_background.json,/manifests/fisher_2004_speech_sampled.json,/manifests/google_train_manifest.json,/manifests/icsi_all_0.63.json,/manifests/musan_freesound_train.json,/manifests/musan_music_train.json,/manifests/musan_soundbible_train.json,/manifests/mandarin_train_sample.json,/manifests/german_train_sample.json,/manifests/spanish_train_sample.json,/manifests/french_train_sample.json,/manifests/russian_train_sample.json
    sample_rate: 16000
    labels:
    - background
    - speech
    batch_size: 256
    shuffle: true
    is_tarred: false
    tarred_audio_filepaths: null
    tarred_shard_strategy: sca

[NeMo I 2024-08-26 03:40:02 features:289] PADDING: 16
[NeMo I 2024-08-26 03:40:02 save_restore_connector:249] Model EncDecClassificationModel was successfully restored from /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-08-26 03:40:02 msdd_models:864] Multiscale Weights: [1, 1, 1, 1, 1]
[NeMo I 2024-08-26 03:40:02 msdd_models:865] Clustering Parameters: {
        "oracle_num_speakers": false,
        "max_num_speakers": 8,
        "enhanced_count_thres": 80,
        "max_rp_threshold": 0.25,
        "sparse_search_volume": 30,
        "maj_vote_spk_count": false,
        "chunk_cluster_count": 50,
        "embeddings_per_chunk": 10000
    }


[NeMo W 2024-08-26 03:40:02 clustering_diarizer:411] Deleting previous clustering diarizer outputs.


[NeMo I 2024-08-26 03:40:02 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-08-26 03:40:02 clustering_diarizer:309] Split long audio file to avoid CUDA memory issue


splitting manifest: 100%|██████████| 1/1 [00:00<00:00, 38.76it/s]

[NeMo I 2024-08-26 03:40:02 vad_utils:107] The prepared manifest file exists. Overwriting!





[NeMo I 2024-08-26 03:40:02 classification_models:273] Perform streaming frame-level VAD
[NeMo I 2024-08-26 03:40:02 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 03:40:02 collections:302] Dataset loaded with 2 items, total duration of  0.02 hours.
[NeMo I 2024-08-26 03:40:02 collections:304] # 2 files loaded accounting to # 1 labels


vad: 100%|██████████| 2/2 [00:00<00:00,  4.18it/s]

[NeMo I 2024-08-26 03:40:03 clustering_diarizer:250] Generating predictions with overlapping input segments



                                                               

[NeMo I 2024-08-26 03:40:04 clustering_diarizer:262] Converting frame level prediction to speech/no-speech segment in start and end times format.


creating speech segments: 100%|██████████| 1/1 [00:00<00:00,  8.15it/s]

[NeMo I 2024-08-26 03:40:04 clustering_diarizer:287] Subsegmentation for embedding extraction: scale0, /content/temp_outputs/speaker_outputs/subsegments_scale0.json
[NeMo I 2024-08-26 03:40:04 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 03:40:04 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 03:40:04 collections:302] Dataset loaded with 35 items, total duration of  0.01 hours.
[NeMo I 2024-08-26 03:40:04 collections:304] # 35 files loaded accounting to # 1 labels



[1/5] extract embeddings: 100%|██████████| 1/1 [00:00<00:00,  6.24it/s]

[NeMo I 2024-08-26 03:40:04 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 03:40:04 clustering_diarizer:287] Subsegmentation for embedding extraction: scale1, /content/temp_outputs/speaker_outputs/subsegments_scale1.json
[NeMo I 2024-08-26 03:40:04 clustering_diarizer:343] Extracting embeddings for Diarization





[NeMo I 2024-08-26 03:40:04 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 03:40:04 collections:302] Dataset loaded with 40 items, total duration of  0.01 hours.
[NeMo I 2024-08-26 03:40:04 collections:304] # 40 files loaded accounting to # 1 labels


[2/5] extract embeddings: 100%|██████████| 1/1 [00:00<00:00,  7.04it/s]

[NeMo I 2024-08-26 03:40:04 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings





[NeMo I 2024-08-26 03:40:04 clustering_diarizer:287] Subsegmentation for embedding extraction: scale2, /content/temp_outputs/speaker_outputs/subsegments_scale2.json
[NeMo I 2024-08-26 03:40:04 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 03:40:04 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 03:40:04 collections:302] Dataset loaded with 51 items, total duration of  0.01 hours.
[NeMo I 2024-08-26 03:40:04 collections:304] # 51 files loaded accounting to # 1 labels


[3/5] extract embeddings: 100%|██████████| 1/1 [00:00<00:00,  6.37it/s]

[NeMo I 2024-08-26 03:40:05 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 03:40:05 clustering_diarizer:287] Subsegmentation for embedding extraction: scale3, /content/temp_outputs/speaker_outputs/subsegments_scale3.json





[NeMo I 2024-08-26 03:40:05 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 03:40:05 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 03:40:05 collections:302] Dataset loaded with 70 items, total duration of  0.01 hours.
[NeMo I 2024-08-26 03:40:05 collections:304] # 70 files loaded accounting to # 1 labels


[4/5] extract embeddings: 100%|██████████| 2/2 [00:00<00:00,  4.99it/s]

[NeMo I 2024-08-26 03:40:05 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings





[NeMo I 2024-08-26 03:40:05 clustering_diarizer:287] Subsegmentation for embedding extraction: scale4, /content/temp_outputs/speaker_outputs/subsegments_scale4.json
[NeMo I 2024-08-26 03:40:05 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 03:40:05 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 03:40:05 collections:302] Dataset loaded with 102 items, total duration of  0.01 hours.
[NeMo I 2024-08-26 03:40:05 collections:304] # 102 files loaded accounting to # 1 labels


[5/5] extract embeddings: 100%|██████████| 2/2 [00:00<00:00,  5.20it/s]

[NeMo I 2024-08-26 03:40:05 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings



clustering: 100%|██████████| 1/1 [00:00<00:00,  1.92it/s]

[NeMo I 2024-08-26 03:40:06 clustering_diarizer:464] Outputs are saved in /content/temp_outputs directory



[NeMo W 2024-08-26 03:40:06 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 03:40:06 msdd_models:960] Loading embedding pickle file of scale:0 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale0_embeddings.pkl
[NeMo I 2024-08-26 03:40:06 msdd_models:960] Loading embedding pickle file of scale:1 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale1_embeddings.pkl
[NeMo I 2024-08-26 03:40:06 msdd_models:960] Loading embedding pickle file of scale:2 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale2_embeddings.pkl
[NeMo I 2024-08-26 03:40:06 msdd_models:960] Loading embedding pickle file of scale:3 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale3_embeddings.pkl
[NeMo I 2024-08-26 03:40:06 msdd_models:960] Loading embedding pickle file of scale:4 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale4_embeddings.pkl
[NeMo I 2024-08-26 03:40:06 msdd_models:938] Loading cluster label file from /content/temp_outputs/speaker_outputs/subsegments_scale4_cluste

100%|██████████| 1/1 [00:00<00:00, 12.95it/s]

[NeMo I 2024-08-26 03:40:06 msdd_models:1403]      [Threshold: 0.7000] [use_clus_as_main=False] [diar_window=50]
[NeMo I 2024-08-26 03:40:06 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-08-26 03:40:06 speaker_utils:93] Number of files to diarize: 1



[NeMo W 2024-08-26 03:40:06 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 03:40:06 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-08-26 03:40:06 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 03:40:06 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-08-26 03:40:06 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 03:40:06 msdd_models:1431]   
    
archivo 7561a59a1cc24d3f9dcb726bda6b777e_20230610t14_18_utc.wav listo


INFO:pytorch_lightning.utilities.migration.utils:Lightning automatically upgraded your loaded checkpoint from v1.5.4 to v2.0.7. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint --file ../root/.cache/torch/whisperx-vad-segmentation.bin`


No language specified, language will be first be detected for each audio file (increases inference time).
Model was trained with pyannote.audio 0.0.1, yours is 3.1.1. Bad things might happen unless you revert pyannote.audio to 0.x.
Model was trained with torch 1.10.0+cu102, yours is 2.0.1+cu117. Bad things might happen unless you revert torch to 1.x.
Suppressing numeral and symbol tokens: [3, 4, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 502, 568, 628, 805, 945, 1017, 1025, 1266, 1294, 1360, 1386, 1525, 1614, 1649, 1722, 1848, 1958, 2009, 2119, 2217, 2272, 2319, 2331, 2443, 2625, 2803, 2975, 3165, 3279, 3282, 3356, 3405, 3446, 3499, 3552, 3705, 4022, 4060, 4289, 4303, 4436, 4550, 4688, 4702, 4762, 4808, 5080, 5211, 5254, 5285, 5348, 5853, 5867, 5923, 6071, 6074, 6096, 6375, 6494, 6549, 6591, 6641, 6673, 6856, 6866, 6879, 6905, 6976, 7143, 7201, 7271, 7490, 7526, 7546, 7551, 7560, 7562, 7629, 7634, 7668, 7771, 7773, 7911, 7998, 8132, 8227, 8423, 8451, 8465, 8494, 8652, 8794, 8858, 8923, 90

[NeMo W 2024-08-26 03:40:49 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: true
    
[NeMo W 2024-08-26 03:40:49 modelPT:168] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s). 
    Validation config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: false
    
[NeMo W 2024-08-26 03:40:49 modelPT:174] Please call the ModelPT.setup_test_data() or ModelPT.setup_multiple

[NeMo I 2024-08-26 03:40:49 features:289] PADDING: 16
[NeMo I 2024-08-26 03:40:49 features:289] PADDING: 16
[NeMo I 2024-08-26 03:40:50 save_restore_connector:249] Model EncDecDiarLabelModel was successfully restored from /root/.cache/torch/NeMo/NeMo_1.21.0/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo.
[NeMo I 2024-08-26 03:40:51 features:289] PADDING: 16
[NeMo I 2024-08-26 03:40:51 clustering_diarizer:127] Loading pretrained vad_multilingual_marblenet model from NGC
[NeMo I 2024-08-26 03:40:51 cloud:58] Found existing object /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-08-26 03:40:51 cloud:64] Re-using file from: /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo
[NeMo I 2024-08-26 03:40:51 common:913] Instantiating model from pre-trained checkpoint


[NeMo W 2024-08-26 03:40:52 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: /manifests/ami_train_0.63.json,/manifests/freesound_background_train.json,/manifests/freesound_laughter_train.json,/manifests/fisher_2004_background.json,/manifests/fisher_2004_speech_sampled.json,/manifests/google_train_manifest.json,/manifests/icsi_all_0.63.json,/manifests/musan_freesound_train.json,/manifests/musan_music_train.json,/manifests/musan_soundbible_train.json,/manifests/mandarin_train_sample.json,/manifests/german_train_sample.json,/manifests/spanish_train_sample.json,/manifests/french_train_sample.json,/manifests/russian_train_sample.json
    sample_rate: 16000
    labels:
    - background
    - speech
    batch_size: 256
    shuffle: true
    is_tarred: false
    tarred_audio_filepaths: null
    tarred_shard_strategy: sca

[NeMo I 2024-08-26 03:40:52 features:289] PADDING: 16
[NeMo I 2024-08-26 03:40:52 save_restore_connector:249] Model EncDecClassificationModel was successfully restored from /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-08-26 03:40:52 msdd_models:864] Multiscale Weights: [1, 1, 1, 1, 1]
[NeMo I 2024-08-26 03:40:52 msdd_models:865] Clustering Parameters: {
        "oracle_num_speakers": false,
        "max_num_speakers": 8,
        "enhanced_count_thres": 80,
        "max_rp_threshold": 0.25,
        "sparse_search_volume": 30,
        "maj_vote_spk_count": false,
        "chunk_cluster_count": 50,
        "embeddings_per_chunk": 10000
    }


[NeMo W 2024-08-26 03:40:52 clustering_diarizer:411] Deleting previous clustering diarizer outputs.


[NeMo I 2024-08-26 03:40:52 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-08-26 03:40:52 clustering_diarizer:309] Split long audio file to avoid CUDA memory issue


splitting manifest: 100%|██████████| 1/1 [00:00<00:00, 71.37it/s]

[NeMo I 2024-08-26 03:40:52 vad_utils:107] The prepared manifest file exists. Overwriting!
[NeMo I 2024-08-26 03:40:52 classification_models:273] Perform streaming frame-level VAD
[NeMo I 2024-08-26 03:40:52 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 03:40:52 collections:302] Dataset loaded with 2 items, total duration of  0.01 hours.
[NeMo I 2024-08-26 03:40:52 collections:304] # 2 files loaded accounting to # 1 labels



vad: 100%|██████████| 2/2 [00:00<00:00,  6.15it/s]

[NeMo I 2024-08-26 03:40:52 clustering_diarizer:250] Generating predictions with overlapping input segments



                                                               

[NeMo I 2024-08-26 03:40:53 clustering_diarizer:262] Converting frame level prediction to speech/no-speech segment in start and end times format.


creating speech segments: 100%|██████████| 1/1 [00:00<00:00, 20.56it/s]

[NeMo I 2024-08-26 03:40:53 clustering_diarizer:287] Subsegmentation for embedding extraction: scale0, /content/temp_outputs/speaker_outputs/subsegments_scale0.json
[NeMo I 2024-08-26 03:40:53 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 03:40:53 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 03:40:53 collections:302] Dataset loaded with 52 items, total duration of  0.02 hours.
[NeMo I 2024-08-26 03:40:53 collections:304] # 52 files loaded accounting to # 1 labels



[1/5] extract embeddings: 100%|██████████| 1/1 [00:00<00:00,  6.69it/s]

[NeMo I 2024-08-26 03:40:53 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 03:40:53 clustering_diarizer:287] Subsegmentation for embedding extraction: scale1, /content/temp_outputs/speaker_outputs/subsegments_scale1.json
[NeMo I 2024-08-26 03:40:53 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 03:40:53 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 03:40:53 collections:302] Dataset loaded with 63 items, total duration of  0.02 hours.





[NeMo I 2024-08-26 03:40:53 collections:304] # 63 files loaded accounting to # 1 labels


[2/5] extract embeddings: 100%|██████████| 1/1 [00:00<00:00,  6.12it/s]


[NeMo I 2024-08-26 03:40:53 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 03:40:53 clustering_diarizer:287] Subsegmentation for embedding extraction: scale2, /content/temp_outputs/speaker_outputs/subsegments_scale2.json
[NeMo I 2024-08-26 03:40:53 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 03:40:53 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 03:40:53 collections:302] Dataset loaded with 78 items, total duration of  0.02 hours.
[NeMo I 2024-08-26 03:40:53 collections:304] # 78 files loaded accounting to # 1 labels


[3/5] extract embeddings: 100%|██████████| 2/2 [00:00<00:00,  9.64it/s]

[NeMo I 2024-08-26 03:40:53 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 03:40:53 clustering_diarizer:287] Subsegmentation for embedding extraction: scale3, /content/temp_outputs/speaker_outputs/subsegments_scale3.json
[NeMo I 2024-08-26 03:40:53 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 03:40:53 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 03:40:53 collections:302] Dataset loaded with 106 items, total duration of  0.02 hours.
[NeMo I 2024-08-26 03:40:53 collections:304] # 106 files loaded accounting to # 1 labels



[4/5] extract embeddings: 100%|██████████| 2/2 [00:00<00:00,  8.88it/s]

[NeMo I 2024-08-26 03:40:53 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 03:40:53 clustering_diarizer:287] Subsegmentation for embedding extraction: scale4, /content/temp_outputs/speaker_outputs/subsegments_scale4.json
[NeMo I 2024-08-26 03:40:53 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 03:40:53 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 03:40:53 collections:302] Dataset loaded with 162 items, total duration of  0.02 hours.
[NeMo I 2024-08-26 03:40:53 collections:304] # 162 files loaded accounting to # 1 labels



[5/5] extract embeddings: 100%|██████████| 3/3 [00:00<00:00, 11.25it/s]

[NeMo I 2024-08-26 03:40:54 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings



clustering: 100%|██████████| 1/1 [00:00<00:00,  4.78it/s]

[NeMo I 2024-08-26 03:40:54 clustering_diarizer:464] Outputs are saved in /content/temp_outputs directory



[NeMo W 2024-08-26 03:40:54 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 03:40:54 msdd_models:960] Loading embedding pickle file of scale:0 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale0_embeddings.pkl
[NeMo I 2024-08-26 03:40:54 msdd_models:960] Loading embedding pickle file of scale:1 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale1_embeddings.pkl
[NeMo I 2024-08-26 03:40:54 msdd_models:960] Loading embedding pickle file of scale:2 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale2_embeddings.pkl
[NeMo I 2024-08-26 03:40:54 msdd_models:960] Loading embedding pickle file of scale:3 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale3_embeddings.pkl
[NeMo I 2024-08-26 03:40:54 msdd_models:960] Loading embedding pickle file of scale:4 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale4_embeddings.pkl
[NeMo I 2024-08-26 03:40:54 msdd_models:938] Loading cluster label file from /content/temp_outputs/speaker_outputs/subsegments_scale4_cluste

100%|██████████| 1/1 [00:00<00:00, 59.25it/s]

[NeMo I 2024-08-26 03:40:54 msdd_models:1403]      [Threshold: 0.7000] [use_clus_as_main=False] [diar_window=50]
[NeMo I 2024-08-26 03:40:54 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-08-26 03:40:54 speaker_utils:93] Number of files to diarize: 1



[NeMo W 2024-08-26 03:40:54 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 03:40:54 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-08-26 03:40:54 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 03:40:54 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-08-26 03:40:54 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 03:40:54 msdd_models:1431]   
    
archivo 2423c4d401d1424ab3843e5ddda79e7a_20230610t16_50_utc.wav listo


INFO:pytorch_lightning.utilities.migration.utils:Lightning automatically upgraded your loaded checkpoint from v1.5.4 to v2.0.7. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint --file ../root/.cache/torch/whisperx-vad-segmentation.bin`


No language specified, language will be first be detected for each audio file (increases inference time).
Model was trained with pyannote.audio 0.0.1, yours is 3.1.1. Bad things might happen unless you revert pyannote.audio to 0.x.
Model was trained with torch 1.10.0+cu102, yours is 2.0.1+cu117. Bad things might happen unless you revert torch to 1.x.
Suppressing numeral and symbol tokens: [3, 4, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 502, 568, 628, 805, 945, 1017, 1025, 1266, 1294, 1360, 1386, 1525, 1614, 1649, 1722, 1848, 1958, 2009, 2119, 2217, 2272, 2319, 2331, 2443, 2625, 2803, 2975, 3165, 3279, 3282, 3356, 3405, 3446, 3499, 3552, 3705, 4022, 4060, 4289, 4303, 4436, 4550, 4688, 4702, 4762, 4808, 5080, 5211, 5254, 5285, 5348, 5853, 5867, 5923, 6071, 6074, 6096, 6375, 6494, 6549, 6591, 6641, 6673, 6856, 6866, 6879, 6905, 6976, 7143, 7201, 7271, 7490, 7526, 7546, 7551, 7560, 7562, 7629, 7634, 7668, 7771, 7773, 7911, 7998, 8132, 8227, 8423, 8451, 8465, 8494, 8652, 8794, 8858, 8923, 90

[NeMo W 2024-08-26 03:41:39 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: true
    
[NeMo W 2024-08-26 03:41:39 modelPT:168] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s). 
    Validation config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: false
    
[NeMo W 2024-08-26 03:41:39 modelPT:174] Please call the ModelPT.setup_test_data() or ModelPT.setup_multiple

[NeMo I 2024-08-26 03:41:39 features:289] PADDING: 16
[NeMo I 2024-08-26 03:41:40 features:289] PADDING: 16
[NeMo I 2024-08-26 03:41:40 save_restore_connector:249] Model EncDecDiarLabelModel was successfully restored from /root/.cache/torch/NeMo/NeMo_1.21.0/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo.
[NeMo I 2024-08-26 03:41:40 features:289] PADDING: 16
[NeMo I 2024-08-26 03:41:41 clustering_diarizer:127] Loading pretrained vad_multilingual_marblenet model from NGC
[NeMo I 2024-08-26 03:41:41 cloud:58] Found existing object /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-08-26 03:41:41 cloud:64] Re-using file from: /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo
[NeMo I 2024-08-26 03:41:41 common:913] Instantiating model from pre-trained checkpoint


[NeMo W 2024-08-26 03:41:41 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: /manifests/ami_train_0.63.json,/manifests/freesound_background_train.json,/manifests/freesound_laughter_train.json,/manifests/fisher_2004_background.json,/manifests/fisher_2004_speech_sampled.json,/manifests/google_train_manifest.json,/manifests/icsi_all_0.63.json,/manifests/musan_freesound_train.json,/manifests/musan_music_train.json,/manifests/musan_soundbible_train.json,/manifests/mandarin_train_sample.json,/manifests/german_train_sample.json,/manifests/spanish_train_sample.json,/manifests/french_train_sample.json,/manifests/russian_train_sample.json
    sample_rate: 16000
    labels:
    - background
    - speech
    batch_size: 256
    shuffle: true
    is_tarred: false
    tarred_audio_filepaths: null
    tarred_shard_strategy: sca

[NeMo I 2024-08-26 03:41:41 features:289] PADDING: 16
[NeMo I 2024-08-26 03:41:41 save_restore_connector:249] Model EncDecClassificationModel was successfully restored from /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-08-26 03:41:41 msdd_models:864] Multiscale Weights: [1, 1, 1, 1, 1]
[NeMo I 2024-08-26 03:41:41 msdd_models:865] Clustering Parameters: {
        "oracle_num_speakers": false,
        "max_num_speakers": 8,
        "enhanced_count_thres": 80,
        "max_rp_threshold": 0.25,
        "sparse_search_volume": 30,
        "maj_vote_spk_count": false,
        "chunk_cluster_count": 50,
        "embeddings_per_chunk": 10000
    }


[NeMo W 2024-08-26 03:41:41 clustering_diarizer:411] Deleting previous clustering diarizer outputs.


[NeMo I 2024-08-26 03:41:41 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-08-26 03:41:41 clustering_diarizer:309] Split long audio file to avoid CUDA memory issue


splitting manifest: 100%|██████████| 1/1 [00:00<00:00, 38.32it/s]

[NeMo I 2024-08-26 03:41:41 vad_utils:107] The prepared manifest file exists. Overwriting!
[NeMo I 2024-08-26 03:41:41 classification_models:273] Perform streaming frame-level VAD





[NeMo I 2024-08-26 03:41:41 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 03:41:41 collections:302] Dataset loaded with 2 items, total duration of  0.03 hours.
[NeMo I 2024-08-26 03:41:41 collections:304] # 2 files loaded accounting to # 1 labels


vad: 100%|██████████| 2/2 [00:00<00:00,  3.66it/s]

[NeMo I 2024-08-26 03:41:42 clustering_diarizer:250] Generating predictions with overlapping input segments



                                                               

[NeMo I 2024-08-26 03:41:43 clustering_diarizer:262] Converting frame level prediction to speech/no-speech segment in start and end times format.


creating speech segments: 100%|██████████| 1/1 [00:00<00:00, 13.18it/s]

[NeMo I 2024-08-26 03:41:43 clustering_diarizer:287] Subsegmentation for embedding extraction: scale0, /content/temp_outputs/speaker_outputs/subsegments_scale0.json
[NeMo I 2024-08-26 03:41:43 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 03:41:43 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 03:41:43 collections:302] Dataset loaded with 91 items, total duration of  0.03 hours.
[NeMo I 2024-08-26 03:41:43 collections:304] # 91 files loaded accounting to # 1 labels



[1/5] extract embeddings: 100%|██████████| 2/2 [00:00<00:00,  8.18it/s]

[NeMo I 2024-08-26 03:41:43 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 03:41:43 clustering_diarizer:287] Subsegmentation for embedding extraction: scale1, /content/temp_outputs/speaker_outputs/subsegments_scale1.json
[NeMo I 2024-08-26 03:41:43 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 03:41:43 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 03:41:43 collections:302] Dataset loaded with 109 items, total duration of  0.04 hours.
[NeMo I 2024-08-26 03:41:43 collections:304] # 109 files loaded accounting to # 1 labels



[2/5] extract embeddings: 100%|██████████| 2/2 [00:00<00:00,  7.35it/s]

[NeMo I 2024-08-26 03:41:43 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 03:41:43 clustering_diarizer:287] Subsegmentation for embedding extraction: scale2, /content/temp_outputs/speaker_outputs/subsegments_scale2.json
[NeMo I 2024-08-26 03:41:43 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 03:41:43 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 03:41:43 collections:302] Dataset loaded with 140 items, total duration of  0.04 hours.
[NeMo I 2024-08-26 03:41:43 collections:304] # 140 files loaded accounting to # 1 labels



[3/5] extract embeddings: 100%|██████████| 3/3 [00:00<00:00, 10.06it/s]

[NeMo I 2024-08-26 03:41:44 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 03:41:44 clustering_diarizer:287] Subsegmentation for embedding extraction: scale3, /content/temp_outputs/speaker_outputs/subsegments_scale3.json
[NeMo I 2024-08-26 03:41:44 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 03:41:44 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 03:41:44 collections:302] Dataset loaded with 191 items, total duration of  0.04 hours.
[NeMo I 2024-08-26 03:41:44 collections:304] # 191 files loaded accounting to # 1 labels



[4/5] extract embeddings: 100%|██████████| 3/3 [00:00<00:00,  8.49it/s]

[NeMo I 2024-08-26 03:41:44 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 03:41:44 clustering_diarizer:287] Subsegmentation for embedding extraction: scale4, /content/temp_outputs/speaker_outputs/subsegments_scale4.json
[NeMo I 2024-08-26 03:41:44 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 03:41:44 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 03:41:44 collections:302] Dataset loaded with 290 items, total duration of  0.04 hours.
[NeMo I 2024-08-26 03:41:44 collections:304] # 290 files loaded accounting to # 1 labels



[5/5] extract embeddings: 100%|██████████| 5/5 [00:00<00:00,  8.78it/s]


[NeMo I 2024-08-26 03:41:45 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings


clustering: 100%|██████████| 1/1 [00:00<00:00,  2.61it/s]

[NeMo I 2024-08-26 03:41:45 clustering_diarizer:464] Outputs are saved in /content/temp_outputs directory



[NeMo W 2024-08-26 03:41:45 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 03:41:45 msdd_models:960] Loading embedding pickle file of scale:0 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale0_embeddings.pkl
[NeMo I 2024-08-26 03:41:45 msdd_models:960] Loading embedding pickle file of scale:1 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale1_embeddings.pkl
[NeMo I 2024-08-26 03:41:45 msdd_models:960] Loading embedding pickle file of scale:2 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale2_embeddings.pkl
[NeMo I 2024-08-26 03:41:45 msdd_models:960] Loading embedding pickle file of scale:3 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale3_embeddings.pkl
[NeMo I 2024-08-26 03:41:45 msdd_models:960] Loading embedding pickle file of scale:4 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale4_embeddings.pkl
[NeMo I 2024-08-26 03:41:45 msdd_models:938] Loading cluster label file from /content/temp_outputs/speaker_outputs/subsegments_scale4_cluste

100%|██████████| 1/1 [00:00<00:00, 26.94it/s]

[NeMo I 2024-08-26 03:41:45 msdd_models:1403]      [Threshold: 0.7000] [use_clus_as_main=False] [diar_window=50]
[NeMo I 2024-08-26 03:41:45 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-08-26 03:41:45 speaker_utils:93] Number of files to diarize: 1



[NeMo W 2024-08-26 03:41:45 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 03:41:45 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-08-26 03:41:45 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 03:41:45 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-08-26 03:41:45 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 03:41:45 msdd_models:1431]   
    
archivo 975df3115269486098f522a3eb856dfd_20230610t15_05_utc.wav listo


INFO:pytorch_lightning.utilities.migration.utils:Lightning automatically upgraded your loaded checkpoint from v1.5.4 to v2.0.7. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint --file ../root/.cache/torch/whisperx-vad-segmentation.bin`


No language specified, language will be first be detected for each audio file (increases inference time).
Model was trained with pyannote.audio 0.0.1, yours is 3.1.1. Bad things might happen unless you revert pyannote.audio to 0.x.
Model was trained with torch 1.10.0+cu102, yours is 2.0.1+cu117. Bad things might happen unless you revert torch to 1.x.
Suppressing numeral and symbol tokens: [3, 4, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 502, 568, 628, 805, 945, 1017, 1025, 1266, 1294, 1360, 1386, 1525, 1614, 1649, 1722, 1848, 1958, 2009, 2119, 2217, 2272, 2319, 2331, 2443, 2625, 2803, 2975, 3165, 3279, 3282, 3356, 3405, 3446, 3499, 3552, 3705, 4022, 4060, 4289, 4303, 4436, 4550, 4688, 4702, 4762, 4808, 5080, 5211, 5254, 5285, 5348, 5853, 5867, 5923, 6071, 6074, 6096, 6375, 6494, 6549, 6591, 6641, 6673, 6856, 6866, 6879, 6905, 6976, 7143, 7201, 7271, 7490, 7526, 7546, 7551, 7560, 7562, 7629, 7634, 7668, 7771, 7773, 7911, 7998, 8132, 8227, 8423, 8451, 8465, 8494, 8652, 8794, 8858, 8923, 90

[NeMo W 2024-08-26 03:42:29 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: true
    
[NeMo W 2024-08-26 03:42:29 modelPT:168] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s). 
    Validation config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: false
    
[NeMo W 2024-08-26 03:42:29 modelPT:174] Please call the ModelPT.setup_test_data() or ModelPT.setup_multiple

[NeMo I 2024-08-26 03:42:29 features:289] PADDING: 16
[NeMo I 2024-08-26 03:42:29 features:289] PADDING: 16
[NeMo I 2024-08-26 03:42:30 save_restore_connector:249] Model EncDecDiarLabelModel was successfully restored from /root/.cache/torch/NeMo/NeMo_1.21.0/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo.
[NeMo I 2024-08-26 03:42:30 features:289] PADDING: 16
[NeMo I 2024-08-26 03:42:30 clustering_diarizer:127] Loading pretrained vad_multilingual_marblenet model from NGC
[NeMo I 2024-08-26 03:42:30 cloud:58] Found existing object /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-08-26 03:42:30 cloud:64] Re-using file from: /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo
[NeMo I 2024-08-26 03:42:30 common:913] Instantiating model from pre-trained checkpoint


[NeMo W 2024-08-26 03:42:31 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: /manifests/ami_train_0.63.json,/manifests/freesound_background_train.json,/manifests/freesound_laughter_train.json,/manifests/fisher_2004_background.json,/manifests/fisher_2004_speech_sampled.json,/manifests/google_train_manifest.json,/manifests/icsi_all_0.63.json,/manifests/musan_freesound_train.json,/manifests/musan_music_train.json,/manifests/musan_soundbible_train.json,/manifests/mandarin_train_sample.json,/manifests/german_train_sample.json,/manifests/spanish_train_sample.json,/manifests/french_train_sample.json,/manifests/russian_train_sample.json
    sample_rate: 16000
    labels:
    - background
    - speech
    batch_size: 256
    shuffle: true
    is_tarred: false
    tarred_audio_filepaths: null
    tarred_shard_strategy: sca

[NeMo I 2024-08-26 03:42:31 features:289] PADDING: 16
[NeMo I 2024-08-26 03:42:31 save_restore_connector:249] Model EncDecClassificationModel was successfully restored from /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-08-26 03:42:31 msdd_models:864] Multiscale Weights: [1, 1, 1, 1, 1]
[NeMo I 2024-08-26 03:42:31 msdd_models:865] Clustering Parameters: {
        "oracle_num_speakers": false,
        "max_num_speakers": 8,
        "enhanced_count_thres": 80,
        "max_rp_threshold": 0.25,
        "sparse_search_volume": 30,
        "maj_vote_spk_count": false,
        "chunk_cluster_count": 50,
        "embeddings_per_chunk": 10000
    }


[NeMo W 2024-08-26 03:42:31 clustering_diarizer:411] Deleting previous clustering diarizer outputs.


[NeMo I 2024-08-26 03:42:31 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-08-26 03:42:31 clustering_diarizer:309] Split long audio file to avoid CUDA memory issue


splitting manifest: 100%|██████████| 1/1 [00:00<00:00, 47.25it/s]

[NeMo I 2024-08-26 03:42:31 vad_utils:107] The prepared manifest file exists. Overwriting!
[NeMo I 2024-08-26 03:42:31 classification_models:273] Perform streaming frame-level VAD





[NeMo I 2024-08-26 03:42:31 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 03:42:31 collections:302] Dataset loaded with 2 items, total duration of  0.02 hours.
[NeMo I 2024-08-26 03:42:31 collections:304] # 2 files loaded accounting to # 1 labels


vad: 100%|██████████| 2/2 [00:00<00:00,  5.28it/s]

[NeMo I 2024-08-26 03:42:31 clustering_diarizer:250] Generating predictions with overlapping input segments



                                                               

[NeMo I 2024-08-26 03:42:32 clustering_diarizer:262] Converting frame level prediction to speech/no-speech segment in start and end times format.


creating speech segments: 100%|██████████| 1/1 [00:00<00:00, 16.87it/s]

[NeMo I 2024-08-26 03:42:32 clustering_diarizer:287] Subsegmentation for embedding extraction: scale0, /content/temp_outputs/speaker_outputs/subsegments_scale0.json
[NeMo I 2024-08-26 03:42:32 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 03:42:32 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 03:42:32 collections:302] Dataset loaded with 67 items, total duration of  0.02 hours.
[NeMo I 2024-08-26 03:42:32 collections:304] # 67 files loaded accounting to # 1 labels



[1/5] extract embeddings: 100%|██████████| 2/2 [00:00<00:00,  8.73it/s]

[NeMo I 2024-08-26 03:42:32 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 03:42:32 clustering_diarizer:287] Subsegmentation for embedding extraction: scale1, /content/temp_outputs/speaker_outputs/subsegments_scale1.json
[NeMo I 2024-08-26 03:42:32 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 03:42:32 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 03:42:32 collections:302] Dataset loaded with 80 items, total duration of  0.03 hours.
[NeMo I 2024-08-26 03:42:32 collections:304] # 80 files loaded accounting to # 1 labels



[2/5] extract embeddings: 100%|██████████| 2/2 [00:00<00:00,  8.53it/s]

[NeMo I 2024-08-26 03:42:32 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 03:42:32 clustering_diarizer:287] Subsegmentation for embedding extraction: scale2, /content/temp_outputs/speaker_outputs/subsegments_scale2.json
[NeMo I 2024-08-26 03:42:32 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 03:42:32 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 03:42:32 collections:302] Dataset loaded with 101 items, total duration of  0.03 hours.
[NeMo I 2024-08-26 03:42:32 collections:304] # 101 files loaded accounting to # 1 labels



[3/5] extract embeddings: 100%|██████████| 2/2 [00:00<00:00,  9.10it/s]

[NeMo I 2024-08-26 03:42:33 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 03:42:33 clustering_diarizer:287] Subsegmentation for embedding extraction: scale3, /content/temp_outputs/speaker_outputs/subsegments_scale3.json
[NeMo I 2024-08-26 03:42:33 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 03:42:33 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 03:42:33 collections:302] Dataset loaded with 136 items, total duration of  0.03 hours.
[NeMo I 2024-08-26 03:42:33 collections:304] # 136 files loaded accounting to # 1 labels



[4/5] extract embeddings: 100%|██████████| 3/3 [00:00<00:00, 10.94it/s]

[NeMo I 2024-08-26 03:42:33 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 03:42:33 clustering_diarizer:287] Subsegmentation for embedding extraction: scale4, /content/temp_outputs/speaker_outputs/subsegments_scale4.json
[NeMo I 2024-08-26 03:42:33 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 03:42:33 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 03:42:33 collections:302] Dataset loaded with 209 items, total duration of  0.03 hours.
[NeMo I 2024-08-26 03:42:33 collections:304] # 209 files loaded accounting to # 1 labels



[5/5] extract embeddings: 100%|██████████| 4/4 [00:00<00:00, 10.41it/s]

[NeMo I 2024-08-26 03:42:33 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings



clustering: 100%|██████████| 1/1 [00:00<00:00,  3.96it/s]

[NeMo I 2024-08-26 03:42:34 clustering_diarizer:464] Outputs are saved in /content/temp_outputs directory



[NeMo W 2024-08-26 03:42:34 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 03:42:34 msdd_models:960] Loading embedding pickle file of scale:0 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale0_embeddings.pkl
[NeMo I 2024-08-26 03:42:34 msdd_models:960] Loading embedding pickle file of scale:1 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale1_embeddings.pkl
[NeMo I 2024-08-26 03:42:34 msdd_models:960] Loading embedding pickle file of scale:2 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale2_embeddings.pkl
[NeMo I 2024-08-26 03:42:34 msdd_models:960] Loading embedding pickle file of scale:3 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale3_embeddings.pkl
[NeMo I 2024-08-26 03:42:34 msdd_models:960] Loading embedding pickle file of scale:4 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale4_embeddings.pkl
[NeMo I 2024-08-26 03:42:34 msdd_models:938] Loading cluster label file from /content/temp_outputs/speaker_outputs/subsegments_scale4_cluste

100%|██████████| 1/1 [00:00<00:00, 52.79it/s]

[NeMo I 2024-08-26 03:42:34 msdd_models:1403]      [Threshold: 0.7000] [use_clus_as_main=False] [diar_window=50]
[NeMo I 2024-08-26 03:42:34 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-08-26 03:42:34 speaker_utils:93] Number of files to diarize: 1



[NeMo W 2024-08-26 03:42:34 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 03:42:34 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-08-26 03:42:34 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 03:42:34 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-08-26 03:42:34 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 03:42:34 msdd_models:1431]   
    
archivo 34958b78223b449782e44c3afd0d0fd7_20230610t14_27_utc.wav listo


INFO:pytorch_lightning.utilities.migration.utils:Lightning automatically upgraded your loaded checkpoint from v1.5.4 to v2.0.7. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint --file ../root/.cache/torch/whisperx-vad-segmentation.bin`


No language specified, language will be first be detected for each audio file (increases inference time).
Model was trained with pyannote.audio 0.0.1, yours is 3.1.1. Bad things might happen unless you revert pyannote.audio to 0.x.
Model was trained with torch 1.10.0+cu102, yours is 2.0.1+cu117. Bad things might happen unless you revert torch to 1.x.
Suppressing numeral and symbol tokens: [3, 4, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 502, 568, 628, 805, 945, 1017, 1025, 1266, 1294, 1360, 1386, 1525, 1614, 1649, 1722, 1848, 1958, 2009, 2119, 2217, 2272, 2319, 2331, 2443, 2625, 2803, 2975, 3165, 3279, 3282, 3356, 3405, 3446, 3499, 3552, 3705, 4022, 4060, 4289, 4303, 4436, 4550, 4688, 4702, 4762, 4808, 5080, 5211, 5254, 5285, 5348, 5853, 5867, 5923, 6071, 6074, 6096, 6375, 6494, 6549, 6591, 6641, 6673, 6856, 6866, 6879, 6905, 6976, 7143, 7201, 7271, 7490, 7526, 7546, 7551, 7560, 7562, 7629, 7634, 7668, 7771, 7773, 7911, 7998, 8132, 8227, 8423, 8451, 8465, 8494, 8652, 8794, 8858, 8923, 90

[NeMo W 2024-08-26 03:43:32 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: true
    
[NeMo W 2024-08-26 03:43:32 modelPT:168] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s). 
    Validation config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: false
    
[NeMo W 2024-08-26 03:43:32 modelPT:174] Please call the ModelPT.setup_test_data() or ModelPT.setup_multiple

[NeMo I 2024-08-26 03:43:32 features:289] PADDING: 16
[NeMo I 2024-08-26 03:43:32 features:289] PADDING: 16
[NeMo I 2024-08-26 03:43:33 save_restore_connector:249] Model EncDecDiarLabelModel was successfully restored from /root/.cache/torch/NeMo/NeMo_1.21.0/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo.
[NeMo I 2024-08-26 03:43:33 features:289] PADDING: 16
[NeMo I 2024-08-26 03:43:34 clustering_diarizer:127] Loading pretrained vad_multilingual_marblenet model from NGC
[NeMo I 2024-08-26 03:43:34 cloud:58] Found existing object /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-08-26 03:43:34 cloud:64] Re-using file from: /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo
[NeMo I 2024-08-26 03:43:34 common:913] Instantiating model from pre-trained checkpoint


[NeMo W 2024-08-26 03:43:34 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: /manifests/ami_train_0.63.json,/manifests/freesound_background_train.json,/manifests/freesound_laughter_train.json,/manifests/fisher_2004_background.json,/manifests/fisher_2004_speech_sampled.json,/manifests/google_train_manifest.json,/manifests/icsi_all_0.63.json,/manifests/musan_freesound_train.json,/manifests/musan_music_train.json,/manifests/musan_soundbible_train.json,/manifests/mandarin_train_sample.json,/manifests/german_train_sample.json,/manifests/spanish_train_sample.json,/manifests/french_train_sample.json,/manifests/russian_train_sample.json
    sample_rate: 16000
    labels:
    - background
    - speech
    batch_size: 256
    shuffle: true
    is_tarred: false
    tarred_audio_filepaths: null
    tarred_shard_strategy: sca

[NeMo I 2024-08-26 03:43:34 features:289] PADDING: 16
[NeMo I 2024-08-26 03:43:34 save_restore_connector:249] Model EncDecClassificationModel was successfully restored from /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-08-26 03:43:34 msdd_models:864] Multiscale Weights: [1, 1, 1, 1, 1]
[NeMo I 2024-08-26 03:43:34 msdd_models:865] Clustering Parameters: {
        "oracle_num_speakers": false,
        "max_num_speakers": 8,
        "enhanced_count_thres": 80,
        "max_rp_threshold": 0.25,
        "sparse_search_volume": 30,
        "maj_vote_spk_count": false,
        "chunk_cluster_count": 50,
        "embeddings_per_chunk": 10000
    }


[NeMo W 2024-08-26 03:43:34 clustering_diarizer:411] Deleting previous clustering diarizer outputs.


[NeMo I 2024-08-26 03:43:34 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-08-26 03:43:34 clustering_diarizer:309] Split long audio file to avoid CUDA memory issue


splitting manifest: 100%|██████████| 1/1 [00:00<00:00, 17.71it/s]

[NeMo I 2024-08-26 03:43:34 vad_utils:107] The prepared manifest file exists. Overwriting!





[NeMo I 2024-08-26 03:43:34 classification_models:273] Perform streaming frame-level VAD
[NeMo I 2024-08-26 03:43:34 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 03:43:34 collections:302] Dataset loaded with 6 items, total duration of  0.07 hours.
[NeMo I 2024-08-26 03:43:34 collections:304] # 6 files loaded accounting to # 1 labels


vad: 100%|██████████| 6/6 [00:01<00:00,  4.40it/s]

[NeMo I 2024-08-26 03:43:35 clustering_diarizer:250] Generating predictions with overlapping input segments



                                                               

[NeMo I 2024-08-26 03:43:38 clustering_diarizer:262] Converting frame level prediction to speech/no-speech segment in start and end times format.


creating speech segments: 100%|██████████| 1/1 [00:00<00:00,  4.86it/s]

[NeMo I 2024-08-26 03:43:38 clustering_diarizer:287] Subsegmentation for embedding extraction: scale0, /content/temp_outputs/speaker_outputs/subsegments_scale0.json
[NeMo I 2024-08-26 03:43:38 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 03:43:38 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 03:43:38 collections:302] Dataset loaded with 264 items, total duration of  0.10 hours.
[NeMo I 2024-08-26 03:43:38 collections:304] # 264 files loaded accounting to # 1 labels



[1/5] extract embeddings: 100%|██████████| 5/5 [00:00<00:00,  6.52it/s]

[NeMo I 2024-08-26 03:43:39 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 03:43:39 clustering_diarizer:287] Subsegmentation for embedding extraction: scale1, /content/temp_outputs/speaker_outputs/subsegments_scale1.json
[NeMo I 2024-08-26 03:43:39 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 03:43:39 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 03:43:39 collections:302] Dataset loaded with 319 items, total duration of  0.10 hours.
[NeMo I 2024-08-26 03:43:39 collections:304] # 319 files loaded accounting to # 1 labels



[2/5] extract embeddings: 100%|██████████| 5/5 [00:00<00:00,  5.73it/s]

[NeMo I 2024-08-26 03:43:40 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 03:43:40 clustering_diarizer:287] Subsegmentation for embedding extraction: scale2, /content/temp_outputs/speaker_outputs/subsegments_scale2.json





[NeMo I 2024-08-26 03:43:40 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 03:43:40 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 03:43:40 collections:302] Dataset loaded with 402 items, total duration of  0.10 hours.
[NeMo I 2024-08-26 03:43:40 collections:304] # 402 files loaded accounting to # 1 labels


[3/5] extract embeddings: 100%|██████████| 7/7 [00:00<00:00,  7.52it/s]

[NeMo I 2024-08-26 03:43:41 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 03:43:41 clustering_diarizer:287] Subsegmentation for embedding extraction: scale3, /content/temp_outputs/speaker_outputs/subsegments_scale3.json





[NeMo I 2024-08-26 03:43:42 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 03:43:42 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 03:43:42 collections:302] Dataset loaded with 541 items, total duration of  0.11 hours.
[NeMo I 2024-08-26 03:43:42 collections:304] # 541 files loaded accounting to # 1 labels


[4/5] extract embeddings: 100%|██████████| 9/9 [00:01<00:00,  7.28it/s]

[NeMo I 2024-08-26 03:43:43 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 03:43:43 clustering_diarizer:287] Subsegmentation for embedding extraction: scale4, /content/temp_outputs/speaker_outputs/subsegments_scale4.json
[NeMo I 2024-08-26 03:43:43 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 03:43:43 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 03:43:43 collections:302] Dataset loaded with 829 items, total duration of  0.11 hours.





[NeMo I 2024-08-26 03:43:43 collections:304] # 829 files loaded accounting to # 1 labels


[5/5] extract embeddings: 100%|██████████| 13/13 [00:01<00:00,  8.02it/s]

[NeMo I 2024-08-26 03:43:45 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings



clustering: 100%|██████████| 1/1 [00:00<00:00,  1.60it/s]

[NeMo I 2024-08-26 03:43:45 clustering_diarizer:464] Outputs are saved in /content/temp_outputs directory



[NeMo W 2024-08-26 03:43:45 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 03:43:45 msdd_models:960] Loading embedding pickle file of scale:0 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale0_embeddings.pkl
[NeMo I 2024-08-26 03:43:45 msdd_models:960] Loading embedding pickle file of scale:1 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale1_embeddings.pkl
[NeMo I 2024-08-26 03:43:45 msdd_models:960] Loading embedding pickle file of scale:2 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale2_embeddings.pkl
[NeMo I 2024-08-26 03:43:45 msdd_models:960] Loading embedding pickle file of scale:3 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale3_embeddings.pkl
[NeMo I 2024-08-26 03:43:45 msdd_models:960] Loading embedding pickle file of scale:4 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale4_embeddings.pkl
[NeMo I 2024-08-26 03:43:45 msdd_models:938] Loading cluster label file from /content/temp_outputs/speaker_outputs/subsegments_scale4_cluste

100%|██████████| 1/1 [00:00<00:00, 24.18it/s]

[NeMo I 2024-08-26 03:43:45 msdd_models:1403]      [Threshold: 0.7000] [use_clus_as_main=False] [diar_window=50]
[NeMo I 2024-08-26 03:43:45 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-08-26 03:43:45 speaker_utils:93] Number of files to diarize: 1



[NeMo W 2024-08-26 03:43:46 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 03:43:46 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-08-26 03:43:46 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 03:43:46 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-08-26 03:43:46 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 03:43:46 msdd_models:1431]   
    
archivo d32ed68f007043a4a4eb582c82015c33_20230610t14_59_utc.wav listo


INFO:pytorch_lightning.utilities.migration.utils:Lightning automatically upgraded your loaded checkpoint from v1.5.4 to v2.0.7. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint --file ../root/.cache/torch/whisperx-vad-segmentation.bin`


No language specified, language will be first be detected for each audio file (increases inference time).
Model was trained with pyannote.audio 0.0.1, yours is 3.1.1. Bad things might happen unless you revert pyannote.audio to 0.x.
Model was trained with torch 1.10.0+cu102, yours is 2.0.1+cu117. Bad things might happen unless you revert torch to 1.x.
Suppressing numeral and symbol tokens: [3, 4, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 502, 568, 628, 805, 945, 1017, 1025, 1266, 1294, 1360, 1386, 1525, 1614, 1649, 1722, 1848, 1958, 2009, 2119, 2217, 2272, 2319, 2331, 2443, 2625, 2803, 2975, 3165, 3279, 3282, 3356, 3405, 3446, 3499, 3552, 3705, 4022, 4060, 4289, 4303, 4436, 4550, 4688, 4702, 4762, 4808, 5080, 5211, 5254, 5285, 5348, 5853, 5867, 5923, 6071, 6074, 6096, 6375, 6494, 6549, 6591, 6641, 6673, 6856, 6866, 6879, 6905, 6976, 7143, 7201, 7271, 7490, 7526, 7546, 7551, 7560, 7562, 7629, 7634, 7668, 7771, 7773, 7911, 7998, 8132, 8227, 8423, 8451, 8465, 8494, 8652, 8794, 8858, 8923, 90

[NeMo W 2024-08-26 03:44:36 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: true
    
[NeMo W 2024-08-26 03:44:36 modelPT:168] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s). 
    Validation config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: false
    
[NeMo W 2024-08-26 03:44:36 modelPT:174] Please call the ModelPT.setup_test_data() or ModelPT.setup_multiple

[NeMo I 2024-08-26 03:44:36 features:289] PADDING: 16
[NeMo I 2024-08-26 03:44:36 features:289] PADDING: 16
[NeMo I 2024-08-26 03:44:36 save_restore_connector:249] Model EncDecDiarLabelModel was successfully restored from /root/.cache/torch/NeMo/NeMo_1.21.0/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo.
[NeMo I 2024-08-26 03:44:36 features:289] PADDING: 16
[NeMo I 2024-08-26 03:44:37 clustering_diarizer:127] Loading pretrained vad_multilingual_marblenet model from NGC
[NeMo I 2024-08-26 03:44:37 cloud:58] Found existing object /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-08-26 03:44:37 cloud:64] Re-using file from: /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo
[NeMo I 2024-08-26 03:44:37 common:913] Instantiating model from pre-trained checkpoint


[NeMo W 2024-08-26 03:44:37 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: /manifests/ami_train_0.63.json,/manifests/freesound_background_train.json,/manifests/freesound_laughter_train.json,/manifests/fisher_2004_background.json,/manifests/fisher_2004_speech_sampled.json,/manifests/google_train_manifest.json,/manifests/icsi_all_0.63.json,/manifests/musan_freesound_train.json,/manifests/musan_music_train.json,/manifests/musan_soundbible_train.json,/manifests/mandarin_train_sample.json,/manifests/german_train_sample.json,/manifests/spanish_train_sample.json,/manifests/french_train_sample.json,/manifests/russian_train_sample.json
    sample_rate: 16000
    labels:
    - background
    - speech
    batch_size: 256
    shuffle: true
    is_tarred: false
    tarred_audio_filepaths: null
    tarred_shard_strategy: sca

[NeMo I 2024-08-26 03:44:37 features:289] PADDING: 16
[NeMo I 2024-08-26 03:44:37 save_restore_connector:249] Model EncDecClassificationModel was successfully restored from /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-08-26 03:44:37 msdd_models:864] Multiscale Weights: [1, 1, 1, 1, 1]
[NeMo I 2024-08-26 03:44:37 msdd_models:865] Clustering Parameters: {
        "oracle_num_speakers": false,
        "max_num_speakers": 8,
        "enhanced_count_thres": 80,
        "max_rp_threshold": 0.25,
        "sparse_search_volume": 30,
        "maj_vote_spk_count": false,
        "chunk_cluster_count": 50,
        "embeddings_per_chunk": 10000
    }


[NeMo W 2024-08-26 03:44:37 clustering_diarizer:411] Deleting previous clustering diarizer outputs.


[NeMo I 2024-08-26 03:44:37 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-08-26 03:44:37 clustering_diarizer:309] Split long audio file to avoid CUDA memory issue


splitting manifest: 100%|██████████| 1/1 [00:00<00:00, 23.74it/s]

[NeMo I 2024-08-26 03:44:37 vad_utils:107] The prepared manifest file exists. Overwriting!





[NeMo I 2024-08-26 03:44:37 classification_models:273] Perform streaming frame-level VAD
[NeMo I 2024-08-26 03:44:37 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 03:44:37 collections:302] Dataset loaded with 4 items, total duration of  0.05 hours.
[NeMo I 2024-08-26 03:44:37 collections:304] # 4 files loaded accounting to # 1 labels


vad: 100%|██████████| 4/4 [00:01<00:00,  3.90it/s]

[NeMo I 2024-08-26 03:44:38 clustering_diarizer:250] Generating predictions with overlapping input segments



                                                               

[NeMo I 2024-08-26 03:44:41 clustering_diarizer:262] Converting frame level prediction to speech/no-speech segment in start and end times format.


creating speech segments: 100%|██████████| 1/1 [00:00<00:00,  4.33it/s]

[NeMo I 2024-08-26 03:44:41 clustering_diarizer:287] Subsegmentation for embedding extraction: scale0, /content/temp_outputs/speaker_outputs/subsegments_scale0.json
[NeMo I 2024-08-26 03:44:41 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 03:44:41 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 03:44:41 collections:302] Dataset loaded with 176 items, total duration of  0.07 hours.
[NeMo I 2024-08-26 03:44:41 collections:304] # 176 files loaded accounting to # 1 labels



[1/5] extract embeddings: 100%|██████████| 3/3 [00:00<00:00,  5.39it/s]

[NeMo I 2024-08-26 03:44:41 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 03:44:41 clustering_diarizer:287] Subsegmentation for embedding extraction: scale1, /content/temp_outputs/speaker_outputs/subsegments_scale1.json
[NeMo I 2024-08-26 03:44:42 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 03:44:42 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 03:44:42 collections:302] Dataset loaded with 211 items, total duration of  0.07 hours.
[NeMo I 2024-08-26 03:44:42 collections:304] # 211 files loaded accounting to # 1 labels



[2/5] extract embeddings: 100%|██████████| 4/4 [00:00<00:00,  6.37it/s]

[NeMo I 2024-08-26 03:44:42 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 03:44:42 clustering_diarizer:287] Subsegmentation for embedding extraction: scale2, /content/temp_outputs/speaker_outputs/subsegments_scale2.json
[NeMo I 2024-08-26 03:44:42 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 03:44:42 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 03:44:42 collections:302] Dataset loaded with 266 items, total duration of  0.07 hours.
[NeMo I 2024-08-26 03:44:42 collections:304] # 266 files loaded accounting to # 1 labels



[3/5] extract embeddings: 100%|██████████| 5/5 [00:00<00:00,  7.24it/s]

[NeMo I 2024-08-26 03:44:43 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 03:44:43 clustering_diarizer:287] Subsegmentation for embedding extraction: scale3, /content/temp_outputs/speaker_outputs/subsegments_scale3.json
[NeMo I 2024-08-26 03:44:43 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 03:44:43 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 03:44:43 collections:302] Dataset loaded with 358 items, total duration of  0.07 hours.
[NeMo I 2024-08-26 03:44:43 collections:304] # 358 files loaded accounting to # 1 labels



[4/5] extract embeddings: 100%|██████████| 6/6 [00:00<00:00,  7.13it/s]

[NeMo I 2024-08-26 03:44:44 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 03:44:44 clustering_diarizer:287] Subsegmentation for embedding extraction: scale4, /content/temp_outputs/speaker_outputs/subsegments_scale4.json
[NeMo I 2024-08-26 03:44:44 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 03:44:44 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 03:44:44 collections:302] Dataset loaded with 547 items, total duration of  0.07 hours.
[NeMo I 2024-08-26 03:44:44 collections:304] # 547 files loaded accounting to # 1 labels



[5/5] extract embeddings: 100%|██████████| 9/9 [00:00<00:00, 10.45it/s]

[NeMo I 2024-08-26 03:44:45 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings



clustering: 100%|██████████| 1/1 [00:00<00:00,  2.42it/s]

[NeMo I 2024-08-26 03:44:45 clustering_diarizer:464] Outputs are saved in /content/temp_outputs directory



[NeMo W 2024-08-26 03:44:45 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 03:44:45 msdd_models:960] Loading embedding pickle file of scale:0 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale0_embeddings.pkl
[NeMo I 2024-08-26 03:44:45 msdd_models:960] Loading embedding pickle file of scale:1 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale1_embeddings.pkl
[NeMo I 2024-08-26 03:44:45 msdd_models:960] Loading embedding pickle file of scale:2 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale2_embeddings.pkl
[NeMo I 2024-08-26 03:44:45 msdd_models:960] Loading embedding pickle file of scale:3 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale3_embeddings.pkl
[NeMo I 2024-08-26 03:44:45 msdd_models:960] Loading embedding pickle file of scale:4 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale4_embeddings.pkl
[NeMo I 2024-08-26 03:44:45 msdd_models:938] Loading cluster label file from /content/temp_outputs/speaker_outputs/subsegments_scale4_cluste

100%|██████████| 1/1 [00:00<00:00, 29.90it/s]

[NeMo I 2024-08-26 03:44:45 msdd_models:1403]      [Threshold: 0.7000] [use_clus_as_main=False] [diar_window=50]
[NeMo I 2024-08-26 03:44:45 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-08-26 03:44:45 speaker_utils:93] Number of files to diarize: 1



[NeMo W 2024-08-26 03:44:45 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 03:44:45 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-08-26 03:44:45 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 03:44:45 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-08-26 03:44:45 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 03:44:45 msdd_models:1431]   
    
archivo 3fde7eca0a2f4b188a3fd81fce9a9d09_20230610t14_48_utc.wav listo


INFO:pytorch_lightning.utilities.migration.utils:Lightning automatically upgraded your loaded checkpoint from v1.5.4 to v2.0.7. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint --file ../root/.cache/torch/whisperx-vad-segmentation.bin`


No language specified, language will be first be detected for each audio file (increases inference time).
Model was trained with pyannote.audio 0.0.1, yours is 3.1.1. Bad things might happen unless you revert pyannote.audio to 0.x.
Model was trained with torch 1.10.0+cu102, yours is 2.0.1+cu117. Bad things might happen unless you revert torch to 1.x.
Suppressing numeral and symbol tokens: [3, 4, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 502, 568, 628, 805, 945, 1017, 1025, 1266, 1294, 1360, 1386, 1525, 1614, 1649, 1722, 1848, 1958, 2009, 2119, 2217, 2272, 2319, 2331, 2443, 2625, 2803, 2975, 3165, 3279, 3282, 3356, 3405, 3446, 3499, 3552, 3705, 4022, 4060, 4289, 4303, 4436, 4550, 4688, 4702, 4762, 4808, 5080, 5211, 5254, 5285, 5348, 5853, 5867, 5923, 6071, 6074, 6096, 6375, 6494, 6549, 6591, 6641, 6673, 6856, 6866, 6879, 6905, 6976, 7143, 7201, 7271, 7490, 7526, 7546, 7551, 7560, 7562, 7629, 7634, 7668, 7771, 7773, 7911, 7998, 8132, 8227, 8423, 8451, 8465, 8494, 8652, 8794, 8858, 8923, 90

[NeMo W 2024-08-26 03:45:43 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: true
    
[NeMo W 2024-08-26 03:45:43 modelPT:168] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s). 
    Validation config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: false
    
[NeMo W 2024-08-26 03:45:43 modelPT:174] Please call the ModelPT.setup_test_data() or ModelPT.setup_multiple

[NeMo I 2024-08-26 03:45:43 features:289] PADDING: 16
[NeMo I 2024-08-26 03:45:43 features:289] PADDING: 16
[NeMo I 2024-08-26 03:45:44 save_restore_connector:249] Model EncDecDiarLabelModel was successfully restored from /root/.cache/torch/NeMo/NeMo_1.21.0/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo.
[NeMo I 2024-08-26 03:45:44 features:289] PADDING: 16
[NeMo I 2024-08-26 03:45:44 clustering_diarizer:127] Loading pretrained vad_multilingual_marblenet model from NGC
[NeMo I 2024-08-26 03:45:44 cloud:58] Found existing object /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-08-26 03:45:44 cloud:64] Re-using file from: /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo
[NeMo I 2024-08-26 03:45:44 common:913] Instantiating model from pre-trained checkpoint


[NeMo W 2024-08-26 03:45:45 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: /manifests/ami_train_0.63.json,/manifests/freesound_background_train.json,/manifests/freesound_laughter_train.json,/manifests/fisher_2004_background.json,/manifests/fisher_2004_speech_sampled.json,/manifests/google_train_manifest.json,/manifests/icsi_all_0.63.json,/manifests/musan_freesound_train.json,/manifests/musan_music_train.json,/manifests/musan_soundbible_train.json,/manifests/mandarin_train_sample.json,/manifests/german_train_sample.json,/manifests/spanish_train_sample.json,/manifests/french_train_sample.json,/manifests/russian_train_sample.json
    sample_rate: 16000
    labels:
    - background
    - speech
    batch_size: 256
    shuffle: true
    is_tarred: false
    tarred_audio_filepaths: null
    tarred_shard_strategy: sca

[NeMo I 2024-08-26 03:45:45 features:289] PADDING: 16
[NeMo I 2024-08-26 03:45:45 save_restore_connector:249] Model EncDecClassificationModel was successfully restored from /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-08-26 03:45:45 msdd_models:864] Multiscale Weights: [1, 1, 1, 1, 1]
[NeMo I 2024-08-26 03:45:45 msdd_models:865] Clustering Parameters: {
        "oracle_num_speakers": false,
        "max_num_speakers": 8,
        "enhanced_count_thres": 80,
        "max_rp_threshold": 0.25,
        "sparse_search_volume": 30,
        "maj_vote_spk_count": false,
        "chunk_cluster_count": 50,
        "embeddings_per_chunk": 10000
    }


[NeMo W 2024-08-26 03:45:45 clustering_diarizer:411] Deleting previous clustering diarizer outputs.


[NeMo I 2024-08-26 03:45:45 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-08-26 03:45:45 clustering_diarizer:309] Split long audio file to avoid CUDA memory issue


splitting manifest: 100%|██████████| 1/1 [00:00<00:00, 15.27it/s]

[NeMo I 2024-08-26 03:45:45 vad_utils:107] The prepared manifest file exists. Overwriting!
[NeMo I 2024-08-26 03:45:45 classification_models:273] Perform streaming frame-level VAD
[NeMo I 2024-08-26 03:45:45 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 03:45:45 collections:302] Dataset loaded with 7 items, total duration of  0.09 hours.
[NeMo I 2024-08-26 03:45:45 collections:304] # 7 files loaded accounting to # 1 labels



vad: 100%|██████████| 7/7 [00:01<00:00,  4.29it/s]

[NeMo I 2024-08-26 03:45:46 clustering_diarizer:250] Generating predictions with overlapping input segments



                                                               

[NeMo I 2024-08-26 03:45:49 clustering_diarizer:262] Converting frame level prediction to speech/no-speech segment in start and end times format.


creating speech segments: 100%|██████████| 1/1 [00:00<00:00,  4.03it/s]

[NeMo I 2024-08-26 03:45:50 clustering_diarizer:287] Subsegmentation for embedding extraction: scale0, /content/temp_outputs/speaker_outputs/subsegments_scale0.json
[NeMo I 2024-08-26 03:45:50 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 03:45:50 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 03:45:50 collections:302] Dataset loaded with 289 items, total duration of  0.11 hours.
[NeMo I 2024-08-26 03:45:50 collections:304] # 289 files loaded accounting to # 1 labels



[1/5] extract embeddings: 100%|██████████| 5/5 [00:00<00:00,  6.44it/s]

[NeMo I 2024-08-26 03:45:50 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 03:45:50 clustering_diarizer:287] Subsegmentation for embedding extraction: scale1, /content/temp_outputs/speaker_outputs/subsegments_scale1.json
[NeMo I 2024-08-26 03:45:50 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 03:45:50 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 03:45:50 collections:302] Dataset loaded with 354 items, total duration of  0.11 hours.
[NeMo I 2024-08-26 03:45:50 collections:304] # 354 files loaded accounting to # 1 labels



[2/5] extract embeddings: 100%|██████████| 6/6 [00:00<00:00,  7.72it/s]

[NeMo I 2024-08-26 03:45:51 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 03:45:51 clustering_diarizer:287] Subsegmentation for embedding extraction: scale2, /content/temp_outputs/speaker_outputs/subsegments_scale2.json
[NeMo I 2024-08-26 03:45:51 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 03:45:51 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 03:45:51 collections:302] Dataset loaded with 443 items, total duration of  0.11 hours.
[NeMo I 2024-08-26 03:45:51 collections:304] # 443 files loaded accounting to # 1 labels



[3/5] extract embeddings: 100%|██████████| 7/7 [00:00<00:00,  8.13it/s]

[NeMo I 2024-08-26 03:45:52 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 03:45:52 clustering_diarizer:287] Subsegmentation for embedding extraction: scale3, /content/temp_outputs/speaker_outputs/subsegments_scale3.json
[NeMo I 2024-08-26 03:45:52 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 03:45:52 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 03:45:52 collections:302] Dataset loaded with 596 items, total duration of  0.12 hours.





[NeMo I 2024-08-26 03:45:52 collections:304] # 596 files loaded accounting to # 1 labels


[4/5] extract embeddings: 100%|██████████| 10/10 [00:01<00:00,  8.65it/s]

[NeMo I 2024-08-26 03:45:53 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 03:45:53 clustering_diarizer:287] Subsegmentation for embedding extraction: scale4, /content/temp_outputs/speaker_outputs/subsegments_scale4.json
[NeMo I 2024-08-26 03:45:53 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 03:45:54 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 03:45:54 collections:302] Dataset loaded with 911 items, total duration of  0.12 hours.
[NeMo I 2024-08-26 03:45:54 collections:304] # 911 files loaded accounting to # 1 labels



[5/5] extract embeddings: 100%|██████████| 15/15 [00:01<00:00,  8.72it/s]


[NeMo I 2024-08-26 03:45:55 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings


clustering: 100%|██████████| 1/1 [00:00<00:00,  1.19it/s]

[NeMo I 2024-08-26 03:45:56 clustering_diarizer:464] Outputs are saved in /content/temp_outputs directory



[NeMo W 2024-08-26 03:45:56 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 03:45:56 msdd_models:960] Loading embedding pickle file of scale:0 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale0_embeddings.pkl
[NeMo I 2024-08-26 03:45:56 msdd_models:960] Loading embedding pickle file of scale:1 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale1_embeddings.pkl
[NeMo I 2024-08-26 03:45:56 msdd_models:960] Loading embedding pickle file of scale:2 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale2_embeddings.pkl
[NeMo I 2024-08-26 03:45:56 msdd_models:960] Loading embedding pickle file of scale:3 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale3_embeddings.pkl
[NeMo I 2024-08-26 03:45:56 msdd_models:960] Loading embedding pickle file of scale:4 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale4_embeddings.pkl
[NeMo I 2024-08-26 03:45:56 msdd_models:938] Loading cluster label file from /content/temp_outputs/speaker_outputs/subsegments_scale4_cluste

100%|██████████| 1/1 [00:00<00:00, 16.29it/s]

[NeMo I 2024-08-26 03:45:57 msdd_models:1403]      [Threshold: 0.7000] [use_clus_as_main=False] [diar_window=50]
[NeMo I 2024-08-26 03:45:57 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-08-26 03:45:57 speaker_utils:93] Number of files to diarize: 1



[NeMo W 2024-08-26 03:45:57 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 03:45:57 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-08-26 03:45:57 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 03:45:57 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-08-26 03:45:57 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 03:45:57 msdd_models:1431]   
    
archivo ca04630280544cbf8b9bee6854a2f681_20230610t14_13_utc.wav listo


INFO:pytorch_lightning.utilities.migration.utils:Lightning automatically upgraded your loaded checkpoint from v1.5.4 to v2.0.7. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint --file ../root/.cache/torch/whisperx-vad-segmentation.bin`


No language specified, language will be first be detected for each audio file (increases inference time).
Model was trained with pyannote.audio 0.0.1, yours is 3.1.1. Bad things might happen unless you revert pyannote.audio to 0.x.
Model was trained with torch 1.10.0+cu102, yours is 2.0.1+cu117. Bad things might happen unless you revert torch to 1.x.
Suppressing numeral and symbol tokens: [3, 4, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 502, 568, 628, 805, 945, 1017, 1025, 1266, 1294, 1360, 1386, 1525, 1614, 1649, 1722, 1848, 1958, 2009, 2119, 2217, 2272, 2319, 2331, 2443, 2625, 2803, 2975, 3165, 3279, 3282, 3356, 3405, 3446, 3499, 3552, 3705, 4022, 4060, 4289, 4303, 4436, 4550, 4688, 4702, 4762, 4808, 5080, 5211, 5254, 5285, 5348, 5853, 5867, 5923, 6071, 6074, 6096, 6375, 6494, 6549, 6591, 6641, 6673, 6856, 6866, 6879, 6905, 6976, 7143, 7201, 7271, 7490, 7526, 7546, 7551, 7560, 7562, 7629, 7634, 7668, 7771, 7773, 7911, 7998, 8132, 8227, 8423, 8451, 8465, 8494, 8652, 8794, 8858, 8923, 90

[NeMo W 2024-08-26 03:46:38 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: true
    
[NeMo W 2024-08-26 03:46:38 modelPT:168] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s). 
    Validation config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: false
    
[NeMo W 2024-08-26 03:46:38 modelPT:174] Please call the ModelPT.setup_test_data() or ModelPT.setup_multiple

[NeMo I 2024-08-26 03:46:38 features:289] PADDING: 16
[NeMo I 2024-08-26 03:46:38 features:289] PADDING: 16
[NeMo I 2024-08-26 03:46:39 save_restore_connector:249] Model EncDecDiarLabelModel was successfully restored from /root/.cache/torch/NeMo/NeMo_1.21.0/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo.
[NeMo I 2024-08-26 03:46:39 features:289] PADDING: 16
[NeMo I 2024-08-26 03:46:40 clustering_diarizer:127] Loading pretrained vad_multilingual_marblenet model from NGC
[NeMo I 2024-08-26 03:46:40 cloud:58] Found existing object /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-08-26 03:46:40 cloud:64] Re-using file from: /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo
[NeMo I 2024-08-26 03:46:40 common:913] Instantiating model from pre-trained checkpoint


[NeMo W 2024-08-26 03:46:40 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: /manifests/ami_train_0.63.json,/manifests/freesound_background_train.json,/manifests/freesound_laughter_train.json,/manifests/fisher_2004_background.json,/manifests/fisher_2004_speech_sampled.json,/manifests/google_train_manifest.json,/manifests/icsi_all_0.63.json,/manifests/musan_freesound_train.json,/manifests/musan_music_train.json,/manifests/musan_soundbible_train.json,/manifests/mandarin_train_sample.json,/manifests/german_train_sample.json,/manifests/spanish_train_sample.json,/manifests/french_train_sample.json,/manifests/russian_train_sample.json
    sample_rate: 16000
    labels:
    - background
    - speech
    batch_size: 256
    shuffle: true
    is_tarred: false
    tarred_audio_filepaths: null
    tarred_shard_strategy: sca

[NeMo I 2024-08-26 03:46:40 features:289] PADDING: 16
[NeMo I 2024-08-26 03:46:40 save_restore_connector:249] Model EncDecClassificationModel was successfully restored from /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-08-26 03:46:40 msdd_models:864] Multiscale Weights: [1, 1, 1, 1, 1]
[NeMo I 2024-08-26 03:46:40 msdd_models:865] Clustering Parameters: {
        "oracle_num_speakers": false,
        "max_num_speakers": 8,
        "enhanced_count_thres": 80,
        "max_rp_threshold": 0.25,
        "sparse_search_volume": 30,
        "maj_vote_spk_count": false,
        "chunk_cluster_count": 50,
        "embeddings_per_chunk": 10000
    }


[NeMo W 2024-08-26 03:46:40 clustering_diarizer:411] Deleting previous clustering diarizer outputs.


[NeMo I 2024-08-26 03:46:40 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-08-26 03:46:40 clustering_diarizer:309] Split long audio file to avoid CUDA memory issue


splitting manifest: 100%|██████████| 1/1 [00:00<00:00, 48.16it/s]

[NeMo I 2024-08-26 03:46:40 vad_utils:107] The prepared manifest file exists. Overwriting!





[NeMo I 2024-08-26 03:46:40 classification_models:273] Perform streaming frame-level VAD
[NeMo I 2024-08-26 03:46:40 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 03:46:40 collections:302] Dataset loaded with 1 items, total duration of  0.01 hours.
[NeMo I 2024-08-26 03:46:40 collections:304] # 1 files loaded accounting to # 1 labels


vad: 100%|██████████| 1/1 [00:00<00:00,  3.73it/s]

[NeMo I 2024-08-26 03:46:41 clustering_diarizer:250] Generating predictions with overlapping input segments



                                                               

[NeMo I 2024-08-26 03:46:41 clustering_diarizer:262] Converting frame level prediction to speech/no-speech segment in start and end times format.


creating speech segments: 100%|██████████| 1/1 [00:00<00:00, 19.81it/s]

[NeMo I 2024-08-26 03:46:41 clustering_diarizer:287] Subsegmentation for embedding extraction: scale0, /content/temp_outputs/speaker_outputs/subsegments_scale0.json
[NeMo I 2024-08-26 03:46:41 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 03:46:41 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 03:46:41 collections:302] Dataset loaded with 24 items, total duration of  0.01 hours.
[NeMo I 2024-08-26 03:46:41 collections:304] # 24 files loaded accounting to # 1 labels



[1/5] extract embeddings: 100%|██████████| 1/1 [00:00<00:00, 12.36it/s]

[NeMo I 2024-08-26 03:46:41 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 03:46:41 clustering_diarizer:287] Subsegmentation for embedding extraction: scale1, /content/temp_outputs/speaker_outputs/subsegments_scale1.json
[NeMo I 2024-08-26 03:46:41 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 03:46:41 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 03:46:41 collections:302] Dataset loaded with 31 items, total duration of  0.01 hours.
[NeMo I 2024-08-26 03:46:41 collections:304] # 31 files loaded accounting to # 1 labels



[2/5] extract embeddings: 100%|██████████| 1/1 [00:00<00:00, 10.39it/s]

[NeMo I 2024-08-26 03:46:41 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 03:46:41 clustering_diarizer:287] Subsegmentation for embedding extraction: scale2, /content/temp_outputs/speaker_outputs/subsegments_scale2.json
[NeMo I 2024-08-26 03:46:41 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 03:46:41 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 03:46:41 collections:302] Dataset loaded with 38 items, total duration of  0.01 hours.
[NeMo I 2024-08-26 03:46:41 collections:304] # 38 files loaded accounting to # 1 labels



[3/5] extract embeddings: 100%|██████████| 1/1 [00:00<00:00,  7.70it/s]

[NeMo I 2024-08-26 03:46:41 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 03:46:41 clustering_diarizer:287] Subsegmentation for embedding extraction: scale3, /content/temp_outputs/speaker_outputs/subsegments_scale3.json
[NeMo I 2024-08-26 03:46:41 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 03:46:41 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 03:46:41 collections:302] Dataset loaded with 50 items, total duration of  0.01 hours.
[NeMo I 2024-08-26 03:46:41 collections:304] # 50 files loaded accounting to # 1 labels



[4/5] extract embeddings: 100%|██████████| 1/1 [00:00<00:00,  8.88it/s]

[NeMo I 2024-08-26 03:46:42 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 03:46:42 clustering_diarizer:287] Subsegmentation for embedding extraction: scale4, /content/temp_outputs/speaker_outputs/subsegments_scale4.json
[NeMo I 2024-08-26 03:46:42 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 03:46:42 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 03:46:42 collections:302] Dataset loaded with 81 items, total duration of  0.01 hours.
[NeMo I 2024-08-26 03:46:42 collections:304] # 81 files loaded accounting to # 1 labels



[5/5] extract embeddings: 100%|██████████| 2/2 [00:00<00:00, 11.99it/s]


[NeMo I 2024-08-26 03:46:42 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings


clustering: 100%|██████████| 1/1 [00:00<00:00,  9.14it/s]

[NeMo I 2024-08-26 03:46:42 clustering_diarizer:464] Outputs are saved in /content/temp_outputs directory



[NeMo W 2024-08-26 03:46:42 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 03:46:42 msdd_models:960] Loading embedding pickle file of scale:0 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale0_embeddings.pkl
[NeMo I 2024-08-26 03:46:42 msdd_models:960] Loading embedding pickle file of scale:1 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale1_embeddings.pkl
[NeMo I 2024-08-26 03:46:42 msdd_models:960] Loading embedding pickle file of scale:2 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale2_embeddings.pkl
[NeMo I 2024-08-26 03:46:42 msdd_models:960] Loading embedding pickle file of scale:3 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale3_embeddings.pkl
[NeMo I 2024-08-26 03:46:42 msdd_models:960] Loading embedding pickle file of scale:4 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale4_embeddings.pkl
[NeMo I 2024-08-26 03:46:42 msdd_models:938] Loading cluster label file from /content/temp_outputs/speaker_outputs/subsegments_scale4_cluste

100%|██████████| 1/1 [00:00<00:00, 25.34it/s]

[NeMo I 2024-08-26 03:46:42 msdd_models:1403]      [Threshold: 0.7000] [use_clus_as_main=False] [diar_window=50]
[NeMo I 2024-08-26 03:46:42 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-08-26 03:46:42 speaker_utils:93] Number of files to diarize: 1



[NeMo W 2024-08-26 03:46:42 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 03:46:42 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-08-26 03:46:42 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 03:46:42 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-08-26 03:46:42 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 03:46:42 msdd_models:1431]   
    
archivo 5fab44f212e3402d99638a5dc9d97701_20230610t14_26_utc.wav listo


INFO:pytorch_lightning.utilities.migration.utils:Lightning automatically upgraded your loaded checkpoint from v1.5.4 to v2.0.7. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint --file ../root/.cache/torch/whisperx-vad-segmentation.bin`


No language specified, language will be first be detected for each audio file (increases inference time).
Model was trained with pyannote.audio 0.0.1, yours is 3.1.1. Bad things might happen unless you revert pyannote.audio to 0.x.
Model was trained with torch 1.10.0+cu102, yours is 2.0.1+cu117. Bad things might happen unless you revert torch to 1.x.
Suppressing numeral and symbol tokens: [3, 4, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 502, 568, 628, 805, 945, 1017, 1025, 1266, 1294, 1360, 1386, 1525, 1614, 1649, 1722, 1848, 1958, 2009, 2119, 2217, 2272, 2319, 2331, 2443, 2625, 2803, 2975, 3165, 3279, 3282, 3356, 3405, 3446, 3499, 3552, 3705, 4022, 4060, 4289, 4303, 4436, 4550, 4688, 4702, 4762, 4808, 5080, 5211, 5254, 5285, 5348, 5853, 5867, 5923, 6071, 6074, 6096, 6375, 6494, 6549, 6591, 6641, 6673, 6856, 6866, 6879, 6905, 6976, 7143, 7201, 7271, 7490, 7526, 7546, 7551, 7560, 7562, 7629, 7634, 7668, 7771, 7773, 7911, 7998, 8132, 8227, 8423, 8451, 8465, 8494, 8652, 8794, 8858, 8923, 90

[NeMo W 2024-08-26 03:47:25 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: true
    
[NeMo W 2024-08-26 03:47:25 modelPT:168] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s). 
    Validation config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: false
    
[NeMo W 2024-08-26 03:47:25 modelPT:174] Please call the ModelPT.setup_test_data() or ModelPT.setup_multiple

[NeMo I 2024-08-26 03:47:25 features:289] PADDING: 16
[NeMo I 2024-08-26 03:47:25 features:289] PADDING: 16
[NeMo I 2024-08-26 03:47:26 save_restore_connector:249] Model EncDecDiarLabelModel was successfully restored from /root/.cache/torch/NeMo/NeMo_1.21.0/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo.
[NeMo I 2024-08-26 03:47:26 features:289] PADDING: 16
[NeMo I 2024-08-26 03:47:26 clustering_diarizer:127] Loading pretrained vad_multilingual_marblenet model from NGC
[NeMo I 2024-08-26 03:47:26 cloud:58] Found existing object /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-08-26 03:47:26 cloud:64] Re-using file from: /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo
[NeMo I 2024-08-26 03:47:26 common:913] Instantiating model from pre-trained checkpoint


[NeMo W 2024-08-26 03:47:27 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: /manifests/ami_train_0.63.json,/manifests/freesound_background_train.json,/manifests/freesound_laughter_train.json,/manifests/fisher_2004_background.json,/manifests/fisher_2004_speech_sampled.json,/manifests/google_train_manifest.json,/manifests/icsi_all_0.63.json,/manifests/musan_freesound_train.json,/manifests/musan_music_train.json,/manifests/musan_soundbible_train.json,/manifests/mandarin_train_sample.json,/manifests/german_train_sample.json,/manifests/spanish_train_sample.json,/manifests/french_train_sample.json,/manifests/russian_train_sample.json
    sample_rate: 16000
    labels:
    - background
    - speech
    batch_size: 256
    shuffle: true
    is_tarred: false
    tarred_audio_filepaths: null
    tarred_shard_strategy: sca

[NeMo I 2024-08-26 03:47:27 features:289] PADDING: 16
[NeMo I 2024-08-26 03:47:27 save_restore_connector:249] Model EncDecClassificationModel was successfully restored from /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-08-26 03:47:27 msdd_models:864] Multiscale Weights: [1, 1, 1, 1, 1]
[NeMo I 2024-08-26 03:47:27 msdd_models:865] Clustering Parameters: {
        "oracle_num_speakers": false,
        "max_num_speakers": 8,
        "enhanced_count_thres": 80,
        "max_rp_threshold": 0.25,
        "sparse_search_volume": 30,
        "maj_vote_spk_count": false,
        "chunk_cluster_count": 50,
        "embeddings_per_chunk": 10000
    }


[NeMo W 2024-08-26 03:47:27 clustering_diarizer:411] Deleting previous clustering diarizer outputs.


[NeMo I 2024-08-26 03:47:27 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-08-26 03:47:27 clustering_diarizer:309] Split long audio file to avoid CUDA memory issue


splitting manifest: 100%|██████████| 1/1 [00:00<00:00, 55.81it/s]

[NeMo I 2024-08-26 03:47:27 vad_utils:107] The prepared manifest file exists. Overwriting!





[NeMo I 2024-08-26 03:47:27 classification_models:273] Perform streaming frame-level VAD
[NeMo I 2024-08-26 03:47:27 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 03:47:27 collections:302] Dataset loaded with 1 items, total duration of  0.01 hours.
[NeMo I 2024-08-26 03:47:27 collections:304] # 1 files loaded accounting to # 1 labels


vad: 100%|██████████| 1/1 [00:00<00:00,  3.41it/s]

[NeMo I 2024-08-26 03:47:27 clustering_diarizer:250] Generating predictions with overlapping input segments



                                                               

[NeMo I 2024-08-26 03:47:28 clustering_diarizer:262] Converting frame level prediction to speech/no-speech segment in start and end times format.


creating speech segments: 100%|██████████| 1/1 [00:00<00:00, 17.75it/s]

[NeMo I 2024-08-26 03:47:28 clustering_diarizer:287] Subsegmentation for embedding extraction: scale0, /content/temp_outputs/speaker_outputs/subsegments_scale0.json
[NeMo I 2024-08-26 03:47:28 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 03:47:28 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 03:47:28 collections:302] Dataset loaded with 48 items, total duration of  0.02 hours.
[NeMo I 2024-08-26 03:47:28 collections:304] # 48 files loaded accounting to # 1 labels



[1/5] extract embeddings: 100%|██████████| 1/1 [00:00<00:00,  6.72it/s]

[NeMo I 2024-08-26 03:47:28 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 03:47:28 clustering_diarizer:287] Subsegmentation for embedding extraction: scale1, /content/temp_outputs/speaker_outputs/subsegments_scale1.json
[NeMo I 2024-08-26 03:47:28 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 03:47:28 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 03:47:28 collections:302] Dataset loaded with 61 items, total duration of  0.02 hours.
[NeMo I 2024-08-26 03:47:28 collections:304] # 61 files loaded accounting to # 1 labels



[2/5] extract embeddings: 100%|██████████| 1/1 [00:00<00:00,  5.57it/s]

[NeMo I 2024-08-26 03:47:28 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 03:47:28 clustering_diarizer:287] Subsegmentation for embedding extraction: scale2, /content/temp_outputs/speaker_outputs/subsegments_scale2.json
[NeMo I 2024-08-26 03:47:28 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 03:47:28 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 03:47:28 collections:302] Dataset loaded with 74 items, total duration of  0.02 hours.





[NeMo I 2024-08-26 03:47:28 collections:304] # 74 files loaded accounting to # 1 labels


[3/5] extract embeddings: 100%|██████████| 2/2 [00:00<00:00,  9.41it/s]

[NeMo I 2024-08-26 03:47:28 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 03:47:28 clustering_diarizer:287] Subsegmentation for embedding extraction: scale3, /content/temp_outputs/speaker_outputs/subsegments_scale3.json
[NeMo I 2024-08-26 03:47:28 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 03:47:28 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 03:47:28 collections:302] Dataset loaded with 101 items, total duration of  0.02 hours.
[NeMo I 2024-08-26 03:47:28 collections:304] # 101 files loaded accounting to # 1 labels



[4/5] extract embeddings: 100%|██████████| 2/2 [00:00<00:00,  7.71it/s]

[NeMo I 2024-08-26 03:47:29 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 03:47:29 clustering_diarizer:287] Subsegmentation for embedding extraction: scale4, /content/temp_outputs/speaker_outputs/subsegments_scale4.json
[NeMo I 2024-08-26 03:47:29 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 03:47:29 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 03:47:29 collections:302] Dataset loaded with 154 items, total duration of  0.02 hours.
[NeMo I 2024-08-26 03:47:29 collections:304] # 154 files loaded accounting to # 1 labels



[5/5] extract embeddings: 100%|██████████| 3/3 [00:00<00:00, 10.73it/s]

[NeMo I 2024-08-26 03:47:29 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings



clustering: 100%|██████████| 1/1 [00:00<00:00,  5.03it/s]

[NeMo I 2024-08-26 03:47:29 clustering_diarizer:464] Outputs are saved in /content/temp_outputs directory



[NeMo W 2024-08-26 03:47:29 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 03:47:29 msdd_models:960] Loading embedding pickle file of scale:0 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale0_embeddings.pkl
[NeMo I 2024-08-26 03:47:29 msdd_models:960] Loading embedding pickle file of scale:1 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale1_embeddings.pkl
[NeMo I 2024-08-26 03:47:29 msdd_models:960] Loading embedding pickle file of scale:2 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale2_embeddings.pkl
[NeMo I 2024-08-26 03:47:29 msdd_models:960] Loading embedding pickle file of scale:3 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale3_embeddings.pkl
[NeMo I 2024-08-26 03:47:29 msdd_models:960] Loading embedding pickle file of scale:4 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale4_embeddings.pkl
[NeMo I 2024-08-26 03:47:29 msdd_models:938] Loading cluster label file from /content/temp_outputs/speaker_outputs/subsegments_scale4_cluste

100%|██████████| 1/1 [00:00<00:00, 49.75it/s]

[NeMo I 2024-08-26 03:47:29 msdd_models:1403]      [Threshold: 0.7000] [use_clus_as_main=False] [diar_window=50]
[NeMo I 2024-08-26 03:47:29 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-08-26 03:47:29 speaker_utils:93] Number of files to diarize: 1



[NeMo W 2024-08-26 03:47:29 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 03:47:29 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-08-26 03:47:29 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 03:47:29 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-08-26 03:47:29 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 03:47:29 msdd_models:1431]   
    
archivo e091c45596054f6ea5fac6b7388e115c_20230610t15_13_utc.wav listo


INFO:pytorch_lightning.utilities.migration.utils:Lightning automatically upgraded your loaded checkpoint from v1.5.4 to v2.0.7. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint --file ../root/.cache/torch/whisperx-vad-segmentation.bin`


No language specified, language will be first be detected for each audio file (increases inference time).
Model was trained with pyannote.audio 0.0.1, yours is 3.1.1. Bad things might happen unless you revert pyannote.audio to 0.x.
Model was trained with torch 1.10.0+cu102, yours is 2.0.1+cu117. Bad things might happen unless you revert torch to 1.x.
Suppressing numeral and symbol tokens: [3, 4, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 502, 568, 628, 805, 945, 1017, 1025, 1266, 1294, 1360, 1386, 1525, 1614, 1649, 1722, 1848, 1958, 2009, 2119, 2217, 2272, 2319, 2331, 2443, 2625, 2803, 2975, 3165, 3279, 3282, 3356, 3405, 3446, 3499, 3552, 3705, 4022, 4060, 4289, 4303, 4436, 4550, 4688, 4702, 4762, 4808, 5080, 5211, 5254, 5285, 5348, 5853, 5867, 5923, 6071, 6074, 6096, 6375, 6494, 6549, 6591, 6641, 6673, 6856, 6866, 6879, 6905, 6976, 7143, 7201, 7271, 7490, 7526, 7546, 7551, 7560, 7562, 7629, 7634, 7668, 7771, 7773, 7911, 7998, 8132, 8227, 8423, 8451, 8465, 8494, 8652, 8794, 8858, 8923, 90

[NeMo W 2024-08-26 03:48:12 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: true
    
[NeMo W 2024-08-26 03:48:12 modelPT:168] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s). 
    Validation config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: false
    
[NeMo W 2024-08-26 03:48:12 modelPT:174] Please call the ModelPT.setup_test_data() or ModelPT.setup_multiple

[NeMo I 2024-08-26 03:48:12 features:289] PADDING: 16
[NeMo I 2024-08-26 03:48:12 features:289] PADDING: 16
[NeMo I 2024-08-26 03:48:13 save_restore_connector:249] Model EncDecDiarLabelModel was successfully restored from /root/.cache/torch/NeMo/NeMo_1.21.0/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo.
[NeMo I 2024-08-26 03:48:13 features:289] PADDING: 16
[NeMo I 2024-08-26 03:48:14 clustering_diarizer:127] Loading pretrained vad_multilingual_marblenet model from NGC
[NeMo I 2024-08-26 03:48:14 cloud:58] Found existing object /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-08-26 03:48:14 cloud:64] Re-using file from: /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo
[NeMo I 2024-08-26 03:48:14 common:913] Instantiating model from pre-trained checkpoint


[NeMo W 2024-08-26 03:48:15 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: /manifests/ami_train_0.63.json,/manifests/freesound_background_train.json,/manifests/freesound_laughter_train.json,/manifests/fisher_2004_background.json,/manifests/fisher_2004_speech_sampled.json,/manifests/google_train_manifest.json,/manifests/icsi_all_0.63.json,/manifests/musan_freesound_train.json,/manifests/musan_music_train.json,/manifests/musan_soundbible_train.json,/manifests/mandarin_train_sample.json,/manifests/german_train_sample.json,/manifests/spanish_train_sample.json,/manifests/french_train_sample.json,/manifests/russian_train_sample.json
    sample_rate: 16000
    labels:
    - background
    - speech
    batch_size: 256
    shuffle: true
    is_tarred: false
    tarred_audio_filepaths: null
    tarred_shard_strategy: sca

[NeMo I 2024-08-26 03:48:15 features:289] PADDING: 16
[NeMo I 2024-08-26 03:48:15 save_restore_connector:249] Model EncDecClassificationModel was successfully restored from /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-08-26 03:48:15 msdd_models:864] Multiscale Weights: [1, 1, 1, 1, 1]
[NeMo I 2024-08-26 03:48:15 msdd_models:865] Clustering Parameters: {
        "oracle_num_speakers": false,
        "max_num_speakers": 8,
        "enhanced_count_thres": 80,
        "max_rp_threshold": 0.25,
        "sparse_search_volume": 30,
        "maj_vote_spk_count": false,
        "chunk_cluster_count": 50,
        "embeddings_per_chunk": 10000
    }


[NeMo W 2024-08-26 03:48:15 clustering_diarizer:411] Deleting previous clustering diarizer outputs.


[NeMo I 2024-08-26 03:48:15 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-08-26 03:48:15 clustering_diarizer:309] Split long audio file to avoid CUDA memory issue


splitting manifest: 100%|██████████| 1/1 [00:00<00:00, 55.21it/s]

[NeMo I 2024-08-26 03:48:15 vad_utils:107] The prepared manifest file exists. Overwriting!
[NeMo I 2024-08-26 03:48:15 classification_models:273] Perform streaming frame-level VAD
[NeMo I 2024-08-26 03:48:15 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 03:48:15 collections:302] Dataset loaded with 2 items, total duration of  0.02 hours.
[NeMo I 2024-08-26 03:48:15 collections:304] # 2 files loaded accounting to # 1 labels



vad: 100%|██████████| 2/2 [00:00<00:00,  4.24it/s]

[NeMo I 2024-08-26 03:48:15 clustering_diarizer:250] Generating predictions with overlapping input segments



                                                               

[NeMo I 2024-08-26 03:48:16 clustering_diarizer:262] Converting frame level prediction to speech/no-speech segment in start and end times format.


creating speech segments: 100%|██████████| 1/1 [00:00<00:00, 12.14it/s]

[NeMo I 2024-08-26 03:48:16 clustering_diarizer:287] Subsegmentation for embedding extraction: scale0, /content/temp_outputs/speaker_outputs/subsegments_scale0.json
[NeMo I 2024-08-26 03:48:16 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 03:48:16 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 03:48:16 collections:302] Dataset loaded with 50 items, total duration of  0.02 hours.
[NeMo I 2024-08-26 03:48:16 collections:304] # 50 files loaded accounting to # 1 labels



[1/5] extract embeddings: 100%|██████████| 1/1 [00:00<00:00,  4.83it/s]

[NeMo I 2024-08-26 03:48:17 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 03:48:17 clustering_diarizer:287] Subsegmentation for embedding extraction: scale1, /content/temp_outputs/speaker_outputs/subsegments_scale1.json
[NeMo I 2024-08-26 03:48:17 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 03:48:17 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 03:48:17 collections:302] Dataset loaded with 59 items, total duration of  0.02 hours.
[NeMo I 2024-08-26 03:48:17 collections:304] # 59 files loaded accounting to # 1 labels



[2/5] extract embeddings: 100%|██████████| 1/1 [00:00<00:00,  5.08it/s]

[NeMo I 2024-08-26 03:48:17 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 03:48:17 clustering_diarizer:287] Subsegmentation for embedding extraction: scale2, /content/temp_outputs/speaker_outputs/subsegments_scale2.json
[NeMo I 2024-08-26 03:48:17 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 03:48:17 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 03:48:17 collections:302] Dataset loaded with 74 items, total duration of  0.02 hours.
[NeMo I 2024-08-26 03:48:17 collections:304] # 74 files loaded accounting to # 1 labels



[3/5] extract embeddings: 100%|██████████| 2/2 [00:00<00:00,  7.43it/s]

[NeMo I 2024-08-26 03:48:17 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 03:48:17 clustering_diarizer:287] Subsegmentation for embedding extraction: scale3, /content/temp_outputs/speaker_outputs/subsegments_scale3.json
[NeMo I 2024-08-26 03:48:17 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 03:48:17 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 03:48:17 collections:302] Dataset loaded with 99 items, total duration of  0.02 hours.
[NeMo I 2024-08-26 03:48:17 collections:304] # 99 files loaded accounting to # 1 labels



[4/5] extract embeddings: 100%|██████████| 2/2 [00:00<00:00,  8.03it/s]

[NeMo I 2024-08-26 03:48:17 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 03:48:17 clustering_diarizer:287] Subsegmentation for embedding extraction: scale4, /content/temp_outputs/speaker_outputs/subsegments_scale4.json
[NeMo I 2024-08-26 03:48:17 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 03:48:17 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 03:48:17 collections:302] Dataset loaded with 152 items, total duration of  0.02 hours.
[NeMo I 2024-08-26 03:48:17 collections:304] # 152 files loaded accounting to # 1 labels



[5/5] extract embeddings: 100%|██████████| 3/3 [00:00<00:00, 10.48it/s]

[NeMo I 2024-08-26 03:48:18 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings



clustering: 100%|██████████| 1/1 [00:00<00:00,  5.05it/s]

[NeMo I 2024-08-26 03:48:18 clustering_diarizer:464] Outputs are saved in /content/temp_outputs directory



[NeMo W 2024-08-26 03:48:18 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 03:48:18 msdd_models:960] Loading embedding pickle file of scale:0 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale0_embeddings.pkl
[NeMo I 2024-08-26 03:48:18 msdd_models:960] Loading embedding pickle file of scale:1 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale1_embeddings.pkl
[NeMo I 2024-08-26 03:48:18 msdd_models:960] Loading embedding pickle file of scale:2 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale2_embeddings.pkl
[NeMo I 2024-08-26 03:48:18 msdd_models:960] Loading embedding pickle file of scale:3 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale3_embeddings.pkl
[NeMo I 2024-08-26 03:48:18 msdd_models:960] Loading embedding pickle file of scale:4 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale4_embeddings.pkl
[NeMo I 2024-08-26 03:48:18 msdd_models:938] Loading cluster label file from /content/temp_outputs/speaker_outputs/subsegments_scale4_cluste

100%|██████████| 1/1 [00:00<00:00, 32.59it/s]

[NeMo I 2024-08-26 03:48:18 msdd_models:1403]      [Threshold: 0.7000] [use_clus_as_main=False] [diar_window=50]
[NeMo I 2024-08-26 03:48:18 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-08-26 03:48:18 speaker_utils:93] Number of files to diarize: 1



[NeMo W 2024-08-26 03:48:18 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 03:48:18 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-08-26 03:48:18 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 03:48:18 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-08-26 03:48:18 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 03:48:18 msdd_models:1431]   
    
archivo d9553adbba9c45bead6e1a2bded61ad8_20230610t16_33_utc.wav listo


INFO:pytorch_lightning.utilities.migration.utils:Lightning automatically upgraded your loaded checkpoint from v1.5.4 to v2.0.7. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint --file ../root/.cache/torch/whisperx-vad-segmentation.bin`


No language specified, language will be first be detected for each audio file (increases inference time).
Model was trained with pyannote.audio 0.0.1, yours is 3.1.1. Bad things might happen unless you revert pyannote.audio to 0.x.
Model was trained with torch 1.10.0+cu102, yours is 2.0.1+cu117. Bad things might happen unless you revert torch to 1.x.
Suppressing numeral and symbol tokens: [3, 4, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 502, 568, 628, 805, 945, 1017, 1025, 1266, 1294, 1360, 1386, 1525, 1614, 1649, 1722, 1848, 1958, 2009, 2119, 2217, 2272, 2319, 2331, 2443, 2625, 2803, 2975, 3165, 3279, 3282, 3356, 3405, 3446, 3499, 3552, 3705, 4022, 4060, 4289, 4303, 4436, 4550, 4688, 4702, 4762, 4808, 5080, 5211, 5254, 5285, 5348, 5853, 5867, 5923, 6071, 6074, 6096, 6375, 6494, 6549, 6591, 6641, 6673, 6856, 6866, 6879, 6905, 6976, 7143, 7201, 7271, 7490, 7526, 7546, 7551, 7560, 7562, 7629, 7634, 7668, 7771, 7773, 7911, 7998, 8132, 8227, 8423, 8451, 8465, 8494, 8652, 8794, 8858, 8923, 90

[NeMo W 2024-08-26 03:49:10 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: true
    
[NeMo W 2024-08-26 03:49:10 modelPT:168] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s). 
    Validation config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: false
    
[NeMo W 2024-08-26 03:49:10 modelPT:174] Please call the ModelPT.setup_test_data() or ModelPT.setup_multiple

[NeMo I 2024-08-26 03:49:10 features:289] PADDING: 16
[NeMo I 2024-08-26 03:49:10 features:289] PADDING: 16
[NeMo I 2024-08-26 03:49:11 save_restore_connector:249] Model EncDecDiarLabelModel was successfully restored from /root/.cache/torch/NeMo/NeMo_1.21.0/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo.
[NeMo I 2024-08-26 03:49:12 features:289] PADDING: 16
[NeMo I 2024-08-26 03:49:13 clustering_diarizer:127] Loading pretrained vad_multilingual_marblenet model from NGC
[NeMo I 2024-08-26 03:49:13 cloud:58] Found existing object /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-08-26 03:49:13 cloud:64] Re-using file from: /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo
[NeMo I 2024-08-26 03:49:13 common:913] Instantiating model from pre-trained checkpoint


[NeMo W 2024-08-26 03:49:13 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: /manifests/ami_train_0.63.json,/manifests/freesound_background_train.json,/manifests/freesound_laughter_train.json,/manifests/fisher_2004_background.json,/manifests/fisher_2004_speech_sampled.json,/manifests/google_train_manifest.json,/manifests/icsi_all_0.63.json,/manifests/musan_freesound_train.json,/manifests/musan_music_train.json,/manifests/musan_soundbible_train.json,/manifests/mandarin_train_sample.json,/manifests/german_train_sample.json,/manifests/spanish_train_sample.json,/manifests/french_train_sample.json,/manifests/russian_train_sample.json
    sample_rate: 16000
    labels:
    - background
    - speech
    batch_size: 256
    shuffle: true
    is_tarred: false
    tarred_audio_filepaths: null
    tarred_shard_strategy: sca

[NeMo I 2024-08-26 03:49:13 features:289] PADDING: 16
[NeMo I 2024-08-26 03:49:13 save_restore_connector:249] Model EncDecClassificationModel was successfully restored from /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-08-26 03:49:13 msdd_models:864] Multiscale Weights: [1, 1, 1, 1, 1]
[NeMo I 2024-08-26 03:49:13 msdd_models:865] Clustering Parameters: {
        "oracle_num_speakers": false,
        "max_num_speakers": 8,
        "enhanced_count_thres": 80,
        "max_rp_threshold": 0.25,
        "sparse_search_volume": 30,
        "maj_vote_spk_count": false,
        "chunk_cluster_count": 50,
        "embeddings_per_chunk": 10000
    }


[NeMo W 2024-08-26 03:49:13 clustering_diarizer:411] Deleting previous clustering diarizer outputs.


[NeMo I 2024-08-26 03:49:13 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-08-26 03:49:13 clustering_diarizer:309] Split long audio file to avoid CUDA memory issue


splitting manifest: 100%|██████████| 1/1 [00:00<00:00, 16.98it/s]

[NeMo I 2024-08-26 03:49:13 vad_utils:107] The prepared manifest file exists. Overwriting!
[NeMo I 2024-08-26 03:49:13 classification_models:273] Perform streaming frame-level VAD
[NeMo I 2024-08-26 03:49:13 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 03:49:13 collections:302] Dataset loaded with 5 items, total duration of  0.06 hours.
[NeMo I 2024-08-26 03:49:13 collections:304] # 5 files loaded accounting to # 1 labels



vad: 100%|██████████| 5/5 [00:01<00:00,  3.42it/s]

[NeMo I 2024-08-26 03:49:15 clustering_diarizer:250] Generating predictions with overlapping input segments



                                                               

[NeMo I 2024-08-26 03:49:17 clustering_diarizer:262] Converting frame level prediction to speech/no-speech segment in start and end times format.


creating speech segments: 100%|██████████| 1/1 [00:00<00:00,  5.76it/s]


[NeMo I 2024-08-26 03:49:17 clustering_diarizer:287] Subsegmentation for embedding extraction: scale0, /content/temp_outputs/speaker_outputs/subsegments_scale0.json
[NeMo I 2024-08-26 03:49:17 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 03:49:17 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 03:49:17 collections:302] Dataset loaded with 240 items, total duration of  0.09 hours.
[NeMo I 2024-08-26 03:49:17 collections:304] # 240 files loaded accounting to # 1 labels


[1/5] extract embeddings: 100%|██████████| 4/4 [00:00<00:00,  6.21it/s]

[NeMo I 2024-08-26 03:49:18 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 03:49:18 clustering_diarizer:287] Subsegmentation for embedding extraction: scale1, /content/temp_outputs/speaker_outputs/subsegments_scale1.json
[NeMo I 2024-08-26 03:49:18 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 03:49:18 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 03:49:18 collections:302] Dataset loaded with 282 items, total duration of  0.09 hours.
[NeMo I 2024-08-26 03:49:18 collections:304] # 282 files loaded accounting to # 1 labels



[2/5] extract embeddings: 100%|██████████| 5/5 [00:00<00:00,  7.77it/s]


[NeMo I 2024-08-26 03:49:18 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 03:49:18 clustering_diarizer:287] Subsegmentation for embedding extraction: scale2, /content/temp_outputs/speaker_outputs/subsegments_scale2.json
[NeMo I 2024-08-26 03:49:18 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 03:49:18 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 03:49:18 collections:302] Dataset loaded with 359 items, total duration of  0.10 hours.
[NeMo I 2024-08-26 03:49:18 collections:304] # 359 files loaded accounting to # 1 labels


[3/5] extract embeddings: 100%|██████████| 6/6 [00:00<00:00,  7.94it/s]

[NeMo I 2024-08-26 03:49:19 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 03:49:19 clustering_diarizer:287] Subsegmentation for embedding extraction: scale3, /content/temp_outputs/speaker_outputs/subsegments_scale3.json
[NeMo I 2024-08-26 03:49:19 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 03:49:19 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 03:49:19 collections:302] Dataset loaded with 485 items, total duration of  0.10 hours.
[NeMo I 2024-08-26 03:49:19 collections:304] # 485 files loaded accounting to # 1 labels



[4/5] extract embeddings: 100%|██████████| 8/8 [00:00<00:00,  8.06it/s]

[NeMo I 2024-08-26 03:49:20 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 03:49:20 clustering_diarizer:287] Subsegmentation for embedding extraction: scale4, /content/temp_outputs/speaker_outputs/subsegments_scale4.json
[NeMo I 2024-08-26 03:49:20 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 03:49:20 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 03:49:20 collections:302] Dataset loaded with 738 items, total duration of  0.10 hours.
[NeMo I 2024-08-26 03:49:20 collections:304] # 738 files loaded accounting to # 1 labels



[5/5] extract embeddings: 100%|██████████| 12/12 [00:01<00:00, 10.81it/s]

[NeMo I 2024-08-26 03:49:22 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings



clustering: 100%|██████████| 1/1 [00:00<00:00,  1.83it/s]

[NeMo I 2024-08-26 03:49:22 clustering_diarizer:464] Outputs are saved in /content/temp_outputs directory



[NeMo W 2024-08-26 03:49:22 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 03:49:22 msdd_models:960] Loading embedding pickle file of scale:0 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale0_embeddings.pkl
[NeMo I 2024-08-26 03:49:22 msdd_models:960] Loading embedding pickle file of scale:1 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale1_embeddings.pkl
[NeMo I 2024-08-26 03:49:22 msdd_models:960] Loading embedding pickle file of scale:2 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale2_embeddings.pkl
[NeMo I 2024-08-26 03:49:22 msdd_models:960] Loading embedding pickle file of scale:3 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale3_embeddings.pkl
[NeMo I 2024-08-26 03:49:22 msdd_models:960] Loading embedding pickle file of scale:4 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale4_embeddings.pkl
[NeMo I 2024-08-26 03:49:22 msdd_models:938] Loading cluster label file from /content/temp_outputs/speaker_outputs/subsegments_scale4_cluste

100%|██████████| 1/1 [00:00<00:00, 23.31it/s]

[NeMo I 2024-08-26 03:49:22 msdd_models:1403]      [Threshold: 0.7000] [use_clus_as_main=False] [diar_window=50]
[NeMo I 2024-08-26 03:49:22 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-08-26 03:49:22 speaker_utils:93] Number of files to diarize: 1



[NeMo W 2024-08-26 03:49:22 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 03:49:22 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-08-26 03:49:22 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 03:49:22 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-08-26 03:49:22 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 03:49:23 msdd_models:1431]   
    
archivo 934de5177dd94afd8f8a198b2226c2e5_20230610t15_00_utc.wav listo


INFO:pytorch_lightning.utilities.migration.utils:Lightning automatically upgraded your loaded checkpoint from v1.5.4 to v2.0.7. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint --file ../root/.cache/torch/whisperx-vad-segmentation.bin`


No language specified, language will be first be detected for each audio file (increases inference time).
Model was trained with pyannote.audio 0.0.1, yours is 3.1.1. Bad things might happen unless you revert pyannote.audio to 0.x.
Model was trained with torch 1.10.0+cu102, yours is 2.0.1+cu117. Bad things might happen unless you revert torch to 1.x.
Suppressing numeral and symbol tokens: [3, 4, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 502, 568, 628, 805, 945, 1017, 1025, 1266, 1294, 1360, 1386, 1525, 1614, 1649, 1722, 1848, 1958, 2009, 2119, 2217, 2272, 2319, 2331, 2443, 2625, 2803, 2975, 3165, 3279, 3282, 3356, 3405, 3446, 3499, 3552, 3705, 4022, 4060, 4289, 4303, 4436, 4550, 4688, 4702, 4762, 4808, 5080, 5211, 5254, 5285, 5348, 5853, 5867, 5923, 6071, 6074, 6096, 6375, 6494, 6549, 6591, 6641, 6673, 6856, 6866, 6879, 6905, 6976, 7143, 7201, 7271, 7490, 7526, 7546, 7551, 7560, 7562, 7629, 7634, 7668, 7771, 7773, 7911, 7998, 8132, 8227, 8423, 8451, 8465, 8494, 8652, 8794, 8858, 8923, 90

[NeMo W 2024-08-26 03:50:05 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: true
    
[NeMo W 2024-08-26 03:50:05 modelPT:168] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s). 
    Validation config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: false
    
[NeMo W 2024-08-26 03:50:05 modelPT:174] Please call the ModelPT.setup_test_data() or ModelPT.setup_multiple

[NeMo I 2024-08-26 03:50:05 features:289] PADDING: 16
[NeMo I 2024-08-26 03:50:05 features:289] PADDING: 16
[NeMo I 2024-08-26 03:50:05 save_restore_connector:249] Model EncDecDiarLabelModel was successfully restored from /root/.cache/torch/NeMo/NeMo_1.21.0/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo.
[NeMo I 2024-08-26 03:50:06 features:289] PADDING: 16
[NeMo I 2024-08-26 03:50:06 clustering_diarizer:127] Loading pretrained vad_multilingual_marblenet model from NGC
[NeMo I 2024-08-26 03:50:06 cloud:58] Found existing object /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-08-26 03:50:06 cloud:64] Re-using file from: /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo
[NeMo I 2024-08-26 03:50:06 common:913] Instantiating model from pre-trained checkpoint


[NeMo W 2024-08-26 03:50:06 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: /manifests/ami_train_0.63.json,/manifests/freesound_background_train.json,/manifests/freesound_laughter_train.json,/manifests/fisher_2004_background.json,/manifests/fisher_2004_speech_sampled.json,/manifests/google_train_manifest.json,/manifests/icsi_all_0.63.json,/manifests/musan_freesound_train.json,/manifests/musan_music_train.json,/manifests/musan_soundbible_train.json,/manifests/mandarin_train_sample.json,/manifests/german_train_sample.json,/manifests/spanish_train_sample.json,/manifests/french_train_sample.json,/manifests/russian_train_sample.json
    sample_rate: 16000
    labels:
    - background
    - speech
    batch_size: 256
    shuffle: true
    is_tarred: false
    tarred_audio_filepaths: null
    tarred_shard_strategy: sca

[NeMo I 2024-08-26 03:50:06 features:289] PADDING: 16
[NeMo I 2024-08-26 03:50:06 save_restore_connector:249] Model EncDecClassificationModel was successfully restored from /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-08-26 03:50:06 msdd_models:864] Multiscale Weights: [1, 1, 1, 1, 1]
[NeMo I 2024-08-26 03:50:06 msdd_models:865] Clustering Parameters: {
        "oracle_num_speakers": false,
        "max_num_speakers": 8,
        "enhanced_count_thres": 80,
        "max_rp_threshold": 0.25,
        "sparse_search_volume": 30,
        "maj_vote_spk_count": false,
        "chunk_cluster_count": 50,
        "embeddings_per_chunk": 10000
    }


[NeMo W 2024-08-26 03:50:06 clustering_diarizer:411] Deleting previous clustering diarizer outputs.


[NeMo I 2024-08-26 03:50:06 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-08-26 03:50:06 clustering_diarizer:309] Split long audio file to avoid CUDA memory issue


splitting manifest: 100%|██████████| 1/1 [00:00<00:00, 66.30it/s]

[NeMo I 2024-08-26 03:50:06 vad_utils:107] The prepared manifest file exists. Overwriting!





[NeMo I 2024-08-26 03:50:06 classification_models:273] Perform streaming frame-level VAD
[NeMo I 2024-08-26 03:50:06 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 03:50:06 collections:302] Dataset loaded with 1 items, total duration of  0.01 hours.
[NeMo I 2024-08-26 03:50:06 collections:304] # 1 files loaded accounting to # 1 labels


vad: 100%|██████████| 1/1 [00:00<00:00,  3.76it/s]

[NeMo I 2024-08-26 03:50:07 clustering_diarizer:250] Generating predictions with overlapping input segments



                                                               

[NeMo I 2024-08-26 03:50:07 clustering_diarizer:262] Converting frame level prediction to speech/no-speech segment in start and end times format.


creating speech segments: 100%|██████████| 1/1 [00:00<00:00, 16.53it/s]

[NeMo I 2024-08-26 03:50:07 clustering_diarizer:287] Subsegmentation for embedding extraction: scale0, /content/temp_outputs/speaker_outputs/subsegments_scale0.json
[NeMo I 2024-08-26 03:50:07 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 03:50:07 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 03:50:07 collections:302] Dataset loaded with 47 items, total duration of  0.02 hours.
[NeMo I 2024-08-26 03:50:07 collections:304] # 47 files loaded accounting to # 1 labels



[1/5] extract embeddings: 100%|██████████| 1/1 [00:00<00:00,  5.49it/s]

[NeMo I 2024-08-26 03:50:07 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 03:50:07 clustering_diarizer:287] Subsegmentation for embedding extraction: scale1, /content/temp_outputs/speaker_outputs/subsegments_scale1.json
[NeMo I 2024-08-26 03:50:07 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 03:50:07 collections:301] Filtered duration for loading collection is  0.00 hours.





[NeMo I 2024-08-26 03:50:07 collections:302] Dataset loaded with 58 items, total duration of  0.02 hours.
[NeMo I 2024-08-26 03:50:07 collections:304] # 58 files loaded accounting to # 1 labels


[2/5] extract embeddings: 100%|██████████| 1/1 [00:00<00:00,  5.53it/s]

[NeMo I 2024-08-26 03:50:08 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 03:50:08 clustering_diarizer:287] Subsegmentation for embedding extraction: scale2, /content/temp_outputs/speaker_outputs/subsegments_scale2.json
[NeMo I 2024-08-26 03:50:08 clustering_diarizer:343] Extracting embeddings for Diarization





[NeMo I 2024-08-26 03:50:08 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 03:50:08 collections:302] Dataset loaded with 72 items, total duration of  0.02 hours.
[NeMo I 2024-08-26 03:50:08 collections:304] # 72 files loaded accounting to # 1 labels


[3/5] extract embeddings: 100%|██████████| 2/2 [00:00<00:00, 10.00it/s]

[NeMo I 2024-08-26 03:50:08 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 03:50:08 clustering_diarizer:287] Subsegmentation for embedding extraction: scale3, /content/temp_outputs/speaker_outputs/subsegments_scale3.json
[NeMo I 2024-08-26 03:50:08 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 03:50:08 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 03:50:08 collections:302] Dataset loaded with 97 items, total duration of  0.02 hours.
[NeMo I 2024-08-26 03:50:08 collections:304] # 97 files loaded accounting to # 1 labels



[4/5] extract embeddings: 100%|██████████| 2/2 [00:00<00:00,  8.02it/s]

[NeMo I 2024-08-26 03:50:08 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 03:50:08 clustering_diarizer:287] Subsegmentation for embedding extraction: scale4, /content/temp_outputs/speaker_outputs/subsegments_scale4.json
[NeMo I 2024-08-26 03:50:08 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 03:50:08 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 03:50:08 collections:302] Dataset loaded with 148 items, total duration of  0.02 hours.
[NeMo I 2024-08-26 03:50:08 collections:304] # 148 files loaded accounting to # 1 labels



[5/5] extract embeddings: 100%|██████████| 3/3 [00:00<00:00,  7.99it/s]

[NeMo I 2024-08-26 03:50:09 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings



clustering: 100%|██████████| 1/1 [00:00<00:00,  3.95it/s]

[NeMo I 2024-08-26 03:50:09 clustering_diarizer:464] Outputs are saved in /content/temp_outputs directory



[NeMo W 2024-08-26 03:50:09 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 03:50:09 msdd_models:960] Loading embedding pickle file of scale:0 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale0_embeddings.pkl
[NeMo I 2024-08-26 03:50:09 msdd_models:960] Loading embedding pickle file of scale:1 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale1_embeddings.pkl
[NeMo I 2024-08-26 03:50:09 msdd_models:960] Loading embedding pickle file of scale:2 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale2_embeddings.pkl
[NeMo I 2024-08-26 03:50:09 msdd_models:960] Loading embedding pickle file of scale:3 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale3_embeddings.pkl
[NeMo I 2024-08-26 03:50:09 msdd_models:960] Loading embedding pickle file of scale:4 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale4_embeddings.pkl
[NeMo I 2024-08-26 03:50:09 msdd_models:938] Loading cluster label file from /content/temp_outputs/speaker_outputs/subsegments_scale4_cluste

100%|██████████| 1/1 [00:00<00:00, 53.86it/s]

[NeMo I 2024-08-26 03:50:09 msdd_models:1403]      [Threshold: 0.7000] [use_clus_as_main=False] [diar_window=50]
[NeMo I 2024-08-26 03:50:09 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-08-26 03:50:09 speaker_utils:93] Number of files to diarize: 1



[NeMo W 2024-08-26 03:50:09 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 03:50:09 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-08-26 03:50:09 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 03:50:09 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-08-26 03:50:09 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 03:50:09 msdd_models:1431]   
    
archivo 45adf080f3e24793900699ada4592d63_20230610t15_47_utc.wav listo


INFO:pytorch_lightning.utilities.migration.utils:Lightning automatically upgraded your loaded checkpoint from v1.5.4 to v2.0.7. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint --file ../root/.cache/torch/whisperx-vad-segmentation.bin`


No language specified, language will be first be detected for each audio file (increases inference time).
Model was trained with pyannote.audio 0.0.1, yours is 3.1.1. Bad things might happen unless you revert pyannote.audio to 0.x.
Model was trained with torch 1.10.0+cu102, yours is 2.0.1+cu117. Bad things might happen unless you revert torch to 1.x.
Suppressing numeral and symbol tokens: [3, 4, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 502, 568, 628, 805, 945, 1017, 1025, 1266, 1294, 1360, 1386, 1525, 1614, 1649, 1722, 1848, 1958, 2009, 2119, 2217, 2272, 2319, 2331, 2443, 2625, 2803, 2975, 3165, 3279, 3282, 3356, 3405, 3446, 3499, 3552, 3705, 4022, 4060, 4289, 4303, 4436, 4550, 4688, 4702, 4762, 4808, 5080, 5211, 5254, 5285, 5348, 5853, 5867, 5923, 6071, 6074, 6096, 6375, 6494, 6549, 6591, 6641, 6673, 6856, 6866, 6879, 6905, 6976, 7143, 7201, 7271, 7490, 7526, 7546, 7551, 7560, 7562, 7629, 7634, 7668, 7771, 7773, 7911, 7998, 8132, 8227, 8423, 8451, 8465, 8494, 8652, 8794, 8858, 8923, 90

[NeMo W 2024-08-26 03:51:03 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: true
    
[NeMo W 2024-08-26 03:51:03 modelPT:168] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s). 
    Validation config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: false
    
[NeMo W 2024-08-26 03:51:03 modelPT:174] Please call the ModelPT.setup_test_data() or ModelPT.setup_multiple

[NeMo I 2024-08-26 03:51:03 features:289] PADDING: 16
[NeMo I 2024-08-26 03:51:04 features:289] PADDING: 16
[NeMo I 2024-08-26 03:51:04 save_restore_connector:249] Model EncDecDiarLabelModel was successfully restored from /root/.cache/torch/NeMo/NeMo_1.21.0/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo.
[NeMo I 2024-08-26 03:51:04 features:289] PADDING: 16
[NeMo I 2024-08-26 03:51:05 clustering_diarizer:127] Loading pretrained vad_multilingual_marblenet model from NGC
[NeMo I 2024-08-26 03:51:05 cloud:58] Found existing object /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-08-26 03:51:05 cloud:64] Re-using file from: /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo
[NeMo I 2024-08-26 03:51:05 common:913] Instantiating model from pre-trained checkpoint


[NeMo W 2024-08-26 03:51:06 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: /manifests/ami_train_0.63.json,/manifests/freesound_background_train.json,/manifests/freesound_laughter_train.json,/manifests/fisher_2004_background.json,/manifests/fisher_2004_speech_sampled.json,/manifests/google_train_manifest.json,/manifests/icsi_all_0.63.json,/manifests/musan_freesound_train.json,/manifests/musan_music_train.json,/manifests/musan_soundbible_train.json,/manifests/mandarin_train_sample.json,/manifests/german_train_sample.json,/manifests/spanish_train_sample.json,/manifests/french_train_sample.json,/manifests/russian_train_sample.json
    sample_rate: 16000
    labels:
    - background
    - speech
    batch_size: 256
    shuffle: true
    is_tarred: false
    tarred_audio_filepaths: null
    tarred_shard_strategy: sca

[NeMo I 2024-08-26 03:51:06 features:289] PADDING: 16
[NeMo I 2024-08-26 03:51:06 save_restore_connector:249] Model EncDecClassificationModel was successfully restored from /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-08-26 03:51:06 msdd_models:864] Multiscale Weights: [1, 1, 1, 1, 1]
[NeMo I 2024-08-26 03:51:06 msdd_models:865] Clustering Parameters: {
        "oracle_num_speakers": false,
        "max_num_speakers": 8,
        "enhanced_count_thres": 80,
        "max_rp_threshold": 0.25,
        "sparse_search_volume": 30,
        "maj_vote_spk_count": false,
        "chunk_cluster_count": 50,
        "embeddings_per_chunk": 10000
    }


[NeMo W 2024-08-26 03:51:06 clustering_diarizer:411] Deleting previous clustering diarizer outputs.


[NeMo I 2024-08-26 03:51:06 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-08-26 03:51:06 clustering_diarizer:309] Split long audio file to avoid CUDA memory issue


splitting manifest: 100%|██████████| 1/1 [00:00<00:00, 14.05it/s]

[NeMo I 2024-08-26 03:51:06 vad_utils:107] The prepared manifest file exists. Overwriting!
[NeMo I 2024-08-26 03:51:06 classification_models:273] Perform streaming frame-level VAD
[NeMo I 2024-08-26 03:51:06 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 03:51:06 collections:302] Dataset loaded with 6 items, total duration of  0.07 hours.
[NeMo I 2024-08-26 03:51:06 collections:304] # 6 files loaded accounting to # 1 labels



vad: 100%|██████████| 6/6 [00:01<00:00,  3.51it/s]

[NeMo I 2024-08-26 03:51:08 clustering_diarizer:250] Generating predictions with overlapping input segments



                                                               

[NeMo I 2024-08-26 03:51:11 clustering_diarizer:262] Converting frame level prediction to speech/no-speech segment in start and end times format.


creating speech segments: 100%|██████████| 1/1 [00:00<00:00,  4.80it/s]

[NeMo I 2024-08-26 03:51:11 clustering_diarizer:287] Subsegmentation for embedding extraction: scale0, /content/temp_outputs/speaker_outputs/subsegments_scale0.json
[NeMo I 2024-08-26 03:51:11 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 03:51:11 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 03:51:11 collections:302] Dataset loaded with 264 items, total duration of  0.10 hours.
[NeMo I 2024-08-26 03:51:11 collections:304] # 264 files loaded accounting to # 1 labels



[1/5] extract embeddings: 100%|██████████| 5/5 [00:00<00:00,  5.99it/s]


[NeMo I 2024-08-26 03:51:12 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 03:51:12 clustering_diarizer:287] Subsegmentation for embedding extraction: scale1, /content/temp_outputs/speaker_outputs/subsegments_scale1.json
[NeMo I 2024-08-26 03:51:12 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 03:51:12 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 03:51:12 collections:302] Dataset loaded with 317 items, total duration of  0.10 hours.
[NeMo I 2024-08-26 03:51:12 collections:304] # 317 files loaded accounting to # 1 labels


[2/5] extract embeddings: 100%|██████████| 5/5 [00:00<00:00,  7.01it/s]

[NeMo I 2024-08-26 03:51:13 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 03:51:13 clustering_diarizer:287] Subsegmentation for embedding extraction: scale2, /content/temp_outputs/speaker_outputs/subsegments_scale2.json
[NeMo I 2024-08-26 03:51:13 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 03:51:13 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 03:51:13 collections:302] Dataset loaded with 402 items, total duration of  0.11 hours.
[NeMo I 2024-08-26 03:51:13 collections:304] # 402 files loaded accounting to # 1 labels



[3/5] extract embeddings: 100%|██████████| 7/7 [00:00<00:00,  8.82it/s]

[NeMo I 2024-08-26 03:51:14 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 03:51:14 clustering_diarizer:287] Subsegmentation for embedding extraction: scale3, /content/temp_outputs/speaker_outputs/subsegments_scale3.json
[NeMo I 2024-08-26 03:51:14 clustering_diarizer:343] Extracting embeddings for Diarization





[NeMo I 2024-08-26 03:51:14 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 03:51:14 collections:302] Dataset loaded with 539 items, total duration of  0.11 hours.
[NeMo I 2024-08-26 03:51:14 collections:304] # 539 files loaded accounting to # 1 labels


[4/5] extract embeddings: 100%|██████████| 9/9 [00:00<00:00,  9.71it/s]

[NeMo I 2024-08-26 03:51:15 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 03:51:15 clustering_diarizer:287] Subsegmentation for embedding extraction: scale4, /content/temp_outputs/speaker_outputs/subsegments_scale4.json





[NeMo I 2024-08-26 03:51:15 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 03:51:15 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 03:51:15 collections:302] Dataset loaded with 825 items, total duration of  0.11 hours.
[NeMo I 2024-08-26 03:51:15 collections:304] # 825 files loaded accounting to # 1 labels


[5/5] extract embeddings: 100%|██████████| 13/13 [00:01<00:00, 10.60it/s]

[NeMo I 2024-08-26 03:51:16 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings



clustering: 100%|██████████| 1/1 [00:00<00:00,  1.55it/s]

[NeMo I 2024-08-26 03:51:17 clustering_diarizer:464] Outputs are saved in /content/temp_outputs directory



[NeMo W 2024-08-26 03:51:17 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 03:51:17 msdd_models:960] Loading embedding pickle file of scale:0 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale0_embeddings.pkl
[NeMo I 2024-08-26 03:51:17 msdd_models:960] Loading embedding pickle file of scale:1 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale1_embeddings.pkl
[NeMo I 2024-08-26 03:51:17 msdd_models:960] Loading embedding pickle file of scale:2 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale2_embeddings.pkl
[NeMo I 2024-08-26 03:51:17 msdd_models:960] Loading embedding pickle file of scale:3 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale3_embeddings.pkl
[NeMo I 2024-08-26 03:51:17 msdd_models:960] Loading embedding pickle file of scale:4 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale4_embeddings.pkl
[NeMo I 2024-08-26 03:51:17 msdd_models:938] Loading cluster label file from /content/temp_outputs/speaker_outputs/subsegments_scale4_cluste

100%|██████████| 1/1 [00:00<00:00, 25.57it/s]

[NeMo I 2024-08-26 03:51:17 msdd_models:1403]      [Threshold: 0.7000] [use_clus_as_main=False] [diar_window=50]
[NeMo I 2024-08-26 03:51:17 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-08-26 03:51:17 speaker_utils:93] Number of files to diarize: 1



[NeMo W 2024-08-26 03:51:17 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 03:51:17 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-08-26 03:51:17 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 03:51:17 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-08-26 03:51:17 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 03:51:17 msdd_models:1431]   
    
archivo eda880c7ca6f4e40835aa372d01627c0_20230610t14_59_utc.wav listo


INFO:pytorch_lightning.utilities.migration.utils:Lightning automatically upgraded your loaded checkpoint from v1.5.4 to v2.0.7. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint --file ../root/.cache/torch/whisperx-vad-segmentation.bin`


No language specified, language will be first be detected for each audio file (increases inference time).
Model was trained with pyannote.audio 0.0.1, yours is 3.1.1. Bad things might happen unless you revert pyannote.audio to 0.x.
Model was trained with torch 1.10.0+cu102, yours is 2.0.1+cu117. Bad things might happen unless you revert torch to 1.x.
Suppressing numeral and symbol tokens: [3, 4, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 502, 568, 628, 805, 945, 1017, 1025, 1266, 1294, 1360, 1386, 1525, 1614, 1649, 1722, 1848, 1958, 2009, 2119, 2217, 2272, 2319, 2331, 2443, 2625, 2803, 2975, 3165, 3279, 3282, 3356, 3405, 3446, 3499, 3552, 3705, 4022, 4060, 4289, 4303, 4436, 4550, 4688, 4702, 4762, 4808, 5080, 5211, 5254, 5285, 5348, 5853, 5867, 5923, 6071, 6074, 6096, 6375, 6494, 6549, 6591, 6641, 6673, 6856, 6866, 6879, 6905, 6976, 7143, 7201, 7271, 7490, 7526, 7546, 7551, 7560, 7562, 7629, 7634, 7668, 7771, 7773, 7911, 7998, 8132, 8227, 8423, 8451, 8465, 8494, 8652, 8794, 8858, 8923, 90

[NeMo W 2024-08-26 03:52:01 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: true
    
[NeMo W 2024-08-26 03:52:01 modelPT:168] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s). 
    Validation config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: false
    
[NeMo W 2024-08-26 03:52:01 modelPT:174] Please call the ModelPT.setup_test_data() or ModelPT.setup_multiple

[NeMo I 2024-08-26 03:52:01 features:289] PADDING: 16
[NeMo I 2024-08-26 03:52:01 features:289] PADDING: 16
[NeMo I 2024-08-26 03:52:03 save_restore_connector:249] Model EncDecDiarLabelModel was successfully restored from /root/.cache/torch/NeMo/NeMo_1.21.0/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo.
[NeMo I 2024-08-26 03:52:03 features:289] PADDING: 16
[NeMo I 2024-08-26 03:52:04 clustering_diarizer:127] Loading pretrained vad_multilingual_marblenet model from NGC
[NeMo I 2024-08-26 03:52:04 cloud:58] Found existing object /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-08-26 03:52:04 cloud:64] Re-using file from: /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo
[NeMo I 2024-08-26 03:52:04 common:913] Instantiating model from pre-trained checkpoint


[NeMo W 2024-08-26 03:52:04 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: /manifests/ami_train_0.63.json,/manifests/freesound_background_train.json,/manifests/freesound_laughter_train.json,/manifests/fisher_2004_background.json,/manifests/fisher_2004_speech_sampled.json,/manifests/google_train_manifest.json,/manifests/icsi_all_0.63.json,/manifests/musan_freesound_train.json,/manifests/musan_music_train.json,/manifests/musan_soundbible_train.json,/manifests/mandarin_train_sample.json,/manifests/german_train_sample.json,/manifests/spanish_train_sample.json,/manifests/french_train_sample.json,/manifests/russian_train_sample.json
    sample_rate: 16000
    labels:
    - background
    - speech
    batch_size: 256
    shuffle: true
    is_tarred: false
    tarred_audio_filepaths: null
    tarred_shard_strategy: sca

[NeMo I 2024-08-26 03:52:04 features:289] PADDING: 16
[NeMo I 2024-08-26 03:52:04 save_restore_connector:249] Model EncDecClassificationModel was successfully restored from /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-08-26 03:52:04 msdd_models:864] Multiscale Weights: [1, 1, 1, 1, 1]
[NeMo I 2024-08-26 03:52:04 msdd_models:865] Clustering Parameters: {
        "oracle_num_speakers": false,
        "max_num_speakers": 8,
        "enhanced_count_thres": 80,
        "max_rp_threshold": 0.25,
        "sparse_search_volume": 30,
        "maj_vote_spk_count": false,
        "chunk_cluster_count": 50,
        "embeddings_per_chunk": 10000
    }


[NeMo W 2024-08-26 03:52:04 clustering_diarizer:411] Deleting previous clustering diarizer outputs.


[NeMo I 2024-08-26 03:52:04 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-08-26 03:52:04 clustering_diarizer:309] Split long audio file to avoid CUDA memory issue


splitting manifest: 100%|██████████| 1/1 [00:00<00:00, 30.52it/s]

[NeMo I 2024-08-26 03:52:04 vad_utils:107] The prepared manifest file exists. Overwriting!
[NeMo I 2024-08-26 03:52:04 classification_models:273] Perform streaming frame-level VAD
[NeMo I 2024-08-26 03:52:04 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 03:52:04 collections:302] Dataset loaded with 2 items, total duration of  0.03 hours.
[NeMo I 2024-08-26 03:52:04 collections:304] # 2 files loaded accounting to # 1 labels



vad: 100%|██████████| 2/2 [00:00<00:00,  3.28it/s]

[NeMo I 2024-08-26 03:52:05 clustering_diarizer:250] Generating predictions with overlapping input segments



                                                               

[NeMo I 2024-08-26 03:52:06 clustering_diarizer:262] Converting frame level prediction to speech/no-speech segment in start and end times format.


creating speech segments: 100%|██████████| 1/1 [00:00<00:00,  7.96it/s]

[NeMo I 2024-08-26 03:52:06 clustering_diarizer:287] Subsegmentation for embedding extraction: scale0, /content/temp_outputs/speaker_outputs/subsegments_scale0.json
[NeMo I 2024-08-26 03:52:06 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 03:52:06 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 03:52:06 collections:302] Dataset loaded with 87 items, total duration of  0.03 hours.
[NeMo I 2024-08-26 03:52:06 collections:304] # 87 files loaded accounting to # 1 labels



[1/5] extract embeddings: 100%|██████████| 2/2 [00:00<00:00,  6.45it/s]

[NeMo I 2024-08-26 03:52:06 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 03:52:06 clustering_diarizer:287] Subsegmentation for embedding extraction: scale1, /content/temp_outputs/speaker_outputs/subsegments_scale1.json
[NeMo I 2024-08-26 03:52:07 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 03:52:07 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 03:52:07 collections:302] Dataset loaded with 106 items, total duration of  0.03 hours.
[NeMo I 2024-08-26 03:52:07 collections:304] # 106 files loaded accounting to # 1 labels



[2/5] extract embeddings: 100%|██████████| 2/2 [00:00<00:00,  6.37it/s]

[NeMo I 2024-08-26 03:52:07 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 03:52:07 clustering_diarizer:287] Subsegmentation for embedding extraction: scale2, /content/temp_outputs/speaker_outputs/subsegments_scale2.json
[NeMo I 2024-08-26 03:52:07 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 03:52:07 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 03:52:07 collections:302] Dataset loaded with 132 items, total duration of  0.03 hours.
[NeMo I 2024-08-26 03:52:07 collections:304] # 132 files loaded accounting to # 1 labels



[3/5] extract embeddings: 100%|██████████| 3/3 [00:00<00:00,  8.34it/s]

[NeMo I 2024-08-26 03:52:07 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 03:52:07 clustering_diarizer:287] Subsegmentation for embedding extraction: scale3, /content/temp_outputs/speaker_outputs/subsegments_scale3.json
[NeMo I 2024-08-26 03:52:07 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 03:52:07 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 03:52:07 collections:302] Dataset loaded with 179 items, total duration of  0.04 hours.
[NeMo I 2024-08-26 03:52:07 collections:304] # 179 files loaded accounting to # 1 labels



[4/5] extract embeddings: 100%|██████████| 3/3 [00:00<00:00,  6.65it/s]

[NeMo I 2024-08-26 03:52:08 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 03:52:08 clustering_diarizer:287] Subsegmentation for embedding extraction: scale4, /content/temp_outputs/speaker_outputs/subsegments_scale4.json
[NeMo I 2024-08-26 03:52:08 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 03:52:08 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 03:52:08 collections:302] Dataset loaded with 273 items, total duration of  0.04 hours.
[NeMo I 2024-08-26 03:52:08 collections:304] # 273 files loaded accounting to # 1 labels



[5/5] extract embeddings: 100%|██████████| 5/5 [00:00<00:00,  8.15it/s]


[NeMo I 2024-08-26 03:52:08 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings


clustering: 100%|██████████| 1/1 [00:00<00:00,  3.00it/s]

[NeMo I 2024-08-26 03:52:09 clustering_diarizer:464] Outputs are saved in /content/temp_outputs directory



[NeMo W 2024-08-26 03:52:09 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 03:52:09 msdd_models:960] Loading embedding pickle file of scale:0 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale0_embeddings.pkl
[NeMo I 2024-08-26 03:52:09 msdd_models:960] Loading embedding pickle file of scale:1 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale1_embeddings.pkl
[NeMo I 2024-08-26 03:52:09 msdd_models:960] Loading embedding pickle file of scale:2 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale2_embeddings.pkl
[NeMo I 2024-08-26 03:52:09 msdd_models:960] Loading embedding pickle file of scale:3 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale3_embeddings.pkl
[NeMo I 2024-08-26 03:52:09 msdd_models:960] Loading embedding pickle file of scale:4 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale4_embeddings.pkl
[NeMo I 2024-08-26 03:52:09 msdd_models:938] Loading cluster label file from /content/temp_outputs/speaker_outputs/subsegments_scale4_cluste

100%|██████████| 1/1 [00:00<00:00, 47.74it/s]

[NeMo I 2024-08-26 03:52:09 msdd_models:1403]      [Threshold: 0.7000] [use_clus_as_main=False] [diar_window=50]
[NeMo I 2024-08-26 03:52:09 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-08-26 03:52:09 speaker_utils:93] Number of files to diarize: 1



[NeMo W 2024-08-26 03:52:09 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 03:52:09 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-08-26 03:52:09 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 03:52:09 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-08-26 03:52:09 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 03:52:09 msdd_models:1431]   
    
archivo c7e11bdfad064aae944c4ce1a11dc2d7_20230610t14_02_utc.wav listo


INFO:pytorch_lightning.utilities.migration.utils:Lightning automatically upgraded your loaded checkpoint from v1.5.4 to v2.0.7. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint --file ../root/.cache/torch/whisperx-vad-segmentation.bin`


No language specified, language will be first be detected for each audio file (increases inference time).
Model was trained with pyannote.audio 0.0.1, yours is 3.1.1. Bad things might happen unless you revert pyannote.audio to 0.x.
Model was trained with torch 1.10.0+cu102, yours is 2.0.1+cu117. Bad things might happen unless you revert torch to 1.x.
Suppressing numeral and symbol tokens: [3, 4, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 502, 568, 628, 805, 945, 1017, 1025, 1266, 1294, 1360, 1386, 1525, 1614, 1649, 1722, 1848, 1958, 2009, 2119, 2217, 2272, 2319, 2331, 2443, 2625, 2803, 2975, 3165, 3279, 3282, 3356, 3405, 3446, 3499, 3552, 3705, 4022, 4060, 4289, 4303, 4436, 4550, 4688, 4702, 4762, 4808, 5080, 5211, 5254, 5285, 5348, 5853, 5867, 5923, 6071, 6074, 6096, 6375, 6494, 6549, 6591, 6641, 6673, 6856, 6866, 6879, 6905, 6976, 7143, 7201, 7271, 7490, 7526, 7546, 7551, 7560, 7562, 7629, 7634, 7668, 7771, 7773, 7911, 7998, 8132, 8227, 8423, 8451, 8465, 8494, 8652, 8794, 8858, 8923, 90

[NeMo W 2024-08-26 03:53:00 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: true
    
[NeMo W 2024-08-26 03:53:00 modelPT:168] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s). 
    Validation config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: false
    
[NeMo W 2024-08-26 03:53:00 modelPT:174] Please call the ModelPT.setup_test_data() or ModelPT.setup_multiple

[NeMo I 2024-08-26 03:53:00 features:289] PADDING: 16
[NeMo I 2024-08-26 03:53:01 features:289] PADDING: 16
[NeMo I 2024-08-26 03:53:01 save_restore_connector:249] Model EncDecDiarLabelModel was successfully restored from /root/.cache/torch/NeMo/NeMo_1.21.0/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo.
[NeMo I 2024-08-26 03:53:01 features:289] PADDING: 16
[NeMo I 2024-08-26 03:53:02 clustering_diarizer:127] Loading pretrained vad_multilingual_marblenet model from NGC
[NeMo I 2024-08-26 03:53:02 cloud:58] Found existing object /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-08-26 03:53:02 cloud:64] Re-using file from: /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo
[NeMo I 2024-08-26 03:53:02 common:913] Instantiating model from pre-trained checkpoint


[NeMo W 2024-08-26 03:53:03 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: /manifests/ami_train_0.63.json,/manifests/freesound_background_train.json,/manifests/freesound_laughter_train.json,/manifests/fisher_2004_background.json,/manifests/fisher_2004_speech_sampled.json,/manifests/google_train_manifest.json,/manifests/icsi_all_0.63.json,/manifests/musan_freesound_train.json,/manifests/musan_music_train.json,/manifests/musan_soundbible_train.json,/manifests/mandarin_train_sample.json,/manifests/german_train_sample.json,/manifests/spanish_train_sample.json,/manifests/french_train_sample.json,/manifests/russian_train_sample.json
    sample_rate: 16000
    labels:
    - background
    - speech
    batch_size: 256
    shuffle: true
    is_tarred: false
    tarred_audio_filepaths: null
    tarred_shard_strategy: sca

[NeMo I 2024-08-26 03:53:03 features:289] PADDING: 16
[NeMo I 2024-08-26 03:53:03 save_restore_connector:249] Model EncDecClassificationModel was successfully restored from /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-08-26 03:53:03 msdd_models:864] Multiscale Weights: [1, 1, 1, 1, 1]
[NeMo I 2024-08-26 03:53:03 msdd_models:865] Clustering Parameters: {
        "oracle_num_speakers": false,
        "max_num_speakers": 8,
        "enhanced_count_thres": 80,
        "max_rp_threshold": 0.25,
        "sparse_search_volume": 30,
        "maj_vote_spk_count": false,
        "chunk_cluster_count": 50,
        "embeddings_per_chunk": 10000
    }


[NeMo W 2024-08-26 03:53:03 clustering_diarizer:411] Deleting previous clustering diarizer outputs.


[NeMo I 2024-08-26 03:53:03 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-08-26 03:53:03 clustering_diarizer:309] Split long audio file to avoid CUDA memory issue


splitting manifest: 100%|██████████| 1/1 [00:00<00:00, 13.29it/s]

[NeMo I 2024-08-26 03:53:03 vad_utils:107] The prepared manifest file exists. Overwriting!
[NeMo I 2024-08-26 03:53:03 classification_models:273] Perform streaming frame-level VAD
[NeMo I 2024-08-26 03:53:03 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 03:53:03 collections:302] Dataset loaded with 6 items, total duration of  0.07 hours.
[NeMo I 2024-08-26 03:53:03 collections:304] # 6 files loaded accounting to # 1 labels



vad: 100%|██████████| 6/6 [00:01<00:00,  3.46it/s]

[NeMo I 2024-08-26 03:53:05 clustering_diarizer:250] Generating predictions with overlapping input segments



                                                               

[NeMo I 2024-08-26 03:53:08 clustering_diarizer:262] Converting frame level prediction to speech/no-speech segment in start and end times format.


creating speech segments: 100%|██████████| 1/1 [00:00<00:00,  4.53it/s]

[NeMo I 2024-08-26 03:53:08 clustering_diarizer:287] Subsegmentation for embedding extraction: scale0, /content/temp_outputs/speaker_outputs/subsegments_scale0.json
[NeMo I 2024-08-26 03:53:08 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 03:53:08 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 03:53:08 collections:302] Dataset loaded with 218 items, total duration of  0.07 hours.
[NeMo I 2024-08-26 03:53:08 collections:304] # 218 files loaded accounting to # 1 labels



[1/5] extract embeddings: 100%|██████████| 4/4 [00:00<00:00,  6.77it/s]

[NeMo I 2024-08-26 03:53:09 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 03:53:09 clustering_diarizer:287] Subsegmentation for embedding extraction: scale1, /content/temp_outputs/speaker_outputs/subsegments_scale1.json
[NeMo I 2024-08-26 03:53:09 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 03:53:09 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 03:53:09 collections:302] Dataset loaded with 259 items, total duration of  0.07 hours.
[NeMo I 2024-08-26 03:53:09 collections:304] # 259 files loaded accounting to # 1 labels



[2/5] extract embeddings: 100%|██████████| 5/5 [00:00<00:00,  6.83it/s]

[NeMo I 2024-08-26 03:53:10 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 03:53:10 clustering_diarizer:287] Subsegmentation for embedding extraction: scale2, /content/temp_outputs/speaker_outputs/subsegments_scale2.json
[NeMo I 2024-08-26 03:53:10 clustering_diarizer:343] Extracting embeddings for Diarization





[NeMo I 2024-08-26 03:53:10 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 03:53:10 collections:302] Dataset loaded with 316 items, total duration of  0.08 hours.
[NeMo I 2024-08-26 03:53:10 collections:304] # 316 files loaded accounting to # 1 labels


[3/5] extract embeddings: 100%|██████████| 5/5 [00:00<00:00,  8.01it/s]

[NeMo I 2024-08-26 03:53:10 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 03:53:10 clustering_diarizer:287] Subsegmentation for embedding extraction: scale3, /content/temp_outputs/speaker_outputs/subsegments_scale3.json
[NeMo I 2024-08-26 03:53:10 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 03:53:10 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 03:53:10 collections:302] Dataset loaded with 419 items, total duration of  0.08 hours.
[NeMo I 2024-08-26 03:53:10 collections:304] # 419 files loaded accounting to # 1 labels



[4/5] extract embeddings: 100%|██████████| 7/7 [00:00<00:00,  9.60it/s]

[NeMo I 2024-08-26 03:53:11 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 03:53:11 clustering_diarizer:287] Subsegmentation for embedding extraction: scale4, /content/temp_outputs/speaker_outputs/subsegments_scale4.json





[NeMo I 2024-08-26 03:53:11 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 03:53:11 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 03:53:11 collections:302] Dataset loaded with 627 items, total duration of  0.08 hours.
[NeMo I 2024-08-26 03:53:11 collections:304] # 627 files loaded accounting to # 1 labels


[5/5] extract embeddings: 100%|██████████| 10/10 [00:00<00:00, 10.42it/s]

[NeMo I 2024-08-26 03:53:12 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings



clustering: 100%|██████████| 1/1 [00:00<00:00,  2.41it/s]

[NeMo I 2024-08-26 03:53:13 clustering_diarizer:464] Outputs are saved in /content/temp_outputs directory



[NeMo W 2024-08-26 03:53:13 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 03:53:13 msdd_models:960] Loading embedding pickle file of scale:0 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale0_embeddings.pkl
[NeMo I 2024-08-26 03:53:13 msdd_models:960] Loading embedding pickle file of scale:1 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale1_embeddings.pkl
[NeMo I 2024-08-26 03:53:13 msdd_models:960] Loading embedding pickle file of scale:2 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale2_embeddings.pkl
[NeMo I 2024-08-26 03:53:13 msdd_models:960] Loading embedding pickle file of scale:3 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale3_embeddings.pkl
[NeMo I 2024-08-26 03:53:13 msdd_models:960] Loading embedding pickle file of scale:4 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale4_embeddings.pkl
[NeMo I 2024-08-26 03:53:13 msdd_models:938] Loading cluster label file from /content/temp_outputs/speaker_outputs/subsegments_scale4_cluste

100%|██████████| 1/1 [00:00<00:00, 12.74it/s]

[NeMo I 2024-08-26 03:53:13 msdd_models:1403]      [Threshold: 0.7000] [use_clus_as_main=False] [diar_window=50]
[NeMo I 2024-08-26 03:53:13 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-08-26 03:53:13 speaker_utils:93] Number of files to diarize: 1



[NeMo W 2024-08-26 03:53:13 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 03:53:13 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-08-26 03:53:13 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 03:53:13 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-08-26 03:53:13 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 03:53:13 msdd_models:1431]   
    
archivo 6eb352d2758049f09c9619ac9df008e5_20230610t14_29_utc.wav listo


INFO:pytorch_lightning.utilities.migration.utils:Lightning automatically upgraded your loaded checkpoint from v1.5.4 to v2.0.7. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint --file ../root/.cache/torch/whisperx-vad-segmentation.bin`


No language specified, language will be first be detected for each audio file (increases inference time).
Model was trained with pyannote.audio 0.0.1, yours is 3.1.1. Bad things might happen unless you revert pyannote.audio to 0.x.
Model was trained with torch 1.10.0+cu102, yours is 2.0.1+cu117. Bad things might happen unless you revert torch to 1.x.
Suppressing numeral and symbol tokens: [3, 4, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 502, 568, 628, 805, 945, 1017, 1025, 1266, 1294, 1360, 1386, 1525, 1614, 1649, 1722, 1848, 1958, 2009, 2119, 2217, 2272, 2319, 2331, 2443, 2625, 2803, 2975, 3165, 3279, 3282, 3356, 3405, 3446, 3499, 3552, 3705, 4022, 4060, 4289, 4303, 4436, 4550, 4688, 4702, 4762, 4808, 5080, 5211, 5254, 5285, 5348, 5853, 5867, 5923, 6071, 6074, 6096, 6375, 6494, 6549, 6591, 6641, 6673, 6856, 6866, 6879, 6905, 6976, 7143, 7201, 7271, 7490, 7526, 7546, 7551, 7560, 7562, 7629, 7634, 7668, 7771, 7773, 7911, 7998, 8132, 8227, 8423, 8451, 8465, 8494, 8652, 8794, 8858, 8923, 90

[NeMo W 2024-08-26 03:53:57 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: true
    
[NeMo W 2024-08-26 03:53:57 modelPT:168] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s). 
    Validation config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: false
    
[NeMo W 2024-08-26 03:53:57 modelPT:174] Please call the ModelPT.setup_test_data() or ModelPT.setup_multiple

[NeMo I 2024-08-26 03:53:57 features:289] PADDING: 16
[NeMo I 2024-08-26 03:53:57 features:289] PADDING: 16
[NeMo I 2024-08-26 03:53:58 save_restore_connector:249] Model EncDecDiarLabelModel was successfully restored from /root/.cache/torch/NeMo/NeMo_1.21.0/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo.
[NeMo I 2024-08-26 03:53:58 features:289] PADDING: 16
[NeMo I 2024-08-26 03:53:58 clustering_diarizer:127] Loading pretrained vad_multilingual_marblenet model from NGC
[NeMo I 2024-08-26 03:53:58 cloud:58] Found existing object /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-08-26 03:53:58 cloud:64] Re-using file from: /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo
[NeMo I 2024-08-26 03:53:58 common:913] Instantiating model from pre-trained checkpoint


[NeMo W 2024-08-26 03:53:58 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: /manifests/ami_train_0.63.json,/manifests/freesound_background_train.json,/manifests/freesound_laughter_train.json,/manifests/fisher_2004_background.json,/manifests/fisher_2004_speech_sampled.json,/manifests/google_train_manifest.json,/manifests/icsi_all_0.63.json,/manifests/musan_freesound_train.json,/manifests/musan_music_train.json,/manifests/musan_soundbible_train.json,/manifests/mandarin_train_sample.json,/manifests/german_train_sample.json,/manifests/spanish_train_sample.json,/manifests/french_train_sample.json,/manifests/russian_train_sample.json
    sample_rate: 16000
    labels:
    - background
    - speech
    batch_size: 256
    shuffle: true
    is_tarred: false
    tarred_audio_filepaths: null
    tarred_shard_strategy: sca

[NeMo I 2024-08-26 03:53:58 features:289] PADDING: 16
[NeMo I 2024-08-26 03:53:59 save_restore_connector:249] Model EncDecClassificationModel was successfully restored from /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-08-26 03:53:59 msdd_models:864] Multiscale Weights: [1, 1, 1, 1, 1]
[NeMo I 2024-08-26 03:53:59 msdd_models:865] Clustering Parameters: {
        "oracle_num_speakers": false,
        "max_num_speakers": 8,
        "enhanced_count_thres": 80,
        "max_rp_threshold": 0.25,
        "sparse_search_volume": 30,
        "maj_vote_spk_count": false,
        "chunk_cluster_count": 50,
        "embeddings_per_chunk": 10000
    }


[NeMo W 2024-08-26 03:53:59 clustering_diarizer:411] Deleting previous clustering diarizer outputs.


[NeMo I 2024-08-26 03:53:59 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-08-26 03:53:59 clustering_diarizer:309] Split long audio file to avoid CUDA memory issue


splitting manifest: 100%|██████████| 1/1 [00:00<00:00, 29.49it/s]

[NeMo I 2024-08-26 03:53:59 vad_utils:107] The prepared manifest file exists. Overwriting!





[NeMo I 2024-08-26 03:53:59 classification_models:273] Perform streaming frame-level VAD
[NeMo I 2024-08-26 03:53:59 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 03:53:59 collections:302] Dataset loaded with 3 items, total duration of  0.04 hours.
[NeMo I 2024-08-26 03:53:59 collections:304] # 3 files loaded accounting to # 1 labels


vad: 100%|██████████| 3/3 [00:00<00:00,  4.14it/s]

[NeMo I 2024-08-26 03:53:59 clustering_diarizer:250] Generating predictions with overlapping input segments



                                                               

[NeMo I 2024-08-26 03:54:01 clustering_diarizer:262] Converting frame level prediction to speech/no-speech segment in start and end times format.


creating speech segments: 100%|██████████| 1/1 [00:00<00:00,  6.91it/s]

[NeMo I 2024-08-26 03:54:01 clustering_diarizer:287] Subsegmentation for embedding extraction: scale0, /content/temp_outputs/speaker_outputs/subsegments_scale0.json
[NeMo I 2024-08-26 03:54:01 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 03:54:01 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 03:54:01 collections:302] Dataset loaded with 84 items, total duration of  0.03 hours.
[NeMo I 2024-08-26 03:54:01 collections:304] # 84 files loaded accounting to # 1 labels



[1/5] extract embeddings: 100%|██████████| 2/2 [00:00<00:00,  6.11it/s]

[NeMo I 2024-08-26 03:54:02 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 03:54:02 clustering_diarizer:287] Subsegmentation for embedding extraction: scale1, /content/temp_outputs/speaker_outputs/subsegments_scale1.json
[NeMo I 2024-08-26 03:54:02 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 03:54:02 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 03:54:02 collections:302] Dataset loaded with 97 items, total duration of  0.03 hours.
[NeMo I 2024-08-26 03:54:02 collections:304] # 97 files loaded accounting to # 1 labels



[2/5] extract embeddings: 100%|██████████| 2/2 [00:00<00:00,  7.16it/s]

[NeMo I 2024-08-26 03:54:02 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 03:54:02 clustering_diarizer:287] Subsegmentation for embedding extraction: scale2, /content/temp_outputs/speaker_outputs/subsegments_scale2.json
[NeMo I 2024-08-26 03:54:02 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 03:54:02 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 03:54:02 collections:302] Dataset loaded with 127 items, total duration of  0.03 hours.
[NeMo I 2024-08-26 03:54:02 collections:304] # 127 files loaded accounting to # 1 labels



[3/5] extract embeddings: 100%|██████████| 2/2 [00:00<00:00,  6.05it/s]

[NeMo I 2024-08-26 03:54:02 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 03:54:02 clustering_diarizer:287] Subsegmentation for embedding extraction: scale3, /content/temp_outputs/speaker_outputs/subsegments_scale3.json
[NeMo I 2024-08-26 03:54:02 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 03:54:02 collections:301] Filtered duration for loading collection is  0.00 hours.





[NeMo I 2024-08-26 03:54:02 collections:302] Dataset loaded with 171 items, total duration of  0.03 hours.
[NeMo I 2024-08-26 03:54:02 collections:304] # 171 files loaded accounting to # 1 labels


[4/5] extract embeddings: 100%|██████████| 3/3 [00:00<00:00,  7.02it/s]

[NeMo I 2024-08-26 03:54:03 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 03:54:03 clustering_diarizer:287] Subsegmentation for embedding extraction: scale4, /content/temp_outputs/speaker_outputs/subsegments_scale4.json
[NeMo I 2024-08-26 03:54:03 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 03:54:03 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 03:54:03 collections:302] Dataset loaded with 261 items, total duration of  0.04 hours.
[NeMo I 2024-08-26 03:54:03 collections:304] # 261 files loaded accounting to # 1 labels



[5/5] extract embeddings: 100%|██████████| 5/5 [00:00<00:00,  8.66it/s]


[NeMo I 2024-08-26 03:54:03 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings


clustering: 100%|██████████| 1/1 [00:00<00:00,  2.65it/s]

[NeMo I 2024-08-26 03:54:04 clustering_diarizer:464] Outputs are saved in /content/temp_outputs directory



[NeMo W 2024-08-26 03:54:04 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 03:54:04 msdd_models:960] Loading embedding pickle file of scale:0 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale0_embeddings.pkl
[NeMo I 2024-08-26 03:54:04 msdd_models:960] Loading embedding pickle file of scale:1 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale1_embeddings.pkl
[NeMo I 2024-08-26 03:54:04 msdd_models:960] Loading embedding pickle file of scale:2 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale2_embeddings.pkl
[NeMo I 2024-08-26 03:54:04 msdd_models:960] Loading embedding pickle file of scale:3 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale3_embeddings.pkl
[NeMo I 2024-08-26 03:54:04 msdd_models:960] Loading embedding pickle file of scale:4 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale4_embeddings.pkl
[NeMo I 2024-08-26 03:54:04 msdd_models:938] Loading cluster label file from /content/temp_outputs/speaker_outputs/subsegments_scale4_cluste

100%|██████████| 1/1 [00:00<00:00, 36.77it/s]

[NeMo I 2024-08-26 03:54:04 msdd_models:1403]      [Threshold: 0.7000] [use_clus_as_main=False] [diar_window=50]
[NeMo I 2024-08-26 03:54:04 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-08-26 03:54:04 speaker_utils:93] Number of files to diarize: 1



[NeMo W 2024-08-26 03:54:04 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 03:54:04 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-08-26 03:54:04 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 03:54:04 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-08-26 03:54:04 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 03:54:04 msdd_models:1431]   
    
archivo 6eb42d94dfb3406faf290fe85feffbfe_20230610t16_55_utc.wav listo


INFO:pytorch_lightning.utilities.migration.utils:Lightning automatically upgraded your loaded checkpoint from v1.5.4 to v2.0.7. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint --file ../root/.cache/torch/whisperx-vad-segmentation.bin`


No language specified, language will be first be detected for each audio file (increases inference time).
Model was trained with pyannote.audio 0.0.1, yours is 3.1.1. Bad things might happen unless you revert pyannote.audio to 0.x.
Model was trained with torch 1.10.0+cu102, yours is 2.0.1+cu117. Bad things might happen unless you revert torch to 1.x.
Suppressing numeral and symbol tokens: [3, 4, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 502, 568, 628, 805, 945, 1017, 1025, 1266, 1294, 1360, 1386, 1525, 1614, 1649, 1722, 1848, 1958, 2009, 2119, 2217, 2272, 2319, 2331, 2443, 2625, 2803, 2975, 3165, 3279, 3282, 3356, 3405, 3446, 3499, 3552, 3705, 4022, 4060, 4289, 4303, 4436, 4550, 4688, 4702, 4762, 4808, 5080, 5211, 5254, 5285, 5348, 5853, 5867, 5923, 6071, 6074, 6096, 6375, 6494, 6549, 6591, 6641, 6673, 6856, 6866, 6879, 6905, 6976, 7143, 7201, 7271, 7490, 7526, 7546, 7551, 7560, 7562, 7629, 7634, 7668, 7771, 7773, 7911, 7998, 8132, 8227, 8423, 8451, 8465, 8494, 8652, 8794, 8858, 8923, 90

[NeMo W 2024-08-26 03:54:46 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: true
    
[NeMo W 2024-08-26 03:54:46 modelPT:168] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s). 
    Validation config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: false
    
[NeMo W 2024-08-26 03:54:46 modelPT:174] Please call the ModelPT.setup_test_data() or ModelPT.setup_multiple

[NeMo I 2024-08-26 03:54:46 features:289] PADDING: 16
[NeMo I 2024-08-26 03:54:46 features:289] PADDING: 16
[NeMo I 2024-08-26 03:54:47 save_restore_connector:249] Model EncDecDiarLabelModel was successfully restored from /root/.cache/torch/NeMo/NeMo_1.21.0/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo.
[NeMo I 2024-08-26 03:54:47 features:289] PADDING: 16
[NeMo I 2024-08-26 03:54:48 clustering_diarizer:127] Loading pretrained vad_multilingual_marblenet model from NGC
[NeMo I 2024-08-26 03:54:48 cloud:58] Found existing object /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-08-26 03:54:48 cloud:64] Re-using file from: /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo
[NeMo I 2024-08-26 03:54:48 common:913] Instantiating model from pre-trained checkpoint


[NeMo W 2024-08-26 03:54:48 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: /manifests/ami_train_0.63.json,/manifests/freesound_background_train.json,/manifests/freesound_laughter_train.json,/manifests/fisher_2004_background.json,/manifests/fisher_2004_speech_sampled.json,/manifests/google_train_manifest.json,/manifests/icsi_all_0.63.json,/manifests/musan_freesound_train.json,/manifests/musan_music_train.json,/manifests/musan_soundbible_train.json,/manifests/mandarin_train_sample.json,/manifests/german_train_sample.json,/manifests/spanish_train_sample.json,/manifests/french_train_sample.json,/manifests/russian_train_sample.json
    sample_rate: 16000
    labels:
    - background
    - speech
    batch_size: 256
    shuffle: true
    is_tarred: false
    tarred_audio_filepaths: null
    tarred_shard_strategy: sca

[NeMo I 2024-08-26 03:54:48 features:289] PADDING: 16
[NeMo I 2024-08-26 03:54:48 save_restore_connector:249] Model EncDecClassificationModel was successfully restored from /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-08-26 03:54:48 msdd_models:864] Multiscale Weights: [1, 1, 1, 1, 1]
[NeMo I 2024-08-26 03:54:48 msdd_models:865] Clustering Parameters: {
        "oracle_num_speakers": false,
        "max_num_speakers": 8,
        "enhanced_count_thres": 80,
        "max_rp_threshold": 0.25,
        "sparse_search_volume": 30,
        "maj_vote_spk_count": false,
        "chunk_cluster_count": 50,
        "embeddings_per_chunk": 10000
    }


[NeMo W 2024-08-26 03:54:48 clustering_diarizer:411] Deleting previous clustering diarizer outputs.


[NeMo I 2024-08-26 03:54:48 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-08-26 03:54:48 clustering_diarizer:309] Split long audio file to avoid CUDA memory issue


splitting manifest: 100%|██████████| 1/1 [00:00<00:00, 67.44it/s]

[NeMo I 2024-08-26 03:54:48 vad_utils:107] The prepared manifest file exists. Overwriting!





[NeMo I 2024-08-26 03:54:48 classification_models:273] Perform streaming frame-level VAD
[NeMo I 2024-08-26 03:54:48 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 03:54:48 collections:302] Dataset loaded with 1 items, total duration of  0.01 hours.
[NeMo I 2024-08-26 03:54:48 collections:304] # 1 files loaded accounting to # 1 labels


vad: 100%|██████████| 1/1 [00:00<00:00,  3.80it/s]

[NeMo I 2024-08-26 03:54:48 clustering_diarizer:250] Generating predictions with overlapping input segments



                                                               

[NeMo I 2024-08-26 03:54:49 clustering_diarizer:262] Converting frame level prediction to speech/no-speech segment in start and end times format.


creating speech segments: 100%|██████████| 1/1 [00:00<00:00, 16.29it/s]

[NeMo I 2024-08-26 03:54:49 clustering_diarizer:287] Subsegmentation for embedding extraction: scale0, /content/temp_outputs/speaker_outputs/subsegments_scale0.json
[NeMo I 2024-08-26 03:54:49 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 03:54:49 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 03:54:49 collections:302] Dataset loaded with 44 items, total duration of  0.02 hours.
[NeMo I 2024-08-26 03:54:49 collections:304] # 44 files loaded accounting to # 1 labels



[1/5] extract embeddings: 100%|██████████| 1/1 [00:00<00:00,  6.58it/s]

[NeMo I 2024-08-26 03:54:49 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 03:54:49 clustering_diarizer:287] Subsegmentation for embedding extraction: scale1, /content/temp_outputs/speaker_outputs/subsegments_scale1.json
[NeMo I 2024-08-26 03:54:49 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 03:54:49 collections:301] Filtered duration for loading collection is  0.00 hours.





[NeMo I 2024-08-26 03:54:49 collections:302] Dataset loaded with 56 items, total duration of  0.02 hours.
[NeMo I 2024-08-26 03:54:49 collections:304] # 56 files loaded accounting to # 1 labels


[2/5] extract embeddings: 100%|██████████| 1/1 [00:00<00:00,  6.22it/s]

[NeMo I 2024-08-26 03:54:49 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings





[NeMo I 2024-08-26 03:54:49 clustering_diarizer:287] Subsegmentation for embedding extraction: scale2, /content/temp_outputs/speaker_outputs/subsegments_scale2.json
[NeMo I 2024-08-26 03:54:49 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 03:54:49 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 03:54:49 collections:302] Dataset loaded with 67 items, total duration of  0.02 hours.
[NeMo I 2024-08-26 03:54:49 collections:304] # 67 files loaded accounting to # 1 labels


[3/5] extract embeddings: 100%|██████████| 2/2 [00:00<00:00,  9.85it/s]

[NeMo I 2024-08-26 03:54:49 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 03:54:49 clustering_diarizer:287] Subsegmentation for embedding extraction: scale3, /content/temp_outputs/speaker_outputs/subsegments_scale3.json
[NeMo I 2024-08-26 03:54:49 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 03:54:49 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 03:54:49 collections:302] Dataset loaded with 92 items, total duration of  0.02 hours.
[NeMo I 2024-08-26 03:54:49 collections:304] # 92 files loaded accounting to # 1 labels



[4/5] extract embeddings: 100%|██████████| 2/2 [00:00<00:00,  8.85it/s]

[NeMo I 2024-08-26 03:54:50 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 03:54:50 clustering_diarizer:287] Subsegmentation for embedding extraction: scale4, /content/temp_outputs/speaker_outputs/subsegments_scale4.json
[NeMo I 2024-08-26 03:54:50 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 03:54:50 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 03:54:50 collections:302] Dataset loaded with 137 items, total duration of  0.02 hours.
[NeMo I 2024-08-26 03:54:50 collections:304] # 137 files loaded accounting to # 1 labels



[5/5] extract embeddings: 100%|██████████| 3/3 [00:00<00:00, 11.00it/s]

[NeMo I 2024-08-26 03:54:50 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings



clustering: 100%|██████████| 1/1 [00:00<00:00,  4.20it/s]

[NeMo I 2024-08-26 03:54:50 clustering_diarizer:464] Outputs are saved in /content/temp_outputs directory



[NeMo W 2024-08-26 03:54:50 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 03:54:50 msdd_models:960] Loading embedding pickle file of scale:0 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale0_embeddings.pkl
[NeMo I 2024-08-26 03:54:50 msdd_models:960] Loading embedding pickle file of scale:1 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale1_embeddings.pkl
[NeMo I 2024-08-26 03:54:50 msdd_models:960] Loading embedding pickle file of scale:2 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale2_embeddings.pkl
[NeMo I 2024-08-26 03:54:50 msdd_models:960] Loading embedding pickle file of scale:3 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale3_embeddings.pkl
[NeMo I 2024-08-26 03:54:50 msdd_models:960] Loading embedding pickle file of scale:4 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale4_embeddings.pkl
[NeMo I 2024-08-26 03:54:50 msdd_models:938] Loading cluster label file from /content/temp_outputs/speaker_outputs/subsegments_scale4_cluste

100%|██████████| 1/1 [00:00<00:00, 25.07it/s]

[NeMo I 2024-08-26 03:54:50 msdd_models:1403]      [Threshold: 0.7000] [use_clus_as_main=False] [diar_window=50]
[NeMo I 2024-08-26 03:54:50 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-08-26 03:54:50 speaker_utils:93] Number of files to diarize: 1



[NeMo W 2024-08-26 03:54:50 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 03:54:50 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-08-26 03:54:50 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 03:54:50 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-08-26 03:54:50 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 03:54:50 msdd_models:1431]   
    
archivo d15a221c5dce4a34ac2d8bda6ed71ebd_20230610t13_57_utc.wav listo


INFO:pytorch_lightning.utilities.migration.utils:Lightning automatically upgraded your loaded checkpoint from v1.5.4 to v2.0.7. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint --file ../root/.cache/torch/whisperx-vad-segmentation.bin`


No language specified, language will be first be detected for each audio file (increases inference time).
Model was trained with pyannote.audio 0.0.1, yours is 3.1.1. Bad things might happen unless you revert pyannote.audio to 0.x.
Model was trained with torch 1.10.0+cu102, yours is 2.0.1+cu117. Bad things might happen unless you revert torch to 1.x.
Suppressing numeral and symbol tokens: [3, 4, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 502, 568, 628, 805, 945, 1017, 1025, 1266, 1294, 1360, 1386, 1525, 1614, 1649, 1722, 1848, 1958, 2009, 2119, 2217, 2272, 2319, 2331, 2443, 2625, 2803, 2975, 3165, 3279, 3282, 3356, 3405, 3446, 3499, 3552, 3705, 4022, 4060, 4289, 4303, 4436, 4550, 4688, 4702, 4762, 4808, 5080, 5211, 5254, 5285, 5348, 5853, 5867, 5923, 6071, 6074, 6096, 6375, 6494, 6549, 6591, 6641, 6673, 6856, 6866, 6879, 6905, 6976, 7143, 7201, 7271, 7490, 7526, 7546, 7551, 7560, 7562, 7629, 7634, 7668, 7771, 7773, 7911, 7998, 8132, 8227, 8423, 8451, 8465, 8494, 8652, 8794, 8858, 8923, 90

[NeMo W 2024-08-26 03:55:35 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: true
    
[NeMo W 2024-08-26 03:55:35 modelPT:168] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s). 
    Validation config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: false
    
[NeMo W 2024-08-26 03:55:35 modelPT:174] Please call the ModelPT.setup_test_data() or ModelPT.setup_multiple

[NeMo I 2024-08-26 03:55:35 features:289] PADDING: 16
[NeMo I 2024-08-26 03:55:35 features:289] PADDING: 16
[NeMo I 2024-08-26 03:55:36 save_restore_connector:249] Model EncDecDiarLabelModel was successfully restored from /root/.cache/torch/NeMo/NeMo_1.21.0/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo.
[NeMo I 2024-08-26 03:55:36 features:289] PADDING: 16
[NeMo I 2024-08-26 03:55:37 clustering_diarizer:127] Loading pretrained vad_multilingual_marblenet model from NGC
[NeMo I 2024-08-26 03:55:37 cloud:58] Found existing object /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-08-26 03:55:37 cloud:64] Re-using file from: /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo
[NeMo I 2024-08-26 03:55:37 common:913] Instantiating model from pre-trained checkpoint


[NeMo W 2024-08-26 03:55:37 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: /manifests/ami_train_0.63.json,/manifests/freesound_background_train.json,/manifests/freesound_laughter_train.json,/manifests/fisher_2004_background.json,/manifests/fisher_2004_speech_sampled.json,/manifests/google_train_manifest.json,/manifests/icsi_all_0.63.json,/manifests/musan_freesound_train.json,/manifests/musan_music_train.json,/manifests/musan_soundbible_train.json,/manifests/mandarin_train_sample.json,/manifests/german_train_sample.json,/manifests/spanish_train_sample.json,/manifests/french_train_sample.json,/manifests/russian_train_sample.json
    sample_rate: 16000
    labels:
    - background
    - speech
    batch_size: 256
    shuffle: true
    is_tarred: false
    tarred_audio_filepaths: null
    tarred_shard_strategy: sca

[NeMo I 2024-08-26 03:55:37 features:289] PADDING: 16
[NeMo I 2024-08-26 03:55:37 save_restore_connector:249] Model EncDecClassificationModel was successfully restored from /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-08-26 03:55:37 msdd_models:864] Multiscale Weights: [1, 1, 1, 1, 1]
[NeMo I 2024-08-26 03:55:37 msdd_models:865] Clustering Parameters: {
        "oracle_num_speakers": false,
        "max_num_speakers": 8,
        "enhanced_count_thres": 80,
        "max_rp_threshold": 0.25,
        "sparse_search_volume": 30,
        "maj_vote_spk_count": false,
        "chunk_cluster_count": 50,
        "embeddings_per_chunk": 10000
    }


[NeMo W 2024-08-26 03:55:37 clustering_diarizer:411] Deleting previous clustering diarizer outputs.


[NeMo I 2024-08-26 03:55:37 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-08-26 03:55:37 clustering_diarizer:309] Split long audio file to avoid CUDA memory issue


splitting manifest: 100%|██████████| 1/1 [00:00<00:00, 40.13it/s]

[NeMo I 2024-08-26 03:55:37 vad_utils:107] The prepared manifest file exists. Overwriting!





[NeMo I 2024-08-26 03:55:37 classification_models:273] Perform streaming frame-level VAD
[NeMo I 2024-08-26 03:55:37 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 03:55:37 collections:302] Dataset loaded with 2 items, total duration of  0.03 hours.
[NeMo I 2024-08-26 03:55:37 collections:304] # 2 files loaded accounting to # 1 labels


vad: 100%|██████████| 2/2 [00:01<00:00,  1.74it/s]

[NeMo I 2024-08-26 03:55:38 clustering_diarizer:250] Generating predictions with overlapping input segments



                                                               

[NeMo I 2024-08-26 03:55:39 clustering_diarizer:262] Converting frame level prediction to speech/no-speech segment in start and end times format.


creating speech segments: 100%|██████████| 1/1 [00:00<00:00,  8.73it/s]

[NeMo I 2024-08-26 03:55:39 clustering_diarizer:287] Subsegmentation for embedding extraction: scale0, /content/temp_outputs/speaker_outputs/subsegments_scale0.json
[NeMo I 2024-08-26 03:55:39 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 03:55:39 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 03:55:39 collections:302] Dataset loaded with 76 items, total duration of  0.03 hours.
[NeMo I 2024-08-26 03:55:39 collections:304] # 76 files loaded accounting to # 1 labels



[1/5] extract embeddings: 100%|██████████| 2/2 [00:00<00:00,  6.15it/s]

[NeMo I 2024-08-26 03:55:39 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 03:55:39 clustering_diarizer:287] Subsegmentation for embedding extraction: scale1, /content/temp_outputs/speaker_outputs/subsegments_scale1.json
[NeMo I 2024-08-26 03:55:39 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 03:55:39 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 03:55:39 collections:302] Dataset loaded with 94 items, total duration of  0.03 hours.
[NeMo I 2024-08-26 03:55:39 collections:304] # 94 files loaded accounting to # 1 labels



[2/5] extract embeddings: 100%|██████████| 2/2 [00:00<00:00,  6.55it/s]

[NeMo I 2024-08-26 03:55:40 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 03:55:40 clustering_diarizer:287] Subsegmentation for embedding extraction: scale2, /content/temp_outputs/speaker_outputs/subsegments_scale2.json
[NeMo I 2024-08-26 03:55:40 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 03:55:40 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 03:55:40 collections:302] Dataset loaded with 117 items, total duration of  0.03 hours.
[NeMo I 2024-08-26 03:55:40 collections:304] # 117 files loaded accounting to # 1 labels



[3/5] extract embeddings: 100%|██████████| 2/2 [00:00<00:00,  6.10it/s]

[NeMo I 2024-08-26 03:55:40 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 03:55:40 clustering_diarizer:287] Subsegmentation for embedding extraction: scale3, /content/temp_outputs/speaker_outputs/subsegments_scale3.json
[NeMo I 2024-08-26 03:55:40 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 03:55:40 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 03:55:40 collections:302] Dataset loaded with 157 items, total duration of  0.03 hours.
[NeMo I 2024-08-26 03:55:40 collections:304] # 157 files loaded accounting to # 1 labels



[4/5] extract embeddings: 100%|██████████| 3/3 [00:00<00:00,  7.69it/s]

[NeMo I 2024-08-26 03:55:41 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 03:55:41 clustering_diarizer:287] Subsegmentation for embedding extraction: scale4, /content/temp_outputs/speaker_outputs/subsegments_scale4.json
[NeMo I 2024-08-26 03:55:41 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 03:55:41 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 03:55:41 collections:302] Dataset loaded with 246 items, total duration of  0.03 hours.
[NeMo I 2024-08-26 03:55:41 collections:304] # 246 files loaded accounting to # 1 labels



[5/5] extract embeddings: 100%|██████████| 4/4 [00:00<00:00,  7.85it/s]

[NeMo I 2024-08-26 03:55:41 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings



clustering: 100%|██████████| 1/1 [00:00<00:00,  2.70it/s]

[NeMo I 2024-08-26 03:55:42 clustering_diarizer:464] Outputs are saved in /content/temp_outputs directory



[NeMo W 2024-08-26 03:55:42 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 03:55:42 msdd_models:960] Loading embedding pickle file of scale:0 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale0_embeddings.pkl
[NeMo I 2024-08-26 03:55:42 msdd_models:960] Loading embedding pickle file of scale:1 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale1_embeddings.pkl
[NeMo I 2024-08-26 03:55:42 msdd_models:960] Loading embedding pickle file of scale:2 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale2_embeddings.pkl
[NeMo I 2024-08-26 03:55:42 msdd_models:960] Loading embedding pickle file of scale:3 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale3_embeddings.pkl
[NeMo I 2024-08-26 03:55:42 msdd_models:960] Loading embedding pickle file of scale:4 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale4_embeddings.pkl
[NeMo I 2024-08-26 03:55:42 msdd_models:938] Loading cluster label file from /content/temp_outputs/speaker_outputs/subsegments_scale4_cluste

100%|██████████| 1/1 [00:00<00:00, 40.49it/s]

[NeMo I 2024-08-26 03:55:42 msdd_models:1403]      [Threshold: 0.7000] [use_clus_as_main=False] [diar_window=50]
[NeMo I 2024-08-26 03:55:42 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-08-26 03:55:42 speaker_utils:93] Number of files to diarize: 1



[NeMo W 2024-08-26 03:55:42 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 03:55:42 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-08-26 03:55:42 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 03:55:42 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-08-26 03:55:42 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 03:55:42 msdd_models:1431]   
    
archivo d554a73b18c84f35866c98565d703d21_20230610t16_41_utc.wav listo


INFO:pytorch_lightning.utilities.migration.utils:Lightning automatically upgraded your loaded checkpoint from v1.5.4 to v2.0.7. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint --file ../root/.cache/torch/whisperx-vad-segmentation.bin`


No language specified, language will be first be detected for each audio file (increases inference time).
Model was trained with pyannote.audio 0.0.1, yours is 3.1.1. Bad things might happen unless you revert pyannote.audio to 0.x.
Model was trained with torch 1.10.0+cu102, yours is 2.0.1+cu117. Bad things might happen unless you revert torch to 1.x.
Suppressing numeral and symbol tokens: [3, 4, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 502, 568, 628, 805, 945, 1017, 1025, 1266, 1294, 1360, 1386, 1525, 1614, 1649, 1722, 1848, 1958, 2009, 2119, 2217, 2272, 2319, 2331, 2443, 2625, 2803, 2975, 3165, 3279, 3282, 3356, 3405, 3446, 3499, 3552, 3705, 4022, 4060, 4289, 4303, 4436, 4550, 4688, 4702, 4762, 4808, 5080, 5211, 5254, 5285, 5348, 5853, 5867, 5923, 6071, 6074, 6096, 6375, 6494, 6549, 6591, 6641, 6673, 6856, 6866, 6879, 6905, 6976, 7143, 7201, 7271, 7490, 7526, 7546, 7551, 7560, 7562, 7629, 7634, 7668, 7771, 7773, 7911, 7998, 8132, 8227, 8423, 8451, 8465, 8494, 8652, 8794, 8858, 8923, 90

[NeMo W 2024-08-26 03:56:26 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: true
    
[NeMo W 2024-08-26 03:56:26 modelPT:168] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s). 
    Validation config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: false
    
[NeMo W 2024-08-26 03:56:26 modelPT:174] Please call the ModelPT.setup_test_data() or ModelPT.setup_multiple

[NeMo I 2024-08-26 03:56:26 features:289] PADDING: 16
[NeMo I 2024-08-26 03:56:27 features:289] PADDING: 16
[NeMo I 2024-08-26 03:56:27 save_restore_connector:249] Model EncDecDiarLabelModel was successfully restored from /root/.cache/torch/NeMo/NeMo_1.21.0/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo.
[NeMo I 2024-08-26 03:56:27 features:289] PADDING: 16
[NeMo I 2024-08-26 03:56:28 clustering_diarizer:127] Loading pretrained vad_multilingual_marblenet model from NGC
[NeMo I 2024-08-26 03:56:28 cloud:58] Found existing object /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-08-26 03:56:28 cloud:64] Re-using file from: /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo
[NeMo I 2024-08-26 03:56:28 common:913] Instantiating model from pre-trained checkpoint


[NeMo W 2024-08-26 03:56:28 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: /manifests/ami_train_0.63.json,/manifests/freesound_background_train.json,/manifests/freesound_laughter_train.json,/manifests/fisher_2004_background.json,/manifests/fisher_2004_speech_sampled.json,/manifests/google_train_manifest.json,/manifests/icsi_all_0.63.json,/manifests/musan_freesound_train.json,/manifests/musan_music_train.json,/manifests/musan_soundbible_train.json,/manifests/mandarin_train_sample.json,/manifests/german_train_sample.json,/manifests/spanish_train_sample.json,/manifests/french_train_sample.json,/manifests/russian_train_sample.json
    sample_rate: 16000
    labels:
    - background
    - speech
    batch_size: 256
    shuffle: true
    is_tarred: false
    tarred_audio_filepaths: null
    tarred_shard_strategy: sca

[NeMo I 2024-08-26 03:56:28 features:289] PADDING: 16
[NeMo I 2024-08-26 03:56:28 save_restore_connector:249] Model EncDecClassificationModel was successfully restored from /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-08-26 03:56:28 msdd_models:864] Multiscale Weights: [1, 1, 1, 1, 1]
[NeMo I 2024-08-26 03:56:28 msdd_models:865] Clustering Parameters: {
        "oracle_num_speakers": false,
        "max_num_speakers": 8,
        "enhanced_count_thres": 80,
        "max_rp_threshold": 0.25,
        "sparse_search_volume": 30,
        "maj_vote_spk_count": false,
        "chunk_cluster_count": 50,
        "embeddings_per_chunk": 10000
    }


[NeMo W 2024-08-26 03:56:28 clustering_diarizer:411] Deleting previous clustering diarizer outputs.


[NeMo I 2024-08-26 03:56:28 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-08-26 03:56:28 clustering_diarizer:309] Split long audio file to avoid CUDA memory issue


splitting manifest: 100%|██████████| 1/1 [00:00<00:00, 31.11it/s]

[NeMo I 2024-08-26 03:56:28 vad_utils:107] The prepared manifest file exists. Overwriting!





[NeMo I 2024-08-26 03:56:28 classification_models:273] Perform streaming frame-level VAD
[NeMo I 2024-08-26 03:56:28 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 03:56:28 collections:302] Dataset loaded with 3 items, total duration of  0.03 hours.
[NeMo I 2024-08-26 03:56:28 collections:304] # 3 files loaded accounting to # 1 labels


vad: 100%|██████████| 3/3 [00:00<00:00,  4.68it/s]

[NeMo I 2024-08-26 03:56:29 clustering_diarizer:250] Generating predictions with overlapping input segments



                                                               

[NeMo I 2024-08-26 03:56:30 clustering_diarizer:262] Converting frame level prediction to speech/no-speech segment in start and end times format.


creating speech segments: 100%|██████████| 1/1 [00:00<00:00,  8.16it/s]

[NeMo I 2024-08-26 03:56:30 clustering_diarizer:287] Subsegmentation for embedding extraction: scale0, /content/temp_outputs/speaker_outputs/subsegments_scale0.json
[NeMo I 2024-08-26 03:56:30 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 03:56:30 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 03:56:30 collections:302] Dataset loaded with 92 items, total duration of  0.03 hours.
[NeMo I 2024-08-26 03:56:30 collections:304] # 92 files loaded accounting to # 1 labels



[1/5] extract embeddings: 100%|██████████| 2/2 [00:00<00:00,  7.34it/s]

[NeMo I 2024-08-26 03:56:30 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 03:56:30 clustering_diarizer:287] Subsegmentation for embedding extraction: scale1, /content/temp_outputs/speaker_outputs/subsegments_scale1.json
[NeMo I 2024-08-26 03:56:30 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 03:56:30 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 03:56:30 collections:302] Dataset loaded with 111 items, total duration of  0.04 hours.
[NeMo I 2024-08-26 03:56:30 collections:304] # 111 files loaded accounting to # 1 labels



[2/5] extract embeddings: 100%|██████████| 2/2 [00:00<00:00,  7.72it/s]

[NeMo I 2024-08-26 03:56:31 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 03:56:31 clustering_diarizer:287] Subsegmentation for embedding extraction: scale2, /content/temp_outputs/speaker_outputs/subsegments_scale2.json
[NeMo I 2024-08-26 03:56:31 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 03:56:31 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 03:56:31 collections:302] Dataset loaded with 140 items, total duration of  0.04 hours.
[NeMo I 2024-08-26 03:56:31 collections:304] # 140 files loaded accounting to # 1 labels



[3/5] extract embeddings: 100%|██████████| 3/3 [00:00<00:00,  9.95it/s]

[NeMo I 2024-08-26 03:56:31 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 03:56:31 clustering_diarizer:287] Subsegmentation for embedding extraction: scale3, /content/temp_outputs/speaker_outputs/subsegments_scale3.json
[NeMo I 2024-08-26 03:56:31 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 03:56:31 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 03:56:31 collections:302] Dataset loaded with 186 items, total duration of  0.04 hours.
[NeMo I 2024-08-26 03:56:31 collections:304] # 186 files loaded accounting to # 1 labels



[4/5] extract embeddings: 100%|██████████| 3/3 [00:00<00:00,  8.40it/s]

[NeMo I 2024-08-26 03:56:31 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 03:56:31 clustering_diarizer:287] Subsegmentation for embedding extraction: scale4, /content/temp_outputs/speaker_outputs/subsegments_scale4.json
[NeMo I 2024-08-26 03:56:31 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 03:56:31 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 03:56:31 collections:302] Dataset loaded with 287 items, total duration of  0.04 hours.
[NeMo I 2024-08-26 03:56:31 collections:304] # 287 files loaded accounting to # 1 labels



[5/5] extract embeddings: 100%|██████████| 5/5 [00:00<00:00, 11.40it/s]

[NeMo I 2024-08-26 03:56:32 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings



clustering: 100%|██████████| 1/1 [00:00<00:00,  3.70it/s]

[NeMo I 2024-08-26 03:56:32 clustering_diarizer:464] Outputs are saved in /content/temp_outputs directory



[NeMo W 2024-08-26 03:56:32 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 03:56:32 msdd_models:960] Loading embedding pickle file of scale:0 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale0_embeddings.pkl
[NeMo I 2024-08-26 03:56:32 msdd_models:960] Loading embedding pickle file of scale:1 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale1_embeddings.pkl
[NeMo I 2024-08-26 03:56:32 msdd_models:960] Loading embedding pickle file of scale:2 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale2_embeddings.pkl
[NeMo I 2024-08-26 03:56:32 msdd_models:960] Loading embedding pickle file of scale:3 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale3_embeddings.pkl
[NeMo I 2024-08-26 03:56:32 msdd_models:960] Loading embedding pickle file of scale:4 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale4_embeddings.pkl
[NeMo I 2024-08-26 03:56:32 msdd_models:938] Loading cluster label file from /content/temp_outputs/speaker_outputs/subsegments_scale4_cluste

100%|██████████| 1/1 [00:00<00:00, 44.70it/s]

[NeMo I 2024-08-26 03:56:32 msdd_models:1403]      [Threshold: 0.7000] [use_clus_as_main=False] [diar_window=50]
[NeMo I 2024-08-26 03:56:32 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-08-26 03:56:32 speaker_utils:93] Number of files to diarize: 1



[NeMo W 2024-08-26 03:56:32 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 03:56:32 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-08-26 03:56:32 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 03:56:32 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-08-26 03:56:32 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 03:56:32 msdd_models:1431]   
    
archivo 463aea85483f442fa152796bbe6412f0_20230610t13_52_utc.wav listo


INFO:pytorch_lightning.utilities.migration.utils:Lightning automatically upgraded your loaded checkpoint from v1.5.4 to v2.0.7. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint --file ../root/.cache/torch/whisperx-vad-segmentation.bin`


No language specified, language will be first be detected for each audio file (increases inference time).
Model was trained with pyannote.audio 0.0.1, yours is 3.1.1. Bad things might happen unless you revert pyannote.audio to 0.x.
Model was trained with torch 1.10.0+cu102, yours is 2.0.1+cu117. Bad things might happen unless you revert torch to 1.x.
Suppressing numeral and symbol tokens: [3, 4, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 502, 568, 628, 805, 945, 1017, 1025, 1266, 1294, 1360, 1386, 1525, 1614, 1649, 1722, 1848, 1958, 2009, 2119, 2217, 2272, 2319, 2331, 2443, 2625, 2803, 2975, 3165, 3279, 3282, 3356, 3405, 3446, 3499, 3552, 3705, 4022, 4060, 4289, 4303, 4436, 4550, 4688, 4702, 4762, 4808, 5080, 5211, 5254, 5285, 5348, 5853, 5867, 5923, 6071, 6074, 6096, 6375, 6494, 6549, 6591, 6641, 6673, 6856, 6866, 6879, 6905, 6976, 7143, 7201, 7271, 7490, 7526, 7546, 7551, 7560, 7562, 7629, 7634, 7668, 7771, 7773, 7911, 7998, 8132, 8227, 8423, 8451, 8465, 8494, 8652, 8794, 8858, 8923, 90

[NeMo W 2024-08-26 03:57:14 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: true
    
[NeMo W 2024-08-26 03:57:14 modelPT:168] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s). 
    Validation config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: false
    
[NeMo W 2024-08-26 03:57:14 modelPT:174] Please call the ModelPT.setup_test_data() or ModelPT.setup_multiple

[NeMo I 2024-08-26 03:57:14 features:289] PADDING: 16
[NeMo I 2024-08-26 03:57:14 features:289] PADDING: 16
[NeMo I 2024-08-26 03:57:15 save_restore_connector:249] Model EncDecDiarLabelModel was successfully restored from /root/.cache/torch/NeMo/NeMo_1.21.0/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo.
[NeMo I 2024-08-26 03:57:15 features:289] PADDING: 16
[NeMo I 2024-08-26 03:57:16 clustering_diarizer:127] Loading pretrained vad_multilingual_marblenet model from NGC
[NeMo I 2024-08-26 03:57:16 cloud:58] Found existing object /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-08-26 03:57:16 cloud:64] Re-using file from: /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo
[NeMo I 2024-08-26 03:57:16 common:913] Instantiating model from pre-trained checkpoint


[NeMo W 2024-08-26 03:57:17 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: /manifests/ami_train_0.63.json,/manifests/freesound_background_train.json,/manifests/freesound_laughter_train.json,/manifests/fisher_2004_background.json,/manifests/fisher_2004_speech_sampled.json,/manifests/google_train_manifest.json,/manifests/icsi_all_0.63.json,/manifests/musan_freesound_train.json,/manifests/musan_music_train.json,/manifests/musan_soundbible_train.json,/manifests/mandarin_train_sample.json,/manifests/german_train_sample.json,/manifests/spanish_train_sample.json,/manifests/french_train_sample.json,/manifests/russian_train_sample.json
    sample_rate: 16000
    labels:
    - background
    - speech
    batch_size: 256
    shuffle: true
    is_tarred: false
    tarred_audio_filepaths: null
    tarred_shard_strategy: sca

[NeMo I 2024-08-26 03:57:17 features:289] PADDING: 16
[NeMo I 2024-08-26 03:57:17 save_restore_connector:249] Model EncDecClassificationModel was successfully restored from /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-08-26 03:57:17 msdd_models:864] Multiscale Weights: [1, 1, 1, 1, 1]
[NeMo I 2024-08-26 03:57:17 msdd_models:865] Clustering Parameters: {
        "oracle_num_speakers": false,
        "max_num_speakers": 8,
        "enhanced_count_thres": 80,
        "max_rp_threshold": 0.25,
        "sparse_search_volume": 30,
        "maj_vote_spk_count": false,
        "chunk_cluster_count": 50,
        "embeddings_per_chunk": 10000
    }


[NeMo W 2024-08-26 03:57:17 clustering_diarizer:411] Deleting previous clustering diarizer outputs.


[NeMo I 2024-08-26 03:57:17 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-08-26 03:57:17 clustering_diarizer:309] Split long audio file to avoid CUDA memory issue


splitting manifest: 100%|██████████| 1/1 [00:00<00:00, 51.96it/s]

[NeMo I 2024-08-26 03:57:17 vad_utils:107] The prepared manifest file exists. Overwriting!
[NeMo I 2024-08-26 03:57:17 classification_models:273] Perform streaming frame-level VAD
[NeMo I 2024-08-26 03:57:17 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 03:57:17 collections:302] Dataset loaded with 2 items, total duration of  0.01 hours.
[NeMo I 2024-08-26 03:57:17 collections:304] # 2 files loaded accounting to # 1 labels



vad: 100%|██████████| 2/2 [00:00<00:00,  4.43it/s]

[NeMo I 2024-08-26 03:57:17 clustering_diarizer:250] Generating predictions with overlapping input segments



                                                               

[NeMo I 2024-08-26 03:57:18 clustering_diarizer:262] Converting frame level prediction to speech/no-speech segment in start and end times format.


creating speech segments: 100%|██████████| 1/1 [00:00<00:00, 10.85it/s]

[NeMo I 2024-08-26 03:57:18 clustering_diarizer:287] Subsegmentation for embedding extraction: scale0, /content/temp_outputs/speaker_outputs/subsegments_scale0.json
[NeMo I 2024-08-26 03:57:18 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 03:57:18 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 03:57:18 collections:302] Dataset loaded with 47 items, total duration of  0.02 hours.
[NeMo I 2024-08-26 03:57:18 collections:304] # 47 files loaded accounting to # 1 labels



[1/5] extract embeddings: 100%|██████████| 1/1 [00:00<00:00,  5.25it/s]

[NeMo I 2024-08-26 03:57:18 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 03:57:18 clustering_diarizer:287] Subsegmentation for embedding extraction: scale1, /content/temp_outputs/speaker_outputs/subsegments_scale1.json





[NeMo I 2024-08-26 03:57:18 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 03:57:18 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 03:57:18 collections:302] Dataset loaded with 58 items, total duration of  0.02 hours.
[NeMo I 2024-08-26 03:57:18 collections:304] # 58 files loaded accounting to # 1 labels


[2/5] extract embeddings: 100%|██████████| 1/1 [00:00<00:00,  5.55it/s]

[NeMo I 2024-08-26 03:57:19 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 03:57:19 clustering_diarizer:287] Subsegmentation for embedding extraction: scale2, /content/temp_outputs/speaker_outputs/subsegments_scale2.json
[NeMo I 2024-08-26 03:57:19 clustering_diarizer:343] Extracting embeddings for Diarization





[NeMo I 2024-08-26 03:57:19 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 03:57:19 collections:302] Dataset loaded with 72 items, total duration of  0.02 hours.
[NeMo I 2024-08-26 03:57:19 collections:304] # 72 files loaded accounting to # 1 labels


[3/5] extract embeddings: 100%|██████████| 2/2 [00:00<00:00,  8.46it/s]

[NeMo I 2024-08-26 03:57:19 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 03:57:19 clustering_diarizer:287] Subsegmentation for embedding extraction: scale3, /content/temp_outputs/speaker_outputs/subsegments_scale3.json
[NeMo I 2024-08-26 03:57:19 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 03:57:19 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 03:57:19 collections:302] Dataset loaded with 97 items, total duration of  0.02 hours.
[NeMo I 2024-08-26 03:57:19 collections:304] # 97 files loaded accounting to # 1 labels



[4/5] extract embeddings: 100%|██████████| 2/2 [00:00<00:00,  7.83it/s]

[NeMo I 2024-08-26 03:57:19 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 03:57:19 clustering_diarizer:287] Subsegmentation for embedding extraction: scale4, /content/temp_outputs/speaker_outputs/subsegments_scale4.json
[NeMo I 2024-08-26 03:57:19 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 03:57:19 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 03:57:19 collections:302] Dataset loaded with 148 items, total duration of  0.02 hours.
[NeMo I 2024-08-26 03:57:19 collections:304] # 148 files loaded accounting to # 1 labels



[5/5] extract embeddings: 100%|██████████| 3/3 [00:00<00:00,  7.58it/s]

[NeMo I 2024-08-26 03:57:20 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings



clustering: 100%|██████████| 1/1 [00:00<00:00,  3.71it/s]

[NeMo I 2024-08-26 03:57:20 clustering_diarizer:464] Outputs are saved in /content/temp_outputs directory



[NeMo W 2024-08-26 03:57:20 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 03:57:20 msdd_models:960] Loading embedding pickle file of scale:0 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale0_embeddings.pkl
[NeMo I 2024-08-26 03:57:20 msdd_models:960] Loading embedding pickle file of scale:1 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale1_embeddings.pkl
[NeMo I 2024-08-26 03:57:20 msdd_models:960] Loading embedding pickle file of scale:2 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale2_embeddings.pkl
[NeMo I 2024-08-26 03:57:20 msdd_models:960] Loading embedding pickle file of scale:3 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale3_embeddings.pkl
[NeMo I 2024-08-26 03:57:20 msdd_models:960] Loading embedding pickle file of scale:4 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale4_embeddings.pkl
[NeMo I 2024-08-26 03:57:20 msdd_models:938] Loading cluster label file from /content/temp_outputs/speaker_outputs/subsegments_scale4_cluste

100%|██████████| 1/1 [00:00<00:00, 41.08it/s]

[NeMo I 2024-08-26 03:57:20 msdd_models:1403]      [Threshold: 0.7000] [use_clus_as_main=False] [diar_window=50]
[NeMo I 2024-08-26 03:57:20 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-08-26 03:57:20 speaker_utils:93] Number of files to diarize: 1



[NeMo W 2024-08-26 03:57:20 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 03:57:20 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-08-26 03:57:20 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 03:57:20 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-08-26 03:57:20 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 03:57:20 msdd_models:1431]   
    
archivo ed011e99f19e4257b0cc08f003decd8c_20230610t16_21_utc.wav listo


INFO:pytorch_lightning.utilities.migration.utils:Lightning automatically upgraded your loaded checkpoint from v1.5.4 to v2.0.7. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint --file ../root/.cache/torch/whisperx-vad-segmentation.bin`


No language specified, language will be first be detected for each audio file (increases inference time).
Model was trained with pyannote.audio 0.0.1, yours is 3.1.1. Bad things might happen unless you revert pyannote.audio to 0.x.
Model was trained with torch 1.10.0+cu102, yours is 2.0.1+cu117. Bad things might happen unless you revert torch to 1.x.
Suppressing numeral and symbol tokens: [3, 4, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 502, 568, 628, 805, 945, 1017, 1025, 1266, 1294, 1360, 1386, 1525, 1614, 1649, 1722, 1848, 1958, 2009, 2119, 2217, 2272, 2319, 2331, 2443, 2625, 2803, 2975, 3165, 3279, 3282, 3356, 3405, 3446, 3499, 3552, 3705, 4022, 4060, 4289, 4303, 4436, 4550, 4688, 4702, 4762, 4808, 5080, 5211, 5254, 5285, 5348, 5853, 5867, 5923, 6071, 6074, 6096, 6375, 6494, 6549, 6591, 6641, 6673, 6856, 6866, 6879, 6905, 6976, 7143, 7201, 7271, 7490, 7526, 7546, 7551, 7560, 7562, 7629, 7634, 7668, 7771, 7773, 7911, 7998, 8132, 8227, 8423, 8451, 8465, 8494, 8652, 8794, 8858, 8923, 90

[NeMo W 2024-08-26 03:58:03 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: true
    
[NeMo W 2024-08-26 03:58:03 modelPT:168] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s). 
    Validation config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: false
    
[NeMo W 2024-08-26 03:58:03 modelPT:174] Please call the ModelPT.setup_test_data() or ModelPT.setup_multiple

[NeMo I 2024-08-26 03:58:03 features:289] PADDING: 16
[NeMo I 2024-08-26 03:58:03 features:289] PADDING: 16
[NeMo I 2024-08-26 03:58:04 save_restore_connector:249] Model EncDecDiarLabelModel was successfully restored from /root/.cache/torch/NeMo/NeMo_1.21.0/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo.
[NeMo I 2024-08-26 03:58:04 features:289] PADDING: 16
[NeMo I 2024-08-26 03:58:04 clustering_diarizer:127] Loading pretrained vad_multilingual_marblenet model from NGC
[NeMo I 2024-08-26 03:58:04 cloud:58] Found existing object /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-08-26 03:58:04 cloud:64] Re-using file from: /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo
[NeMo I 2024-08-26 03:58:04 common:913] Instantiating model from pre-trained checkpoint


[NeMo W 2024-08-26 03:58:04 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: /manifests/ami_train_0.63.json,/manifests/freesound_background_train.json,/manifests/freesound_laughter_train.json,/manifests/fisher_2004_background.json,/manifests/fisher_2004_speech_sampled.json,/manifests/google_train_manifest.json,/manifests/icsi_all_0.63.json,/manifests/musan_freesound_train.json,/manifests/musan_music_train.json,/manifests/musan_soundbible_train.json,/manifests/mandarin_train_sample.json,/manifests/german_train_sample.json,/manifests/spanish_train_sample.json,/manifests/french_train_sample.json,/manifests/russian_train_sample.json
    sample_rate: 16000
    labels:
    - background
    - speech
    batch_size: 256
    shuffle: true
    is_tarred: false
    tarred_audio_filepaths: null
    tarred_shard_strategy: sca

[NeMo I 2024-08-26 03:58:04 features:289] PADDING: 16
[NeMo I 2024-08-26 03:58:04 save_restore_connector:249] Model EncDecClassificationModel was successfully restored from /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-08-26 03:58:04 msdd_models:864] Multiscale Weights: [1, 1, 1, 1, 1]
[NeMo I 2024-08-26 03:58:04 msdd_models:865] Clustering Parameters: {
        "oracle_num_speakers": false,
        "max_num_speakers": 8,
        "enhanced_count_thres": 80,
        "max_rp_threshold": 0.25,
        "sparse_search_volume": 30,
        "maj_vote_spk_count": false,
        "chunk_cluster_count": 50,
        "embeddings_per_chunk": 10000
    }


[NeMo W 2024-08-26 03:58:04 clustering_diarizer:411] Deleting previous clustering diarizer outputs.


[NeMo I 2024-08-26 03:58:04 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-08-26 03:58:04 clustering_diarizer:309] Split long audio file to avoid CUDA memory issue


splitting manifest: 100%|██████████| 1/1 [00:00<00:00, 37.53it/s]

[NeMo I 2024-08-26 03:58:05 vad_utils:107] The prepared manifest file exists. Overwriting!





[NeMo I 2024-08-26 03:58:05 classification_models:273] Perform streaming frame-level VAD
[NeMo I 2024-08-26 03:58:05 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 03:58:05 collections:302] Dataset loaded with 2 items, total duration of  0.03 hours.
[NeMo I 2024-08-26 03:58:05 collections:304] # 2 files loaded accounting to # 1 labels


vad: 100%|██████████| 2/2 [00:00<00:00,  3.76it/s]

[NeMo I 2024-08-26 03:58:05 clustering_diarizer:250] Generating predictions with overlapping input segments



                                                               

[NeMo I 2024-08-26 03:58:06 clustering_diarizer:262] Converting frame level prediction to speech/no-speech segment in start and end times format.


creating speech segments: 100%|██████████| 1/1 [00:00<00:00, 10.70it/s]

[NeMo I 2024-08-26 03:58:06 clustering_diarizer:287] Subsegmentation for embedding extraction: scale0, /content/temp_outputs/speaker_outputs/subsegments_scale0.json
[NeMo I 2024-08-26 03:58:06 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 03:58:06 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 03:58:06 collections:302] Dataset loaded with 59 items, total duration of  0.02 hours.
[NeMo I 2024-08-26 03:58:06 collections:304] # 59 files loaded accounting to # 1 labels



[1/5] extract embeddings: 100%|██████████| 1/1 [00:00<00:00,  4.94it/s]

[NeMo I 2024-08-26 03:58:06 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 03:58:06 clustering_diarizer:287] Subsegmentation for embedding extraction: scale1, /content/temp_outputs/speaker_outputs/subsegments_scale1.json
[NeMo I 2024-08-26 03:58:06 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 03:58:06 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 03:58:06 collections:302] Dataset loaded with 73 items, total duration of  0.02 hours.
[NeMo I 2024-08-26 03:58:06 collections:304] # 73 files loaded accounting to # 1 labels



[2/5] extract embeddings: 100%|██████████| 2/2 [00:00<00:00, 10.09it/s]

[NeMo I 2024-08-26 03:58:06 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 03:58:06 clustering_diarizer:287] Subsegmentation for embedding extraction: scale2, /content/temp_outputs/speaker_outputs/subsegments_scale2.json
[NeMo I 2024-08-26 03:58:06 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 03:58:06 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 03:58:07 collections:302] Dataset loaded with 89 items, total duration of  0.02 hours.
[NeMo I 2024-08-26 03:58:07 collections:304] # 89 files loaded accounting to # 1 labels



[3/5] extract embeddings: 100%|██████████| 2/2 [00:00<00:00,  9.39it/s]

[NeMo I 2024-08-26 03:58:07 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 03:58:07 clustering_diarizer:287] Subsegmentation for embedding extraction: scale3, /content/temp_outputs/speaker_outputs/subsegments_scale3.json
[NeMo I 2024-08-26 03:58:07 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 03:58:07 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 03:58:07 collections:302] Dataset loaded with 120 items, total duration of  0.02 hours.
[NeMo I 2024-08-26 03:58:07 collections:304] # 120 files loaded accounting to # 1 labels



[4/5] extract embeddings: 100%|██████████| 2/2 [00:00<00:00,  7.72it/s]

[NeMo I 2024-08-26 03:58:07 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 03:58:07 clustering_diarizer:287] Subsegmentation for embedding extraction: scale4, /content/temp_outputs/speaker_outputs/subsegments_scale4.json
[NeMo I 2024-08-26 03:58:07 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 03:58:07 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 03:58:07 collections:302] Dataset loaded with 184 items, total duration of  0.03 hours.
[NeMo I 2024-08-26 03:58:07 collections:304] # 184 files loaded accounting to # 1 labels



[5/5] extract embeddings: 100%|██████████| 3/3 [00:00<00:00,  8.92it/s]

[NeMo I 2024-08-26 03:58:07 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings



clustering: 100%|██████████| 1/1 [00:00<00:00,  5.07it/s]

[NeMo I 2024-08-26 03:58:08 clustering_diarizer:464] Outputs are saved in /content/temp_outputs directory



[NeMo W 2024-08-26 03:58:08 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 03:58:08 msdd_models:960] Loading embedding pickle file of scale:0 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale0_embeddings.pkl
[NeMo I 2024-08-26 03:58:08 msdd_models:960] Loading embedding pickle file of scale:1 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale1_embeddings.pkl
[NeMo I 2024-08-26 03:58:08 msdd_models:960] Loading embedding pickle file of scale:2 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale2_embeddings.pkl
[NeMo I 2024-08-26 03:58:08 msdd_models:960] Loading embedding pickle file of scale:3 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale3_embeddings.pkl
[NeMo I 2024-08-26 03:58:08 msdd_models:960] Loading embedding pickle file of scale:4 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale4_embeddings.pkl
[NeMo I 2024-08-26 03:58:08 msdd_models:938] Loading cluster label file from /content/temp_outputs/speaker_outputs/subsegments_scale4_cluste

100%|██████████| 1/1 [00:00<00:00, 40.21it/s]

[NeMo I 2024-08-26 03:58:08 msdd_models:1403]      [Threshold: 0.7000] [use_clus_as_main=False] [diar_window=50]
[NeMo I 2024-08-26 03:58:08 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-08-26 03:58:08 speaker_utils:93] Number of files to diarize: 1



[NeMo W 2024-08-26 03:58:08 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 03:58:08 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-08-26 03:58:08 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 03:58:08 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-08-26 03:58:08 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 03:58:08 msdd_models:1431]   
    
archivo 29a70da7b6b4483d85fd483e814378b1_20230610t14_41_utc.wav listo


INFO:pytorch_lightning.utilities.migration.utils:Lightning automatically upgraded your loaded checkpoint from v1.5.4 to v2.0.7. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint --file ../root/.cache/torch/whisperx-vad-segmentation.bin`


No language specified, language will be first be detected for each audio file (increases inference time).
Model was trained with pyannote.audio 0.0.1, yours is 3.1.1. Bad things might happen unless you revert pyannote.audio to 0.x.
Model was trained with torch 1.10.0+cu102, yours is 2.0.1+cu117. Bad things might happen unless you revert torch to 1.x.
Suppressing numeral and symbol tokens: [3, 4, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 502, 568, 628, 805, 945, 1017, 1025, 1266, 1294, 1360, 1386, 1525, 1614, 1649, 1722, 1848, 1958, 2009, 2119, 2217, 2272, 2319, 2331, 2443, 2625, 2803, 2975, 3165, 3279, 3282, 3356, 3405, 3446, 3499, 3552, 3705, 4022, 4060, 4289, 4303, 4436, 4550, 4688, 4702, 4762, 4808, 5080, 5211, 5254, 5285, 5348, 5853, 5867, 5923, 6071, 6074, 6096, 6375, 6494, 6549, 6591, 6641, 6673, 6856, 6866, 6879, 6905, 6976, 7143, 7201, 7271, 7490, 7526, 7546, 7551, 7560, 7562, 7629, 7634, 7668, 7771, 7773, 7911, 7998, 8132, 8227, 8423, 8451, 8465, 8494, 8652, 8794, 8858, 8923, 90

[NeMo W 2024-08-26 03:59:00 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: true
    
[NeMo W 2024-08-26 03:59:00 modelPT:168] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s). 
    Validation config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: false
    
[NeMo W 2024-08-26 03:59:00 modelPT:174] Please call the ModelPT.setup_test_data() or ModelPT.setup_multiple

[NeMo I 2024-08-26 03:59:00 features:289] PADDING: 16
[NeMo I 2024-08-26 03:59:00 features:289] PADDING: 16
[NeMo I 2024-08-26 03:59:01 save_restore_connector:249] Model EncDecDiarLabelModel was successfully restored from /root/.cache/torch/NeMo/NeMo_1.21.0/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo.
[NeMo I 2024-08-26 03:59:01 features:289] PADDING: 16
[NeMo I 2024-08-26 03:59:01 clustering_diarizer:127] Loading pretrained vad_multilingual_marblenet model from NGC
[NeMo I 2024-08-26 03:59:01 cloud:58] Found existing object /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-08-26 03:59:01 cloud:64] Re-using file from: /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo
[NeMo I 2024-08-26 03:59:01 common:913] Instantiating model from pre-trained checkpoint


[NeMo W 2024-08-26 03:59:02 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: /manifests/ami_train_0.63.json,/manifests/freesound_background_train.json,/manifests/freesound_laughter_train.json,/manifests/fisher_2004_background.json,/manifests/fisher_2004_speech_sampled.json,/manifests/google_train_manifest.json,/manifests/icsi_all_0.63.json,/manifests/musan_freesound_train.json,/manifests/musan_music_train.json,/manifests/musan_soundbible_train.json,/manifests/mandarin_train_sample.json,/manifests/german_train_sample.json,/manifests/spanish_train_sample.json,/manifests/french_train_sample.json,/manifests/russian_train_sample.json
    sample_rate: 16000
    labels:
    - background
    - speech
    batch_size: 256
    shuffle: true
    is_tarred: false
    tarred_audio_filepaths: null
    tarred_shard_strategy: sca

[NeMo I 2024-08-26 03:59:02 features:289] PADDING: 16
[NeMo I 2024-08-26 03:59:02 save_restore_connector:249] Model EncDecClassificationModel was successfully restored from /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-08-26 03:59:02 msdd_models:864] Multiscale Weights: [1, 1, 1, 1, 1]
[NeMo I 2024-08-26 03:59:02 msdd_models:865] Clustering Parameters: {
        "oracle_num_speakers": false,
        "max_num_speakers": 8,
        "enhanced_count_thres": 80,
        "max_rp_threshold": 0.25,
        "sparse_search_volume": 30,
        "maj_vote_spk_count": false,
        "chunk_cluster_count": 50,
        "embeddings_per_chunk": 10000
    }


[NeMo W 2024-08-26 03:59:02 clustering_diarizer:411] Deleting previous clustering diarizer outputs.


[NeMo I 2024-08-26 03:59:02 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-08-26 03:59:02 clustering_diarizer:309] Split long audio file to avoid CUDA memory issue


splitting manifest: 100%|██████████| 1/1 [00:00<00:00, 20.50it/s]

[NeMo I 2024-08-26 03:59:02 vad_utils:107] The prepared manifest file exists. Overwriting!
[NeMo I 2024-08-26 03:59:02 classification_models:273] Perform streaming frame-level VAD
[NeMo I 2024-08-26 03:59:02 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 03:59:02 collections:302] Dataset loaded with 4 items, total duration of  0.05 hours.
[NeMo I 2024-08-26 03:59:02 collections:304] # 4 files loaded accounting to # 1 labels



vad: 100%|██████████| 4/4 [00:00<00:00,  4.07it/s]

[NeMo I 2024-08-26 03:59:03 clustering_diarizer:250] Generating predictions with overlapping input segments



                                                               

[NeMo I 2024-08-26 03:59:04 clustering_diarizer:262] Converting frame level prediction to speech/no-speech segment in start and end times format.


creating speech segments: 100%|██████████| 1/1 [00:00<00:00,  6.14it/s]

[NeMo I 2024-08-26 03:59:05 clustering_diarizer:287] Subsegmentation for embedding extraction: scale0, /content/temp_outputs/speaker_outputs/subsegments_scale0.json
[NeMo I 2024-08-26 03:59:05 clustering_diarizer:343] Extracting embeddings for Diarization





[NeMo I 2024-08-26 03:59:05 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 03:59:05 collections:302] Dataset loaded with 178 items, total duration of  0.07 hours.
[NeMo I 2024-08-26 03:59:05 collections:304] # 178 files loaded accounting to # 1 labels


[1/5] extract embeddings: 100%|██████████| 3/3 [00:00<00:00,  5.87it/s]

[NeMo I 2024-08-26 03:59:05 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 03:59:05 clustering_diarizer:287] Subsegmentation for embedding extraction: scale1, /content/temp_outputs/speaker_outputs/subsegments_scale1.json
[NeMo I 2024-08-26 03:59:05 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 03:59:05 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 03:59:05 collections:302] Dataset loaded with 219 items, total duration of  0.07 hours.
[NeMo I 2024-08-26 03:59:05 collections:304] # 219 files loaded accounting to # 1 labels



[2/5] extract embeddings: 100%|██████████| 4/4 [00:00<00:00,  7.92it/s]

[NeMo I 2024-08-26 03:59:06 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 03:59:06 clustering_diarizer:287] Subsegmentation for embedding extraction: scale2, /content/temp_outputs/speaker_outputs/subsegments_scale2.json
[NeMo I 2024-08-26 03:59:06 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 03:59:06 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 03:59:06 collections:302] Dataset loaded with 272 items, total duration of  0.07 hours.
[NeMo I 2024-08-26 03:59:06 collections:304] # 272 files loaded accounting to # 1 labels



[3/5] extract embeddings: 100%|██████████| 5/5 [00:00<00:00,  8.77it/s]

[NeMo I 2024-08-26 03:59:06 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 03:59:06 clustering_diarizer:287] Subsegmentation for embedding extraction: scale3, /content/temp_outputs/speaker_outputs/subsegments_scale3.json





[NeMo I 2024-08-26 03:59:06 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 03:59:06 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 03:59:06 collections:302] Dataset loaded with 367 items, total duration of  0.07 hours.
[NeMo I 2024-08-26 03:59:06 collections:304] # 367 files loaded accounting to # 1 labels


[4/5] extract embeddings: 100%|██████████| 6/6 [00:00<00:00,  9.26it/s]

[NeMo I 2024-08-26 03:59:07 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 03:59:07 clustering_diarizer:287] Subsegmentation for embedding extraction: scale4, /content/temp_outputs/speaker_outputs/subsegments_scale4.json
[NeMo I 2024-08-26 03:59:07 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 03:59:07 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 03:59:07 collections:302] Dataset loaded with 559 items, total duration of  0.08 hours.
[NeMo I 2024-08-26 03:59:07 collections:304] # 559 files loaded accounting to # 1 labels



[5/5] extract embeddings: 100%|██████████| 9/9 [00:00<00:00, 10.34it/s]


[NeMo I 2024-08-26 03:59:08 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings


clustering: 100%|██████████| 1/1 [00:00<00:00,  2.80it/s]

[NeMo I 2024-08-26 03:59:08 clustering_diarizer:464] Outputs are saved in /content/temp_outputs directory



[NeMo W 2024-08-26 03:59:08 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 03:59:08 msdd_models:960] Loading embedding pickle file of scale:0 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale0_embeddings.pkl
[NeMo I 2024-08-26 03:59:08 msdd_models:960] Loading embedding pickle file of scale:1 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale1_embeddings.pkl
[NeMo I 2024-08-26 03:59:08 msdd_models:960] Loading embedding pickle file of scale:2 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale2_embeddings.pkl
[NeMo I 2024-08-26 03:59:08 msdd_models:960] Loading embedding pickle file of scale:3 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale3_embeddings.pkl
[NeMo I 2024-08-26 03:59:08 msdd_models:960] Loading embedding pickle file of scale:4 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale4_embeddings.pkl
[NeMo I 2024-08-26 03:59:08 msdd_models:938] Loading cluster label file from /content/temp_outputs/speaker_outputs/subsegments_scale4_cluste

100%|██████████| 1/1 [00:00<00:00, 27.24it/s]

[NeMo I 2024-08-26 03:59:08 msdd_models:1403]      [Threshold: 0.7000] [use_clus_as_main=False] [diar_window=50]
[NeMo I 2024-08-26 03:59:08 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-08-26 03:59:08 speaker_utils:93] Number of files to diarize: 1



[NeMo W 2024-08-26 03:59:09 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 03:59:09 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-08-26 03:59:09 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 03:59:09 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-08-26 03:59:09 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 03:59:09 msdd_models:1431]   
    
archivo 7f77a99fac1b4dabbcce2f408995b9bf_20230610t13_04_utc.wav listo


INFO:pytorch_lightning.utilities.migration.utils:Lightning automatically upgraded your loaded checkpoint from v1.5.4 to v2.0.7. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint --file ../root/.cache/torch/whisperx-vad-segmentation.bin`


No language specified, language will be first be detected for each audio file (increases inference time).
Model was trained with pyannote.audio 0.0.1, yours is 3.1.1. Bad things might happen unless you revert pyannote.audio to 0.x.
Model was trained with torch 1.10.0+cu102, yours is 2.0.1+cu117. Bad things might happen unless you revert torch to 1.x.
Suppressing numeral and symbol tokens: [3, 4, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 502, 568, 628, 805, 945, 1017, 1025, 1266, 1294, 1360, 1386, 1525, 1614, 1649, 1722, 1848, 1958, 2009, 2119, 2217, 2272, 2319, 2331, 2443, 2625, 2803, 2975, 3165, 3279, 3282, 3356, 3405, 3446, 3499, 3552, 3705, 4022, 4060, 4289, 4303, 4436, 4550, 4688, 4702, 4762, 4808, 5080, 5211, 5254, 5285, 5348, 5853, 5867, 5923, 6071, 6074, 6096, 6375, 6494, 6549, 6591, 6641, 6673, 6856, 6866, 6879, 6905, 6976, 7143, 7201, 7271, 7490, 7526, 7546, 7551, 7560, 7562, 7629, 7634, 7668, 7771, 7773, 7911, 7998, 8132, 8227, 8423, 8451, 8465, 8494, 8652, 8794, 8858, 8923, 90

[NeMo W 2024-08-26 03:59:53 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: true
    
[NeMo W 2024-08-26 03:59:53 modelPT:168] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s). 
    Validation config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: false
    
[NeMo W 2024-08-26 03:59:53 modelPT:174] Please call the ModelPT.setup_test_data() or ModelPT.setup_multiple

[NeMo I 2024-08-26 03:59:54 features:289] PADDING: 16
[NeMo I 2024-08-26 03:59:54 features:289] PADDING: 16
[NeMo I 2024-08-26 03:59:55 save_restore_connector:249] Model EncDecDiarLabelModel was successfully restored from /root/.cache/torch/NeMo/NeMo_1.21.0/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo.
[NeMo I 2024-08-26 03:59:55 features:289] PADDING: 16
[NeMo I 2024-08-26 03:59:56 clustering_diarizer:127] Loading pretrained vad_multilingual_marblenet model from NGC
[NeMo I 2024-08-26 03:59:56 cloud:58] Found existing object /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-08-26 03:59:56 cloud:64] Re-using file from: /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo
[NeMo I 2024-08-26 03:59:56 common:913] Instantiating model from pre-trained checkpoint


[NeMo W 2024-08-26 03:59:56 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: /manifests/ami_train_0.63.json,/manifests/freesound_background_train.json,/manifests/freesound_laughter_train.json,/manifests/fisher_2004_background.json,/manifests/fisher_2004_speech_sampled.json,/manifests/google_train_manifest.json,/manifests/icsi_all_0.63.json,/manifests/musan_freesound_train.json,/manifests/musan_music_train.json,/manifests/musan_soundbible_train.json,/manifests/mandarin_train_sample.json,/manifests/german_train_sample.json,/manifests/spanish_train_sample.json,/manifests/french_train_sample.json,/manifests/russian_train_sample.json
    sample_rate: 16000
    labels:
    - background
    - speech
    batch_size: 256
    shuffle: true
    is_tarred: false
    tarred_audio_filepaths: null
    tarred_shard_strategy: sca

[NeMo I 2024-08-26 03:59:56 features:289] PADDING: 16
[NeMo I 2024-08-26 03:59:56 save_restore_connector:249] Model EncDecClassificationModel was successfully restored from /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-08-26 03:59:56 msdd_models:864] Multiscale Weights: [1, 1, 1, 1, 1]
[NeMo I 2024-08-26 03:59:56 msdd_models:865] Clustering Parameters: {
        "oracle_num_speakers": false,
        "max_num_speakers": 8,
        "enhanced_count_thres": 80,
        "max_rp_threshold": 0.25,
        "sparse_search_volume": 30,
        "maj_vote_spk_count": false,
        "chunk_cluster_count": 50,
        "embeddings_per_chunk": 10000
    }


[NeMo W 2024-08-26 03:59:56 clustering_diarizer:411] Deleting previous clustering diarizer outputs.


[NeMo I 2024-08-26 03:59:56 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-08-26 03:59:56 clustering_diarizer:309] Split long audio file to avoid CUDA memory issue


splitting manifest: 100%|██████████| 1/1 [00:00<00:00, 29.06it/s]

[NeMo I 2024-08-26 03:59:56 vad_utils:107] The prepared manifest file exists. Overwriting!





[NeMo I 2024-08-26 03:59:56 classification_models:273] Perform streaming frame-level VAD
[NeMo I 2024-08-26 03:59:56 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 03:59:56 collections:302] Dataset loaded with 3 items, total duration of  0.03 hours.
[NeMo I 2024-08-26 03:59:56 collections:304] # 3 files loaded accounting to # 1 labels


vad: 100%|██████████| 3/3 [00:00<00:00,  4.11it/s]

[NeMo I 2024-08-26 03:59:57 clustering_diarizer:250] Generating predictions with overlapping input segments



                                                               

[NeMo I 2024-08-26 03:59:58 clustering_diarizer:262] Converting frame level prediction to speech/no-speech segment in start and end times format.


creating speech segments: 100%|██████████| 1/1 [00:00<00:00,  8.04it/s]

[NeMo I 2024-08-26 03:59:58 clustering_diarizer:287] Subsegmentation for embedding extraction: scale0, /content/temp_outputs/speaker_outputs/subsegments_scale0.json
[NeMo I 2024-08-26 03:59:58 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 03:59:58 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 03:59:58 collections:302] Dataset loaded with 85 items, total duration of  0.03 hours.
[NeMo I 2024-08-26 03:59:58 collections:304] # 85 files loaded accounting to # 1 labels



[1/5] extract embeddings: 100%|██████████| 2/2 [00:00<00:00,  7.56it/s]

[NeMo I 2024-08-26 03:59:59 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 03:59:59 clustering_diarizer:287] Subsegmentation for embedding extraction: scale1, /content/temp_outputs/speaker_outputs/subsegments_scale1.json
[NeMo I 2024-08-26 03:59:59 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 03:59:59 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 03:59:59 collections:302] Dataset loaded with 98 items, total duration of  0.03 hours.
[NeMo I 2024-08-26 03:59:59 collections:304] # 98 files loaded accounting to # 1 labels



[2/5] extract embeddings: 100%|██████████| 2/2 [00:00<00:00,  7.54it/s]

[NeMo I 2024-08-26 03:59:59 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 03:59:59 clustering_diarizer:287] Subsegmentation for embedding extraction: scale2, /content/temp_outputs/speaker_outputs/subsegments_scale2.json
[NeMo I 2024-08-26 03:59:59 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 03:59:59 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 03:59:59 collections:302] Dataset loaded with 122 items, total duration of  0.03 hours.
[NeMo I 2024-08-26 03:59:59 collections:304] # 122 files loaded accounting to # 1 labels



[3/5] extract embeddings: 100%|██████████| 2/2 [00:00<00:00,  7.76it/s]

[NeMo I 2024-08-26 03:59:59 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 03:59:59 clustering_diarizer:287] Subsegmentation for embedding extraction: scale3, /content/temp_outputs/speaker_outputs/subsegments_scale3.json
[NeMo I 2024-08-26 03:59:59 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 03:59:59 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 03:59:59 collections:302] Dataset loaded with 167 items, total duration of  0.03 hours.
[NeMo I 2024-08-26 03:59:59 collections:304] # 167 files loaded accounting to # 1 labels



[4/5] extract embeddings: 100%|██████████| 3/3 [00:00<00:00,  8.83it/s]

[NeMo I 2024-08-26 04:00:00 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 04:00:00 clustering_diarizer:287] Subsegmentation for embedding extraction: scale4, /content/temp_outputs/speaker_outputs/subsegments_scale4.json
[NeMo I 2024-08-26 04:00:00 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 04:00:00 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:00:00 collections:302] Dataset loaded with 252 items, total duration of  0.03 hours.
[NeMo I 2024-08-26 04:00:00 collections:304] # 252 files loaded accounting to # 1 labels



[5/5] extract embeddings: 100%|██████████| 4/4 [00:00<00:00,  9.85it/s]

[NeMo I 2024-08-26 04:00:00 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings



clustering: 100%|██████████| 1/1 [00:00<00:00,  3.63it/s]

[NeMo I 2024-08-26 04:00:00 clustering_diarizer:464] Outputs are saved in /content/temp_outputs directory



[NeMo W 2024-08-26 04:00:00 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 04:00:00 msdd_models:960] Loading embedding pickle file of scale:0 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale0_embeddings.pkl
[NeMo I 2024-08-26 04:00:00 msdd_models:960] Loading embedding pickle file of scale:1 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale1_embeddings.pkl
[NeMo I 2024-08-26 04:00:00 msdd_models:960] Loading embedding pickle file of scale:2 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale2_embeddings.pkl
[NeMo I 2024-08-26 04:00:00 msdd_models:960] Loading embedding pickle file of scale:3 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale3_embeddings.pkl
[NeMo I 2024-08-26 04:00:00 msdd_models:960] Loading embedding pickle file of scale:4 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale4_embeddings.pkl
[NeMo I 2024-08-26 04:00:00 msdd_models:938] Loading cluster label file from /content/temp_outputs/speaker_outputs/subsegments_scale4_cluste

100%|██████████| 1/1 [00:00<00:00, 39.26it/s]

[NeMo I 2024-08-26 04:00:00 msdd_models:1403]      [Threshold: 0.7000] [use_clus_as_main=False] [diar_window=50]
[NeMo I 2024-08-26 04:00:00 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-08-26 04:00:00 speaker_utils:93] Number of files to diarize: 1



[NeMo W 2024-08-26 04:00:00 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 04:00:00 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-08-26 04:00:01 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 04:00:01 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-08-26 04:00:01 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 04:00:01 msdd_models:1431]   
    
archivo 3790e2a16369420295cd1456849101ac_20230610t16_51_utc.wav listo


INFO:pytorch_lightning.utilities.migration.utils:Lightning automatically upgraded your loaded checkpoint from v1.5.4 to v2.0.7. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint --file ../root/.cache/torch/whisperx-vad-segmentation.bin`


No language specified, language will be first be detected for each audio file (increases inference time).
Model was trained with pyannote.audio 0.0.1, yours is 3.1.1. Bad things might happen unless you revert pyannote.audio to 0.x.
Model was trained with torch 1.10.0+cu102, yours is 2.0.1+cu117. Bad things might happen unless you revert torch to 1.x.
Suppressing numeral and symbol tokens: [3, 4, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 502, 568, 628, 805, 945, 1017, 1025, 1266, 1294, 1360, 1386, 1525, 1614, 1649, 1722, 1848, 1958, 2009, 2119, 2217, 2272, 2319, 2331, 2443, 2625, 2803, 2975, 3165, 3279, 3282, 3356, 3405, 3446, 3499, 3552, 3705, 4022, 4060, 4289, 4303, 4436, 4550, 4688, 4702, 4762, 4808, 5080, 5211, 5254, 5285, 5348, 5853, 5867, 5923, 6071, 6074, 6096, 6375, 6494, 6549, 6591, 6641, 6673, 6856, 6866, 6879, 6905, 6976, 7143, 7201, 7271, 7490, 7526, 7546, 7551, 7560, 7562, 7629, 7634, 7668, 7771, 7773, 7911, 7998, 8132, 8227, 8423, 8451, 8465, 8494, 8652, 8794, 8858, 8923, 90

[NeMo W 2024-08-26 04:01:02 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: true
    
[NeMo W 2024-08-26 04:01:02 modelPT:168] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s). 
    Validation config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: false
    
[NeMo W 2024-08-26 04:01:02 modelPT:174] Please call the ModelPT.setup_test_data() or ModelPT.setup_multiple

[NeMo I 2024-08-26 04:01:02 features:289] PADDING: 16
[NeMo I 2024-08-26 04:01:02 features:289] PADDING: 16
[NeMo I 2024-08-26 04:01:03 save_restore_connector:249] Model EncDecDiarLabelModel was successfully restored from /root/.cache/torch/NeMo/NeMo_1.21.0/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo.
[NeMo I 2024-08-26 04:01:03 features:289] PADDING: 16
[NeMo I 2024-08-26 04:01:04 clustering_diarizer:127] Loading pretrained vad_multilingual_marblenet model from NGC
[NeMo I 2024-08-26 04:01:04 cloud:58] Found existing object /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-08-26 04:01:04 cloud:64] Re-using file from: /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo
[NeMo I 2024-08-26 04:01:04 common:913] Instantiating model from pre-trained checkpoint


[NeMo W 2024-08-26 04:01:05 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: /manifests/ami_train_0.63.json,/manifests/freesound_background_train.json,/manifests/freesound_laughter_train.json,/manifests/fisher_2004_background.json,/manifests/fisher_2004_speech_sampled.json,/manifests/google_train_manifest.json,/manifests/icsi_all_0.63.json,/manifests/musan_freesound_train.json,/manifests/musan_music_train.json,/manifests/musan_soundbible_train.json,/manifests/mandarin_train_sample.json,/manifests/german_train_sample.json,/manifests/spanish_train_sample.json,/manifests/french_train_sample.json,/manifests/russian_train_sample.json
    sample_rate: 16000
    labels:
    - background
    - speech
    batch_size: 256
    shuffle: true
    is_tarred: false
    tarred_audio_filepaths: null
    tarred_shard_strategy: sca

[NeMo I 2024-08-26 04:01:05 features:289] PADDING: 16
[NeMo I 2024-08-26 04:01:05 save_restore_connector:249] Model EncDecClassificationModel was successfully restored from /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-08-26 04:01:05 msdd_models:864] Multiscale Weights: [1, 1, 1, 1, 1]
[NeMo I 2024-08-26 04:01:05 msdd_models:865] Clustering Parameters: {
        "oracle_num_speakers": false,
        "max_num_speakers": 8,
        "enhanced_count_thres": 80,
        "max_rp_threshold": 0.25,
        "sparse_search_volume": 30,
        "maj_vote_spk_count": false,
        "chunk_cluster_count": 50,
        "embeddings_per_chunk": 10000
    }


[NeMo W 2024-08-26 04:01:05 clustering_diarizer:411] Deleting previous clustering diarizer outputs.


[NeMo I 2024-08-26 04:01:05 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-08-26 04:01:05 clustering_diarizer:309] Split long audio file to avoid CUDA memory issue


splitting manifest: 100%|██████████| 1/1 [00:00<00:00,  9.41it/s]

[NeMo I 2024-08-26 04:01:05 vad_utils:107] The prepared manifest file exists. Overwriting!
[NeMo I 2024-08-26 04:01:05 classification_models:273] Perform streaming frame-level VAD
[NeMo I 2024-08-26 04:01:05 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:01:05 collections:302] Dataset loaded with 8 items, total duration of  0.10 hours.
[NeMo I 2024-08-26 04:01:05 collections:304] # 8 files loaded accounting to # 1 labels



vad: 100%|██████████| 8/8 [00:02<00:00,  3.32it/s]

[NeMo I 2024-08-26 04:01:07 clustering_diarizer:250] Generating predictions with overlapping input segments



                                                               

[NeMo I 2024-08-26 04:01:10 clustering_diarizer:262] Converting frame level prediction to speech/no-speech segment in start and end times format.


creating speech segments: 100%|██████████| 1/1 [00:00<00:00,  3.43it/s]

[NeMo I 2024-08-26 04:01:11 clustering_diarizer:287] Subsegmentation for embedding extraction: scale0, /content/temp_outputs/speaker_outputs/subsegments_scale0.json
[NeMo I 2024-08-26 04:01:11 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 04:01:11 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:01:11 collections:302] Dataset loaded with 365 items, total duration of  0.14 hours.
[NeMo I 2024-08-26 04:01:11 collections:304] # 365 files loaded accounting to # 1 labels



[1/5] extract embeddings: 100%|██████████| 6/6 [00:01<00:00,  5.97it/s]

[NeMo I 2024-08-26 04:01:12 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 04:01:12 clustering_diarizer:287] Subsegmentation for embedding extraction: scale1, /content/temp_outputs/speaker_outputs/subsegments_scale1.json
[NeMo I 2024-08-26 04:01:12 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 04:01:12 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:01:12 collections:302] Dataset loaded with 449 items, total duration of  0.15 hours.
[NeMo I 2024-08-26 04:01:12 collections:304] # 449 files loaded accounting to # 1 labels



[2/5] extract embeddings: 100%|██████████| 8/8 [00:01<00:00,  7.92it/s]

[NeMo I 2024-08-26 04:01:13 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 04:01:13 clustering_diarizer:287] Subsegmentation for embedding extraction: scale2, /content/temp_outputs/speaker_outputs/subsegments_scale2.json
[NeMo I 2024-08-26 04:01:13 clustering_diarizer:343] Extracting embeddings for Diarization





[NeMo I 2024-08-26 04:01:13 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:01:13 collections:302] Dataset loaded with 565 items, total duration of  0.15 hours.
[NeMo I 2024-08-26 04:01:13 collections:304] # 565 files loaded accounting to # 1 labels


[3/5] extract embeddings: 100%|██████████| 9/9 [00:01<00:00,  7.91it/s]

[NeMo I 2024-08-26 04:01:14 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 04:01:14 clustering_diarizer:287] Subsegmentation for embedding extraction: scale3, /content/temp_outputs/speaker_outputs/subsegments_scale3.json
[NeMo I 2024-08-26 04:01:14 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 04:01:14 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:01:14 collections:302] Dataset loaded with 761 items, total duration of  0.15 hours.
[NeMo I 2024-08-26 04:01:14 collections:304] # 761 files loaded accounting to # 1 labels



[4/5] extract embeddings: 100%|██████████| 12/12 [00:01<00:00,  8.92it/s]

[NeMo I 2024-08-26 04:01:16 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 04:01:16 clustering_diarizer:287] Subsegmentation for embedding extraction: scale4, /content/temp_outputs/speaker_outputs/subsegments_scale4.json





[NeMo I 2024-08-26 04:01:16 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 04:01:16 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:01:16 collections:302] Dataset loaded with 1161 items, total duration of  0.16 hours.
[NeMo I 2024-08-26 04:01:16 collections:304] # 1161 files loaded accounting to # 1 labels


[5/5] extract embeddings: 100%|██████████| 19/19 [00:01<00:00, 10.57it/s]

[NeMo I 2024-08-26 04:01:18 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings



clustering: 100%|██████████| 1/1 [00:00<00:00,  1.90it/s]

[NeMo I 2024-08-26 04:01:18 clustering_diarizer:464] Outputs are saved in /content/temp_outputs directory



[NeMo W 2024-08-26 04:01:18 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 04:01:18 msdd_models:960] Loading embedding pickle file of scale:0 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale0_embeddings.pkl
[NeMo I 2024-08-26 04:01:18 msdd_models:960] Loading embedding pickle file of scale:1 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale1_embeddings.pkl
[NeMo I 2024-08-26 04:01:18 msdd_models:960] Loading embedding pickle file of scale:2 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale2_embeddings.pkl
[NeMo I 2024-08-26 04:01:18 msdd_models:960] Loading embedding pickle file of scale:3 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale3_embeddings.pkl
[NeMo I 2024-08-26 04:01:18 msdd_models:960] Loading embedding pickle file of scale:4 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale4_embeddings.pkl
[NeMo I 2024-08-26 04:01:18 msdd_models:938] Loading cluster label file from /content/temp_outputs/speaker_outputs/subsegments_scale4_cluste

100%|██████████| 1/1 [00:00<00:00, 14.97it/s]

[NeMo I 2024-08-26 04:01:18 msdd_models:1403]      [Threshold: 0.7000] [use_clus_as_main=False] [diar_window=50]
[NeMo I 2024-08-26 04:01:18 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-08-26 04:01:18 speaker_utils:93] Number of files to diarize: 1



[NeMo W 2024-08-26 04:01:19 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 04:01:19 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-08-26 04:01:19 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 04:01:19 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-08-26 04:01:19 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 04:01:19 msdd_models:1431]   
    
archivo 2ee89ca6b151452d80fafc0031248b5c_20230610t15_18_utc.wav listo


INFO:pytorch_lightning.utilities.migration.utils:Lightning automatically upgraded your loaded checkpoint from v1.5.4 to v2.0.7. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint --file ../root/.cache/torch/whisperx-vad-segmentation.bin`


No language specified, language will be first be detected for each audio file (increases inference time).
Model was trained with pyannote.audio 0.0.1, yours is 3.1.1. Bad things might happen unless you revert pyannote.audio to 0.x.
Model was trained with torch 1.10.0+cu102, yours is 2.0.1+cu117. Bad things might happen unless you revert torch to 1.x.
Suppressing numeral and symbol tokens: [3, 4, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 502, 568, 628, 805, 945, 1017, 1025, 1266, 1294, 1360, 1386, 1525, 1614, 1649, 1722, 1848, 1958, 2009, 2119, 2217, 2272, 2319, 2331, 2443, 2625, 2803, 2975, 3165, 3279, 3282, 3356, 3405, 3446, 3499, 3552, 3705, 4022, 4060, 4289, 4303, 4436, 4550, 4688, 4702, 4762, 4808, 5080, 5211, 5254, 5285, 5348, 5853, 5867, 5923, 6071, 6074, 6096, 6375, 6494, 6549, 6591, 6641, 6673, 6856, 6866, 6879, 6905, 6976, 7143, 7201, 7271, 7490, 7526, 7546, 7551, 7560, 7562, 7629, 7634, 7668, 7771, 7773, 7911, 7998, 8132, 8227, 8423, 8451, 8465, 8494, 8652, 8794, 8858, 8923, 90

[NeMo W 2024-08-26 04:02:14 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: true
    
[NeMo W 2024-08-26 04:02:14 modelPT:168] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s). 
    Validation config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: false
    
[NeMo W 2024-08-26 04:02:14 modelPT:174] Please call the ModelPT.setup_test_data() or ModelPT.setup_multiple

[NeMo I 2024-08-26 04:02:14 features:289] PADDING: 16
[NeMo I 2024-08-26 04:02:14 features:289] PADDING: 16
[NeMo I 2024-08-26 04:02:15 save_restore_connector:249] Model EncDecDiarLabelModel was successfully restored from /root/.cache/torch/NeMo/NeMo_1.21.0/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo.
[NeMo I 2024-08-26 04:02:15 features:289] PADDING: 16
[NeMo I 2024-08-26 04:02:16 clustering_diarizer:127] Loading pretrained vad_multilingual_marblenet model from NGC
[NeMo I 2024-08-26 04:02:16 cloud:58] Found existing object /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-08-26 04:02:16 cloud:64] Re-using file from: /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo
[NeMo I 2024-08-26 04:02:16 common:913] Instantiating model from pre-trained checkpoint


[NeMo W 2024-08-26 04:02:17 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: /manifests/ami_train_0.63.json,/manifests/freesound_background_train.json,/manifests/freesound_laughter_train.json,/manifests/fisher_2004_background.json,/manifests/fisher_2004_speech_sampled.json,/manifests/google_train_manifest.json,/manifests/icsi_all_0.63.json,/manifests/musan_freesound_train.json,/manifests/musan_music_train.json,/manifests/musan_soundbible_train.json,/manifests/mandarin_train_sample.json,/manifests/german_train_sample.json,/manifests/spanish_train_sample.json,/manifests/french_train_sample.json,/manifests/russian_train_sample.json
    sample_rate: 16000
    labels:
    - background
    - speech
    batch_size: 256
    shuffle: true
    is_tarred: false
    tarred_audio_filepaths: null
    tarred_shard_strategy: sca

[NeMo I 2024-08-26 04:02:17 features:289] PADDING: 16
[NeMo I 2024-08-26 04:02:17 save_restore_connector:249] Model EncDecClassificationModel was successfully restored from /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-08-26 04:02:17 msdd_models:864] Multiscale Weights: [1, 1, 1, 1, 1]
[NeMo I 2024-08-26 04:02:17 msdd_models:865] Clustering Parameters: {
        "oracle_num_speakers": false,
        "max_num_speakers": 8,
        "enhanced_count_thres": 80,
        "max_rp_threshold": 0.25,
        "sparse_search_volume": 30,
        "maj_vote_spk_count": false,
        "chunk_cluster_count": 50,
        "embeddings_per_chunk": 10000
    }


[NeMo W 2024-08-26 04:02:17 clustering_diarizer:411] Deleting previous clustering diarizer outputs.


[NeMo I 2024-08-26 04:02:17 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-08-26 04:02:17 clustering_diarizer:309] Split long audio file to avoid CUDA memory issue


splitting manifest: 100%|██████████| 1/1 [00:00<00:00, 12.65it/s]

[NeMo I 2024-08-26 04:02:17 vad_utils:107] The prepared manifest file exists. Overwriting!
[NeMo I 2024-08-26 04:02:17 classification_models:273] Perform streaming frame-level VAD
[NeMo I 2024-08-26 04:02:17 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:02:17 collections:302] Dataset loaded with 7 items, total duration of  0.08 hours.
[NeMo I 2024-08-26 04:02:17 collections:304] # 7 files loaded accounting to # 1 labels



vad: 100%|██████████| 7/7 [00:02<00:00,  3.42it/s]

[NeMo I 2024-08-26 04:02:19 clustering_diarizer:250] Generating predictions with overlapping input segments



                                                               

[NeMo I 2024-08-26 04:02:22 clustering_diarizer:262] Converting frame level prediction to speech/no-speech segment in start and end times format.


creating speech segments: 100%|██████████| 1/1 [00:00<00:00,  3.53it/s]

[NeMo I 2024-08-26 04:02:22 clustering_diarizer:287] Subsegmentation for embedding extraction: scale0, /content/temp_outputs/speaker_outputs/subsegments_scale0.json
[NeMo I 2024-08-26 04:02:22 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 04:02:22 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:02:22 collections:302] Dataset loaded with 286 items, total duration of  0.11 hours.
[NeMo I 2024-08-26 04:02:22 collections:304] # 286 files loaded accounting to # 1 labels



[1/5] extract embeddings: 100%|██████████| 5/5 [00:00<00:00,  5.61it/s]

[NeMo I 2024-08-26 04:02:23 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 04:02:23 clustering_diarizer:287] Subsegmentation for embedding extraction: scale1, /content/temp_outputs/speaker_outputs/subsegments_scale1.json
[NeMo I 2024-08-26 04:02:23 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 04:02:23 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:02:23 collections:302] Dataset loaded with 346 items, total duration of  0.11 hours.
[NeMo I 2024-08-26 04:02:23 collections:304] # 346 files loaded accounting to # 1 labels



[2/5] extract embeddings: 100%|██████████| 6/6 [00:00<00:00,  7.70it/s]

[NeMo I 2024-08-26 04:02:24 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 04:02:24 clustering_diarizer:287] Subsegmentation for embedding extraction: scale2, /content/temp_outputs/speaker_outputs/subsegments_scale2.json
[NeMo I 2024-08-26 04:02:24 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 04:02:24 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:02:24 collections:302] Dataset loaded with 435 items, total duration of  0.12 hours.
[NeMo I 2024-08-26 04:02:24 collections:304] # 435 files loaded accounting to # 1 labels



[3/5] extract embeddings: 100%|██████████| 7/7 [00:00<00:00,  7.95it/s]

[NeMo I 2024-08-26 04:02:25 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 04:02:25 clustering_diarizer:287] Subsegmentation for embedding extraction: scale3, /content/temp_outputs/speaker_outputs/subsegments_scale3.json
[NeMo I 2024-08-26 04:02:25 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 04:02:25 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:02:25 collections:302] Dataset loaded with 595 items, total duration of  0.12 hours.
[NeMo I 2024-08-26 04:02:25 collections:304] # 595 files loaded accounting to # 1 labels



[4/5] extract embeddings: 100%|██████████| 10/10 [00:01<00:00,  9.79it/s]

[NeMo I 2024-08-26 04:02:26 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 04:02:26 clustering_diarizer:287] Subsegmentation for embedding extraction: scale4, /content/temp_outputs/speaker_outputs/subsegments_scale4.json





[NeMo I 2024-08-26 04:02:26 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 04:02:26 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:02:26 collections:302] Dataset loaded with 910 items, total duration of  0.12 hours.
[NeMo I 2024-08-26 04:02:26 collections:304] # 910 files loaded accounting to # 1 labels


[5/5] extract embeddings: 100%|██████████| 15/15 [00:01<00:00, 11.10it/s]

[NeMo I 2024-08-26 04:02:27 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings



clustering: 100%|██████████| 1/1 [00:00<00:00,  1.48it/s]

[NeMo I 2024-08-26 04:02:28 clustering_diarizer:464] Outputs are saved in /content/temp_outputs directory



[NeMo W 2024-08-26 04:02:28 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 04:02:28 msdd_models:960] Loading embedding pickle file of scale:0 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale0_embeddings.pkl
[NeMo I 2024-08-26 04:02:28 msdd_models:960] Loading embedding pickle file of scale:1 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale1_embeddings.pkl
[NeMo I 2024-08-26 04:02:28 msdd_models:960] Loading embedding pickle file of scale:2 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale2_embeddings.pkl
[NeMo I 2024-08-26 04:02:28 msdd_models:960] Loading embedding pickle file of scale:3 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale3_embeddings.pkl
[NeMo I 2024-08-26 04:02:28 msdd_models:960] Loading embedding pickle file of scale:4 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale4_embeddings.pkl
[NeMo I 2024-08-26 04:02:28 msdd_models:938] Loading cluster label file from /content/temp_outputs/speaker_outputs/subsegments_scale4_cluste

100%|██████████| 1/1 [00:00<00:00, 27.11it/s]

[NeMo I 2024-08-26 04:02:28 msdd_models:1403]      [Threshold: 0.7000] [use_clus_as_main=False] [diar_window=50]
[NeMo I 2024-08-26 04:02:28 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-08-26 04:02:28 speaker_utils:93] Number of files to diarize: 1



[NeMo W 2024-08-26 04:02:28 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 04:02:28 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-08-26 04:02:28 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 04:02:28 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-08-26 04:02:28 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 04:02:28 msdd_models:1431]   
    
archivo f189dbfe6981414087b0500db982a3e7_20230610t13_45_utc.wav listo


INFO:pytorch_lightning.utilities.migration.utils:Lightning automatically upgraded your loaded checkpoint from v1.5.4 to v2.0.7. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint --file ../root/.cache/torch/whisperx-vad-segmentation.bin`


No language specified, language will be first be detected for each audio file (increases inference time).
Model was trained with pyannote.audio 0.0.1, yours is 3.1.1. Bad things might happen unless you revert pyannote.audio to 0.x.
Model was trained with torch 1.10.0+cu102, yours is 2.0.1+cu117. Bad things might happen unless you revert torch to 1.x.
Suppressing numeral and symbol tokens: [3, 4, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 502, 568, 628, 805, 945, 1017, 1025, 1266, 1294, 1360, 1386, 1525, 1614, 1649, 1722, 1848, 1958, 2009, 2119, 2217, 2272, 2319, 2331, 2443, 2625, 2803, 2975, 3165, 3279, 3282, 3356, 3405, 3446, 3499, 3552, 3705, 4022, 4060, 4289, 4303, 4436, 4550, 4688, 4702, 4762, 4808, 5080, 5211, 5254, 5285, 5348, 5853, 5867, 5923, 6071, 6074, 6096, 6375, 6494, 6549, 6591, 6641, 6673, 6856, 6866, 6879, 6905, 6976, 7143, 7201, 7271, 7490, 7526, 7546, 7551, 7560, 7562, 7629, 7634, 7668, 7771, 7773, 7911, 7998, 8132, 8227, 8423, 8451, 8465, 8494, 8652, 8794, 8858, 8923, 90

[NeMo W 2024-08-26 04:03:37 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: true
    
[NeMo W 2024-08-26 04:03:37 modelPT:168] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s). 
    Validation config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: false
    
[NeMo W 2024-08-26 04:03:37 modelPT:174] Please call the ModelPT.setup_test_data() or ModelPT.setup_multiple

[NeMo I 2024-08-26 04:03:37 features:289] PADDING: 16
[NeMo I 2024-08-26 04:03:37 features:289] PADDING: 16
[NeMo I 2024-08-26 04:03:38 save_restore_connector:249] Model EncDecDiarLabelModel was successfully restored from /root/.cache/torch/NeMo/NeMo_1.21.0/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo.
[NeMo I 2024-08-26 04:03:38 features:289] PADDING: 16
[NeMo I 2024-08-26 04:03:39 clustering_diarizer:127] Loading pretrained vad_multilingual_marblenet model from NGC
[NeMo I 2024-08-26 04:03:39 cloud:58] Found existing object /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-08-26 04:03:39 cloud:64] Re-using file from: /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo
[NeMo I 2024-08-26 04:03:39 common:913] Instantiating model from pre-trained checkpoint


[NeMo W 2024-08-26 04:03:40 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: /manifests/ami_train_0.63.json,/manifests/freesound_background_train.json,/manifests/freesound_laughter_train.json,/manifests/fisher_2004_background.json,/manifests/fisher_2004_speech_sampled.json,/manifests/google_train_manifest.json,/manifests/icsi_all_0.63.json,/manifests/musan_freesound_train.json,/manifests/musan_music_train.json,/manifests/musan_soundbible_train.json,/manifests/mandarin_train_sample.json,/manifests/german_train_sample.json,/manifests/spanish_train_sample.json,/manifests/french_train_sample.json,/manifests/russian_train_sample.json
    sample_rate: 16000
    labels:
    - background
    - speech
    batch_size: 256
    shuffle: true
    is_tarred: false
    tarred_audio_filepaths: null
    tarred_shard_strategy: sca

[NeMo I 2024-08-26 04:03:40 features:289] PADDING: 16
[NeMo I 2024-08-26 04:03:41 save_restore_connector:249] Model EncDecClassificationModel was successfully restored from /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-08-26 04:03:41 msdd_models:864] Multiscale Weights: [1, 1, 1, 1, 1]
[NeMo I 2024-08-26 04:03:41 msdd_models:865] Clustering Parameters: {
        "oracle_num_speakers": false,
        "max_num_speakers": 8,
        "enhanced_count_thres": 80,
        "max_rp_threshold": 0.25,
        "sparse_search_volume": 30,
        "maj_vote_spk_count": false,
        "chunk_cluster_count": 50,
        "embeddings_per_chunk": 10000
    }


[NeMo W 2024-08-26 04:03:41 clustering_diarizer:411] Deleting previous clustering diarizer outputs.


[NeMo I 2024-08-26 04:03:41 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-08-26 04:03:41 clustering_diarizer:309] Split long audio file to avoid CUDA memory issue


splitting manifest: 100%|██████████| 1/1 [00:00<00:00,  6.25it/s]

[NeMo I 2024-08-26 04:03:41 vad_utils:107] The prepared manifest file exists. Overwriting!
[NeMo I 2024-08-26 04:03:41 classification_models:273] Perform streaming frame-level VAD





[NeMo I 2024-08-26 04:03:41 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:03:41 collections:302] Dataset loaded with 10 items, total duration of  0.13 hours.
[NeMo I 2024-08-26 04:03:41 collections:304] # 10 files loaded accounting to # 1 labels


vad: 100%|██████████| 10/10 [00:02<00:00,  3.51it/s]

[NeMo I 2024-08-26 04:03:44 clustering_diarizer:250] Generating predictions with overlapping input segments



                                                               

[NeMo I 2024-08-26 04:03:48 clustering_diarizer:262] Converting frame level prediction to speech/no-speech segment in start and end times format.


creating speech segments: 100%|██████████| 1/1 [00:00<00:00,  2.55it/s]

[NeMo I 2024-08-26 04:03:48 clustering_diarizer:287] Subsegmentation for embedding extraction: scale0, /content/temp_outputs/speaker_outputs/subsegments_scale0.json
[NeMo I 2024-08-26 04:03:48 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 04:03:48 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:03:48 collections:302] Dataset loaded with 451 items, total duration of  0.16 hours.
[NeMo I 2024-08-26 04:03:48 collections:304] # 451 files loaded accounting to # 1 labels



[1/5] extract embeddings: 100%|██████████| 8/8 [00:01<00:00,  6.50it/s]

[NeMo I 2024-08-26 04:03:50 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 04:03:50 clustering_diarizer:287] Subsegmentation for embedding extraction: scale1, /content/temp_outputs/speaker_outputs/subsegments_scale1.json
[NeMo I 2024-08-26 04:03:50 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 04:03:50 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:03:50 collections:302] Dataset loaded with 539 items, total duration of  0.17 hours.
[NeMo I 2024-08-26 04:03:50 collections:304] # 539 files loaded accounting to # 1 labels



[2/5] extract embeddings: 100%|██████████| 9/9 [00:01<00:00,  7.56it/s]

[NeMo I 2024-08-26 04:03:51 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 04:03:51 clustering_diarizer:287] Subsegmentation for embedding extraction: scale2, /content/temp_outputs/speaker_outputs/subsegments_scale2.json
[NeMo I 2024-08-26 04:03:51 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 04:03:51 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:03:51 collections:302] Dataset loaded with 680 items, total duration of  0.18 hours.
[NeMo I 2024-08-26 04:03:51 collections:304] # 680 files loaded accounting to # 1 labels



[3/5] extract embeddings: 100%|██████████| 11/11 [00:01<00:00,  8.09it/s]

[NeMo I 2024-08-26 04:03:52 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 04:03:52 clustering_diarizer:287] Subsegmentation for embedding extraction: scale3, /content/temp_outputs/speaker_outputs/subsegments_scale3.json
[NeMo I 2024-08-26 04:03:52 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 04:03:52 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:03:52 collections:302] Dataset loaded with 924 items, total duration of  0.18 hours.
[NeMo I 2024-08-26 04:03:52 collections:304] # 924 files loaded accounting to # 1 labels



[4/5] extract embeddings: 100%|██████████| 15/15 [00:01<00:00,  8.70it/s]

[NeMo I 2024-08-26 04:03:54 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 04:03:54 clustering_diarizer:287] Subsegmentation for embedding extraction: scale4, /content/temp_outputs/speaker_outputs/subsegments_scale4.json





[NeMo I 2024-08-26 04:03:54 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 04:03:54 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:03:54 collections:302] Dataset loaded with 1415 items, total duration of  0.19 hours.
[NeMo I 2024-08-26 04:03:54 collections:304] # 1415 files loaded accounting to # 1 labels


[5/5] extract embeddings: 100%|██████████| 23/23 [00:02<00:00,  8.02it/s]


[NeMo I 2024-08-26 04:03:57 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings


clustering: 100%|██████████| 1/1 [00:00<00:00,  1.21it/s]

[NeMo I 2024-08-26 04:03:58 clustering_diarizer:464] Outputs are saved in /content/temp_outputs directory



[NeMo W 2024-08-26 04:03:58 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 04:03:58 msdd_models:960] Loading embedding pickle file of scale:0 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale0_embeddings.pkl
[NeMo I 2024-08-26 04:03:58 msdd_models:960] Loading embedding pickle file of scale:1 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale1_embeddings.pkl
[NeMo I 2024-08-26 04:03:58 msdd_models:960] Loading embedding pickle file of scale:2 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale2_embeddings.pkl
[NeMo I 2024-08-26 04:03:58 msdd_models:960] Loading embedding pickle file of scale:3 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale3_embeddings.pkl
[NeMo I 2024-08-26 04:03:58 msdd_models:960] Loading embedding pickle file of scale:4 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale4_embeddings.pkl
[NeMo I 2024-08-26 04:03:58 msdd_models:938] Loading cluster label file from /content/temp_outputs/speaker_outputs/subsegments_scale4_cluste

100%|██████████| 1/1 [00:00<00:00,  5.00it/s]

[NeMo I 2024-08-26 04:03:59 msdd_models:1403]      [Threshold: 0.7000] [use_clus_as_main=False] [diar_window=50]
[NeMo I 2024-08-26 04:03:59 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-08-26 04:03:59 speaker_utils:93] Number of files to diarize: 1



[NeMo W 2024-08-26 04:03:59 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 04:03:59 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-08-26 04:03:59 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 04:03:59 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-08-26 04:03:59 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 04:03:59 msdd_models:1431]   
    
archivo f1e80f376a024fac8912df9b9df04df0_20230610t13_44_utc.wav listo


INFO:pytorch_lightning.utilities.migration.utils:Lightning automatically upgraded your loaded checkpoint from v1.5.4 to v2.0.7. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint --file ../root/.cache/torch/whisperx-vad-segmentation.bin`


No language specified, language will be first be detected for each audio file (increases inference time).
Model was trained with pyannote.audio 0.0.1, yours is 3.1.1. Bad things might happen unless you revert pyannote.audio to 0.x.
Model was trained with torch 1.10.0+cu102, yours is 2.0.1+cu117. Bad things might happen unless you revert torch to 1.x.
Suppressing numeral and symbol tokens: [3, 4, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 502, 568, 628, 805, 945, 1017, 1025, 1266, 1294, 1360, 1386, 1525, 1614, 1649, 1722, 1848, 1958, 2009, 2119, 2217, 2272, 2319, 2331, 2443, 2625, 2803, 2975, 3165, 3279, 3282, 3356, 3405, 3446, 3499, 3552, 3705, 4022, 4060, 4289, 4303, 4436, 4550, 4688, 4702, 4762, 4808, 5080, 5211, 5254, 5285, 5348, 5853, 5867, 5923, 6071, 6074, 6096, 6375, 6494, 6549, 6591, 6641, 6673, 6856, 6866, 6879, 6905, 6976, 7143, 7201, 7271, 7490, 7526, 7546, 7551, 7560, 7562, 7629, 7634, 7668, 7771, 7773, 7911, 7998, 8132, 8227, 8423, 8451, 8465, 8494, 8652, 8794, 8858, 8923, 90

[NeMo W 2024-08-26 04:04:38 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: true
    
[NeMo W 2024-08-26 04:04:38 modelPT:168] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s). 
    Validation config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: false
    
[NeMo W 2024-08-26 04:04:38 modelPT:174] Please call the ModelPT.setup_test_data() or ModelPT.setup_multiple

[NeMo I 2024-08-26 04:04:38 features:289] PADDING: 16
[NeMo I 2024-08-26 04:04:39 features:289] PADDING: 16
[NeMo I 2024-08-26 04:04:40 save_restore_connector:249] Model EncDecDiarLabelModel was successfully restored from /root/.cache/torch/NeMo/NeMo_1.21.0/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo.
[NeMo I 2024-08-26 04:04:40 features:289] PADDING: 16
[NeMo I 2024-08-26 04:04:41 clustering_diarizer:127] Loading pretrained vad_multilingual_marblenet model from NGC
[NeMo I 2024-08-26 04:04:41 cloud:58] Found existing object /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-08-26 04:04:41 cloud:64] Re-using file from: /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo
[NeMo I 2024-08-26 04:04:41 common:913] Instantiating model from pre-trained checkpoint


[NeMo W 2024-08-26 04:04:41 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: /manifests/ami_train_0.63.json,/manifests/freesound_background_train.json,/manifests/freesound_laughter_train.json,/manifests/fisher_2004_background.json,/manifests/fisher_2004_speech_sampled.json,/manifests/google_train_manifest.json,/manifests/icsi_all_0.63.json,/manifests/musan_freesound_train.json,/manifests/musan_music_train.json,/manifests/musan_soundbible_train.json,/manifests/mandarin_train_sample.json,/manifests/german_train_sample.json,/manifests/spanish_train_sample.json,/manifests/french_train_sample.json,/manifests/russian_train_sample.json
    sample_rate: 16000
    labels:
    - background
    - speech
    batch_size: 256
    shuffle: true
    is_tarred: false
    tarred_audio_filepaths: null
    tarred_shard_strategy: sca

[NeMo I 2024-08-26 04:04:41 features:289] PADDING: 16
[NeMo I 2024-08-26 04:04:41 save_restore_connector:249] Model EncDecClassificationModel was successfully restored from /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-08-26 04:04:41 msdd_models:864] Multiscale Weights: [1, 1, 1, 1, 1]
[NeMo I 2024-08-26 04:04:41 msdd_models:865] Clustering Parameters: {
        "oracle_num_speakers": false,
        "max_num_speakers": 8,
        "enhanced_count_thres": 80,
        "max_rp_threshold": 0.25,
        "sparse_search_volume": 30,
        "maj_vote_spk_count": false,
        "chunk_cluster_count": 50,
        "embeddings_per_chunk": 10000
    }


[NeMo W 2024-08-26 04:04:41 clustering_diarizer:411] Deleting previous clustering diarizer outputs.


[NeMo I 2024-08-26 04:04:41 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-08-26 04:04:41 clustering_diarizer:309] Split long audio file to avoid CUDA memory issue


splitting manifest: 100%|██████████| 1/1 [00:00<00:00, 75.06it/s]

[NeMo I 2024-08-26 04:04:41 vad_utils:107] The prepared manifest file exists. Overwriting!
[NeMo I 2024-08-26 04:04:41 classification_models:273] Perform streaming frame-level VAD
[NeMo I 2024-08-26 04:04:41 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:04:41 collections:302] Dataset loaded with 1 items, total duration of  0.00 hours.
[NeMo I 2024-08-26 04:04:41 collections:304] # 1 files loaded accounting to # 1 labels



vad: 100%|██████████| 1/1 [00:00<00:00,  6.64it/s]


[NeMo I 2024-08-26 04:04:41 clustering_diarizer:250] Generating predictions with overlapping input segments


                                                               

[NeMo I 2024-08-26 04:04:42 clustering_diarizer:262] Converting frame level prediction to speech/no-speech segment in start and end times format.


creating speech segments: 100%|██████████| 1/1 [00:00<00:00, 55.49it/s]

[NeMo I 2024-08-26 04:04:42 clustering_diarizer:287] Subsegmentation for embedding extraction: scale0, /content/temp_outputs/speaker_outputs/subsegments_scale0.json
[NeMo I 2024-08-26 04:04:42 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 04:04:42 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:04:42 collections:302] Dataset loaded with 7 items, total duration of  0.00 hours.





[NeMo I 2024-08-26 04:04:42 collections:304] # 7 files loaded accounting to # 1 labels


[1/5] extract embeddings: 100%|██████████| 1/1 [00:00<00:00, 19.04it/s]

[NeMo I 2024-08-26 04:04:42 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 04:04:42 clustering_diarizer:287] Subsegmentation for embedding extraction: scale1, /content/temp_outputs/speaker_outputs/subsegments_scale1.json
[NeMo I 2024-08-26 04:04:42 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 04:04:42 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:04:42 collections:302] Dataset loaded with 8 items, total duration of  0.00 hours.
[NeMo I 2024-08-26 04:04:42 collections:304] # 8 files loaded accounting to # 1 labels



[2/5] extract embeddings: 100%|██████████| 1/1 [00:00<00:00, 27.50it/s]

[NeMo I 2024-08-26 04:04:42 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 04:04:42 clustering_diarizer:287] Subsegmentation for embedding extraction: scale2, /content/temp_outputs/speaker_outputs/subsegments_scale2.json
[NeMo I 2024-08-26 04:04:42 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 04:04:42 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:04:42 collections:302] Dataset loaded with 10 items, total duration of  0.00 hours.
[NeMo I 2024-08-26 04:04:42 collections:304] # 10 files loaded accounting to # 1 labels



[3/5] extract embeddings: 100%|██████████| 1/1 [00:00<00:00, 25.21it/s]

[NeMo I 2024-08-26 04:04:42 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 04:04:42 clustering_diarizer:287] Subsegmentation for embedding extraction: scale3, /content/temp_outputs/speaker_outputs/subsegments_scale3.json
[NeMo I 2024-08-26 04:04:42 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 04:04:42 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:04:42 collections:302] Dataset loaded with 14 items, total duration of  0.00 hours.
[NeMo I 2024-08-26 04:04:42 collections:304] # 14 files loaded accounting to # 1 labels



[4/5] extract embeddings: 100%|██████████| 1/1 [00:00<00:00, 19.23it/s]

[NeMo I 2024-08-26 04:04:42 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 04:04:42 clustering_diarizer:287] Subsegmentation for embedding extraction: scale4, /content/temp_outputs/speaker_outputs/subsegments_scale4.json
[NeMo I 2024-08-26 04:04:42 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 04:04:42 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:04:42 collections:302] Dataset loaded with 22 items, total duration of  0.00 hours.





[NeMo I 2024-08-26 04:04:42 collections:304] # 22 files loaded accounting to # 1 labels


[5/5] extract embeddings: 100%|██████████| 1/1 [00:00<00:00, 14.68it/s]

[NeMo I 2024-08-26 04:04:42 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings



clustering: 100%|██████████| 1/1 [00:00<00:00,  5.26it/s]

[NeMo I 2024-08-26 04:04:42 clustering_diarizer:464] Outputs are saved in /content/temp_outputs directory



[NeMo W 2024-08-26 04:04:42 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 04:04:42 msdd_models:960] Loading embedding pickle file of scale:0 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale0_embeddings.pkl
[NeMo I 2024-08-26 04:04:42 msdd_models:960] Loading embedding pickle file of scale:1 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale1_embeddings.pkl
[NeMo I 2024-08-26 04:04:42 msdd_models:960] Loading embedding pickle file of scale:2 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale2_embeddings.pkl
[NeMo I 2024-08-26 04:04:42 msdd_models:960] Loading embedding pickle file of scale:3 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale3_embeddings.pkl
[NeMo I 2024-08-26 04:04:42 msdd_models:960] Loading embedding pickle file of scale:4 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale4_embeddings.pkl
[NeMo I 2024-08-26 04:04:42 msdd_models:938] Loading cluster label file from /content/temp_outputs/speaker_outputs/subsegments_scale4_cluste

100%|██████████| 1/1 [00:00<00:00, 54.57it/s]

[NeMo I 2024-08-26 04:04:42 msdd_models:1403]      [Threshold: 0.7000] [use_clus_as_main=False] [diar_window=50]
[NeMo I 2024-08-26 04:04:42 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-08-26 04:04:42 speaker_utils:93] Number of files to diarize: 1



[NeMo W 2024-08-26 04:04:42 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 04:04:42 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-08-26 04:04:42 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 04:04:42 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-08-26 04:04:42 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 04:04:42 msdd_models:1431]   
    
archivo 5d6b0bcd9801457bb62733de21e565f2_20230610t15_44_utc.wav listo


INFO:pytorch_lightning.utilities.migration.utils:Lightning automatically upgraded your loaded checkpoint from v1.5.4 to v2.0.7. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint --file ../root/.cache/torch/whisperx-vad-segmentation.bin`


No language specified, language will be first be detected for each audio file (increases inference time).
Model was trained with pyannote.audio 0.0.1, yours is 3.1.1. Bad things might happen unless you revert pyannote.audio to 0.x.
Model was trained with torch 1.10.0+cu102, yours is 2.0.1+cu117. Bad things might happen unless you revert torch to 1.x.
Suppressing numeral and symbol tokens: [3, 4, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 502, 568, 628, 805, 945, 1017, 1025, 1266, 1294, 1360, 1386, 1525, 1614, 1649, 1722, 1848, 1958, 2009, 2119, 2217, 2272, 2319, 2331, 2443, 2625, 2803, 2975, 3165, 3279, 3282, 3356, 3405, 3446, 3499, 3552, 3705, 4022, 4060, 4289, 4303, 4436, 4550, 4688, 4702, 4762, 4808, 5080, 5211, 5254, 5285, 5348, 5853, 5867, 5923, 6071, 6074, 6096, 6375, 6494, 6549, 6591, 6641, 6673, 6856, 6866, 6879, 6905, 6976, 7143, 7201, 7271, 7490, 7526, 7546, 7551, 7560, 7562, 7629, 7634, 7668, 7771, 7773, 7911, 7998, 8132, 8227, 8423, 8451, 8465, 8494, 8652, 8794, 8858, 8923, 90

[NeMo W 2024-08-26 04:05:24 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: true
    
[NeMo W 2024-08-26 04:05:24 modelPT:168] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s). 
    Validation config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: false
    
[NeMo W 2024-08-26 04:05:24 modelPT:174] Please call the ModelPT.setup_test_data() or ModelPT.setup_multiple

[NeMo I 2024-08-26 04:05:24 features:289] PADDING: 16
[NeMo I 2024-08-26 04:05:24 features:289] PADDING: 16
[NeMo I 2024-08-26 04:05:25 save_restore_connector:249] Model EncDecDiarLabelModel was successfully restored from /root/.cache/torch/NeMo/NeMo_1.21.0/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo.
[NeMo I 2024-08-26 04:05:25 features:289] PADDING: 16
[NeMo I 2024-08-26 04:05:25 clustering_diarizer:127] Loading pretrained vad_multilingual_marblenet model from NGC
[NeMo I 2024-08-26 04:05:25 cloud:58] Found existing object /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-08-26 04:05:25 cloud:64] Re-using file from: /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo
[NeMo I 2024-08-26 04:05:25 common:913] Instantiating model from pre-trained checkpoint


[NeMo W 2024-08-26 04:05:25 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: /manifests/ami_train_0.63.json,/manifests/freesound_background_train.json,/manifests/freesound_laughter_train.json,/manifests/fisher_2004_background.json,/manifests/fisher_2004_speech_sampled.json,/manifests/google_train_manifest.json,/manifests/icsi_all_0.63.json,/manifests/musan_freesound_train.json,/manifests/musan_music_train.json,/manifests/musan_soundbible_train.json,/manifests/mandarin_train_sample.json,/manifests/german_train_sample.json,/manifests/spanish_train_sample.json,/manifests/french_train_sample.json,/manifests/russian_train_sample.json
    sample_rate: 16000
    labels:
    - background
    - speech
    batch_size: 256
    shuffle: true
    is_tarred: false
    tarred_audio_filepaths: null
    tarred_shard_strategy: sca

[NeMo I 2024-08-26 04:05:25 features:289] PADDING: 16
[NeMo I 2024-08-26 04:05:25 save_restore_connector:249] Model EncDecClassificationModel was successfully restored from /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-08-26 04:05:26 msdd_models:864] Multiscale Weights: [1, 1, 1, 1, 1]
[NeMo I 2024-08-26 04:05:26 msdd_models:865] Clustering Parameters: {
        "oracle_num_speakers": false,
        "max_num_speakers": 8,
        "enhanced_count_thres": 80,
        "max_rp_threshold": 0.25,
        "sparse_search_volume": 30,
        "maj_vote_spk_count": false,
        "chunk_cluster_count": 50,
        "embeddings_per_chunk": 10000
    }


[NeMo W 2024-08-26 04:05:26 clustering_diarizer:411] Deleting previous clustering diarizer outputs.


[NeMo I 2024-08-26 04:05:26 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-08-26 04:05:26 clustering_diarizer:309] Split long audio file to avoid CUDA memory issue


splitting manifest: 100%|██████████| 1/1 [00:00<00:00, 47.42it/s]


[NeMo I 2024-08-26 04:05:26 vad_utils:107] The prepared manifest file exists. Overwriting!
[NeMo I 2024-08-26 04:05:26 classification_models:273] Perform streaming frame-level VAD
[NeMo I 2024-08-26 04:05:26 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:05:26 collections:302] Dataset loaded with 1 items, total duration of  0.01 hours.
[NeMo I 2024-08-26 04:05:26 collections:304] # 1 files loaded accounting to # 1 labels


vad: 100%|██████████| 1/1 [00:00<00:00,  3.98it/s]

[NeMo I 2024-08-26 04:05:26 clustering_diarizer:250] Generating predictions with overlapping input segments



                                                               

[NeMo I 2024-08-26 04:05:26 clustering_diarizer:262] Converting frame level prediction to speech/no-speech segment in start and end times format.


creating speech segments: 100%|██████████| 1/1 [00:00<00:00, 17.10it/s]

[NeMo I 2024-08-26 04:05:26 clustering_diarizer:287] Subsegmentation for embedding extraction: scale0, /content/temp_outputs/speaker_outputs/subsegments_scale0.json
[NeMo I 2024-08-26 04:05:26 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 04:05:26 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:05:26 collections:302] Dataset loaded with 41 items, total duration of  0.02 hours.
[NeMo I 2024-08-26 04:05:26 collections:304] # 41 files loaded accounting to # 1 labels



[1/5] extract embeddings: 100%|██████████| 1/1 [00:00<00:00,  7.04it/s]

[NeMo I 2024-08-26 04:05:27 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 04:05:27 clustering_diarizer:287] Subsegmentation for embedding extraction: scale1, /content/temp_outputs/speaker_outputs/subsegments_scale1.json
[NeMo I 2024-08-26 04:05:27 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 04:05:27 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:05:27 collections:302] Dataset loaded with 50 items, total duration of  0.02 hours.
[NeMo I 2024-08-26 04:05:27 collections:304] # 50 files loaded accounting to # 1 labels



[2/5] extract embeddings: 100%|██████████| 1/1 [00:00<00:00,  6.86it/s]

[NeMo I 2024-08-26 04:05:27 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 04:05:27 clustering_diarizer:287] Subsegmentation for embedding extraction: scale2, /content/temp_outputs/speaker_outputs/subsegments_scale2.json
[NeMo I 2024-08-26 04:05:27 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 04:05:27 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:05:27 collections:302] Dataset loaded with 63 items, total duration of  0.02 hours.
[NeMo I 2024-08-26 04:05:27 collections:304] # 63 files loaded accounting to # 1 labels



[3/5] extract embeddings: 100%|██████████| 1/1 [00:00<00:00,  5.43it/s]

[NeMo I 2024-08-26 04:05:27 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 04:05:27 clustering_diarizer:287] Subsegmentation for embedding extraction: scale3, /content/temp_outputs/speaker_outputs/subsegments_scale3.json





[NeMo I 2024-08-26 04:05:27 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 04:05:27 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:05:27 collections:302] Dataset loaded with 86 items, total duration of  0.02 hours.
[NeMo I 2024-08-26 04:05:27 collections:304] # 86 files loaded accounting to # 1 labels


[4/5] extract embeddings: 100%|██████████| 2/2 [00:00<00:00,  8.42it/s]

[NeMo I 2024-08-26 04:05:27 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 04:05:27 clustering_diarizer:287] Subsegmentation for embedding extraction: scale4, /content/temp_outputs/speaker_outputs/subsegments_scale4.json
[NeMo I 2024-08-26 04:05:27 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 04:05:27 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:05:27 collections:302] Dataset loaded with 131 items, total duration of  0.02 hours.
[NeMo I 2024-08-26 04:05:27 collections:304] # 131 files loaded accounting to # 1 labels



[5/5] extract embeddings: 100%|██████████| 3/3 [00:00<00:00,  9.59it/s]

[NeMo I 2024-08-26 04:05:28 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings



clustering: 100%|██████████| 1/1 [00:00<00:00,  4.36it/s]

[NeMo I 2024-08-26 04:05:28 clustering_diarizer:464] Outputs are saved in /content/temp_outputs directory



[NeMo W 2024-08-26 04:05:28 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 04:05:28 msdd_models:960] Loading embedding pickle file of scale:0 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale0_embeddings.pkl
[NeMo I 2024-08-26 04:05:28 msdd_models:960] Loading embedding pickle file of scale:1 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale1_embeddings.pkl
[NeMo I 2024-08-26 04:05:28 msdd_models:960] Loading embedding pickle file of scale:2 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale2_embeddings.pkl
[NeMo I 2024-08-26 04:05:28 msdd_models:960] Loading embedding pickle file of scale:3 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale3_embeddings.pkl
[NeMo I 2024-08-26 04:05:28 msdd_models:960] Loading embedding pickle file of scale:4 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale4_embeddings.pkl
[NeMo I 2024-08-26 04:05:28 msdd_models:938] Loading cluster label file from /content/temp_outputs/speaker_outputs/subsegments_scale4_cluste

100%|██████████| 1/1 [00:00<00:00, 48.40it/s]

[NeMo I 2024-08-26 04:05:28 msdd_models:1403]      [Threshold: 0.7000] [use_clus_as_main=False] [diar_window=50]
[NeMo I 2024-08-26 04:05:28 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-08-26 04:05:28 speaker_utils:93] Number of files to diarize: 1



[NeMo W 2024-08-26 04:05:28 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 04:05:28 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-08-26 04:05:28 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 04:05:28 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-08-26 04:05:28 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 04:05:28 msdd_models:1431]   
    
archivo 9adc604104c341d9adc92db9665fb992_20230610t15_10_utc.wav listo


INFO:pytorch_lightning.utilities.migration.utils:Lightning automatically upgraded your loaded checkpoint from v1.5.4 to v2.0.7. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint --file ../root/.cache/torch/whisperx-vad-segmentation.bin`


No language specified, language will be first be detected for each audio file (increases inference time).
Model was trained with pyannote.audio 0.0.1, yours is 3.1.1. Bad things might happen unless you revert pyannote.audio to 0.x.
Model was trained with torch 1.10.0+cu102, yours is 2.0.1+cu117. Bad things might happen unless you revert torch to 1.x.
Suppressing numeral and symbol tokens: [3, 4, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 502, 568, 628, 805, 945, 1017, 1025, 1266, 1294, 1360, 1386, 1525, 1614, 1649, 1722, 1848, 1958, 2009, 2119, 2217, 2272, 2319, 2331, 2443, 2625, 2803, 2975, 3165, 3279, 3282, 3356, 3405, 3446, 3499, 3552, 3705, 4022, 4060, 4289, 4303, 4436, 4550, 4688, 4702, 4762, 4808, 5080, 5211, 5254, 5285, 5348, 5853, 5867, 5923, 6071, 6074, 6096, 6375, 6494, 6549, 6591, 6641, 6673, 6856, 6866, 6879, 6905, 6976, 7143, 7201, 7271, 7490, 7526, 7546, 7551, 7560, 7562, 7629, 7634, 7668, 7771, 7773, 7911, 7998, 8132, 8227, 8423, 8451, 8465, 8494, 8652, 8794, 8858, 8923, 90

[NeMo W 2024-08-26 04:06:11 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: true
    
[NeMo W 2024-08-26 04:06:11 modelPT:168] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s). 
    Validation config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: false
    
[NeMo W 2024-08-26 04:06:11 modelPT:174] Please call the ModelPT.setup_test_data() or ModelPT.setup_multiple

[NeMo I 2024-08-26 04:06:11 features:289] PADDING: 16
[NeMo I 2024-08-26 04:06:12 features:289] PADDING: 16
[NeMo I 2024-08-26 04:06:12 save_restore_connector:249] Model EncDecDiarLabelModel was successfully restored from /root/.cache/torch/NeMo/NeMo_1.21.0/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo.
[NeMo I 2024-08-26 04:06:12 features:289] PADDING: 16
[NeMo I 2024-08-26 04:06:13 clustering_diarizer:127] Loading pretrained vad_multilingual_marblenet model from NGC
[NeMo I 2024-08-26 04:06:13 cloud:58] Found existing object /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-08-26 04:06:13 cloud:64] Re-using file from: /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo
[NeMo I 2024-08-26 04:06:13 common:913] Instantiating model from pre-trained checkpoint


[NeMo W 2024-08-26 04:06:13 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: /manifests/ami_train_0.63.json,/manifests/freesound_background_train.json,/manifests/freesound_laughter_train.json,/manifests/fisher_2004_background.json,/manifests/fisher_2004_speech_sampled.json,/manifests/google_train_manifest.json,/manifests/icsi_all_0.63.json,/manifests/musan_freesound_train.json,/manifests/musan_music_train.json,/manifests/musan_soundbible_train.json,/manifests/mandarin_train_sample.json,/manifests/german_train_sample.json,/manifests/spanish_train_sample.json,/manifests/french_train_sample.json,/manifests/russian_train_sample.json
    sample_rate: 16000
    labels:
    - background
    - speech
    batch_size: 256
    shuffle: true
    is_tarred: false
    tarred_audio_filepaths: null
    tarred_shard_strategy: sca

[NeMo I 2024-08-26 04:06:13 features:289] PADDING: 16
[NeMo I 2024-08-26 04:06:13 save_restore_connector:249] Model EncDecClassificationModel was successfully restored from /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-08-26 04:06:13 msdd_models:864] Multiscale Weights: [1, 1, 1, 1, 1]
[NeMo I 2024-08-26 04:06:13 msdd_models:865] Clustering Parameters: {
        "oracle_num_speakers": false,
        "max_num_speakers": 8,
        "enhanced_count_thres": 80,
        "max_rp_threshold": 0.25,
        "sparse_search_volume": 30,
        "maj_vote_spk_count": false,
        "chunk_cluster_count": 50,
        "embeddings_per_chunk": 10000
    }


[NeMo W 2024-08-26 04:06:13 clustering_diarizer:411] Deleting previous clustering diarizer outputs.


[NeMo I 2024-08-26 04:06:13 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-08-26 04:06:13 clustering_diarizer:309] Split long audio file to avoid CUDA memory issue


splitting manifest: 100%|██████████| 1/1 [00:00<00:00, 40.14it/s]

[NeMo I 2024-08-26 04:06:13 vad_utils:107] The prepared manifest file exists. Overwriting!
[NeMo I 2024-08-26 04:06:13 classification_models:273] Perform streaming frame-level VAD
[NeMo I 2024-08-26 04:06:13 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:06:13 collections:302] Dataset loaded with 2 items, total duration of  0.02 hours.
[NeMo I 2024-08-26 04:06:13 collections:304] # 2 files loaded accounting to # 1 labels



vad: 100%|██████████| 2/2 [00:00<00:00,  5.30it/s]

[NeMo I 2024-08-26 04:06:14 clustering_diarizer:250] Generating predictions with overlapping input segments



                                                               

[NeMo I 2024-08-26 04:06:14 clustering_diarizer:262] Converting frame level prediction to speech/no-speech segment in start and end times format.


creating speech segments: 100%|██████████| 1/1 [00:00<00:00, 13.41it/s]

[NeMo I 2024-08-26 04:06:14 clustering_diarizer:287] Subsegmentation for embedding extraction: scale0, /content/temp_outputs/speaker_outputs/subsegments_scale0.json
[NeMo I 2024-08-26 04:06:14 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 04:06:14 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:06:14 collections:302] Dataset loaded with 53 items, total duration of  0.02 hours.
[NeMo I 2024-08-26 04:06:14 collections:304] # 53 files loaded accounting to # 1 labels



[1/5] extract embeddings: 100%|██████████| 1/1 [00:00<00:00,  5.87it/s]


[NeMo I 2024-08-26 04:06:15 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 04:06:15 clustering_diarizer:287] Subsegmentation for embedding extraction: scale1, /content/temp_outputs/speaker_outputs/subsegments_scale1.json
[NeMo I 2024-08-26 04:06:15 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 04:06:15 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:06:15 collections:302] Dataset loaded with 67 items, total duration of  0.02 hours.
[NeMo I 2024-08-26 04:06:15 collections:304] # 67 files loaded accounting to # 1 labels


[2/5] extract embeddings: 100%|██████████| 2/2 [00:00<00:00,  9.86it/s]

[NeMo I 2024-08-26 04:06:15 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 04:06:15 clustering_diarizer:287] Subsegmentation for embedding extraction: scale2, /content/temp_outputs/speaker_outputs/subsegments_scale2.json
[NeMo I 2024-08-26 04:06:15 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 04:06:15 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:06:15 collections:302] Dataset loaded with 81 items, total duration of  0.02 hours.
[NeMo I 2024-08-26 04:06:15 collections:304] # 81 files loaded accounting to # 1 labels



[3/5] extract embeddings: 100%|██████████| 2/2 [00:00<00:00,  9.02it/s]

[NeMo I 2024-08-26 04:06:15 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 04:06:15 clustering_diarizer:287] Subsegmentation for embedding extraction: scale3, /content/temp_outputs/speaker_outputs/subsegments_scale3.json
[NeMo I 2024-08-26 04:06:15 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 04:06:15 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:06:15 collections:302] Dataset loaded with 111 items, total duration of  0.02 hours.
[NeMo I 2024-08-26 04:06:15 collections:304] # 111 files loaded accounting to # 1 labels



[4/5] extract embeddings: 100%|██████████| 2/2 [00:00<00:00,  7.43it/s]

[NeMo I 2024-08-26 04:06:15 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 04:06:15 clustering_diarizer:287] Subsegmentation for embedding extraction: scale4, /content/temp_outputs/speaker_outputs/subsegments_scale4.json
[NeMo I 2024-08-26 04:06:15 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 04:06:15 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:06:15 collections:302] Dataset loaded with 168 items, total duration of  0.02 hours.
[NeMo I 2024-08-26 04:06:15 collections:304] # 168 files loaded accounting to # 1 labels



[5/5] extract embeddings: 100%|██████████| 3/3 [00:00<00:00,  9.19it/s]


[NeMo I 2024-08-26 04:06:16 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings


clustering: 100%|██████████| 1/1 [00:00<00:00,  4.23it/s]

[NeMo I 2024-08-26 04:06:16 clustering_diarizer:464] Outputs are saved in /content/temp_outputs directory



[NeMo W 2024-08-26 04:06:16 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 04:06:16 msdd_models:960] Loading embedding pickle file of scale:0 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale0_embeddings.pkl
[NeMo I 2024-08-26 04:06:16 msdd_models:960] Loading embedding pickle file of scale:1 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale1_embeddings.pkl
[NeMo I 2024-08-26 04:06:16 msdd_models:960] Loading embedding pickle file of scale:2 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale2_embeddings.pkl
[NeMo I 2024-08-26 04:06:16 msdd_models:960] Loading embedding pickle file of scale:3 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale3_embeddings.pkl
[NeMo I 2024-08-26 04:06:16 msdd_models:960] Loading embedding pickle file of scale:4 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale4_embeddings.pkl
[NeMo I 2024-08-26 04:06:16 msdd_models:938] Loading cluster label file from /content/temp_outputs/speaker_outputs/subsegments_scale4_cluste

100%|██████████| 1/1 [00:00<00:00, 46.55it/s]

[NeMo I 2024-08-26 04:06:16 msdd_models:1403]      [Threshold: 0.7000] [use_clus_as_main=False] [diar_window=50]
[NeMo I 2024-08-26 04:06:16 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-08-26 04:06:16 speaker_utils:93] Number of files to diarize: 1



[NeMo W 2024-08-26 04:06:16 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 04:06:16 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-08-26 04:06:16 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 04:06:16 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-08-26 04:06:16 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 04:06:16 msdd_models:1431]   
    
archivo ad0946be1ed543acbf3ca01df5fc6b25_20230610t14_38_utc.wav listo


INFO:pytorch_lightning.utilities.migration.utils:Lightning automatically upgraded your loaded checkpoint from v1.5.4 to v2.0.7. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint --file ../root/.cache/torch/whisperx-vad-segmentation.bin`


No language specified, language will be first be detected for each audio file (increases inference time).
Model was trained with pyannote.audio 0.0.1, yours is 3.1.1. Bad things might happen unless you revert pyannote.audio to 0.x.
Model was trained with torch 1.10.0+cu102, yours is 2.0.1+cu117. Bad things might happen unless you revert torch to 1.x.
Suppressing numeral and symbol tokens: [3, 4, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 502, 568, 628, 805, 945, 1017, 1025, 1266, 1294, 1360, 1386, 1525, 1614, 1649, 1722, 1848, 1958, 2009, 2119, 2217, 2272, 2319, 2331, 2443, 2625, 2803, 2975, 3165, 3279, 3282, 3356, 3405, 3446, 3499, 3552, 3705, 4022, 4060, 4289, 4303, 4436, 4550, 4688, 4702, 4762, 4808, 5080, 5211, 5254, 5285, 5348, 5853, 5867, 5923, 6071, 6074, 6096, 6375, 6494, 6549, 6591, 6641, 6673, 6856, 6866, 6879, 6905, 6976, 7143, 7201, 7271, 7490, 7526, 7546, 7551, 7560, 7562, 7629, 7634, 7668, 7771, 7773, 7911, 7998, 8132, 8227, 8423, 8451, 8465, 8494, 8652, 8794, 8858, 8923, 90

[NeMo W 2024-08-26 04:06:59 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: true
    
[NeMo W 2024-08-26 04:06:59 modelPT:168] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s). 
    Validation config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: false
    
[NeMo W 2024-08-26 04:06:59 modelPT:174] Please call the ModelPT.setup_test_data() or ModelPT.setup_multiple

[NeMo I 2024-08-26 04:06:59 features:289] PADDING: 16
[NeMo I 2024-08-26 04:06:59 features:289] PADDING: 16
[NeMo I 2024-08-26 04:07:01 save_restore_connector:249] Model EncDecDiarLabelModel was successfully restored from /root/.cache/torch/NeMo/NeMo_1.21.0/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo.
[NeMo I 2024-08-26 04:07:01 features:289] PADDING: 16
[NeMo I 2024-08-26 04:07:02 clustering_diarizer:127] Loading pretrained vad_multilingual_marblenet model from NGC
[NeMo I 2024-08-26 04:07:02 cloud:58] Found existing object /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-08-26 04:07:02 cloud:64] Re-using file from: /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo
[NeMo I 2024-08-26 04:07:02 common:913] Instantiating model from pre-trained checkpoint


[NeMo W 2024-08-26 04:07:02 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: /manifests/ami_train_0.63.json,/manifests/freesound_background_train.json,/manifests/freesound_laughter_train.json,/manifests/fisher_2004_background.json,/manifests/fisher_2004_speech_sampled.json,/manifests/google_train_manifest.json,/manifests/icsi_all_0.63.json,/manifests/musan_freesound_train.json,/manifests/musan_music_train.json,/manifests/musan_soundbible_train.json,/manifests/mandarin_train_sample.json,/manifests/german_train_sample.json,/manifests/spanish_train_sample.json,/manifests/french_train_sample.json,/manifests/russian_train_sample.json
    sample_rate: 16000
    labels:
    - background
    - speech
    batch_size: 256
    shuffle: true
    is_tarred: false
    tarred_audio_filepaths: null
    tarred_shard_strategy: sca

[NeMo I 2024-08-26 04:07:02 features:289] PADDING: 16
[NeMo I 2024-08-26 04:07:02 save_restore_connector:249] Model EncDecClassificationModel was successfully restored from /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-08-26 04:07:02 msdd_models:864] Multiscale Weights: [1, 1, 1, 1, 1]
[NeMo I 2024-08-26 04:07:02 msdd_models:865] Clustering Parameters: {
        "oracle_num_speakers": false,
        "max_num_speakers": 8,
        "enhanced_count_thres": 80,
        "max_rp_threshold": 0.25,
        "sparse_search_volume": 30,
        "maj_vote_spk_count": false,
        "chunk_cluster_count": 50,
        "embeddings_per_chunk": 10000
    }


[NeMo W 2024-08-26 04:07:02 clustering_diarizer:411] Deleting previous clustering diarizer outputs.


[NeMo I 2024-08-26 04:07:02 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-08-26 04:07:02 clustering_diarizer:309] Split long audio file to avoid CUDA memory issue


splitting manifest: 100%|██████████| 1/1 [00:00<00:00, 33.88it/s]

[NeMo I 2024-08-26 04:07:02 vad_utils:107] The prepared manifest file exists. Overwriting!
[NeMo I 2024-08-26 04:07:02 classification_models:273] Perform streaming frame-level VAD
[NeMo I 2024-08-26 04:07:02 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:07:02 collections:302] Dataset loaded with 2 items, total duration of  0.02 hours.
[NeMo I 2024-08-26 04:07:02 collections:304] # 2 files loaded accounting to # 1 labels



vad: 100%|██████████| 2/2 [00:00<00:00,  4.12it/s]

[NeMo I 2024-08-26 04:07:03 clustering_diarizer:250] Generating predictions with overlapping input segments



                                                               

[NeMo I 2024-08-26 04:07:03 clustering_diarizer:262] Converting frame level prediction to speech/no-speech segment in start and end times format.


creating speech segments: 100%|██████████| 1/1 [00:00<00:00, 11.10it/s]

[NeMo I 2024-08-26 04:07:03 clustering_diarizer:287] Subsegmentation for embedding extraction: scale0, /content/temp_outputs/speaker_outputs/subsegments_scale0.json
[NeMo I 2024-08-26 04:07:03 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 04:07:03 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:07:03 collections:302] Dataset loaded with 55 items, total duration of  0.02 hours.
[NeMo I 2024-08-26 04:07:03 collections:304] # 55 files loaded accounting to # 1 labels



[1/5] extract embeddings: 100%|██████████| 1/1 [00:00<00:00,  5.39it/s]

[NeMo I 2024-08-26 04:07:03 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 04:07:03 clustering_diarizer:287] Subsegmentation for embedding extraction: scale1, /content/temp_outputs/speaker_outputs/subsegments_scale1.json





[NeMo I 2024-08-26 04:07:03 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 04:07:03 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:07:04 collections:302] Dataset loaded with 68 items, total duration of  0.02 hours.
[NeMo I 2024-08-26 04:07:04 collections:304] # 68 files loaded accounting to # 1 labels


[2/5] extract embeddings: 100%|██████████| 2/2 [00:00<00:00,  9.46it/s]

[NeMo I 2024-08-26 04:07:04 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 04:07:04 clustering_diarizer:287] Subsegmentation for embedding extraction: scale2, /content/temp_outputs/speaker_outputs/subsegments_scale2.json
[NeMo I 2024-08-26 04:07:04 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 04:07:04 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:07:04 collections:302] Dataset loaded with 84 items, total duration of  0.02 hours.
[NeMo I 2024-08-26 04:07:04 collections:304] # 84 files loaded accounting to # 1 labels



[3/5] extract embeddings: 100%|██████████| 2/2 [00:00<00:00,  9.28it/s]

[NeMo I 2024-08-26 04:07:04 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 04:07:04 clustering_diarizer:287] Subsegmentation for embedding extraction: scale3, /content/temp_outputs/speaker_outputs/subsegments_scale3.json
[NeMo I 2024-08-26 04:07:04 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 04:07:04 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:07:04 collections:302] Dataset loaded with 113 items, total duration of  0.02 hours.
[NeMo I 2024-08-26 04:07:04 collections:304] # 113 files loaded accounting to # 1 labels



[4/5] extract embeddings: 100%|██████████| 2/2 [00:00<00:00,  7.68it/s]

[NeMo I 2024-08-26 04:07:04 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 04:07:04 clustering_diarizer:287] Subsegmentation for embedding extraction: scale4, /content/temp_outputs/speaker_outputs/subsegments_scale4.json
[NeMo I 2024-08-26 04:07:04 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 04:07:04 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:07:04 collections:302] Dataset loaded with 171 items, total duration of  0.02 hours.
[NeMo I 2024-08-26 04:07:04 collections:304] # 171 files loaded accounting to # 1 labels



[5/5] extract embeddings: 100%|██████████| 3/3 [00:00<00:00,  7.98it/s]

[NeMo I 2024-08-26 04:07:05 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings



clustering: 100%|██████████| 1/1 [00:00<00:00,  3.95it/s]

[NeMo I 2024-08-26 04:07:05 clustering_diarizer:464] Outputs are saved in /content/temp_outputs directory



[NeMo W 2024-08-26 04:07:05 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 04:07:05 msdd_models:960] Loading embedding pickle file of scale:0 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale0_embeddings.pkl
[NeMo I 2024-08-26 04:07:05 msdd_models:960] Loading embedding pickle file of scale:1 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale1_embeddings.pkl
[NeMo I 2024-08-26 04:07:05 msdd_models:960] Loading embedding pickle file of scale:2 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale2_embeddings.pkl
[NeMo I 2024-08-26 04:07:05 msdd_models:960] Loading embedding pickle file of scale:3 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale3_embeddings.pkl
[NeMo I 2024-08-26 04:07:05 msdd_models:960] Loading embedding pickle file of scale:4 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale4_embeddings.pkl
[NeMo I 2024-08-26 04:07:05 msdd_models:938] Loading cluster label file from /content/temp_outputs/speaker_outputs/subsegments_scale4_cluste

100%|██████████| 1/1 [00:00<00:00, 29.67it/s]

[NeMo I 2024-08-26 04:07:05 msdd_models:1403]      [Threshold: 0.7000] [use_clus_as_main=False] [diar_window=50]
[NeMo I 2024-08-26 04:07:05 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-08-26 04:07:05 speaker_utils:93] Number of files to diarize: 1



[NeMo W 2024-08-26 04:07:05 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 04:07:05 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-08-26 04:07:05 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 04:07:05 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-08-26 04:07:05 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 04:07:05 msdd_models:1431]   
    
archivo de226de0f8c94183b0fc1b26c2a7a637_20230610t16_49_utc.wav listo
No language specified, language will be first be detected for each audio file (increases inference time).


INFO:pytorch_lightning.utilities.migration.utils:Lightning automatically upgraded your loaded checkpoint from v1.5.4 to v2.0.7. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint --file ../root/.cache/torch/whisperx-vad-segmentation.bin`


Model was trained with pyannote.audio 0.0.1, yours is 3.1.1. Bad things might happen unless you revert pyannote.audio to 0.x.
Model was trained with torch 1.10.0+cu102, yours is 2.0.1+cu117. Bad things might happen unless you revert torch to 1.x.
Suppressing numeral and symbol tokens: [3, 4, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 502, 568, 628, 805, 945, 1017, 1025, 1266, 1294, 1360, 1386, 1525, 1614, 1649, 1722, 1848, 1958, 2009, 2119, 2217, 2272, 2319, 2331, 2443, 2625, 2803, 2975, 3165, 3279, 3282, 3356, 3405, 3446, 3499, 3552, 3705, 4022, 4060, 4289, 4303, 4436, 4550, 4688, 4702, 4762, 4808, 5080, 5211, 5254, 5285, 5348, 5853, 5867, 5923, 6071, 6074, 6096, 6375, 6494, 6549, 6591, 6641, 6673, 6856, 6866, 6879, 6905, 6976, 7143, 7201, 7271, 7490, 7526, 7546, 7551, 7560, 7562, 7629, 7634, 7668, 7771, 7773, 7911, 7998, 8132, 8227, 8423, 8451, 8465, 8494, 8652, 8794, 8858, 8923, 9012, 9125, 9356, 9413, 9562, 9657, 9714, 9754, 10076, 10154, 10191, 10353, 10389, 10411, 10607, 10858, 1088

[NeMo W 2024-08-26 04:07:50 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: true
    
[NeMo W 2024-08-26 04:07:50 modelPT:168] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s). 
    Validation config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: false
    
[NeMo W 2024-08-26 04:07:50 modelPT:174] Please call the ModelPT.setup_test_data() or ModelPT.setup_multiple

[NeMo I 2024-08-26 04:07:50 features:289] PADDING: 16
[NeMo I 2024-08-26 04:07:50 features:289] PADDING: 16
[NeMo I 2024-08-26 04:07:51 save_restore_connector:249] Model EncDecDiarLabelModel was successfully restored from /root/.cache/torch/NeMo/NeMo_1.21.0/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo.
[NeMo I 2024-08-26 04:07:51 features:289] PADDING: 16
[NeMo I 2024-08-26 04:07:52 clustering_diarizer:127] Loading pretrained vad_multilingual_marblenet model from NGC
[NeMo I 2024-08-26 04:07:52 cloud:58] Found existing object /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-08-26 04:07:52 cloud:64] Re-using file from: /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo
[NeMo I 2024-08-26 04:07:52 common:913] Instantiating model from pre-trained checkpoint


[NeMo W 2024-08-26 04:07:52 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: /manifests/ami_train_0.63.json,/manifests/freesound_background_train.json,/manifests/freesound_laughter_train.json,/manifests/fisher_2004_background.json,/manifests/fisher_2004_speech_sampled.json,/manifests/google_train_manifest.json,/manifests/icsi_all_0.63.json,/manifests/musan_freesound_train.json,/manifests/musan_music_train.json,/manifests/musan_soundbible_train.json,/manifests/mandarin_train_sample.json,/manifests/german_train_sample.json,/manifests/spanish_train_sample.json,/manifests/french_train_sample.json,/manifests/russian_train_sample.json
    sample_rate: 16000
    labels:
    - background
    - speech
    batch_size: 256
    shuffle: true
    is_tarred: false
    tarred_audio_filepaths: null
    tarred_shard_strategy: sca

[NeMo I 2024-08-26 04:07:52 features:289] PADDING: 16
[NeMo I 2024-08-26 04:07:52 save_restore_connector:249] Model EncDecClassificationModel was successfully restored from /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-08-26 04:07:52 msdd_models:864] Multiscale Weights: [1, 1, 1, 1, 1]
[NeMo I 2024-08-26 04:07:52 msdd_models:865] Clustering Parameters: {
        "oracle_num_speakers": false,
        "max_num_speakers": 8,
        "enhanced_count_thres": 80,
        "max_rp_threshold": 0.25,
        "sparse_search_volume": 30,
        "maj_vote_spk_count": false,
        "chunk_cluster_count": 50,
        "embeddings_per_chunk": 10000
    }


[NeMo W 2024-08-26 04:07:52 clustering_diarizer:411] Deleting previous clustering diarizer outputs.


[NeMo I 2024-08-26 04:07:52 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-08-26 04:07:52 clustering_diarizer:309] Split long audio file to avoid CUDA memory issue


splitting manifest: 100%|██████████| 1/1 [00:00<00:00, 33.45it/s]

[NeMo I 2024-08-26 04:07:52 vad_utils:107] The prepared manifest file exists. Overwriting!
[NeMo I 2024-08-26 04:07:52 classification_models:273] Perform streaming frame-level VAD
[NeMo I 2024-08-26 04:07:52 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:07:52 collections:302] Dataset loaded with 2 items, total duration of  0.02 hours.
[NeMo I 2024-08-26 04:07:52 collections:304] # 2 files loaded accounting to # 1 labels



vad: 100%|██████████| 2/2 [00:00<00:00,  4.26it/s]

[NeMo I 2024-08-26 04:07:53 clustering_diarizer:250] Generating predictions with overlapping input segments



                                                               

[NeMo I 2024-08-26 04:07:53 clustering_diarizer:262] Converting frame level prediction to speech/no-speech segment in start and end times format.


creating speech segments: 100%|██████████| 1/1 [00:00<00:00, 10.15it/s]

[NeMo I 2024-08-26 04:07:53 clustering_diarizer:287] Subsegmentation for embedding extraction: scale0, /content/temp_outputs/speaker_outputs/subsegments_scale0.json
[NeMo I 2024-08-26 04:07:53 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 04:07:53 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:07:53 collections:302] Dataset loaded with 86 items, total duration of  0.03 hours.
[NeMo I 2024-08-26 04:07:53 collections:304] # 86 files loaded accounting to # 1 labels



[1/5] extract embeddings: 100%|██████████| 2/2 [00:00<00:00,  7.01it/s]

[NeMo I 2024-08-26 04:07:54 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 04:07:54 clustering_diarizer:287] Subsegmentation for embedding extraction: scale1, /content/temp_outputs/speaker_outputs/subsegments_scale1.json
[NeMo I 2024-08-26 04:07:54 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 04:07:54 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:07:54 collections:302] Dataset loaded with 102 items, total duration of  0.03 hours.
[NeMo I 2024-08-26 04:07:54 collections:304] # 102 files loaded accounting to # 1 labels



[2/5] extract embeddings: 100%|██████████| 2/2 [00:00<00:00,  6.88it/s]

[NeMo I 2024-08-26 04:07:54 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 04:07:54 clustering_diarizer:287] Subsegmentation for embedding extraction: scale2, /content/temp_outputs/speaker_outputs/subsegments_scale2.json
[NeMo I 2024-08-26 04:07:54 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 04:07:54 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:07:54 collections:302] Dataset loaded with 129 items, total duration of  0.03 hours.
[NeMo I 2024-08-26 04:07:54 collections:304] # 129 files loaded accounting to # 1 labels



[3/5] extract embeddings: 100%|██████████| 3/3 [00:00<00:00,  8.01it/s]

[NeMo I 2024-08-26 04:07:54 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 04:07:54 clustering_diarizer:287] Subsegmentation for embedding extraction: scale3, /content/temp_outputs/speaker_outputs/subsegments_scale3.json
[NeMo I 2024-08-26 04:07:54 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 04:07:54 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:07:54 collections:302] Dataset loaded with 175 items, total duration of  0.04 hours.
[NeMo I 2024-08-26 04:07:54 collections:304] # 175 files loaded accounting to # 1 labels



[4/5] extract embeddings: 100%|██████████| 3/3 [00:00<00:00,  7.92it/s]

[NeMo I 2024-08-26 04:07:55 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 04:07:55 clustering_diarizer:287] Subsegmentation for embedding extraction: scale4, /content/temp_outputs/speaker_outputs/subsegments_scale4.json
[NeMo I 2024-08-26 04:07:55 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 04:07:55 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:07:55 collections:302] Dataset loaded with 267 items, total duration of  0.04 hours.
[NeMo I 2024-08-26 04:07:55 collections:304] # 267 files loaded accounting to # 1 labels



[5/5] extract embeddings: 100%|██████████| 5/5 [00:00<00:00,  8.52it/s]


[NeMo I 2024-08-26 04:07:55 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings


clustering: 100%|██████████| 1/1 [00:00<00:00,  2.51it/s]

[NeMo I 2024-08-26 04:07:56 clustering_diarizer:464] Outputs are saved in /content/temp_outputs directory



[NeMo W 2024-08-26 04:07:56 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 04:07:56 msdd_models:960] Loading embedding pickle file of scale:0 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale0_embeddings.pkl
[NeMo I 2024-08-26 04:07:56 msdd_models:960] Loading embedding pickle file of scale:1 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale1_embeddings.pkl
[NeMo I 2024-08-26 04:07:56 msdd_models:960] Loading embedding pickle file of scale:2 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale2_embeddings.pkl
[NeMo I 2024-08-26 04:07:56 msdd_models:960] Loading embedding pickle file of scale:3 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale3_embeddings.pkl
[NeMo I 2024-08-26 04:07:56 msdd_models:960] Loading embedding pickle file of scale:4 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale4_embeddings.pkl
[NeMo I 2024-08-26 04:07:56 msdd_models:938] Loading cluster label file from /content/temp_outputs/speaker_outputs/subsegments_scale4_cluste

100%|██████████| 1/1 [00:00<00:00, 39.02it/s]

[NeMo I 2024-08-26 04:07:56 msdd_models:1403]      [Threshold: 0.7000] [use_clus_as_main=False] [diar_window=50]
[NeMo I 2024-08-26 04:07:56 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-08-26 04:07:56 speaker_utils:93] Number of files to diarize: 1



[NeMo W 2024-08-26 04:07:56 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 04:07:56 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-08-26 04:07:56 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 04:07:56 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-08-26 04:07:56 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 04:07:56 msdd_models:1431]   
    
archivo 42df394a0d744909b694f4139d4b8c09_20230610t15_37_utc.wav listo


INFO:pytorch_lightning.utilities.migration.utils:Lightning automatically upgraded your loaded checkpoint from v1.5.4 to v2.0.7. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint --file ../root/.cache/torch/whisperx-vad-segmentation.bin`


No language specified, language will be first be detected for each audio file (increases inference time).
Model was trained with pyannote.audio 0.0.1, yours is 3.1.1. Bad things might happen unless you revert pyannote.audio to 0.x.
Model was trained with torch 1.10.0+cu102, yours is 2.0.1+cu117. Bad things might happen unless you revert torch to 1.x.
Suppressing numeral and symbol tokens: [3, 4, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 502, 568, 628, 805, 945, 1017, 1025, 1266, 1294, 1360, 1386, 1525, 1614, 1649, 1722, 1848, 1958, 2009, 2119, 2217, 2272, 2319, 2331, 2443, 2625, 2803, 2975, 3165, 3279, 3282, 3356, 3405, 3446, 3499, 3552, 3705, 4022, 4060, 4289, 4303, 4436, 4550, 4688, 4702, 4762, 4808, 5080, 5211, 5254, 5285, 5348, 5853, 5867, 5923, 6071, 6074, 6096, 6375, 6494, 6549, 6591, 6641, 6673, 6856, 6866, 6879, 6905, 6976, 7143, 7201, 7271, 7490, 7526, 7546, 7551, 7560, 7562, 7629, 7634, 7668, 7771, 7773, 7911, 7998, 8132, 8227, 8423, 8451, 8465, 8494, 8652, 8794, 8858, 8923, 90

[NeMo W 2024-08-26 04:08:41 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: true
    
[NeMo W 2024-08-26 04:08:41 modelPT:168] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s). 
    Validation config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: false
    
[NeMo W 2024-08-26 04:08:41 modelPT:174] Please call the ModelPT.setup_test_data() or ModelPT.setup_multiple

[NeMo I 2024-08-26 04:08:41 features:289] PADDING: 16
[NeMo I 2024-08-26 04:08:41 features:289] PADDING: 16
[NeMo I 2024-08-26 04:08:42 save_restore_connector:249] Model EncDecDiarLabelModel was successfully restored from /root/.cache/torch/NeMo/NeMo_1.21.0/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo.
[NeMo I 2024-08-26 04:08:42 features:289] PADDING: 16
[NeMo I 2024-08-26 04:08:43 clustering_diarizer:127] Loading pretrained vad_multilingual_marblenet model from NGC
[NeMo I 2024-08-26 04:08:43 cloud:58] Found existing object /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-08-26 04:08:43 cloud:64] Re-using file from: /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo
[NeMo I 2024-08-26 04:08:43 common:913] Instantiating model from pre-trained checkpoint


[NeMo W 2024-08-26 04:08:43 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: /manifests/ami_train_0.63.json,/manifests/freesound_background_train.json,/manifests/freesound_laughter_train.json,/manifests/fisher_2004_background.json,/manifests/fisher_2004_speech_sampled.json,/manifests/google_train_manifest.json,/manifests/icsi_all_0.63.json,/manifests/musan_freesound_train.json,/manifests/musan_music_train.json,/manifests/musan_soundbible_train.json,/manifests/mandarin_train_sample.json,/manifests/german_train_sample.json,/manifests/spanish_train_sample.json,/manifests/french_train_sample.json,/manifests/russian_train_sample.json
    sample_rate: 16000
    labels:
    - background
    - speech
    batch_size: 256
    shuffle: true
    is_tarred: false
    tarred_audio_filepaths: null
    tarred_shard_strategy: sca

[NeMo I 2024-08-26 04:08:43 features:289] PADDING: 16
[NeMo I 2024-08-26 04:08:43 save_restore_connector:249] Model EncDecClassificationModel was successfully restored from /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-08-26 04:08:43 msdd_models:864] Multiscale Weights: [1, 1, 1, 1, 1]
[NeMo I 2024-08-26 04:08:43 msdd_models:865] Clustering Parameters: {
        "oracle_num_speakers": false,
        "max_num_speakers": 8,
        "enhanced_count_thres": 80,
        "max_rp_threshold": 0.25,
        "sparse_search_volume": 30,
        "maj_vote_spk_count": false,
        "chunk_cluster_count": 50,
        "embeddings_per_chunk": 10000
    }


[NeMo W 2024-08-26 04:08:43 clustering_diarizer:411] Deleting previous clustering diarizer outputs.


[NeMo I 2024-08-26 04:08:43 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-08-26 04:08:43 clustering_diarizer:309] Split long audio file to avoid CUDA memory issue


splitting manifest: 100%|██████████| 1/1 [00:00<00:00, 31.66it/s]

[NeMo I 2024-08-26 04:08:43 vad_utils:107] The prepared manifest file exists. Overwriting!
[NeMo I 2024-08-26 04:08:43 classification_models:273] Perform streaming frame-level VAD
[NeMo I 2024-08-26 04:08:43 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:08:43 collections:302] Dataset loaded with 2 items, total duration of  0.02 hours.
[NeMo I 2024-08-26 04:08:43 collections:304] # 2 files loaded accounting to # 1 labels



vad: 100%|██████████| 2/2 [00:00<00:00,  3.86it/s]

[NeMo I 2024-08-26 04:08:44 clustering_diarizer:250] Generating predictions with overlapping input segments



                                                               

[NeMo I 2024-08-26 04:08:44 clustering_diarizer:262] Converting frame level prediction to speech/no-speech segment in start and end times format.


creating speech segments: 100%|██████████| 1/1 [00:00<00:00,  9.06it/s]

[NeMo I 2024-08-26 04:08:45 clustering_diarizer:287] Subsegmentation for embedding extraction: scale0, /content/temp_outputs/speaker_outputs/subsegments_scale0.json
[NeMo I 2024-08-26 04:08:45 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 04:08:45 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:08:45 collections:302] Dataset loaded with 79 items, total duration of  0.03 hours.
[NeMo I 2024-08-26 04:08:45 collections:304] # 79 files loaded accounting to # 1 labels



[1/5] extract embeddings: 100%|██████████| 2/2 [00:00<00:00,  7.06it/s]

[NeMo I 2024-08-26 04:08:45 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 04:08:45 clustering_diarizer:287] Subsegmentation for embedding extraction: scale1, /content/temp_outputs/speaker_outputs/subsegments_scale1.json
[NeMo I 2024-08-26 04:08:45 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 04:08:45 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:08:45 collections:302] Dataset loaded with 101 items, total duration of  0.03 hours.
[NeMo I 2024-08-26 04:08:45 collections:304] # 101 files loaded accounting to # 1 labels



[2/5] extract embeddings: 100%|██████████| 2/2 [00:00<00:00,  7.28it/s]

[NeMo I 2024-08-26 04:08:45 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 04:08:45 clustering_diarizer:287] Subsegmentation for embedding extraction: scale2, /content/temp_outputs/speaker_outputs/subsegments_scale2.json
[NeMo I 2024-08-26 04:08:45 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 04:08:45 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:08:45 collections:302] Dataset loaded with 126 items, total duration of  0.03 hours.
[NeMo I 2024-08-26 04:08:45 collections:304] # 126 files loaded accounting to # 1 labels



[3/5] extract embeddings: 100%|██████████| 2/2 [00:00<00:00,  6.34it/s]

[NeMo I 2024-08-26 04:08:45 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 04:08:45 clustering_diarizer:287] Subsegmentation for embedding extraction: scale3, /content/temp_outputs/speaker_outputs/subsegments_scale3.json
[NeMo I 2024-08-26 04:08:46 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 04:08:46 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:08:46 collections:302] Dataset loaded with 171 items, total duration of  0.03 hours.
[NeMo I 2024-08-26 04:08:46 collections:304] # 171 files loaded accounting to # 1 labels



[4/5] extract embeddings: 100%|██████████| 3/3 [00:00<00:00,  8.64it/s]

[NeMo I 2024-08-26 04:08:46 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 04:08:46 clustering_diarizer:287] Subsegmentation for embedding extraction: scale4, /content/temp_outputs/speaker_outputs/subsegments_scale4.json
[NeMo I 2024-08-26 04:08:46 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 04:08:46 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:08:46 collections:302] Dataset loaded with 262 items, total duration of  0.04 hours.
[NeMo I 2024-08-26 04:08:46 collections:304] # 262 files loaded accounting to # 1 labels



[5/5] extract embeddings: 100%|██████████| 5/5 [00:00<00:00, 10.57it/s]

[NeMo I 2024-08-26 04:08:46 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings



clustering: 100%|██████████| 1/1 [00:00<00:00,  3.12it/s]

[NeMo I 2024-08-26 04:08:47 clustering_diarizer:464] Outputs are saved in /content/temp_outputs directory



[NeMo W 2024-08-26 04:08:47 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 04:08:47 msdd_models:960] Loading embedding pickle file of scale:0 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale0_embeddings.pkl
[NeMo I 2024-08-26 04:08:47 msdd_models:960] Loading embedding pickle file of scale:1 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale1_embeddings.pkl
[NeMo I 2024-08-26 04:08:47 msdd_models:960] Loading embedding pickle file of scale:2 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale2_embeddings.pkl
[NeMo I 2024-08-26 04:08:47 msdd_models:960] Loading embedding pickle file of scale:3 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale3_embeddings.pkl
[NeMo I 2024-08-26 04:08:47 msdd_models:960] Loading embedding pickle file of scale:4 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale4_embeddings.pkl
[NeMo I 2024-08-26 04:08:47 msdd_models:938] Loading cluster label file from /content/temp_outputs/speaker_outputs/subsegments_scale4_cluste

100%|██████████| 1/1 [00:00<00:00, 28.33it/s]

[NeMo I 2024-08-26 04:08:47 msdd_models:1403]      [Threshold: 0.7000] [use_clus_as_main=False] [diar_window=50]
[NeMo I 2024-08-26 04:08:47 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-08-26 04:08:47 speaker_utils:93] Number of files to diarize: 1



[NeMo W 2024-08-26 04:08:47 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 04:08:47 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-08-26 04:08:47 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 04:08:47 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-08-26 04:08:47 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 04:08:47 msdd_models:1431]   
    
archivo 8273dce465a54e99a47d92da2d1af7ad_20230610t14_54_utc.wav listo


INFO:pytorch_lightning.utilities.migration.utils:Lightning automatically upgraded your loaded checkpoint from v1.5.4 to v2.0.7. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint --file ../root/.cache/torch/whisperx-vad-segmentation.bin`


No language specified, language will be first be detected for each audio file (increases inference time).
Model was trained with pyannote.audio 0.0.1, yours is 3.1.1. Bad things might happen unless you revert pyannote.audio to 0.x.
Model was trained with torch 1.10.0+cu102, yours is 2.0.1+cu117. Bad things might happen unless you revert torch to 1.x.
Suppressing numeral and symbol tokens: [3, 4, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 502, 568, 628, 805, 945, 1017, 1025, 1266, 1294, 1360, 1386, 1525, 1614, 1649, 1722, 1848, 1958, 2009, 2119, 2217, 2272, 2319, 2331, 2443, 2625, 2803, 2975, 3165, 3279, 3282, 3356, 3405, 3446, 3499, 3552, 3705, 4022, 4060, 4289, 4303, 4436, 4550, 4688, 4702, 4762, 4808, 5080, 5211, 5254, 5285, 5348, 5853, 5867, 5923, 6071, 6074, 6096, 6375, 6494, 6549, 6591, 6641, 6673, 6856, 6866, 6879, 6905, 6976, 7143, 7201, 7271, 7490, 7526, 7546, 7551, 7560, 7562, 7629, 7634, 7668, 7771, 7773, 7911, 7998, 8132, 8227, 8423, 8451, 8465, 8494, 8652, 8794, 8858, 8923, 90

[NeMo W 2024-08-26 04:09:31 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: true
    
[NeMo W 2024-08-26 04:09:31 modelPT:168] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s). 
    Validation config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: false
    
[NeMo W 2024-08-26 04:09:31 modelPT:174] Please call the ModelPT.setup_test_data() or ModelPT.setup_multiple

[NeMo I 2024-08-26 04:09:31 features:289] PADDING: 16
[NeMo I 2024-08-26 04:09:32 features:289] PADDING: 16
[NeMo I 2024-08-26 04:09:33 save_restore_connector:249] Model EncDecDiarLabelModel was successfully restored from /root/.cache/torch/NeMo/NeMo_1.21.0/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo.
[NeMo I 2024-08-26 04:09:33 features:289] PADDING: 16
[NeMo I 2024-08-26 04:09:34 clustering_diarizer:127] Loading pretrained vad_multilingual_marblenet model from NGC
[NeMo I 2024-08-26 04:09:34 cloud:58] Found existing object /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-08-26 04:09:34 cloud:64] Re-using file from: /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo
[NeMo I 2024-08-26 04:09:34 common:913] Instantiating model from pre-trained checkpoint


[NeMo W 2024-08-26 04:09:34 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: /manifests/ami_train_0.63.json,/manifests/freesound_background_train.json,/manifests/freesound_laughter_train.json,/manifests/fisher_2004_background.json,/manifests/fisher_2004_speech_sampled.json,/manifests/google_train_manifest.json,/manifests/icsi_all_0.63.json,/manifests/musan_freesound_train.json,/manifests/musan_music_train.json,/manifests/musan_soundbible_train.json,/manifests/mandarin_train_sample.json,/manifests/german_train_sample.json,/manifests/spanish_train_sample.json,/manifests/french_train_sample.json,/manifests/russian_train_sample.json
    sample_rate: 16000
    labels:
    - background
    - speech
    batch_size: 256
    shuffle: true
    is_tarred: false
    tarred_audio_filepaths: null
    tarred_shard_strategy: sca

[NeMo I 2024-08-26 04:09:34 features:289] PADDING: 16
[NeMo I 2024-08-26 04:09:34 save_restore_connector:249] Model EncDecClassificationModel was successfully restored from /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-08-26 04:09:34 msdd_models:864] Multiscale Weights: [1, 1, 1, 1, 1]
[NeMo I 2024-08-26 04:09:34 msdd_models:865] Clustering Parameters: {
        "oracle_num_speakers": false,
        "max_num_speakers": 8,
        "enhanced_count_thres": 80,
        "max_rp_threshold": 0.25,
        "sparse_search_volume": 30,
        "maj_vote_spk_count": false,
        "chunk_cluster_count": 50,
        "embeddings_per_chunk": 10000
    }


[NeMo W 2024-08-26 04:09:34 clustering_diarizer:411] Deleting previous clustering diarizer outputs.


[NeMo I 2024-08-26 04:09:34 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-08-26 04:09:34 clustering_diarizer:309] Split long audio file to avoid CUDA memory issue


splitting manifest: 100%|██████████| 1/1 [00:00<00:00, 26.45it/s]

[NeMo I 2024-08-26 04:09:34 vad_utils:107] The prepared manifest file exists. Overwriting!
[NeMo I 2024-08-26 04:09:34 classification_models:273] Perform streaming frame-level VAD
[NeMo I 2024-08-26 04:09:34 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:09:34 collections:302] Dataset loaded with 2 items, total duration of  0.03 hours.
[NeMo I 2024-08-26 04:09:34 collections:304] # 2 files loaded accounting to # 1 labels



vad: 100%|██████████| 2/2 [00:00<00:00,  2.82it/s]

[NeMo I 2024-08-26 04:09:35 clustering_diarizer:250] Generating predictions with overlapping input segments



                                                               

[NeMo I 2024-08-26 04:09:36 clustering_diarizer:262] Converting frame level prediction to speech/no-speech segment in start and end times format.


creating speech segments: 100%|██████████| 1/1 [00:00<00:00,  8.07it/s]

[NeMo I 2024-08-26 04:09:36 clustering_diarizer:287] Subsegmentation for embedding extraction: scale0, /content/temp_outputs/speaker_outputs/subsegments_scale0.json
[NeMo I 2024-08-26 04:09:36 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 04:09:36 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:09:36 collections:302] Dataset loaded with 90 items, total duration of  0.03 hours.
[NeMo I 2024-08-26 04:09:36 collections:304] # 90 files loaded accounting to # 1 labels



[1/5] extract embeddings: 100%|██████████| 2/2 [00:00<00:00,  6.69it/s]

[NeMo I 2024-08-26 04:09:37 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 04:09:37 clustering_diarizer:287] Subsegmentation for embedding extraction: scale1, /content/temp_outputs/speaker_outputs/subsegments_scale1.json
[NeMo I 2024-08-26 04:09:37 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 04:09:37 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:09:37 collections:302] Dataset loaded with 107 items, total duration of  0.03 hours.
[NeMo I 2024-08-26 04:09:37 collections:304] # 107 files loaded accounting to # 1 labels



[2/5] extract embeddings: 100%|██████████| 2/2 [00:00<00:00,  6.66it/s]

[NeMo I 2024-08-26 04:09:37 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 04:09:37 clustering_diarizer:287] Subsegmentation for embedding extraction: scale2, /content/temp_outputs/speaker_outputs/subsegments_scale2.json
[NeMo I 2024-08-26 04:09:37 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 04:09:37 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:09:37 collections:302] Dataset loaded with 137 items, total duration of  0.04 hours.
[NeMo I 2024-08-26 04:09:37 collections:304] # 137 files loaded accounting to # 1 labels



[3/5] extract embeddings: 100%|██████████| 3/3 [00:00<00:00,  7.63it/s]

[NeMo I 2024-08-26 04:09:38 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 04:09:38 clustering_diarizer:287] Subsegmentation for embedding extraction: scale3, /content/temp_outputs/speaker_outputs/subsegments_scale3.json
[NeMo I 2024-08-26 04:09:38 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 04:09:38 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:09:38 collections:302] Dataset loaded with 183 items, total duration of  0.04 hours.
[NeMo I 2024-08-26 04:09:38 collections:304] # 183 files loaded accounting to # 1 labels



[4/5] extract embeddings: 100%|██████████| 3/3 [00:00<00:00,  6.55it/s]

[NeMo I 2024-08-26 04:09:38 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 04:09:38 clustering_diarizer:287] Subsegmentation for embedding extraction: scale4, /content/temp_outputs/speaker_outputs/subsegments_scale4.json
[NeMo I 2024-08-26 04:09:38 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 04:09:38 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:09:38 collections:302] Dataset loaded with 278 items, total duration of  0.04 hours.
[NeMo I 2024-08-26 04:09:38 collections:304] # 278 files loaded accounting to # 1 labels



[5/5] extract embeddings: 100%|██████████| 5/5 [00:00<00:00,  7.72it/s]

[NeMo I 2024-08-26 04:09:39 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings



clustering: 100%|██████████| 1/1 [00:00<00:00,  2.11it/s]

[NeMo I 2024-08-26 04:09:39 clustering_diarizer:464] Outputs are saved in /content/temp_outputs directory



[NeMo W 2024-08-26 04:09:39 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 04:09:39 msdd_models:960] Loading embedding pickle file of scale:0 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale0_embeddings.pkl
[NeMo I 2024-08-26 04:09:39 msdd_models:960] Loading embedding pickle file of scale:1 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale1_embeddings.pkl
[NeMo I 2024-08-26 04:09:39 msdd_models:960] Loading embedding pickle file of scale:2 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale2_embeddings.pkl
[NeMo I 2024-08-26 04:09:39 msdd_models:960] Loading embedding pickle file of scale:3 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale3_embeddings.pkl
[NeMo I 2024-08-26 04:09:39 msdd_models:960] Loading embedding pickle file of scale:4 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale4_embeddings.pkl
[NeMo I 2024-08-26 04:09:39 msdd_models:938] Loading cluster label file from /content/temp_outputs/speaker_outputs/subsegments_scale4_cluste

100%|██████████| 1/1 [00:00<00:00, 36.17it/s]

[NeMo I 2024-08-26 04:09:39 msdd_models:1403]      [Threshold: 0.7000] [use_clus_as_main=False] [diar_window=50]
[NeMo I 2024-08-26 04:09:39 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-08-26 04:09:39 speaker_utils:93] Number of files to diarize: 1



[NeMo W 2024-08-26 04:09:39 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 04:09:39 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-08-26 04:09:39 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 04:09:39 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-08-26 04:09:40 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 04:09:40 msdd_models:1431]   
    
archivo a7b3a1c2f26c464d97dbec49d46b75f1_20230610t14_25_utc.wav listo


INFO:pytorch_lightning.utilities.migration.utils:Lightning automatically upgraded your loaded checkpoint from v1.5.4 to v2.0.7. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint --file ../root/.cache/torch/whisperx-vad-segmentation.bin`


No language specified, language will be first be detected for each audio file (increases inference time).
Model was trained with pyannote.audio 0.0.1, yours is 3.1.1. Bad things might happen unless you revert pyannote.audio to 0.x.
Model was trained with torch 1.10.0+cu102, yours is 2.0.1+cu117. Bad things might happen unless you revert torch to 1.x.
Suppressing numeral and symbol tokens: [3, 4, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 502, 568, 628, 805, 945, 1017, 1025, 1266, 1294, 1360, 1386, 1525, 1614, 1649, 1722, 1848, 1958, 2009, 2119, 2217, 2272, 2319, 2331, 2443, 2625, 2803, 2975, 3165, 3279, 3282, 3356, 3405, 3446, 3499, 3552, 3705, 4022, 4060, 4289, 4303, 4436, 4550, 4688, 4702, 4762, 4808, 5080, 5211, 5254, 5285, 5348, 5853, 5867, 5923, 6071, 6074, 6096, 6375, 6494, 6549, 6591, 6641, 6673, 6856, 6866, 6879, 6905, 6976, 7143, 7201, 7271, 7490, 7526, 7546, 7551, 7560, 7562, 7629, 7634, 7668, 7771, 7773, 7911, 7998, 8132, 8227, 8423, 8451, 8465, 8494, 8652, 8794, 8858, 8923, 90

[NeMo W 2024-08-26 04:10:42 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: true
    
[NeMo W 2024-08-26 04:10:42 modelPT:168] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s). 
    Validation config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: false
    
[NeMo W 2024-08-26 04:10:42 modelPT:174] Please call the ModelPT.setup_test_data() or ModelPT.setup_multiple

[NeMo I 2024-08-26 04:10:42 features:289] PADDING: 16
[NeMo I 2024-08-26 04:10:42 features:289] PADDING: 16
[NeMo I 2024-08-26 04:10:43 save_restore_connector:249] Model EncDecDiarLabelModel was successfully restored from /root/.cache/torch/NeMo/NeMo_1.21.0/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo.
[NeMo I 2024-08-26 04:10:43 features:289] PADDING: 16
[NeMo I 2024-08-26 04:10:44 clustering_diarizer:127] Loading pretrained vad_multilingual_marblenet model from NGC
[NeMo I 2024-08-26 04:10:44 cloud:58] Found existing object /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-08-26 04:10:44 cloud:64] Re-using file from: /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo
[NeMo I 2024-08-26 04:10:44 common:913] Instantiating model from pre-trained checkpoint


[NeMo W 2024-08-26 04:10:44 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: /manifests/ami_train_0.63.json,/manifests/freesound_background_train.json,/manifests/freesound_laughter_train.json,/manifests/fisher_2004_background.json,/manifests/fisher_2004_speech_sampled.json,/manifests/google_train_manifest.json,/manifests/icsi_all_0.63.json,/manifests/musan_freesound_train.json,/manifests/musan_music_train.json,/manifests/musan_soundbible_train.json,/manifests/mandarin_train_sample.json,/manifests/german_train_sample.json,/manifests/spanish_train_sample.json,/manifests/french_train_sample.json,/manifests/russian_train_sample.json
    sample_rate: 16000
    labels:
    - background
    - speech
    batch_size: 256
    shuffle: true
    is_tarred: false
    tarred_audio_filepaths: null
    tarred_shard_strategy: sca

[NeMo I 2024-08-26 04:10:44 features:289] PADDING: 16
[NeMo I 2024-08-26 04:10:44 save_restore_connector:249] Model EncDecClassificationModel was successfully restored from /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-08-26 04:10:44 msdd_models:864] Multiscale Weights: [1, 1, 1, 1, 1]
[NeMo I 2024-08-26 04:10:44 msdd_models:865] Clustering Parameters: {
        "oracle_num_speakers": false,
        "max_num_speakers": 8,
        "enhanced_count_thres": 80,
        "max_rp_threshold": 0.25,
        "sparse_search_volume": 30,
        "maj_vote_spk_count": false,
        "chunk_cluster_count": 50,
        "embeddings_per_chunk": 10000
    }


[NeMo W 2024-08-26 04:10:44 clustering_diarizer:411] Deleting previous clustering diarizer outputs.


[NeMo I 2024-08-26 04:10:44 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-08-26 04:10:44 clustering_diarizer:309] Split long audio file to avoid CUDA memory issue


splitting manifest: 100%|██████████| 1/1 [00:00<00:00,  8.88it/s]

[NeMo I 2024-08-26 04:10:44 vad_utils:107] The prepared manifest file exists. Overwriting!
[NeMo I 2024-08-26 04:10:44 classification_models:273] Perform streaming frame-level VAD
[NeMo I 2024-08-26 04:10:44 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:10:44 collections:302] Dataset loaded with 8 items, total duration of  0.10 hours.
[NeMo I 2024-08-26 04:10:44 collections:304] # 8 files loaded accounting to # 1 labels



vad: 100%|██████████| 8/8 [00:01<00:00,  4.16it/s]

[NeMo I 2024-08-26 04:10:46 clustering_diarizer:250] Generating predictions with overlapping input segments



                                                               

[NeMo I 2024-08-26 04:10:50 clustering_diarizer:262] Converting frame level prediction to speech/no-speech segment in start and end times format.


creating speech segments: 100%|██████████| 1/1 [00:00<00:00,  2.30it/s]

[NeMo I 2024-08-26 04:10:51 clustering_diarizer:287] Subsegmentation for embedding extraction: scale0, /content/temp_outputs/speaker_outputs/subsegments_scale0.json
[NeMo I 2024-08-26 04:10:51 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 04:10:51 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:10:51 collections:302] Dataset loaded with 356 items, total duration of  0.14 hours.
[NeMo I 2024-08-26 04:10:51 collections:304] # 356 files loaded accounting to # 1 labels



[1/5] extract embeddings: 100%|██████████| 6/6 [00:01<00:00,  5.34it/s]

[NeMo I 2024-08-26 04:10:52 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 04:10:52 clustering_diarizer:287] Subsegmentation for embedding extraction: scale1, /content/temp_outputs/speaker_outputs/subsegments_scale1.json
[NeMo I 2024-08-26 04:10:52 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 04:10:52 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:10:52 collections:302] Dataset loaded with 425 items, total duration of  0.14 hours.
[NeMo I 2024-08-26 04:10:52 collections:304] # 425 files loaded accounting to # 1 labels



[2/5] extract embeddings: 100%|██████████| 7/7 [00:01<00:00,  6.33it/s]

[NeMo I 2024-08-26 04:10:53 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 04:10:53 clustering_diarizer:287] Subsegmentation for embedding extraction: scale2, /content/temp_outputs/speaker_outputs/subsegments_scale2.json
[NeMo I 2024-08-26 04:10:53 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 04:10:53 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:10:53 collections:302] Dataset loaded with 543 items, total duration of  0.14 hours.
[NeMo I 2024-08-26 04:10:53 collections:304] # 543 files loaded accounting to # 1 labels



[3/5] extract embeddings: 100%|██████████| 9/9 [00:01<00:00,  7.92it/s]

[NeMo I 2024-08-26 04:10:54 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 04:10:54 clustering_diarizer:287] Subsegmentation for embedding extraction: scale3, /content/temp_outputs/speaker_outputs/subsegments_scale3.json
[NeMo I 2024-08-26 04:10:54 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 04:10:54 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:10:54 collections:302] Dataset loaded with 734 items, total duration of  0.15 hours.
[NeMo I 2024-08-26 04:10:54 collections:304] # 734 files loaded accounting to # 1 labels



[4/5] extract embeddings: 100%|██████████| 12/12 [00:01<00:00,  9.26it/s]


[NeMo I 2024-08-26 04:10:56 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 04:10:56 clustering_diarizer:287] Subsegmentation for embedding extraction: scale4, /content/temp_outputs/speaker_outputs/subsegments_scale4.json
[NeMo I 2024-08-26 04:10:56 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 04:10:56 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:10:56 collections:302] Dataset loaded with 1123 items, total duration of  0.15 hours.
[NeMo I 2024-08-26 04:10:56 collections:304] # 1123 files loaded accounting to # 1 labels


[5/5] extract embeddings: 100%|██████████| 18/18 [00:01<00:00, 10.20it/s]

[NeMo I 2024-08-26 04:10:58 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings



clustering: 100%|██████████| 1/1 [00:00<00:00,  2.00it/s]

[NeMo I 2024-08-26 04:10:58 clustering_diarizer:464] Outputs are saved in /content/temp_outputs directory



[NeMo W 2024-08-26 04:10:58 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 04:10:58 msdd_models:960] Loading embedding pickle file of scale:0 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale0_embeddings.pkl
[NeMo I 2024-08-26 04:10:58 msdd_models:960] Loading embedding pickle file of scale:1 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale1_embeddings.pkl
[NeMo I 2024-08-26 04:10:58 msdd_models:960] Loading embedding pickle file of scale:2 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale2_embeddings.pkl
[NeMo I 2024-08-26 04:10:58 msdd_models:960] Loading embedding pickle file of scale:3 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale3_embeddings.pkl
[NeMo I 2024-08-26 04:10:58 msdd_models:960] Loading embedding pickle file of scale:4 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale4_embeddings.pkl
[NeMo I 2024-08-26 04:10:58 msdd_models:938] Loading cluster label file from /content/temp_outputs/speaker_outputs/subsegments_scale4_cluste

100%|██████████| 1/1 [00:00<00:00, 14.75it/s]

[NeMo I 2024-08-26 04:10:59 msdd_models:1403]      [Threshold: 0.7000] [use_clus_as_main=False] [diar_window=50]
[NeMo I 2024-08-26 04:10:59 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-08-26 04:10:59 speaker_utils:93] Number of files to diarize: 1



[NeMo W 2024-08-26 04:10:59 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 04:10:59 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-08-26 04:10:59 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 04:10:59 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-08-26 04:10:59 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 04:10:59 msdd_models:1431]   
    
archivo 26746b7e69e140b3b3a11b973330ce96_20230610t14_35_utc.wav listo


INFO:pytorch_lightning.utilities.migration.utils:Lightning automatically upgraded your loaded checkpoint from v1.5.4 to v2.0.7. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint --file ../root/.cache/torch/whisperx-vad-segmentation.bin`


No language specified, language will be first be detected for each audio file (increases inference time).
Model was trained with pyannote.audio 0.0.1, yours is 3.1.1. Bad things might happen unless you revert pyannote.audio to 0.x.
Model was trained with torch 1.10.0+cu102, yours is 2.0.1+cu117. Bad things might happen unless you revert torch to 1.x.
Suppressing numeral and symbol tokens: [3, 4, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 502, 568, 628, 805, 945, 1017, 1025, 1266, 1294, 1360, 1386, 1525, 1614, 1649, 1722, 1848, 1958, 2009, 2119, 2217, 2272, 2319, 2331, 2443, 2625, 2803, 2975, 3165, 3279, 3282, 3356, 3405, 3446, 3499, 3552, 3705, 4022, 4060, 4289, 4303, 4436, 4550, 4688, 4702, 4762, 4808, 5080, 5211, 5254, 5285, 5348, 5853, 5867, 5923, 6071, 6074, 6096, 6375, 6494, 6549, 6591, 6641, 6673, 6856, 6866, 6879, 6905, 6976, 7143, 7201, 7271, 7490, 7526, 7546, 7551, 7560, 7562, 7629, 7634, 7668, 7771, 7773, 7911, 7998, 8132, 8227, 8423, 8451, 8465, 8494, 8652, 8794, 8858, 8923, 90

[NeMo W 2024-08-26 04:11:41 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: true
    
[NeMo W 2024-08-26 04:11:41 modelPT:168] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s). 
    Validation config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: false
    
[NeMo W 2024-08-26 04:11:41 modelPT:174] Please call the ModelPT.setup_test_data() or ModelPT.setup_multiple

[NeMo I 2024-08-26 04:11:41 features:289] PADDING: 16
[NeMo I 2024-08-26 04:11:41 features:289] PADDING: 16
[NeMo I 2024-08-26 04:11:42 save_restore_connector:249] Model EncDecDiarLabelModel was successfully restored from /root/.cache/torch/NeMo/NeMo_1.21.0/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo.
[NeMo I 2024-08-26 04:11:42 features:289] PADDING: 16
[NeMo I 2024-08-26 04:11:43 clustering_diarizer:127] Loading pretrained vad_multilingual_marblenet model from NGC
[NeMo I 2024-08-26 04:11:43 cloud:58] Found existing object /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-08-26 04:11:43 cloud:64] Re-using file from: /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo
[NeMo I 2024-08-26 04:11:43 common:913] Instantiating model from pre-trained checkpoint


[NeMo W 2024-08-26 04:11:43 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: /manifests/ami_train_0.63.json,/manifests/freesound_background_train.json,/manifests/freesound_laughter_train.json,/manifests/fisher_2004_background.json,/manifests/fisher_2004_speech_sampled.json,/manifests/google_train_manifest.json,/manifests/icsi_all_0.63.json,/manifests/musan_freesound_train.json,/manifests/musan_music_train.json,/manifests/musan_soundbible_train.json,/manifests/mandarin_train_sample.json,/manifests/german_train_sample.json,/manifests/spanish_train_sample.json,/manifests/french_train_sample.json,/manifests/russian_train_sample.json
    sample_rate: 16000
    labels:
    - background
    - speech
    batch_size: 256
    shuffle: true
    is_tarred: false
    tarred_audio_filepaths: null
    tarred_shard_strategy: sca

[NeMo I 2024-08-26 04:11:43 features:289] PADDING: 16
[NeMo I 2024-08-26 04:11:43 save_restore_connector:249] Model EncDecClassificationModel was successfully restored from /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-08-26 04:11:43 msdd_models:864] Multiscale Weights: [1, 1, 1, 1, 1]
[NeMo I 2024-08-26 04:11:43 msdd_models:865] Clustering Parameters: {
        "oracle_num_speakers": false,
        "max_num_speakers": 8,
        "enhanced_count_thres": 80,
        "max_rp_threshold": 0.25,
        "sparse_search_volume": 30,
        "maj_vote_spk_count": false,
        "chunk_cluster_count": 50,
        "embeddings_per_chunk": 10000
    }


[NeMo W 2024-08-26 04:11:43 clustering_diarizer:411] Deleting previous clustering diarizer outputs.


[NeMo I 2024-08-26 04:11:43 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-08-26 04:11:43 clustering_diarizer:309] Split long audio file to avoid CUDA memory issue


splitting manifest: 100%|██████████| 1/1 [00:00<00:00, 47.85it/s]

[NeMo I 2024-08-26 04:11:43 vad_utils:107] The prepared manifest file exists. Overwriting!





[NeMo I 2024-08-26 04:11:43 classification_models:273] Perform streaming frame-level VAD
[NeMo I 2024-08-26 04:11:43 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:11:43 collections:302] Dataset loaded with 1 items, total duration of  0.01 hours.
[NeMo I 2024-08-26 04:11:43 collections:304] # 1 files loaded accounting to # 1 labels


vad: 100%|██████████| 1/1 [00:00<00:00,  3.57it/s]

[NeMo I 2024-08-26 04:11:44 clustering_diarizer:250] Generating predictions with overlapping input segments



                                                               

[NeMo I 2024-08-26 04:11:44 clustering_diarizer:262] Converting frame level prediction to speech/no-speech segment in start and end times format.


creating speech segments: 100%|██████████| 1/1 [00:00<00:00, 17.12it/s]

[NeMo I 2024-08-26 04:11:44 clustering_diarizer:287] Subsegmentation for embedding extraction: scale0, /content/temp_outputs/speaker_outputs/subsegments_scale0.json
[NeMo I 2024-08-26 04:11:44 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 04:11:44 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:11:44 collections:302] Dataset loaded with 39 items, total duration of  0.01 hours.
[NeMo I 2024-08-26 04:11:44 collections:304] # 39 files loaded accounting to # 1 labels



[1/5] extract embeddings: 100%|██████████| 1/1 [00:00<00:00,  6.10it/s]

[NeMo I 2024-08-26 04:11:44 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 04:11:44 clustering_diarizer:287] Subsegmentation for embedding extraction: scale1, /content/temp_outputs/speaker_outputs/subsegments_scale1.json





[NeMo I 2024-08-26 04:11:44 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 04:11:44 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:11:44 collections:302] Dataset loaded with 46 items, total duration of  0.01 hours.
[NeMo I 2024-08-26 04:11:44 collections:304] # 46 files loaded accounting to # 1 labels


[2/5] extract embeddings: 100%|██████████| 1/1 [00:00<00:00,  5.55it/s]

[NeMo I 2024-08-26 04:11:45 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 04:11:45 clustering_diarizer:287] Subsegmentation for embedding extraction: scale2, /content/temp_outputs/speaker_outputs/subsegments_scale2.json
[NeMo I 2024-08-26 04:11:45 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 04:11:45 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:11:45 collections:302] Dataset loaded with 58 items, total duration of  0.01 hours.





[NeMo I 2024-08-26 04:11:45 collections:304] # 58 files loaded accounting to # 1 labels


[3/5] extract embeddings: 100%|██████████| 1/1 [00:00<00:00,  5.69it/s]

[NeMo I 2024-08-26 04:11:45 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 04:11:45 clustering_diarizer:287] Subsegmentation for embedding extraction: scale3, /content/temp_outputs/speaker_outputs/subsegments_scale3.json
[NeMo I 2024-08-26 04:11:45 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 04:11:45 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:11:45 collections:302] Dataset loaded with 76 items, total duration of  0.01 hours.
[NeMo I 2024-08-26 04:11:45 collections:304] # 76 files loaded accounting to # 1 labels



[4/5] extract embeddings: 100%|██████████| 2/2 [00:00<00:00,  8.77it/s]

[NeMo I 2024-08-26 04:11:45 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 04:11:45 clustering_diarizer:287] Subsegmentation for embedding extraction: scale4, /content/temp_outputs/speaker_outputs/subsegments_scale4.json
[NeMo I 2024-08-26 04:11:45 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 04:11:45 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:11:45 collections:302] Dataset loaded with 115 items, total duration of  0.02 hours.
[NeMo I 2024-08-26 04:11:45 collections:304] # 115 files loaded accounting to # 1 labels



[5/5] extract embeddings: 100%|██████████| 2/2 [00:00<00:00,  6.93it/s]

[NeMo I 2024-08-26 04:11:45 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings



clustering: 100%|██████████| 1/1 [00:00<00:00,  4.45it/s]

[NeMo I 2024-08-26 04:11:46 clustering_diarizer:464] Outputs are saved in /content/temp_outputs directory



[NeMo W 2024-08-26 04:11:46 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 04:11:46 msdd_models:960] Loading embedding pickle file of scale:0 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale0_embeddings.pkl
[NeMo I 2024-08-26 04:11:46 msdd_models:960] Loading embedding pickle file of scale:1 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale1_embeddings.pkl
[NeMo I 2024-08-26 04:11:46 msdd_models:960] Loading embedding pickle file of scale:2 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale2_embeddings.pkl
[NeMo I 2024-08-26 04:11:46 msdd_models:960] Loading embedding pickle file of scale:3 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale3_embeddings.pkl
[NeMo I 2024-08-26 04:11:46 msdd_models:960] Loading embedding pickle file of scale:4 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale4_embeddings.pkl
[NeMo I 2024-08-26 04:11:46 msdd_models:938] Loading cluster label file from /content/temp_outputs/speaker_outputs/subsegments_scale4_cluste

100%|██████████| 1/1 [00:00<00:00, 42.62it/s]

[NeMo I 2024-08-26 04:11:46 msdd_models:1403]      [Threshold: 0.7000] [use_clus_as_main=False] [diar_window=50]
[NeMo I 2024-08-26 04:11:46 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-08-26 04:11:46 speaker_utils:93] Number of files to diarize: 1



[NeMo W 2024-08-26 04:11:46 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 04:11:46 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-08-26 04:11:46 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 04:11:46 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-08-26 04:11:46 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 04:11:46 msdd_models:1431]   
    
archivo 269d7a90316042169c14e767e8b4ec18_20230610t15_41_utc.wav listo


INFO:pytorch_lightning.utilities.migration.utils:Lightning automatically upgraded your loaded checkpoint from v1.5.4 to v2.0.7. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint --file ../root/.cache/torch/whisperx-vad-segmentation.bin`


No language specified, language will be first be detected for each audio file (increases inference time).
Model was trained with pyannote.audio 0.0.1, yours is 3.1.1. Bad things might happen unless you revert pyannote.audio to 0.x.
Model was trained with torch 1.10.0+cu102, yours is 2.0.1+cu117. Bad things might happen unless you revert torch to 1.x.
Suppressing numeral and symbol tokens: [3, 4, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 502, 568, 628, 805, 945, 1017, 1025, 1266, 1294, 1360, 1386, 1525, 1614, 1649, 1722, 1848, 1958, 2009, 2119, 2217, 2272, 2319, 2331, 2443, 2625, 2803, 2975, 3165, 3279, 3282, 3356, 3405, 3446, 3499, 3552, 3705, 4022, 4060, 4289, 4303, 4436, 4550, 4688, 4702, 4762, 4808, 5080, 5211, 5254, 5285, 5348, 5853, 5867, 5923, 6071, 6074, 6096, 6375, 6494, 6549, 6591, 6641, 6673, 6856, 6866, 6879, 6905, 6976, 7143, 7201, 7271, 7490, 7526, 7546, 7551, 7560, 7562, 7629, 7634, 7668, 7771, 7773, 7911, 7998, 8132, 8227, 8423, 8451, 8465, 8494, 8652, 8794, 8858, 8923, 90

[NeMo W 2024-08-26 04:12:26 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: true
    
[NeMo W 2024-08-26 04:12:26 modelPT:168] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s). 
    Validation config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: false
    
[NeMo W 2024-08-26 04:12:26 modelPT:174] Please call the ModelPT.setup_test_data() or ModelPT.setup_multiple

[NeMo I 2024-08-26 04:12:26 features:289] PADDING: 16
[NeMo I 2024-08-26 04:12:27 features:289] PADDING: 16
[NeMo I 2024-08-26 04:12:28 save_restore_connector:249] Model EncDecDiarLabelModel was successfully restored from /root/.cache/torch/NeMo/NeMo_1.21.0/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo.
[NeMo I 2024-08-26 04:12:28 features:289] PADDING: 16
[NeMo I 2024-08-26 04:12:29 clustering_diarizer:127] Loading pretrained vad_multilingual_marblenet model from NGC
[NeMo I 2024-08-26 04:12:29 cloud:58] Found existing object /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-08-26 04:12:29 cloud:64] Re-using file from: /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo
[NeMo I 2024-08-26 04:12:29 common:913] Instantiating model from pre-trained checkpoint


[NeMo W 2024-08-26 04:12:29 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: /manifests/ami_train_0.63.json,/manifests/freesound_background_train.json,/manifests/freesound_laughter_train.json,/manifests/fisher_2004_background.json,/manifests/fisher_2004_speech_sampled.json,/manifests/google_train_manifest.json,/manifests/icsi_all_0.63.json,/manifests/musan_freesound_train.json,/manifests/musan_music_train.json,/manifests/musan_soundbible_train.json,/manifests/mandarin_train_sample.json,/manifests/german_train_sample.json,/manifests/spanish_train_sample.json,/manifests/french_train_sample.json,/manifests/russian_train_sample.json
    sample_rate: 16000
    labels:
    - background
    - speech
    batch_size: 256
    shuffle: true
    is_tarred: false
    tarred_audio_filepaths: null
    tarred_shard_strategy: sca

[NeMo I 2024-08-26 04:12:29 features:289] PADDING: 16
[NeMo I 2024-08-26 04:12:29 save_restore_connector:249] Model EncDecClassificationModel was successfully restored from /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-08-26 04:12:29 msdd_models:864] Multiscale Weights: [1, 1, 1, 1, 1]
[NeMo I 2024-08-26 04:12:29 msdd_models:865] Clustering Parameters: {
        "oracle_num_speakers": false,
        "max_num_speakers": 8,
        "enhanced_count_thres": 80,
        "max_rp_threshold": 0.25,
        "sparse_search_volume": 30,
        "maj_vote_spk_count": false,
        "chunk_cluster_count": 50,
        "embeddings_per_chunk": 10000
    }


[NeMo W 2024-08-26 04:12:29 clustering_diarizer:411] Deleting previous clustering diarizer outputs.


[NeMo I 2024-08-26 04:12:29 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-08-26 04:12:29 clustering_diarizer:309] Split long audio file to avoid CUDA memory issue


splitting manifest: 100%|██████████| 1/1 [00:00<00:00, 56.98it/s]

[NeMo I 2024-08-26 04:12:29 vad_utils:107] The prepared manifest file exists. Overwriting!
[NeMo I 2024-08-26 04:12:29 classification_models:273] Perform streaming frame-level VAD
[NeMo I 2024-08-26 04:12:29 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:12:29 collections:302] Dataset loaded with 1 items, total duration of  0.01 hours.
[NeMo I 2024-08-26 04:12:29 collections:304] # 1 files loaded accounting to # 1 labels



vad: 100%|██████████| 1/1 [00:00<00:00,  4.13it/s]

[NeMo I 2024-08-26 04:12:30 clustering_diarizer:250] Generating predictions with overlapping input segments



                                                               

[NeMo I 2024-08-26 04:12:30 clustering_diarizer:262] Converting frame level prediction to speech/no-speech segment in start and end times format.


creating speech segments: 100%|██████████| 1/1 [00:00<00:00, 17.29it/s]

[NeMo I 2024-08-26 04:12:30 clustering_diarizer:287] Subsegmentation for embedding extraction: scale0, /content/temp_outputs/speaker_outputs/subsegments_scale0.json
[NeMo I 2024-08-26 04:12:30 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 04:12:30 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:12:30 collections:302] Dataset loaded with 20 items, total duration of  0.01 hours.
[NeMo I 2024-08-26 04:12:30 collections:304] # 20 files loaded accounting to # 1 labels



[1/5] extract embeddings: 100%|██████████| 1/1 [00:00<00:00, 11.95it/s]

[NeMo I 2024-08-26 04:12:30 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 04:12:30 clustering_diarizer:287] Subsegmentation for embedding extraction: scale1, /content/temp_outputs/speaker_outputs/subsegments_scale1.json
[NeMo I 2024-08-26 04:12:30 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 04:12:30 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:12:30 collections:302] Dataset loaded with 25 items, total duration of  0.01 hours.
[NeMo I 2024-08-26 04:12:30 collections:304] # 25 files loaded accounting to # 1 labels



[2/5] extract embeddings: 100%|██████████| 1/1 [00:00<00:00,  9.21it/s]

[NeMo I 2024-08-26 04:12:30 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 04:12:30 clustering_diarizer:287] Subsegmentation for embedding extraction: scale2, /content/temp_outputs/speaker_outputs/subsegments_scale2.json
[NeMo I 2024-08-26 04:12:30 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 04:12:30 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:12:30 collections:302] Dataset loaded with 31 items, total duration of  0.01 hours.
[NeMo I 2024-08-26 04:12:30 collections:304] # 31 files loaded accounting to # 1 labels



[3/5] extract embeddings: 100%|██████████| 1/1 [00:00<00:00,  7.89it/s]

[NeMo I 2024-08-26 04:12:31 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 04:12:31 clustering_diarizer:287] Subsegmentation for embedding extraction: scale3, /content/temp_outputs/speaker_outputs/subsegments_scale3.json
[NeMo I 2024-08-26 04:12:31 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 04:12:31 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:12:31 collections:302] Dataset loaded with 43 items, total duration of  0.01 hours.
[NeMo I 2024-08-26 04:12:31 collections:304] # 43 files loaded accounting to # 1 labels



[4/5] extract embeddings: 100%|██████████| 1/1 [00:00<00:00,  8.29it/s]

[NeMo I 2024-08-26 04:12:31 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 04:12:31 clustering_diarizer:287] Subsegmentation for embedding extraction: scale4, /content/temp_outputs/speaker_outputs/subsegments_scale4.json
[NeMo I 2024-08-26 04:12:31 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 04:12:31 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:12:31 collections:302] Dataset loaded with 66 items, total duration of  0.01 hours.
[NeMo I 2024-08-26 04:12:31 collections:304] # 66 files loaded accounting to # 1 labels



[5/5] extract embeddings: 100%|██████████| 2/2 [00:00<00:00, 11.10it/s]

[NeMo I 2024-08-26 04:12:31 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings



clustering: 100%|██████████| 1/1 [00:00<00:00,  8.69it/s]

[NeMo I 2024-08-26 04:12:31 clustering_diarizer:464] Outputs are saved in /content/temp_outputs directory



[NeMo W 2024-08-26 04:12:31 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 04:12:31 msdd_models:960] Loading embedding pickle file of scale:0 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale0_embeddings.pkl
[NeMo I 2024-08-26 04:12:31 msdd_models:960] Loading embedding pickle file of scale:1 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale1_embeddings.pkl
[NeMo I 2024-08-26 04:12:31 msdd_models:960] Loading embedding pickle file of scale:2 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale2_embeddings.pkl
[NeMo I 2024-08-26 04:12:31 msdd_models:960] Loading embedding pickle file of scale:3 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale3_embeddings.pkl
[NeMo I 2024-08-26 04:12:31 msdd_models:960] Loading embedding pickle file of scale:4 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale4_embeddings.pkl
[NeMo I 2024-08-26 04:12:31 msdd_models:938] Loading cluster label file from /content/temp_outputs/speaker_outputs/subsegments_scale4_cluste

100%|██████████| 1/1 [00:00<00:00, 39.93it/s]


[NeMo I 2024-08-26 04:12:31 msdd_models:1403]      [Threshold: 0.7000] [use_clus_as_main=False] [diar_window=50]
[NeMo I 2024-08-26 04:12:31 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-08-26 04:12:31 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-08-26 04:12:31 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 04:12:31 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-08-26 04:12:31 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 04:12:31 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-08-26 04:12:31 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 04:12:31 msdd_models:1431]   
    
archivo 640807edf3f74283a7295a888a93aaf7_20230610t16_54_utc.wav listo


INFO:pytorch_lightning.utilities.migration.utils:Lightning automatically upgraded your loaded checkpoint from v1.5.4 to v2.0.7. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint --file ../root/.cache/torch/whisperx-vad-segmentation.bin`


No language specified, language will be first be detected for each audio file (increases inference time).
Model was trained with pyannote.audio 0.0.1, yours is 3.1.1. Bad things might happen unless you revert pyannote.audio to 0.x.
Model was trained with torch 1.10.0+cu102, yours is 2.0.1+cu117. Bad things might happen unless you revert torch to 1.x.
Suppressing numeral and symbol tokens: [3, 4, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 502, 568, 628, 805, 945, 1017, 1025, 1266, 1294, 1360, 1386, 1525, 1614, 1649, 1722, 1848, 1958, 2009, 2119, 2217, 2272, 2319, 2331, 2443, 2625, 2803, 2975, 3165, 3279, 3282, 3356, 3405, 3446, 3499, 3552, 3705, 4022, 4060, 4289, 4303, 4436, 4550, 4688, 4702, 4762, 4808, 5080, 5211, 5254, 5285, 5348, 5853, 5867, 5923, 6071, 6074, 6096, 6375, 6494, 6549, 6591, 6641, 6673, 6856, 6866, 6879, 6905, 6976, 7143, 7201, 7271, 7490, 7526, 7546, 7551, 7560, 7562, 7629, 7634, 7668, 7771, 7773, 7911, 7998, 8132, 8227, 8423, 8451, 8465, 8494, 8652, 8794, 8858, 8923, 90

[NeMo W 2024-08-26 04:13:09 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: true
    
[NeMo W 2024-08-26 04:13:09 modelPT:168] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s). 
    Validation config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: false
    
[NeMo W 2024-08-26 04:13:09 modelPT:174] Please call the ModelPT.setup_test_data() or ModelPT.setup_multiple

[NeMo I 2024-08-26 04:13:09 features:289] PADDING: 16
[NeMo I 2024-08-26 04:13:10 features:289] PADDING: 16
[NeMo I 2024-08-26 04:13:11 save_restore_connector:249] Model EncDecDiarLabelModel was successfully restored from /root/.cache/torch/NeMo/NeMo_1.21.0/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo.
[NeMo I 2024-08-26 04:13:11 features:289] PADDING: 16
[NeMo I 2024-08-26 04:13:12 clustering_diarizer:127] Loading pretrained vad_multilingual_marblenet model from NGC
[NeMo I 2024-08-26 04:13:12 cloud:58] Found existing object /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-08-26 04:13:12 cloud:64] Re-using file from: /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo
[NeMo I 2024-08-26 04:13:12 common:913] Instantiating model from pre-trained checkpoint


[NeMo W 2024-08-26 04:13:12 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: /manifests/ami_train_0.63.json,/manifests/freesound_background_train.json,/manifests/freesound_laughter_train.json,/manifests/fisher_2004_background.json,/manifests/fisher_2004_speech_sampled.json,/manifests/google_train_manifest.json,/manifests/icsi_all_0.63.json,/manifests/musan_freesound_train.json,/manifests/musan_music_train.json,/manifests/musan_soundbible_train.json,/manifests/mandarin_train_sample.json,/manifests/german_train_sample.json,/manifests/spanish_train_sample.json,/manifests/french_train_sample.json,/manifests/russian_train_sample.json
    sample_rate: 16000
    labels:
    - background
    - speech
    batch_size: 256
    shuffle: true
    is_tarred: false
    tarred_audio_filepaths: null
    tarred_shard_strategy: sca

[NeMo I 2024-08-26 04:13:12 features:289] PADDING: 16
[NeMo I 2024-08-26 04:13:12 save_restore_connector:249] Model EncDecClassificationModel was successfully restored from /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-08-26 04:13:12 msdd_models:864] Multiscale Weights: [1, 1, 1, 1, 1]
[NeMo I 2024-08-26 04:13:12 msdd_models:865] Clustering Parameters: {
        "oracle_num_speakers": false,
        "max_num_speakers": 8,
        "enhanced_count_thres": 80,
        "max_rp_threshold": 0.25,
        "sparse_search_volume": 30,
        "maj_vote_spk_count": false,
        "chunk_cluster_count": 50,
        "embeddings_per_chunk": 10000
    }


[NeMo W 2024-08-26 04:13:12 clustering_diarizer:411] Deleting previous clustering diarizer outputs.


[NeMo I 2024-08-26 04:13:12 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-08-26 04:13:12 clustering_diarizer:309] Split long audio file to avoid CUDA memory issue


splitting manifest: 100%|██████████| 1/1 [00:00<00:00, 57.33it/s]

[NeMo I 2024-08-26 04:13:12 vad_utils:107] The prepared manifest file exists. Overwriting!
[NeMo I 2024-08-26 04:13:12 classification_models:273] Perform streaming frame-level VAD
[NeMo I 2024-08-26 04:13:12 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:13:12 collections:302] Dataset loaded with 1 items, total duration of  0.00 hours.
[NeMo I 2024-08-26 04:13:12 collections:304] # 1 files loaded accounting to # 1 labels



vad: 100%|██████████| 1/1 [00:00<00:00,  5.13it/s]

[NeMo I 2024-08-26 04:13:13 clustering_diarizer:250] Generating predictions with overlapping input segments



                                                               

[NeMo I 2024-08-26 04:13:13 clustering_diarizer:262] Converting frame level prediction to speech/no-speech segment in start and end times format.


creating speech segments: 100%|██████████| 1/1 [00:00<00:00, 35.93it/s]

[NeMo I 2024-08-26 04:13:13 clustering_diarizer:287] Subsegmentation for embedding extraction: scale0, /content/temp_outputs/speaker_outputs/subsegments_scale0.json
[NeMo I 2024-08-26 04:13:13 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 04:13:13 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:13:13 collections:302] Dataset loaded with 20 items, total duration of  0.01 hours.
[NeMo I 2024-08-26 04:13:13 collections:304] # 20 files loaded accounting to # 1 labels



[1/5] extract embeddings: 100%|██████████| 1/1 [00:00<00:00, 11.98it/s]

[NeMo I 2024-08-26 04:13:13 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 04:13:13 clustering_diarizer:287] Subsegmentation for embedding extraction: scale1, /content/temp_outputs/speaker_outputs/subsegments_scale1.json
[NeMo I 2024-08-26 04:13:13 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 04:13:13 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:13:13 collections:302] Dataset loaded with 24 items, total duration of  0.01 hours.
[NeMo I 2024-08-26 04:13:13 collections:304] # 24 files loaded accounting to # 1 labels



[2/5] extract embeddings: 100%|██████████| 1/1 [00:00<00:00, 10.87it/s]

[NeMo I 2024-08-26 04:13:13 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 04:13:13 clustering_diarizer:287] Subsegmentation for embedding extraction: scale2, /content/temp_outputs/speaker_outputs/subsegments_scale2.json
[NeMo I 2024-08-26 04:13:13 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 04:13:13 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:13:13 collections:302] Dataset loaded with 31 items, total duration of  0.01 hours.
[NeMo I 2024-08-26 04:13:13 collections:304] # 31 files loaded accounting to # 1 labels



[3/5] extract embeddings: 100%|██████████| 1/1 [00:00<00:00,  7.60it/s]

[NeMo I 2024-08-26 04:13:13 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 04:13:13 clustering_diarizer:287] Subsegmentation for embedding extraction: scale3, /content/temp_outputs/speaker_outputs/subsegments_scale3.json
[NeMo I 2024-08-26 04:13:13 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 04:13:13 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:13:13 collections:302] Dataset loaded with 41 items, total duration of  0.01 hours.
[NeMo I 2024-08-26 04:13:13 collections:304] # 41 files loaded accounting to # 1 labels



[4/5] extract embeddings: 100%|██████████| 1/1 [00:00<00:00,  7.12it/s]

[NeMo I 2024-08-26 04:13:13 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 04:13:13 clustering_diarizer:287] Subsegmentation for embedding extraction: scale4, /content/temp_outputs/speaker_outputs/subsegments_scale4.json
[NeMo I 2024-08-26 04:13:13 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 04:13:13 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:13:13 collections:302] Dataset loaded with 63 items, total duration of  0.01 hours.
[NeMo I 2024-08-26 04:13:13 collections:304] # 63 files loaded accounting to # 1 labels



[5/5] extract embeddings: 100%|██████████| 1/1 [00:00<00:00,  5.99it/s]

[NeMo I 2024-08-26 04:13:14 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings



clustering: 100%|██████████| 1/1 [00:00<00:00,  8.93it/s]

[NeMo I 2024-08-26 04:13:14 clustering_diarizer:464] Outputs are saved in /content/temp_outputs directory



[NeMo W 2024-08-26 04:13:14 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 04:13:14 msdd_models:960] Loading embedding pickle file of scale:0 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale0_embeddings.pkl
[NeMo I 2024-08-26 04:13:14 msdd_models:960] Loading embedding pickle file of scale:1 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale1_embeddings.pkl
[NeMo I 2024-08-26 04:13:14 msdd_models:960] Loading embedding pickle file of scale:2 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale2_embeddings.pkl
[NeMo I 2024-08-26 04:13:14 msdd_models:960] Loading embedding pickle file of scale:3 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale3_embeddings.pkl
[NeMo I 2024-08-26 04:13:14 msdd_models:960] Loading embedding pickle file of scale:4 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale4_embeddings.pkl
[NeMo I 2024-08-26 04:13:14 msdd_models:938] Loading cluster label file from /content/temp_outputs/speaker_outputs/subsegments_scale4_cluste

100%|██████████| 1/1 [00:00<00:00, 10.29it/s]

[NeMo I 2024-08-26 04:13:14 msdd_models:1403]      [Threshold: 0.7000] [use_clus_as_main=False] [diar_window=50]
[NeMo I 2024-08-26 04:13:14 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-08-26 04:13:14 speaker_utils:93] Number of files to diarize: 1



[NeMo W 2024-08-26 04:13:14 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 04:13:14 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-08-26 04:13:14 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 04:13:14 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-08-26 04:13:14 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 04:13:14 msdd_models:1431]   
    
archivo 52217898a6d84b0f9022c9d33b2f273f_20230610t13_28_utc.wav listo


INFO:pytorch_lightning.utilities.migration.utils:Lightning automatically upgraded your loaded checkpoint from v1.5.4 to v2.0.7. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint --file ../root/.cache/torch/whisperx-vad-segmentation.bin`


No language specified, language will be first be detected for each audio file (increases inference time).
Model was trained with pyannote.audio 0.0.1, yours is 3.1.1. Bad things might happen unless you revert pyannote.audio to 0.x.
Model was trained with torch 1.10.0+cu102, yours is 2.0.1+cu117. Bad things might happen unless you revert torch to 1.x.
Suppressing numeral and symbol tokens: [3, 4, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 502, 568, 628, 805, 945, 1017, 1025, 1266, 1294, 1360, 1386, 1525, 1614, 1649, 1722, 1848, 1958, 2009, 2119, 2217, 2272, 2319, 2331, 2443, 2625, 2803, 2975, 3165, 3279, 3282, 3356, 3405, 3446, 3499, 3552, 3705, 4022, 4060, 4289, 4303, 4436, 4550, 4688, 4702, 4762, 4808, 5080, 5211, 5254, 5285, 5348, 5853, 5867, 5923, 6071, 6074, 6096, 6375, 6494, 6549, 6591, 6641, 6673, 6856, 6866, 6879, 6905, 6976, 7143, 7201, 7271, 7490, 7526, 7546, 7551, 7560, 7562, 7629, 7634, 7668, 7771, 7773, 7911, 7998, 8132, 8227, 8423, 8451, 8465, 8494, 8652, 8794, 8858, 8923, 90

[NeMo W 2024-08-26 04:13:55 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: true
    
[NeMo W 2024-08-26 04:13:55 modelPT:168] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s). 
    Validation config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: false
    
[NeMo W 2024-08-26 04:13:55 modelPT:174] Please call the ModelPT.setup_test_data() or ModelPT.setup_multiple

[NeMo I 2024-08-26 04:13:55 features:289] PADDING: 16
[NeMo I 2024-08-26 04:13:55 features:289] PADDING: 16
[NeMo I 2024-08-26 04:13:56 save_restore_connector:249] Model EncDecDiarLabelModel was successfully restored from /root/.cache/torch/NeMo/NeMo_1.21.0/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo.
[NeMo I 2024-08-26 04:13:56 features:289] PADDING: 16
[NeMo I 2024-08-26 04:13:57 clustering_diarizer:127] Loading pretrained vad_multilingual_marblenet model from NGC
[NeMo I 2024-08-26 04:13:57 cloud:58] Found existing object /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-08-26 04:13:57 cloud:64] Re-using file from: /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo
[NeMo I 2024-08-26 04:13:57 common:913] Instantiating model from pre-trained checkpoint


[NeMo W 2024-08-26 04:13:57 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: /manifests/ami_train_0.63.json,/manifests/freesound_background_train.json,/manifests/freesound_laughter_train.json,/manifests/fisher_2004_background.json,/manifests/fisher_2004_speech_sampled.json,/manifests/google_train_manifest.json,/manifests/icsi_all_0.63.json,/manifests/musan_freesound_train.json,/manifests/musan_music_train.json,/manifests/musan_soundbible_train.json,/manifests/mandarin_train_sample.json,/manifests/german_train_sample.json,/manifests/spanish_train_sample.json,/manifests/french_train_sample.json,/manifests/russian_train_sample.json
    sample_rate: 16000
    labels:
    - background
    - speech
    batch_size: 256
    shuffle: true
    is_tarred: false
    tarred_audio_filepaths: null
    tarred_shard_strategy: sca

[NeMo I 2024-08-26 04:13:57 features:289] PADDING: 16
[NeMo I 2024-08-26 04:13:57 save_restore_connector:249] Model EncDecClassificationModel was successfully restored from /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-08-26 04:13:57 msdd_models:864] Multiscale Weights: [1, 1, 1, 1, 1]
[NeMo I 2024-08-26 04:13:57 msdd_models:865] Clustering Parameters: {
        "oracle_num_speakers": false,
        "max_num_speakers": 8,
        "enhanced_count_thres": 80,
        "max_rp_threshold": 0.25,
        "sparse_search_volume": 30,
        "maj_vote_spk_count": false,
        "chunk_cluster_count": 50,
        "embeddings_per_chunk": 10000
    }


[NeMo W 2024-08-26 04:13:57 clustering_diarizer:411] Deleting previous clustering diarizer outputs.


[NeMo I 2024-08-26 04:13:57 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-08-26 04:13:57 clustering_diarizer:309] Split long audio file to avoid CUDA memory issue


splitting manifest: 100%|██████████| 1/1 [00:00<00:00, 47.99it/s]

[NeMo I 2024-08-26 04:13:57 vad_utils:107] The prepared manifest file exists. Overwriting!





[NeMo I 2024-08-26 04:13:57 classification_models:273] Perform streaming frame-level VAD
[NeMo I 2024-08-26 04:13:57 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:13:57 collections:302] Dataset loaded with 1 items, total duration of  0.01 hours.
[NeMo I 2024-08-26 04:13:57 collections:304] # 1 files loaded accounting to # 1 labels


vad: 100%|██████████| 1/1 [00:00<00:00,  3.69it/s]

[NeMo I 2024-08-26 04:13:57 clustering_diarizer:250] Generating predictions with overlapping input segments



                                                               

[NeMo I 2024-08-26 04:13:58 clustering_diarizer:262] Converting frame level prediction to speech/no-speech segment in start and end times format.


creating speech segments: 100%|██████████| 1/1 [00:00<00:00, 20.27it/s]

[NeMo I 2024-08-26 04:13:58 clustering_diarizer:287] Subsegmentation for embedding extraction: scale0, /content/temp_outputs/speaker_outputs/subsegments_scale0.json
[NeMo I 2024-08-26 04:13:58 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 04:13:58 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:13:58 collections:302] Dataset loaded with 24 items, total duration of  0.01 hours.
[NeMo I 2024-08-26 04:13:58 collections:304] # 24 files loaded accounting to # 1 labels



[1/5] extract embeddings: 100%|██████████| 1/1 [00:00<00:00, 11.53it/s]

[NeMo I 2024-08-26 04:13:58 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 04:13:58 clustering_diarizer:287] Subsegmentation for embedding extraction: scale1, /content/temp_outputs/speaker_outputs/subsegments_scale1.json
[NeMo I 2024-08-26 04:13:58 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 04:13:58 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:13:58 collections:302] Dataset loaded with 29 items, total duration of  0.01 hours.
[NeMo I 2024-08-26 04:13:58 collections:304] # 29 files loaded accounting to # 1 labels



[2/5] extract embeddings: 100%|██████████| 1/1 [00:00<00:00, 10.60it/s]

[NeMo I 2024-08-26 04:13:58 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 04:13:58 clustering_diarizer:287] Subsegmentation for embedding extraction: scale2, /content/temp_outputs/speaker_outputs/subsegments_scale2.json
[NeMo I 2024-08-26 04:13:58 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 04:13:58 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:13:58 collections:302] Dataset loaded with 37 items, total duration of  0.01 hours.
[NeMo I 2024-08-26 04:13:58 collections:304] # 37 files loaded accounting to # 1 labels



[3/5] extract embeddings: 100%|██████████| 1/1 [00:00<00:00,  7.45it/s]

[NeMo I 2024-08-26 04:13:58 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 04:13:58 clustering_diarizer:287] Subsegmentation for embedding extraction: scale3, /content/temp_outputs/speaker_outputs/subsegments_scale3.json
[NeMo I 2024-08-26 04:13:58 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 04:13:58 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:13:58 collections:302] Dataset loaded with 50 items, total duration of  0.01 hours.
[NeMo I 2024-08-26 04:13:58 collections:304] # 50 files loaded accounting to # 1 labels



[4/5] extract embeddings: 100%|██████████| 1/1 [00:00<00:00,  8.14it/s]

[NeMo I 2024-08-26 04:13:58 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 04:13:58 clustering_diarizer:287] Subsegmentation for embedding extraction: scale4, /content/temp_outputs/speaker_outputs/subsegments_scale4.json
[NeMo I 2024-08-26 04:13:58 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 04:13:58 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:13:58 collections:302] Dataset loaded with 78 items, total duration of  0.01 hours.
[NeMo I 2024-08-26 04:13:58 collections:304] # 78 files loaded accounting to # 1 labels



[5/5] extract embeddings: 100%|██████████| 2/2 [00:00<00:00, 11.61it/s]


[NeMo I 2024-08-26 04:13:59 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings


clustering: 100%|██████████| 1/1 [00:00<00:00,  8.10it/s]

[NeMo I 2024-08-26 04:13:59 clustering_diarizer:464] Outputs are saved in /content/temp_outputs directory



[NeMo W 2024-08-26 04:13:59 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 04:13:59 msdd_models:960] Loading embedding pickle file of scale:0 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale0_embeddings.pkl
[NeMo I 2024-08-26 04:13:59 msdd_models:960] Loading embedding pickle file of scale:1 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale1_embeddings.pkl
[NeMo I 2024-08-26 04:13:59 msdd_models:960] Loading embedding pickle file of scale:2 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale2_embeddings.pkl
[NeMo I 2024-08-26 04:13:59 msdd_models:960] Loading embedding pickle file of scale:3 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale3_embeddings.pkl
[NeMo I 2024-08-26 04:13:59 msdd_models:960] Loading embedding pickle file of scale:4 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale4_embeddings.pkl
[NeMo I 2024-08-26 04:13:59 msdd_models:938] Loading cluster label file from /content/temp_outputs/speaker_outputs/subsegments_scale4_cluste

100%|██████████| 1/1 [00:00<00:00, 52.29it/s]

[NeMo I 2024-08-26 04:13:59 msdd_models:1403]      [Threshold: 0.7000] [use_clus_as_main=False] [diar_window=50]
[NeMo I 2024-08-26 04:13:59 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-08-26 04:13:59 speaker_utils:93] Number of files to diarize: 1



[NeMo W 2024-08-26 04:13:59 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 04:13:59 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-08-26 04:13:59 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 04:13:59 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-08-26 04:13:59 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 04:13:59 msdd_models:1431]   
    
archivo f189e55691f04a0cbca711f6ee0bac4a_20230610t13_18_utc.wav listo


INFO:pytorch_lightning.utilities.migration.utils:Lightning automatically upgraded your loaded checkpoint from v1.5.4 to v2.0.7. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint --file ../root/.cache/torch/whisperx-vad-segmentation.bin`


No language specified, language will be first be detected for each audio file (increases inference time).
Model was trained with pyannote.audio 0.0.1, yours is 3.1.1. Bad things might happen unless you revert pyannote.audio to 0.x.
Model was trained with torch 1.10.0+cu102, yours is 2.0.1+cu117. Bad things might happen unless you revert torch to 1.x.
Suppressing numeral and symbol tokens: [3, 4, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 502, 568, 628, 805, 945, 1017, 1025, 1266, 1294, 1360, 1386, 1525, 1614, 1649, 1722, 1848, 1958, 2009, 2119, 2217, 2272, 2319, 2331, 2443, 2625, 2803, 2975, 3165, 3279, 3282, 3356, 3405, 3446, 3499, 3552, 3705, 4022, 4060, 4289, 4303, 4436, 4550, 4688, 4702, 4762, 4808, 5080, 5211, 5254, 5285, 5348, 5853, 5867, 5923, 6071, 6074, 6096, 6375, 6494, 6549, 6591, 6641, 6673, 6856, 6866, 6879, 6905, 6976, 7143, 7201, 7271, 7490, 7526, 7546, 7551, 7560, 7562, 7629, 7634, 7668, 7771, 7773, 7911, 7998, 8132, 8227, 8423, 8451, 8465, 8494, 8652, 8794, 8858, 8923, 90

[NeMo W 2024-08-26 04:14:43 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: true
    
[NeMo W 2024-08-26 04:14:43 modelPT:168] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s). 
    Validation config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: false
    
[NeMo W 2024-08-26 04:14:43 modelPT:174] Please call the ModelPT.setup_test_data() or ModelPT.setup_multiple

[NeMo I 2024-08-26 04:14:43 features:289] PADDING: 16
[NeMo I 2024-08-26 04:14:43 features:289] PADDING: 16
[NeMo I 2024-08-26 04:14:44 save_restore_connector:249] Model EncDecDiarLabelModel was successfully restored from /root/.cache/torch/NeMo/NeMo_1.21.0/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo.
[NeMo I 2024-08-26 04:14:44 features:289] PADDING: 16
[NeMo I 2024-08-26 04:14:44 clustering_diarizer:127] Loading pretrained vad_multilingual_marblenet model from NGC
[NeMo I 2024-08-26 04:14:44 cloud:58] Found existing object /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-08-26 04:14:44 cloud:64] Re-using file from: /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo
[NeMo I 2024-08-26 04:14:44 common:913] Instantiating model from pre-trained checkpoint


[NeMo W 2024-08-26 04:14:44 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: /manifests/ami_train_0.63.json,/manifests/freesound_background_train.json,/manifests/freesound_laughter_train.json,/manifests/fisher_2004_background.json,/manifests/fisher_2004_speech_sampled.json,/manifests/google_train_manifest.json,/manifests/icsi_all_0.63.json,/manifests/musan_freesound_train.json,/manifests/musan_music_train.json,/manifests/musan_soundbible_train.json,/manifests/mandarin_train_sample.json,/manifests/german_train_sample.json,/manifests/spanish_train_sample.json,/manifests/french_train_sample.json,/manifests/russian_train_sample.json
    sample_rate: 16000
    labels:
    - background
    - speech
    batch_size: 256
    shuffle: true
    is_tarred: false
    tarred_audio_filepaths: null
    tarred_shard_strategy: sca

[NeMo I 2024-08-26 04:14:44 features:289] PADDING: 16
[NeMo I 2024-08-26 04:14:45 save_restore_connector:249] Model EncDecClassificationModel was successfully restored from /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-08-26 04:14:45 msdd_models:864] Multiscale Weights: [1, 1, 1, 1, 1]
[NeMo I 2024-08-26 04:14:45 msdd_models:865] Clustering Parameters: {
        "oracle_num_speakers": false,
        "max_num_speakers": 8,
        "enhanced_count_thres": 80,
        "max_rp_threshold": 0.25,
        "sparse_search_volume": 30,
        "maj_vote_spk_count": false,
        "chunk_cluster_count": 50,
        "embeddings_per_chunk": 10000
    }


[NeMo W 2024-08-26 04:14:45 clustering_diarizer:411] Deleting previous clustering diarizer outputs.


[NeMo I 2024-08-26 04:14:45 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-08-26 04:14:45 clustering_diarizer:309] Split long audio file to avoid CUDA memory issue


splitting manifest: 100%|██████████| 1/1 [00:00<00:00, 42.34it/s]

[NeMo I 2024-08-26 04:14:45 vad_utils:107] The prepared manifest file exists. Overwriting!
[NeMo I 2024-08-26 04:14:45 classification_models:273] Perform streaming frame-level VAD
[NeMo I 2024-08-26 04:14:45 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:14:45 collections:302] Dataset loaded with 1 items, total duration of  0.01 hours.
[NeMo I 2024-08-26 04:14:45 collections:304] # 1 files loaded accounting to # 1 labels



vad: 100%|██████████| 1/1 [00:00<00:00,  3.56it/s]

[NeMo I 2024-08-26 04:14:45 clustering_diarizer:250] Generating predictions with overlapping input segments



                                                               

[NeMo I 2024-08-26 04:14:45 clustering_diarizer:262] Converting frame level prediction to speech/no-speech segment in start and end times format.


creating speech segments: 100%|██████████| 1/1 [00:00<00:00, 17.25it/s]

[NeMo I 2024-08-26 04:14:46 clustering_diarizer:287] Subsegmentation for embedding extraction: scale0, /content/temp_outputs/speaker_outputs/subsegments_scale0.json
[NeMo I 2024-08-26 04:14:46 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 04:14:46 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:14:46 collections:302] Dataset loaded with 39 items, total duration of  0.01 hours.
[NeMo I 2024-08-26 04:14:46 collections:304] # 39 files loaded accounting to # 1 labels



[1/5] extract embeddings: 100%|██████████| 1/1 [00:00<00:00,  7.64it/s]

[NeMo I 2024-08-26 04:14:46 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 04:14:46 clustering_diarizer:287] Subsegmentation for embedding extraction: scale1, /content/temp_outputs/speaker_outputs/subsegments_scale1.json
[NeMo I 2024-08-26 04:14:46 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 04:14:46 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:14:46 collections:302] Dataset loaded with 47 items, total duration of  0.02 hours.
[NeMo I 2024-08-26 04:14:46 collections:304] # 47 files loaded accounting to # 1 labels



[2/5] extract embeddings: 100%|██████████| 1/1 [00:00<00:00,  7.17it/s]

[NeMo I 2024-08-26 04:14:46 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 04:14:46 clustering_diarizer:287] Subsegmentation for embedding extraction: scale2, /content/temp_outputs/speaker_outputs/subsegments_scale2.json
[NeMo I 2024-08-26 04:14:46 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 04:14:46 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:14:46 collections:302] Dataset loaded with 61 items, total duration of  0.02 hours.
[NeMo I 2024-08-26 04:14:46 collections:304] # 61 files loaded accounting to # 1 labels



[3/5] extract embeddings: 100%|██████████| 1/1 [00:00<00:00,  6.29it/s]

[NeMo I 2024-08-26 04:14:46 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 04:14:46 clustering_diarizer:287] Subsegmentation for embedding extraction: scale3, /content/temp_outputs/speaker_outputs/subsegments_scale3.json





[NeMo I 2024-08-26 04:14:46 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 04:14:46 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:14:46 collections:302] Dataset loaded with 81 items, total duration of  0.02 hours.
[NeMo I 2024-08-26 04:14:46 collections:304] # 81 files loaded accounting to # 1 labels


[4/5] extract embeddings: 100%|██████████| 2/2 [00:00<00:00, 10.46it/s]


[NeMo I 2024-08-26 04:14:46 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 04:14:46 clustering_diarizer:287] Subsegmentation for embedding extraction: scale4, /content/temp_outputs/speaker_outputs/subsegments_scale4.json
[NeMo I 2024-08-26 04:14:46 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 04:14:46 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:14:46 collections:302] Dataset loaded with 126 items, total duration of  0.02 hours.
[NeMo I 2024-08-26 04:14:46 collections:304] # 126 files loaded accounting to # 1 labels


[5/5] extract embeddings: 100%|██████████| 2/2 [00:00<00:00,  8.00it/s]

[NeMo I 2024-08-26 04:14:47 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings



clustering: 100%|██████████| 1/1 [00:00<00:00,  5.40it/s]


[NeMo I 2024-08-26 04:14:47 clustering_diarizer:464] Outputs are saved in /content/temp_outputs directory


[NeMo W 2024-08-26 04:14:47 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 04:14:47 msdd_models:960] Loading embedding pickle file of scale:0 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale0_embeddings.pkl
[NeMo I 2024-08-26 04:14:47 msdd_models:960] Loading embedding pickle file of scale:1 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale1_embeddings.pkl
[NeMo I 2024-08-26 04:14:47 msdd_models:960] Loading embedding pickle file of scale:2 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale2_embeddings.pkl
[NeMo I 2024-08-26 04:14:47 msdd_models:960] Loading embedding pickle file of scale:3 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale3_embeddings.pkl
[NeMo I 2024-08-26 04:14:47 msdd_models:960] Loading embedding pickle file of scale:4 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale4_embeddings.pkl
[NeMo I 2024-08-26 04:14:47 msdd_models:938] Loading cluster label file from /content/temp_outputs/speaker_outputs/subsegments_scale4_cluste

100%|██████████| 1/1 [00:00<00:00, 28.12it/s]

[NeMo I 2024-08-26 04:14:47 msdd_models:1403]      [Threshold: 0.7000] [use_clus_as_main=False] [diar_window=50]
[NeMo I 2024-08-26 04:14:47 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-08-26 04:14:47 speaker_utils:93] Number of files to diarize: 1



[NeMo W 2024-08-26 04:14:47 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 04:14:47 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-08-26 04:14:47 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 04:14:47 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-08-26 04:14:47 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 04:14:47 msdd_models:1431]   
    
archivo f399f79640f94bfebba1323714013e12_20230610t16_49_utc.wav listo


INFO:pytorch_lightning.utilities.migration.utils:Lightning automatically upgraded your loaded checkpoint from v1.5.4 to v2.0.7. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint --file ../root/.cache/torch/whisperx-vad-segmentation.bin`


No language specified, language will be first be detected for each audio file (increases inference time).
Model was trained with pyannote.audio 0.0.1, yours is 3.1.1. Bad things might happen unless you revert pyannote.audio to 0.x.
Model was trained with torch 1.10.0+cu102, yours is 2.0.1+cu117. Bad things might happen unless you revert torch to 1.x.
Suppressing numeral and symbol tokens: [3, 4, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 502, 568, 628, 805, 945, 1017, 1025, 1266, 1294, 1360, 1386, 1525, 1614, 1649, 1722, 1848, 1958, 2009, 2119, 2217, 2272, 2319, 2331, 2443, 2625, 2803, 2975, 3165, 3279, 3282, 3356, 3405, 3446, 3499, 3552, 3705, 4022, 4060, 4289, 4303, 4436, 4550, 4688, 4702, 4762, 4808, 5080, 5211, 5254, 5285, 5348, 5853, 5867, 5923, 6071, 6074, 6096, 6375, 6494, 6549, 6591, 6641, 6673, 6856, 6866, 6879, 6905, 6976, 7143, 7201, 7271, 7490, 7526, 7546, 7551, 7560, 7562, 7629, 7634, 7668, 7771, 7773, 7911, 7998, 8132, 8227, 8423, 8451, 8465, 8494, 8652, 8794, 8858, 8923, 90

[NeMo W 2024-08-26 04:15:37 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: true
    
[NeMo W 2024-08-26 04:15:37 modelPT:168] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s). 
    Validation config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: false
    
[NeMo W 2024-08-26 04:15:37 modelPT:174] Please call the ModelPT.setup_test_data() or ModelPT.setup_multiple

[NeMo I 2024-08-26 04:15:37 features:289] PADDING: 16
[NeMo I 2024-08-26 04:15:37 features:289] PADDING: 16
[NeMo I 2024-08-26 04:15:38 save_restore_connector:249] Model EncDecDiarLabelModel was successfully restored from /root/.cache/torch/NeMo/NeMo_1.21.0/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo.
[NeMo I 2024-08-26 04:15:38 features:289] PADDING: 16
[NeMo I 2024-08-26 04:15:38 clustering_diarizer:127] Loading pretrained vad_multilingual_marblenet model from NGC
[NeMo I 2024-08-26 04:15:38 cloud:58] Found existing object /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-08-26 04:15:38 cloud:64] Re-using file from: /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo
[NeMo I 2024-08-26 04:15:38 common:913] Instantiating model from pre-trained checkpoint


[NeMo W 2024-08-26 04:15:39 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: /manifests/ami_train_0.63.json,/manifests/freesound_background_train.json,/manifests/freesound_laughter_train.json,/manifests/fisher_2004_background.json,/manifests/fisher_2004_speech_sampled.json,/manifests/google_train_manifest.json,/manifests/icsi_all_0.63.json,/manifests/musan_freesound_train.json,/manifests/musan_music_train.json,/manifests/musan_soundbible_train.json,/manifests/mandarin_train_sample.json,/manifests/german_train_sample.json,/manifests/spanish_train_sample.json,/manifests/french_train_sample.json,/manifests/russian_train_sample.json
    sample_rate: 16000
    labels:
    - background
    - speech
    batch_size: 256
    shuffle: true
    is_tarred: false
    tarred_audio_filepaths: null
    tarred_shard_strategy: sca

[NeMo I 2024-08-26 04:15:39 features:289] PADDING: 16
[NeMo I 2024-08-26 04:15:39 save_restore_connector:249] Model EncDecClassificationModel was successfully restored from /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-08-26 04:15:39 msdd_models:864] Multiscale Weights: [1, 1, 1, 1, 1]
[NeMo I 2024-08-26 04:15:39 msdd_models:865] Clustering Parameters: {
        "oracle_num_speakers": false,
        "max_num_speakers": 8,
        "enhanced_count_thres": 80,
        "max_rp_threshold": 0.25,
        "sparse_search_volume": 30,
        "maj_vote_spk_count": false,
        "chunk_cluster_count": 50,
        "embeddings_per_chunk": 10000
    }


[NeMo W 2024-08-26 04:15:39 clustering_diarizer:411] Deleting previous clustering diarizer outputs.


[NeMo I 2024-08-26 04:15:39 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-08-26 04:15:39 clustering_diarizer:309] Split long audio file to avoid CUDA memory issue


splitting manifest: 100%|██████████| 1/1 [00:00<00:00, 15.78it/s]

[NeMo I 2024-08-26 04:15:39 vad_utils:107] The prepared manifest file exists. Overwriting!
[NeMo I 2024-08-26 04:15:39 classification_models:273] Perform streaming frame-level VAD
[NeMo I 2024-08-26 04:15:39 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:15:39 collections:302] Dataset loaded with 5 items, total duration of  0.06 hours.
[NeMo I 2024-08-26 04:15:39 collections:304] # 5 files loaded accounting to # 1 labels



vad: 100%|██████████| 5/5 [00:01<00:00,  4.31it/s]

[NeMo I 2024-08-26 04:15:40 clustering_diarizer:250] Generating predictions with overlapping input segments



                                                               

[NeMo I 2024-08-26 04:15:42 clustering_diarizer:262] Converting frame level prediction to speech/no-speech segment in start and end times format.


creating speech segments: 100%|██████████| 1/1 [00:00<00:00,  3.79it/s]

[NeMo I 2024-08-26 04:15:42 clustering_diarizer:287] Subsegmentation for embedding extraction: scale0, /content/temp_outputs/speaker_outputs/subsegments_scale0.json
[NeMo I 2024-08-26 04:15:42 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 04:15:42 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:15:42 collections:302] Dataset loaded with 189 items, total duration of  0.07 hours.
[NeMo I 2024-08-26 04:15:42 collections:304] # 189 files loaded accounting to # 1 labels



[1/5] extract embeddings: 100%|██████████| 3/3 [00:00<00:00,  4.81it/s]

[NeMo I 2024-08-26 04:15:43 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 04:15:43 clustering_diarizer:287] Subsegmentation for embedding extraction: scale1, /content/temp_outputs/speaker_outputs/subsegments_scale1.json
[NeMo I 2024-08-26 04:15:43 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 04:15:43 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:15:43 collections:302] Dataset loaded with 226 items, total duration of  0.07 hours.
[NeMo I 2024-08-26 04:15:43 collections:304] # 226 files loaded accounting to # 1 labels



[2/5] extract embeddings: 100%|██████████| 4/4 [00:00<00:00,  5.87it/s]

[NeMo I 2024-08-26 04:15:43 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 04:15:43 clustering_diarizer:287] Subsegmentation for embedding extraction: scale2, /content/temp_outputs/speaker_outputs/subsegments_scale2.json
[NeMo I 2024-08-26 04:15:43 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 04:15:43 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:15:43 collections:302] Dataset loaded with 289 items, total duration of  0.08 hours.
[NeMo I 2024-08-26 04:15:43 collections:304] # 289 files loaded accounting to # 1 labels



[3/5] extract embeddings: 100%|██████████| 5/5 [00:00<00:00,  6.19it/s]

[NeMo I 2024-08-26 04:15:44 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 04:15:44 clustering_diarizer:287] Subsegmentation for embedding extraction: scale3, /content/temp_outputs/speaker_outputs/subsegments_scale3.json





[NeMo I 2024-08-26 04:15:45 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 04:15:45 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:15:45 collections:302] Dataset loaded with 389 items, total duration of  0.08 hours.
[NeMo I 2024-08-26 04:15:45 collections:304] # 389 files loaded accounting to # 1 labels


[4/5] extract embeddings: 100%|██████████| 7/7 [00:01<00:00,  6.81it/s]

[NeMo I 2024-08-26 04:15:46 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 04:15:46 clustering_diarizer:287] Subsegmentation for embedding extraction: scale4, /content/temp_outputs/speaker_outputs/subsegments_scale4.json
[NeMo I 2024-08-26 04:15:46 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 04:15:46 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:15:46 collections:302] Dataset loaded with 591 items, total duration of  0.08 hours.
[NeMo I 2024-08-26 04:15:46 collections:304] # 591 files loaded accounting to # 1 labels



[5/5] extract embeddings: 100%|██████████| 10/10 [00:01<00:00,  8.15it/s]


[NeMo I 2024-08-26 04:15:48 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings


clustering: 100%|██████████| 1/1 [00:00<00:00,  1.78it/s]

[NeMo I 2024-08-26 04:15:48 clustering_diarizer:464] Outputs are saved in /content/temp_outputs directory



[NeMo W 2024-08-26 04:15:48 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 04:15:48 msdd_models:960] Loading embedding pickle file of scale:0 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale0_embeddings.pkl
[NeMo I 2024-08-26 04:15:48 msdd_models:960] Loading embedding pickle file of scale:1 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale1_embeddings.pkl
[NeMo I 2024-08-26 04:15:48 msdd_models:960] Loading embedding pickle file of scale:2 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale2_embeddings.pkl
[NeMo I 2024-08-26 04:15:48 msdd_models:960] Loading embedding pickle file of scale:3 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale3_embeddings.pkl
[NeMo I 2024-08-26 04:15:48 msdd_models:960] Loading embedding pickle file of scale:4 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale4_embeddings.pkl
[NeMo I 2024-08-26 04:15:48 msdd_models:938] Loading cluster label file from /content/temp_outputs/speaker_outputs/subsegments_scale4_cluste

100%|██████████| 1/1 [00:00<00:00, 26.65it/s]


[NeMo I 2024-08-26 04:15:48 msdd_models:1403]      [Threshold: 0.7000] [use_clus_as_main=False] [diar_window=50]
[NeMo I 2024-08-26 04:15:48 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-08-26 04:15:48 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-08-26 04:15:48 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 04:15:48 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-08-26 04:15:48 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 04:15:49 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-08-26 04:15:49 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 04:15:49 msdd_models:1431]   
    
archivo 1f948c343dac4d099d6cbb9a648403ff_20230610t16_57_utc.wav listo


INFO:pytorch_lightning.utilities.migration.utils:Lightning automatically upgraded your loaded checkpoint from v1.5.4 to v2.0.7. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint --file ../root/.cache/torch/whisperx-vad-segmentation.bin`


No language specified, language will be first be detected for each audio file (increases inference time).
Model was trained with pyannote.audio 0.0.1, yours is 3.1.1. Bad things might happen unless you revert pyannote.audio to 0.x.
Model was trained with torch 1.10.0+cu102, yours is 2.0.1+cu117. Bad things might happen unless you revert torch to 1.x.
Suppressing numeral and symbol tokens: [3, 4, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 502, 568, 628, 805, 945, 1017, 1025, 1266, 1294, 1360, 1386, 1525, 1614, 1649, 1722, 1848, 1958, 2009, 2119, 2217, 2272, 2319, 2331, 2443, 2625, 2803, 2975, 3165, 3279, 3282, 3356, 3405, 3446, 3499, 3552, 3705, 4022, 4060, 4289, 4303, 4436, 4550, 4688, 4702, 4762, 4808, 5080, 5211, 5254, 5285, 5348, 5853, 5867, 5923, 6071, 6074, 6096, 6375, 6494, 6549, 6591, 6641, 6673, 6856, 6866, 6879, 6905, 6976, 7143, 7201, 7271, 7490, 7526, 7546, 7551, 7560, 7562, 7629, 7634, 7668, 7771, 7773, 7911, 7998, 8132, 8227, 8423, 8451, 8465, 8494, 8652, 8794, 8858, 8923, 90

[NeMo W 2024-08-26 04:16:42 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: true
    
[NeMo W 2024-08-26 04:16:42 modelPT:168] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s). 
    Validation config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: false
    
[NeMo W 2024-08-26 04:16:42 modelPT:174] Please call the ModelPT.setup_test_data() or ModelPT.setup_multiple

[NeMo I 2024-08-26 04:16:42 features:289] PADDING: 16
[NeMo I 2024-08-26 04:16:42 features:289] PADDING: 16
[NeMo I 2024-08-26 04:16:43 save_restore_connector:249] Model EncDecDiarLabelModel was successfully restored from /root/.cache/torch/NeMo/NeMo_1.21.0/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo.
[NeMo I 2024-08-26 04:16:43 features:289] PADDING: 16
[NeMo I 2024-08-26 04:16:44 clustering_diarizer:127] Loading pretrained vad_multilingual_marblenet model from NGC
[NeMo I 2024-08-26 04:16:44 cloud:58] Found existing object /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-08-26 04:16:44 cloud:64] Re-using file from: /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo
[NeMo I 2024-08-26 04:16:44 common:913] Instantiating model from pre-trained checkpoint


[NeMo W 2024-08-26 04:16:44 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: /manifests/ami_train_0.63.json,/manifests/freesound_background_train.json,/manifests/freesound_laughter_train.json,/manifests/fisher_2004_background.json,/manifests/fisher_2004_speech_sampled.json,/manifests/google_train_manifest.json,/manifests/icsi_all_0.63.json,/manifests/musan_freesound_train.json,/manifests/musan_music_train.json,/manifests/musan_soundbible_train.json,/manifests/mandarin_train_sample.json,/manifests/german_train_sample.json,/manifests/spanish_train_sample.json,/manifests/french_train_sample.json,/manifests/russian_train_sample.json
    sample_rate: 16000
    labels:
    - background
    - speech
    batch_size: 256
    shuffle: true
    is_tarred: false
    tarred_audio_filepaths: null
    tarred_shard_strategy: sca

[NeMo I 2024-08-26 04:16:44 features:289] PADDING: 16
[NeMo I 2024-08-26 04:16:45 save_restore_connector:249] Model EncDecClassificationModel was successfully restored from /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-08-26 04:16:45 msdd_models:864] Multiscale Weights: [1, 1, 1, 1, 1]
[NeMo I 2024-08-26 04:16:45 msdd_models:865] Clustering Parameters: {
        "oracle_num_speakers": false,
        "max_num_speakers": 8,
        "enhanced_count_thres": 80,
        "max_rp_threshold": 0.25,
        "sparse_search_volume": 30,
        "maj_vote_spk_count": false,
        "chunk_cluster_count": 50,
        "embeddings_per_chunk": 10000
    }


[NeMo W 2024-08-26 04:16:45 clustering_diarizer:411] Deleting previous clustering diarizer outputs.


[NeMo I 2024-08-26 04:16:45 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-08-26 04:16:45 clustering_diarizer:309] Split long audio file to avoid CUDA memory issue


splitting manifest: 100%|██████████| 1/1 [00:00<00:00, 10.32it/s]

[NeMo I 2024-08-26 04:16:45 vad_utils:107] The prepared manifest file exists. Overwriting!
[NeMo I 2024-08-26 04:16:45 classification_models:273] Perform streaming frame-level VAD
[NeMo I 2024-08-26 04:16:45 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:16:45 collections:302] Dataset loaded with 7 items, total duration of  0.09 hours.
[NeMo I 2024-08-26 04:16:45 collections:304] # 7 files loaded accounting to # 1 labels



vad: 100%|██████████| 7/7 [00:01<00:00,  3.57it/s]

[NeMo I 2024-08-26 04:16:47 clustering_diarizer:250] Generating predictions with overlapping input segments



                                                               

[NeMo I 2024-08-26 04:16:50 clustering_diarizer:262] Converting frame level prediction to speech/no-speech segment in start and end times format.


creating speech segments: 100%|██████████| 1/1 [00:00<00:00,  3.71it/s]

[NeMo I 2024-08-26 04:16:50 clustering_diarizer:287] Subsegmentation for embedding extraction: scale0, /content/temp_outputs/speaker_outputs/subsegments_scale0.json
[NeMo I 2024-08-26 04:16:50 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 04:16:50 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:16:50 collections:302] Dataset loaded with 192 items, total duration of  0.06 hours.
[NeMo I 2024-08-26 04:16:50 collections:304] # 192 files loaded accounting to # 1 labels



[1/5] extract embeddings: 100%|██████████| 3/3 [00:00<00:00,  5.40it/s]

[NeMo I 2024-08-26 04:16:51 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings





[NeMo I 2024-08-26 04:16:51 clustering_diarizer:287] Subsegmentation for embedding extraction: scale1, /content/temp_outputs/speaker_outputs/subsegments_scale1.json
[NeMo I 2024-08-26 04:16:51 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 04:16:51 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:16:51 collections:302] Dataset loaded with 234 items, total duration of  0.07 hours.
[NeMo I 2024-08-26 04:16:51 collections:304] # 234 files loaded accounting to # 1 labels


[2/5] extract embeddings: 100%|██████████| 4/4 [00:00<00:00,  7.59it/s]

[NeMo I 2024-08-26 04:16:51 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 04:16:51 clustering_diarizer:287] Subsegmentation for embedding extraction: scale2, /content/temp_outputs/speaker_outputs/subsegments_scale2.json
[NeMo I 2024-08-26 04:16:51 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 04:16:51 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:16:51 collections:302] Dataset loaded with 292 items, total duration of  0.07 hours.
[NeMo I 2024-08-26 04:16:51 collections:304] # 292 files loaded accounting to # 1 labels



[3/5] extract embeddings: 100%|██████████| 5/5 [00:00<00:00,  6.60it/s]

[NeMo I 2024-08-26 04:16:52 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings





[NeMo I 2024-08-26 04:16:52 clustering_diarizer:287] Subsegmentation for embedding extraction: scale3, /content/temp_outputs/speaker_outputs/subsegments_scale3.json
[NeMo I 2024-08-26 04:16:52 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 04:16:52 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:16:52 collections:302] Dataset loaded with 383 items, total duration of  0.07 hours.
[NeMo I 2024-08-26 04:16:52 collections:304] # 383 files loaded accounting to # 1 labels


[4/5] extract embeddings: 100%|██████████| 6/6 [00:00<00:00,  8.42it/s]

[NeMo I 2024-08-26 04:16:53 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 04:16:53 clustering_diarizer:287] Subsegmentation for embedding extraction: scale4, /content/temp_outputs/speaker_outputs/subsegments_scale4.json
[NeMo I 2024-08-26 04:16:53 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 04:16:53 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:16:53 collections:302] Dataset loaded with 586 items, total duration of  0.08 hours.
[NeMo I 2024-08-26 04:16:53 collections:304] # 586 files loaded accounting to # 1 labels



[5/5] extract embeddings: 100%|██████████| 10/10 [00:00<00:00, 10.65it/s]

[NeMo I 2024-08-26 04:16:54 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings



clustering: 100%|██████████| 1/1 [00:00<00:00,  2.59it/s]

[NeMo I 2024-08-26 04:16:54 clustering_diarizer:464] Outputs are saved in /content/temp_outputs directory



[NeMo W 2024-08-26 04:16:54 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 04:16:54 msdd_models:960] Loading embedding pickle file of scale:0 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale0_embeddings.pkl
[NeMo I 2024-08-26 04:16:54 msdd_models:960] Loading embedding pickle file of scale:1 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale1_embeddings.pkl
[NeMo I 2024-08-26 04:16:54 msdd_models:960] Loading embedding pickle file of scale:2 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale2_embeddings.pkl
[NeMo I 2024-08-26 04:16:54 msdd_models:960] Loading embedding pickle file of scale:3 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale3_embeddings.pkl
[NeMo I 2024-08-26 04:16:54 msdd_models:960] Loading embedding pickle file of scale:4 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale4_embeddings.pkl
[NeMo I 2024-08-26 04:16:54 msdd_models:938] Loading cluster label file from /content/temp_outputs/speaker_outputs/subsegments_scale4_cluste

100%|██████████| 1/1 [00:00<00:00, 29.01it/s]

[NeMo I 2024-08-26 04:16:54 msdd_models:1403]      [Threshold: 0.7000] [use_clus_as_main=False] [diar_window=50]
[NeMo I 2024-08-26 04:16:54 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-08-26 04:16:54 speaker_utils:93] Number of files to diarize: 1



[NeMo W 2024-08-26 04:16:54 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 04:16:54 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-08-26 04:16:54 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 04:16:54 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-08-26 04:16:55 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 04:16:55 msdd_models:1431]   
    
archivo 18eff36f759c4d73965c2a229beecfee_20230610t16_24_utc.wav listo


INFO:pytorch_lightning.utilities.migration.utils:Lightning automatically upgraded your loaded checkpoint from v1.5.4 to v2.0.7. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint --file ../root/.cache/torch/whisperx-vad-segmentation.bin`


No language specified, language will be first be detected for each audio file (increases inference time).
Model was trained with pyannote.audio 0.0.1, yours is 3.1.1. Bad things might happen unless you revert pyannote.audio to 0.x.
Model was trained with torch 1.10.0+cu102, yours is 2.0.1+cu117. Bad things might happen unless you revert torch to 1.x.
Suppressing numeral and symbol tokens: [3, 4, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 502, 568, 628, 805, 945, 1017, 1025, 1266, 1294, 1360, 1386, 1525, 1614, 1649, 1722, 1848, 1958, 2009, 2119, 2217, 2272, 2319, 2331, 2443, 2625, 2803, 2975, 3165, 3279, 3282, 3356, 3405, 3446, 3499, 3552, 3705, 4022, 4060, 4289, 4303, 4436, 4550, 4688, 4702, 4762, 4808, 5080, 5211, 5254, 5285, 5348, 5853, 5867, 5923, 6071, 6074, 6096, 6375, 6494, 6549, 6591, 6641, 6673, 6856, 6866, 6879, 6905, 6976, 7143, 7201, 7271, 7490, 7526, 7546, 7551, 7560, 7562, 7629, 7634, 7668, 7771, 7773, 7911, 7998, 8132, 8227, 8423, 8451, 8465, 8494, 8652, 8794, 8858, 8923, 90

[NeMo W 2024-08-26 04:17:37 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: true
    
[NeMo W 2024-08-26 04:17:37 modelPT:168] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s). 
    Validation config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: false
    
[NeMo W 2024-08-26 04:17:37 modelPT:174] Please call the ModelPT.setup_test_data() or ModelPT.setup_multiple

[NeMo I 2024-08-26 04:17:37 features:289] PADDING: 16
[NeMo I 2024-08-26 04:17:37 features:289] PADDING: 16
[NeMo I 2024-08-26 04:17:37 save_restore_connector:249] Model EncDecDiarLabelModel was successfully restored from /root/.cache/torch/NeMo/NeMo_1.21.0/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo.
[NeMo I 2024-08-26 04:17:37 features:289] PADDING: 16
[NeMo I 2024-08-26 04:17:38 clustering_diarizer:127] Loading pretrained vad_multilingual_marblenet model from NGC
[NeMo I 2024-08-26 04:17:38 cloud:58] Found existing object /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-08-26 04:17:38 cloud:64] Re-using file from: /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo
[NeMo I 2024-08-26 04:17:38 common:913] Instantiating model from pre-trained checkpoint


[NeMo W 2024-08-26 04:17:38 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: /manifests/ami_train_0.63.json,/manifests/freesound_background_train.json,/manifests/freesound_laughter_train.json,/manifests/fisher_2004_background.json,/manifests/fisher_2004_speech_sampled.json,/manifests/google_train_manifest.json,/manifests/icsi_all_0.63.json,/manifests/musan_freesound_train.json,/manifests/musan_music_train.json,/manifests/musan_soundbible_train.json,/manifests/mandarin_train_sample.json,/manifests/german_train_sample.json,/manifests/spanish_train_sample.json,/manifests/french_train_sample.json,/manifests/russian_train_sample.json
    sample_rate: 16000
    labels:
    - background
    - speech
    batch_size: 256
    shuffle: true
    is_tarred: false
    tarred_audio_filepaths: null
    tarred_shard_strategy: sca

[NeMo I 2024-08-26 04:17:38 features:289] PADDING: 16
[NeMo I 2024-08-26 04:17:38 save_restore_connector:249] Model EncDecClassificationModel was successfully restored from /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-08-26 04:17:38 msdd_models:864] Multiscale Weights: [1, 1, 1, 1, 1]
[NeMo I 2024-08-26 04:17:38 msdd_models:865] Clustering Parameters: {
        "oracle_num_speakers": false,
        "max_num_speakers": 8,
        "enhanced_count_thres": 80,
        "max_rp_threshold": 0.25,
        "sparse_search_volume": 30,
        "maj_vote_spk_count": false,
        "chunk_cluster_count": 50,
        "embeddings_per_chunk": 10000
    }


[NeMo W 2024-08-26 04:17:39 clustering_diarizer:411] Deleting previous clustering diarizer outputs.


[NeMo I 2024-08-26 04:17:39 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-08-26 04:17:39 clustering_diarizer:309] Split long audio file to avoid CUDA memory issue


splitting manifest: 100%|██████████| 1/1 [00:00<00:00, 48.08it/s]

[NeMo I 2024-08-26 04:17:39 vad_utils:107] The prepared manifest file exists. Overwriting!
[NeMo I 2024-08-26 04:17:39 classification_models:273] Perform streaming frame-level VAD
[NeMo I 2024-08-26 04:17:39 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:17:39 collections:302] Dataset loaded with 1 items, total duration of  0.01 hours.
[NeMo I 2024-08-26 04:17:39 collections:304] # 1 files loaded accounting to # 1 labels



vad: 100%|██████████| 1/1 [00:00<00:00,  3.43it/s]

[NeMo I 2024-08-26 04:17:39 clustering_diarizer:250] Generating predictions with overlapping input segments



                                                               

[NeMo I 2024-08-26 04:17:39 clustering_diarizer:262] Converting frame level prediction to speech/no-speech segment in start and end times format.


creating speech segments: 100%|██████████| 1/1 [00:00<00:00, 20.07it/s]

[NeMo I 2024-08-26 04:17:39 clustering_diarizer:287] Subsegmentation for embedding extraction: scale0, /content/temp_outputs/speaker_outputs/subsegments_scale0.json
[NeMo I 2024-08-26 04:17:39 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 04:17:39 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:17:39 collections:302] Dataset loaded with 31 items, total duration of  0.01 hours.
[NeMo I 2024-08-26 04:17:39 collections:304] # 31 files loaded accounting to # 1 labels



[1/5] extract embeddings: 100%|██████████| 1/1 [00:00<00:00,  7.40it/s]

[NeMo I 2024-08-26 04:17:40 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 04:17:40 clustering_diarizer:287] Subsegmentation for embedding extraction: scale1, /content/temp_outputs/speaker_outputs/subsegments_scale1.json
[NeMo I 2024-08-26 04:17:40 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 04:17:40 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:17:40 collections:302] Dataset loaded with 39 items, total duration of  0.01 hours.
[NeMo I 2024-08-26 04:17:40 collections:304] # 39 files loaded accounting to # 1 labels



[2/5] extract embeddings: 100%|██████████| 1/1 [00:00<00:00,  8.38it/s]

[NeMo I 2024-08-26 04:17:40 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 04:17:40 clustering_diarizer:287] Subsegmentation for embedding extraction: scale2, /content/temp_outputs/speaker_outputs/subsegments_scale2.json
[NeMo I 2024-08-26 04:17:40 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 04:17:40 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:17:40 collections:302] Dataset loaded with 50 items, total duration of  0.01 hours.
[NeMo I 2024-08-26 04:17:40 collections:304] # 50 files loaded accounting to # 1 labels



[3/5] extract embeddings: 100%|██████████| 1/1 [00:00<00:00,  7.52it/s]

[NeMo I 2024-08-26 04:17:40 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 04:17:40 clustering_diarizer:287] Subsegmentation for embedding extraction: scale3, /content/temp_outputs/speaker_outputs/subsegments_scale3.json
[NeMo I 2024-08-26 04:17:40 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 04:17:40 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:17:40 collections:302] Dataset loaded with 67 items, total duration of  0.01 hours.
[NeMo I 2024-08-26 04:17:40 collections:304] # 67 files loaded accounting to # 1 labels



[4/5] extract embeddings: 100%|██████████| 2/2 [00:00<00:00,  9.19it/s]

[NeMo I 2024-08-26 04:17:40 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 04:17:40 clustering_diarizer:287] Subsegmentation for embedding extraction: scale4, /content/temp_outputs/speaker_outputs/subsegments_scale4.json
[NeMo I 2024-08-26 04:17:40 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 04:17:40 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:17:40 collections:302] Dataset loaded with 104 items, total duration of  0.01 hours.
[NeMo I 2024-08-26 04:17:40 collections:304] # 104 files loaded accounting to # 1 labels



[5/5] extract embeddings: 100%|██████████| 2/2 [00:00<00:00,  7.26it/s]

[NeMo I 2024-08-26 04:17:41 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings



clustering: 100%|██████████| 1/1 [00:00<00:00,  5.41it/s]

[NeMo I 2024-08-26 04:17:41 clustering_diarizer:464] Outputs are saved in /content/temp_outputs directory



[NeMo W 2024-08-26 04:17:41 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 04:17:41 msdd_models:960] Loading embedding pickle file of scale:0 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale0_embeddings.pkl
[NeMo I 2024-08-26 04:17:41 msdd_models:960] Loading embedding pickle file of scale:1 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale1_embeddings.pkl
[NeMo I 2024-08-26 04:17:41 msdd_models:960] Loading embedding pickle file of scale:2 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale2_embeddings.pkl
[NeMo I 2024-08-26 04:17:41 msdd_models:960] Loading embedding pickle file of scale:3 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale3_embeddings.pkl
[NeMo I 2024-08-26 04:17:41 msdd_models:960] Loading embedding pickle file of scale:4 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale4_embeddings.pkl
[NeMo I 2024-08-26 04:17:41 msdd_models:938] Loading cluster label file from /content/temp_outputs/speaker_outputs/subsegments_scale4_cluste

100%|██████████| 1/1 [00:00<00:00, 44.98it/s]

[NeMo I 2024-08-26 04:17:41 msdd_models:1403]      [Threshold: 0.7000] [use_clus_as_main=False] [diar_window=50]
[NeMo I 2024-08-26 04:17:41 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-08-26 04:17:41 speaker_utils:93] Number of files to diarize: 1



[NeMo W 2024-08-26 04:17:41 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 04:17:41 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-08-26 04:17:41 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 04:17:41 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-08-26 04:17:41 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 04:17:41 msdd_models:1431]   
    
archivo 0c027942655e4f1183614bb47a282aea_20230610t13_32_utc.wav listo


INFO:pytorch_lightning.utilities.migration.utils:Lightning automatically upgraded your loaded checkpoint from v1.5.4 to v2.0.7. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint --file ../root/.cache/torch/whisperx-vad-segmentation.bin`


No language specified, language will be first be detected for each audio file (increases inference time).
Model was trained with pyannote.audio 0.0.1, yours is 3.1.1. Bad things might happen unless you revert pyannote.audio to 0.x.
Model was trained with torch 1.10.0+cu102, yours is 2.0.1+cu117. Bad things might happen unless you revert torch to 1.x.
Suppressing numeral and symbol tokens: [3, 4, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 502, 568, 628, 805, 945, 1017, 1025, 1266, 1294, 1360, 1386, 1525, 1614, 1649, 1722, 1848, 1958, 2009, 2119, 2217, 2272, 2319, 2331, 2443, 2625, 2803, 2975, 3165, 3279, 3282, 3356, 3405, 3446, 3499, 3552, 3705, 4022, 4060, 4289, 4303, 4436, 4550, 4688, 4702, 4762, 4808, 5080, 5211, 5254, 5285, 5348, 5853, 5867, 5923, 6071, 6074, 6096, 6375, 6494, 6549, 6591, 6641, 6673, 6856, 6866, 6879, 6905, 6976, 7143, 7201, 7271, 7490, 7526, 7546, 7551, 7560, 7562, 7629, 7634, 7668, 7771, 7773, 7911, 7998, 8132, 8227, 8423, 8451, 8465, 8494, 8652, 8794, 8858, 8923, 90

[NeMo W 2024-08-26 04:18:26 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: true
    
[NeMo W 2024-08-26 04:18:26 modelPT:168] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s). 
    Validation config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: false
    
[NeMo W 2024-08-26 04:18:26 modelPT:174] Please call the ModelPT.setup_test_data() or ModelPT.setup_multiple

[NeMo I 2024-08-26 04:18:26 features:289] PADDING: 16
[NeMo I 2024-08-26 04:18:26 features:289] PADDING: 16
[NeMo I 2024-08-26 04:18:27 save_restore_connector:249] Model EncDecDiarLabelModel was successfully restored from /root/.cache/torch/NeMo/NeMo_1.21.0/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo.
[NeMo I 2024-08-26 04:18:27 features:289] PADDING: 16
[NeMo I 2024-08-26 04:18:27 clustering_diarizer:127] Loading pretrained vad_multilingual_marblenet model from NGC
[NeMo I 2024-08-26 04:18:27 cloud:58] Found existing object /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-08-26 04:18:28 cloud:64] Re-using file from: /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo
[NeMo I 2024-08-26 04:18:28 common:913] Instantiating model from pre-trained checkpoint


[NeMo W 2024-08-26 04:18:28 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: /manifests/ami_train_0.63.json,/manifests/freesound_background_train.json,/manifests/freesound_laughter_train.json,/manifests/fisher_2004_background.json,/manifests/fisher_2004_speech_sampled.json,/manifests/google_train_manifest.json,/manifests/icsi_all_0.63.json,/manifests/musan_freesound_train.json,/manifests/musan_music_train.json,/manifests/musan_soundbible_train.json,/manifests/mandarin_train_sample.json,/manifests/german_train_sample.json,/manifests/spanish_train_sample.json,/manifests/french_train_sample.json,/manifests/russian_train_sample.json
    sample_rate: 16000
    labels:
    - background
    - speech
    batch_size: 256
    shuffle: true
    is_tarred: false
    tarred_audio_filepaths: null
    tarred_shard_strategy: sca

[NeMo I 2024-08-26 04:18:28 features:289] PADDING: 16
[NeMo I 2024-08-26 04:18:28 save_restore_connector:249] Model EncDecClassificationModel was successfully restored from /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-08-26 04:18:28 msdd_models:864] Multiscale Weights: [1, 1, 1, 1, 1]
[NeMo I 2024-08-26 04:18:28 msdd_models:865] Clustering Parameters: {
        "oracle_num_speakers": false,
        "max_num_speakers": 8,
        "enhanced_count_thres": 80,
        "max_rp_threshold": 0.25,
        "sparse_search_volume": 30,
        "maj_vote_spk_count": false,
        "chunk_cluster_count": 50,
        "embeddings_per_chunk": 10000
    }


[NeMo W 2024-08-26 04:18:28 clustering_diarizer:411] Deleting previous clustering diarizer outputs.


[NeMo I 2024-08-26 04:18:28 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-08-26 04:18:28 clustering_diarizer:309] Split long audio file to avoid CUDA memory issue


splitting manifest: 100%|██████████| 1/1 [00:00<00:00, 26.23it/s]

[NeMo I 2024-08-26 04:18:28 vad_utils:107] The prepared manifest file exists. Overwriting!
[NeMo I 2024-08-26 04:18:28 classification_models:273] Perform streaming frame-level VAD
[NeMo I 2024-08-26 04:18:28 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:18:28 collections:302] Dataset loaded with 3 items, total duration of  0.03 hours.
[NeMo I 2024-08-26 04:18:28 collections:304] # 3 files loaded accounting to # 1 labels



vad: 100%|██████████| 3/3 [00:00<00:00,  4.35it/s]

[NeMo I 2024-08-26 04:18:29 clustering_diarizer:250] Generating predictions with overlapping input segments



                                                               

[NeMo I 2024-08-26 04:18:30 clustering_diarizer:262] Converting frame level prediction to speech/no-speech segment in start and end times format.


creating speech segments: 100%|██████████| 1/1 [00:00<00:00,  7.42it/s]

[NeMo I 2024-08-26 04:18:30 clustering_diarizer:287] Subsegmentation for embedding extraction: scale0, /content/temp_outputs/speaker_outputs/subsegments_scale0.json
[NeMo I 2024-08-26 04:18:30 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 04:18:30 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:18:30 collections:302] Dataset loaded with 84 items, total duration of  0.03 hours.
[NeMo I 2024-08-26 04:18:30 collections:304] # 84 files loaded accounting to # 1 labels



[1/5] extract embeddings: 100%|██████████| 2/2 [00:00<00:00,  8.13it/s]

[NeMo I 2024-08-26 04:18:30 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 04:18:30 clustering_diarizer:287] Subsegmentation for embedding extraction: scale1, /content/temp_outputs/speaker_outputs/subsegments_scale1.json
[NeMo I 2024-08-26 04:18:30 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 04:18:30 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:18:30 collections:302] Dataset loaded with 105 items, total duration of  0.03 hours.
[NeMo I 2024-08-26 04:18:30 collections:304] # 105 files loaded accounting to # 1 labels



[2/5] extract embeddings: 100%|██████████| 2/2 [00:00<00:00,  6.91it/s]

[NeMo I 2024-08-26 04:18:30 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 04:18:30 clustering_diarizer:287] Subsegmentation for embedding extraction: scale2, /content/temp_outputs/speaker_outputs/subsegments_scale2.json
[NeMo I 2024-08-26 04:18:30 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 04:18:30 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:18:30 collections:302] Dataset loaded with 131 items, total duration of  0.03 hours.
[NeMo I 2024-08-26 04:18:30 collections:304] # 131 files loaded accounting to # 1 labels



[3/5] extract embeddings: 100%|██████████| 3/3 [00:00<00:00,  8.40it/s]

[NeMo I 2024-08-26 04:18:31 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 04:18:31 clustering_diarizer:287] Subsegmentation for embedding extraction: scale3, /content/temp_outputs/speaker_outputs/subsegments_scale3.json
[NeMo I 2024-08-26 04:18:31 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 04:18:31 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:18:31 collections:302] Dataset loaded with 178 items, total duration of  0.03 hours.
[NeMo I 2024-08-26 04:18:31 collections:304] # 178 files loaded accounting to # 1 labels



[4/5] extract embeddings: 100%|██████████| 3/3 [00:00<00:00,  8.88it/s]

[NeMo I 2024-08-26 04:18:31 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 04:18:31 clustering_diarizer:287] Subsegmentation for embedding extraction: scale4, /content/temp_outputs/speaker_outputs/subsegments_scale4.json
[NeMo I 2024-08-26 04:18:31 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 04:18:31 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:18:31 collections:302] Dataset loaded with 271 items, total duration of  0.04 hours.
[NeMo I 2024-08-26 04:18:31 collections:304] # 271 files loaded accounting to # 1 labels



[5/5] extract embeddings: 100%|██████████| 5/5 [00:00<00:00,  9.37it/s]

[NeMo I 2024-08-26 04:18:32 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings



clustering: 100%|██████████| 1/1 [00:00<00:00,  3.34it/s]

[NeMo I 2024-08-26 04:18:32 clustering_diarizer:464] Outputs are saved in /content/temp_outputs directory



[NeMo W 2024-08-26 04:18:32 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 04:18:32 msdd_models:960] Loading embedding pickle file of scale:0 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale0_embeddings.pkl
[NeMo I 2024-08-26 04:18:32 msdd_models:960] Loading embedding pickle file of scale:1 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale1_embeddings.pkl
[NeMo I 2024-08-26 04:18:32 msdd_models:960] Loading embedding pickle file of scale:2 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale2_embeddings.pkl
[NeMo I 2024-08-26 04:18:32 msdd_models:960] Loading embedding pickle file of scale:3 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale3_embeddings.pkl
[NeMo I 2024-08-26 04:18:32 msdd_models:960] Loading embedding pickle file of scale:4 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale4_embeddings.pkl
[NeMo I 2024-08-26 04:18:32 msdd_models:938] Loading cluster label file from /content/temp_outputs/speaker_outputs/subsegments_scale4_cluste

100%|██████████| 1/1 [00:00<00:00, 42.54it/s]

[NeMo I 2024-08-26 04:18:32 msdd_models:1403]      [Threshold: 0.7000] [use_clus_as_main=False] [diar_window=50]
[NeMo I 2024-08-26 04:18:32 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-08-26 04:18:32 speaker_utils:93] Number of files to diarize: 1



[NeMo W 2024-08-26 04:18:32 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 04:18:32 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-08-26 04:18:32 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 04:18:32 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-08-26 04:18:32 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 04:18:32 msdd_models:1431]   
    
archivo 822671d3fd8d4b468ca3fe67162b0e18_20230610t15_03_utc.wav listo


INFO:pytorch_lightning.utilities.migration.utils:Lightning automatically upgraded your loaded checkpoint from v1.5.4 to v2.0.7. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint --file ../root/.cache/torch/whisperx-vad-segmentation.bin`


No language specified, language will be first be detected for each audio file (increases inference time).
Model was trained with pyannote.audio 0.0.1, yours is 3.1.1. Bad things might happen unless you revert pyannote.audio to 0.x.
Model was trained with torch 1.10.0+cu102, yours is 2.0.1+cu117. Bad things might happen unless you revert torch to 1.x.
Suppressing numeral and symbol tokens: [3, 4, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 502, 568, 628, 805, 945, 1017, 1025, 1266, 1294, 1360, 1386, 1525, 1614, 1649, 1722, 1848, 1958, 2009, 2119, 2217, 2272, 2319, 2331, 2443, 2625, 2803, 2975, 3165, 3279, 3282, 3356, 3405, 3446, 3499, 3552, 3705, 4022, 4060, 4289, 4303, 4436, 4550, 4688, 4702, 4762, 4808, 5080, 5211, 5254, 5285, 5348, 5853, 5867, 5923, 6071, 6074, 6096, 6375, 6494, 6549, 6591, 6641, 6673, 6856, 6866, 6879, 6905, 6976, 7143, 7201, 7271, 7490, 7526, 7546, 7551, 7560, 7562, 7629, 7634, 7668, 7771, 7773, 7911, 7998, 8132, 8227, 8423, 8451, 8465, 8494, 8652, 8794, 8858, 8923, 90

[NeMo W 2024-08-26 04:19:22 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: true
    
[NeMo W 2024-08-26 04:19:22 modelPT:168] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s). 
    Validation config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: false
    
[NeMo W 2024-08-26 04:19:22 modelPT:174] Please call the ModelPT.setup_test_data() or ModelPT.setup_multiple

[NeMo I 2024-08-26 04:19:22 features:289] PADDING: 16
[NeMo I 2024-08-26 04:19:23 features:289] PADDING: 16
[NeMo I 2024-08-26 04:19:23 save_restore_connector:249] Model EncDecDiarLabelModel was successfully restored from /root/.cache/torch/NeMo/NeMo_1.21.0/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo.
[NeMo I 2024-08-26 04:19:23 features:289] PADDING: 16
[NeMo I 2024-08-26 04:19:24 clustering_diarizer:127] Loading pretrained vad_multilingual_marblenet model from NGC
[NeMo I 2024-08-26 04:19:24 cloud:58] Found existing object /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-08-26 04:19:24 cloud:64] Re-using file from: /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo
[NeMo I 2024-08-26 04:19:24 common:913] Instantiating model from pre-trained checkpoint


[NeMo W 2024-08-26 04:19:24 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: /manifests/ami_train_0.63.json,/manifests/freesound_background_train.json,/manifests/freesound_laughter_train.json,/manifests/fisher_2004_background.json,/manifests/fisher_2004_speech_sampled.json,/manifests/google_train_manifest.json,/manifests/icsi_all_0.63.json,/manifests/musan_freesound_train.json,/manifests/musan_music_train.json,/manifests/musan_soundbible_train.json,/manifests/mandarin_train_sample.json,/manifests/german_train_sample.json,/manifests/spanish_train_sample.json,/manifests/french_train_sample.json,/manifests/russian_train_sample.json
    sample_rate: 16000
    labels:
    - background
    - speech
    batch_size: 256
    shuffle: true
    is_tarred: false
    tarred_audio_filepaths: null
    tarred_shard_strategy: sca

[NeMo I 2024-08-26 04:19:24 features:289] PADDING: 16
[NeMo I 2024-08-26 04:19:24 save_restore_connector:249] Model EncDecClassificationModel was successfully restored from /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-08-26 04:19:24 msdd_models:864] Multiscale Weights: [1, 1, 1, 1, 1]
[NeMo I 2024-08-26 04:19:24 msdd_models:865] Clustering Parameters: {
        "oracle_num_speakers": false,
        "max_num_speakers": 8,
        "enhanced_count_thres": 80,
        "max_rp_threshold": 0.25,
        "sparse_search_volume": 30,
        "maj_vote_spk_count": false,
        "chunk_cluster_count": 50,
        "embeddings_per_chunk": 10000
    }


[NeMo W 2024-08-26 04:19:24 clustering_diarizer:411] Deleting previous clustering diarizer outputs.


[NeMo I 2024-08-26 04:19:24 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-08-26 04:19:24 clustering_diarizer:309] Split long audio file to avoid CUDA memory issue


splitting manifest: 100%|██████████| 1/1 [00:00<00:00, 18.95it/s]

[NeMo I 2024-08-26 04:19:24 vad_utils:107] The prepared manifest file exists. Overwriting!
[NeMo I 2024-08-26 04:19:24 classification_models:273] Perform streaming frame-level VAD
[NeMo I 2024-08-26 04:19:24 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:19:24 collections:302] Dataset loaded with 4 items, total duration of  0.04 hours.
[NeMo I 2024-08-26 04:19:24 collections:304] # 4 files loaded accounting to # 1 labels



vad: 100%|██████████| 4/4 [00:00<00:00,  4.40it/s]

[NeMo I 2024-08-26 04:19:25 clustering_diarizer:250] Generating predictions with overlapping input segments



                                                               

[NeMo I 2024-08-26 04:19:27 clustering_diarizer:262] Converting frame level prediction to speech/no-speech segment in start and end times format.


creating speech segments: 100%|██████████| 1/1 [00:00<00:00,  5.96it/s]

[NeMo I 2024-08-26 04:19:27 clustering_diarizer:287] Subsegmentation for embedding extraction: scale0, /content/temp_outputs/speaker_outputs/subsegments_scale0.json
[NeMo I 2024-08-26 04:19:27 clustering_diarizer:343] Extracting embeddings for Diarization





[NeMo I 2024-08-26 04:19:27 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:19:27 collections:302] Dataset loaded with 153 items, total duration of  0.06 hours.
[NeMo I 2024-08-26 04:19:27 collections:304] # 153 files loaded accounting to # 1 labels


[1/5] extract embeddings: 100%|██████████| 3/3 [00:00<00:00,  6.25it/s]

[NeMo I 2024-08-26 04:19:27 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 04:19:27 clustering_diarizer:287] Subsegmentation for embedding extraction: scale1, /content/temp_outputs/speaker_outputs/subsegments_scale1.json
[NeMo I 2024-08-26 04:19:27 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 04:19:27 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:19:27 collections:302] Dataset loaded with 190 items, total duration of  0.06 hours.
[NeMo I 2024-08-26 04:19:27 collections:304] # 190 files loaded accounting to # 1 labels



[2/5] extract embeddings: 100%|██████████| 3/3 [00:00<00:00,  6.46it/s]

[NeMo I 2024-08-26 04:19:28 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 04:19:28 clustering_diarizer:287] Subsegmentation for embedding extraction: scale2, /content/temp_outputs/speaker_outputs/subsegments_scale2.json
[NeMo I 2024-08-26 04:19:28 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 04:19:28 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:19:28 collections:302] Dataset loaded with 235 items, total duration of  0.06 hours.
[NeMo I 2024-08-26 04:19:28 collections:304] # 235 files loaded accounting to # 1 labels



[3/5] extract embeddings: 100%|██████████| 4/4 [00:00<00:00,  7.66it/s]

[NeMo I 2024-08-26 04:19:28 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 04:19:28 clustering_diarizer:287] Subsegmentation for embedding extraction: scale3, /content/temp_outputs/speaker_outputs/subsegments_scale3.json
[NeMo I 2024-08-26 04:19:28 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 04:19:28 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:19:28 collections:302] Dataset loaded with 317 items, total duration of  0.06 hours.
[NeMo I 2024-08-26 04:19:28 collections:304] # 317 files loaded accounting to # 1 labels



[4/5] extract embeddings: 100%|██████████| 5/5 [00:00<00:00,  8.15it/s]

[NeMo I 2024-08-26 04:19:29 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 04:19:29 clustering_diarizer:287] Subsegmentation for embedding extraction: scale4, /content/temp_outputs/speaker_outputs/subsegments_scale4.json
[NeMo I 2024-08-26 04:19:29 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 04:19:29 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:19:29 collections:302] Dataset loaded with 485 items, total duration of  0.07 hours.
[NeMo I 2024-08-26 04:19:29 collections:304] # 485 files loaded accounting to # 1 labels



[5/5] extract embeddings: 100%|██████████| 8/8 [00:00<00:00,  9.71it/s]


[NeMo I 2024-08-26 04:19:30 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings


clustering: 100%|██████████| 1/1 [00:00<00:00,  1.72it/s]

[NeMo I 2024-08-26 04:19:30 clustering_diarizer:464] Outputs are saved in /content/temp_outputs directory



[NeMo W 2024-08-26 04:19:30 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 04:19:31 msdd_models:960] Loading embedding pickle file of scale:0 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale0_embeddings.pkl
[NeMo I 2024-08-26 04:19:31 msdd_models:960] Loading embedding pickle file of scale:1 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale1_embeddings.pkl
[NeMo I 2024-08-26 04:19:31 msdd_models:960] Loading embedding pickle file of scale:2 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale2_embeddings.pkl
[NeMo I 2024-08-26 04:19:31 msdd_models:960] Loading embedding pickle file of scale:3 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale3_embeddings.pkl
[NeMo I 2024-08-26 04:19:31 msdd_models:960] Loading embedding pickle file of scale:4 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale4_embeddings.pkl
[NeMo I 2024-08-26 04:19:31 msdd_models:938] Loading cluster label file from /content/temp_outputs/speaker_outputs/subsegments_scale4_cluste

100%|██████████| 1/1 [00:00<00:00, 25.93it/s]

[NeMo I 2024-08-26 04:19:31 msdd_models:1403]      [Threshold: 0.7000] [use_clus_as_main=False] [diar_window=50]
[NeMo I 2024-08-26 04:19:31 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-08-26 04:19:31 speaker_utils:93] Number of files to diarize: 1



[NeMo W 2024-08-26 04:19:31 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 04:19:31 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-08-26 04:19:31 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 04:19:31 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-08-26 04:19:31 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 04:19:31 msdd_models:1431]   
    
archivo 18710ae2ed9842aabdb637eae2a858c1_20230610t16_04_utc.wav listo


INFO:pytorch_lightning.utilities.migration.utils:Lightning automatically upgraded your loaded checkpoint from v1.5.4 to v2.0.7. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint --file ../root/.cache/torch/whisperx-vad-segmentation.bin`


No language specified, language will be first be detected for each audio file (increases inference time).
Model was trained with pyannote.audio 0.0.1, yours is 3.1.1. Bad things might happen unless you revert pyannote.audio to 0.x.
Model was trained with torch 1.10.0+cu102, yours is 2.0.1+cu117. Bad things might happen unless you revert torch to 1.x.
Suppressing numeral and symbol tokens: [3, 4, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 502, 568, 628, 805, 945, 1017, 1025, 1266, 1294, 1360, 1386, 1525, 1614, 1649, 1722, 1848, 1958, 2009, 2119, 2217, 2272, 2319, 2331, 2443, 2625, 2803, 2975, 3165, 3279, 3282, 3356, 3405, 3446, 3499, 3552, 3705, 4022, 4060, 4289, 4303, 4436, 4550, 4688, 4702, 4762, 4808, 5080, 5211, 5254, 5285, 5348, 5853, 5867, 5923, 6071, 6074, 6096, 6375, 6494, 6549, 6591, 6641, 6673, 6856, 6866, 6879, 6905, 6976, 7143, 7201, 7271, 7490, 7526, 7546, 7551, 7560, 7562, 7629, 7634, 7668, 7771, 7773, 7911, 7998, 8132, 8227, 8423, 8451, 8465, 8494, 8652, 8794, 8858, 8923, 90

[NeMo W 2024-08-26 04:20:15 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: true
    
[NeMo W 2024-08-26 04:20:15 modelPT:168] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s). 
    Validation config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: false
    
[NeMo W 2024-08-26 04:20:15 modelPT:174] Please call the ModelPT.setup_test_data() or ModelPT.setup_multiple

[NeMo I 2024-08-26 04:20:15 features:289] PADDING: 16
[NeMo I 2024-08-26 04:20:16 features:289] PADDING: 16
[NeMo I 2024-08-26 04:20:17 save_restore_connector:249] Model EncDecDiarLabelModel was successfully restored from /root/.cache/torch/NeMo/NeMo_1.21.0/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo.
[NeMo I 2024-08-26 04:20:17 features:289] PADDING: 16
[NeMo I 2024-08-26 04:20:18 clustering_diarizer:127] Loading pretrained vad_multilingual_marblenet model from NGC
[NeMo I 2024-08-26 04:20:18 cloud:58] Found existing object /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-08-26 04:20:18 cloud:64] Re-using file from: /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo
[NeMo I 2024-08-26 04:20:18 common:913] Instantiating model from pre-trained checkpoint


[NeMo W 2024-08-26 04:20:18 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: /manifests/ami_train_0.63.json,/manifests/freesound_background_train.json,/manifests/freesound_laughter_train.json,/manifests/fisher_2004_background.json,/manifests/fisher_2004_speech_sampled.json,/manifests/google_train_manifest.json,/manifests/icsi_all_0.63.json,/manifests/musan_freesound_train.json,/manifests/musan_music_train.json,/manifests/musan_soundbible_train.json,/manifests/mandarin_train_sample.json,/manifests/german_train_sample.json,/manifests/spanish_train_sample.json,/manifests/french_train_sample.json,/manifests/russian_train_sample.json
    sample_rate: 16000
    labels:
    - background
    - speech
    batch_size: 256
    shuffle: true
    is_tarred: false
    tarred_audio_filepaths: null
    tarred_shard_strategy: sca

[NeMo I 2024-08-26 04:20:18 features:289] PADDING: 16
[NeMo I 2024-08-26 04:20:18 save_restore_connector:249] Model EncDecClassificationModel was successfully restored from /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-08-26 04:20:18 msdd_models:864] Multiscale Weights: [1, 1, 1, 1, 1]
[NeMo I 2024-08-26 04:20:18 msdd_models:865] Clustering Parameters: {
        "oracle_num_speakers": false,
        "max_num_speakers": 8,
        "enhanced_count_thres": 80,
        "max_rp_threshold": 0.25,
        "sparse_search_volume": 30,
        "maj_vote_spk_count": false,
        "chunk_cluster_count": 50,
        "embeddings_per_chunk": 10000
    }


[NeMo W 2024-08-26 04:20:18 clustering_diarizer:411] Deleting previous clustering diarizer outputs.


[NeMo I 2024-08-26 04:20:18 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-08-26 04:20:18 clustering_diarizer:309] Split long audio file to avoid CUDA memory issue


splitting manifest: 100%|██████████| 1/1 [00:00<00:00, 33.53it/s]

[NeMo I 2024-08-26 04:20:18 vad_utils:107] The prepared manifest file exists. Overwriting!
[NeMo I 2024-08-26 04:20:18 classification_models:273] Perform streaming frame-level VAD
[NeMo I 2024-08-26 04:20:18 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:20:18 collections:302] Dataset loaded with 2 items, total duration of  0.02 hours.
[NeMo I 2024-08-26 04:20:18 collections:304] # 2 files loaded accounting to # 1 labels



vad: 100%|██████████| 2/2 [00:00<00:00,  4.24it/s]

[NeMo I 2024-08-26 04:20:19 clustering_diarizer:250] Generating predictions with overlapping input segments



                                                               

[NeMo I 2024-08-26 04:20:19 clustering_diarizer:262] Converting frame level prediction to speech/no-speech segment in start and end times format.


creating speech segments: 100%|██████████| 1/1 [00:00<00:00, 10.94it/s]

[NeMo I 2024-08-26 04:20:19 clustering_diarizer:287] Subsegmentation for embedding extraction: scale0, /content/temp_outputs/speaker_outputs/subsegments_scale0.json
[NeMo I 2024-08-26 04:20:19 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 04:20:19 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:20:19 collections:302] Dataset loaded with 69 items, total duration of  0.03 hours.
[NeMo I 2024-08-26 04:20:19 collections:304] # 69 files loaded accounting to # 1 labels



[1/5] extract embeddings: 100%|██████████| 2/2 [00:00<00:00,  7.59it/s]

[NeMo I 2024-08-26 04:20:20 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 04:20:20 clustering_diarizer:287] Subsegmentation for embedding extraction: scale1, /content/temp_outputs/speaker_outputs/subsegments_scale1.json
[NeMo I 2024-08-26 04:20:20 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 04:20:20 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:20:20 collections:302] Dataset loaded with 86 items, total duration of  0.03 hours.
[NeMo I 2024-08-26 04:20:20 collections:304] # 86 files loaded accounting to # 1 labels



[2/5] extract embeddings: 100%|██████████| 2/2 [00:00<00:00,  7.77it/s]

[NeMo I 2024-08-26 04:20:20 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 04:20:20 clustering_diarizer:287] Subsegmentation for embedding extraction: scale2, /content/temp_outputs/speaker_outputs/subsegments_scale2.json
[NeMo I 2024-08-26 04:20:20 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 04:20:20 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:20:20 collections:302] Dataset loaded with 107 items, total duration of  0.03 hours.
[NeMo I 2024-08-26 04:20:20 collections:304] # 107 files loaded accounting to # 1 labels



[3/5] extract embeddings: 100%|██████████| 2/2 [00:00<00:00,  6.46it/s]

[NeMo I 2024-08-26 04:20:20 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 04:20:20 clustering_diarizer:287] Subsegmentation for embedding extraction: scale3, /content/temp_outputs/speaker_outputs/subsegments_scale3.json
[NeMo I 2024-08-26 04:20:20 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 04:20:20 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:20:20 collections:302] Dataset loaded with 145 items, total duration of  0.03 hours.
[NeMo I 2024-08-26 04:20:20 collections:304] # 145 files loaded accounting to # 1 labels



[4/5] extract embeddings: 100%|██████████| 3/3 [00:00<00:00,  9.09it/s]


[NeMo I 2024-08-26 04:20:21 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 04:20:21 clustering_diarizer:287] Subsegmentation for embedding extraction: scale4, /content/temp_outputs/speaker_outputs/subsegments_scale4.json
[NeMo I 2024-08-26 04:20:21 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 04:20:21 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:20:21 collections:302] Dataset loaded with 223 items, total duration of  0.03 hours.
[NeMo I 2024-08-26 04:20:21 collections:304] # 223 files loaded accounting to # 1 labels


[5/5] extract embeddings: 100%|██████████| 4/4 [00:00<00:00,  9.28it/s]

[NeMo I 2024-08-26 04:20:21 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings



clustering: 100%|██████████| 1/1 [00:00<00:00,  3.41it/s]

[NeMo I 2024-08-26 04:20:22 clustering_diarizer:464] Outputs are saved in /content/temp_outputs directory



[NeMo W 2024-08-26 04:20:22 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 04:20:22 msdd_models:960] Loading embedding pickle file of scale:0 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale0_embeddings.pkl
[NeMo I 2024-08-26 04:20:22 msdd_models:960] Loading embedding pickle file of scale:1 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale1_embeddings.pkl
[NeMo I 2024-08-26 04:20:22 msdd_models:960] Loading embedding pickle file of scale:2 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale2_embeddings.pkl
[NeMo I 2024-08-26 04:20:22 msdd_models:960] Loading embedding pickle file of scale:3 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale3_embeddings.pkl
[NeMo I 2024-08-26 04:20:22 msdd_models:960] Loading embedding pickle file of scale:4 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale4_embeddings.pkl
[NeMo I 2024-08-26 04:20:22 msdd_models:938] Loading cluster label file from /content/temp_outputs/speaker_outputs/subsegments_scale4_cluste

100%|██████████| 1/1 [00:00<00:00, 44.38it/s]

[NeMo I 2024-08-26 04:20:22 msdd_models:1403]      [Threshold: 0.7000] [use_clus_as_main=False] [diar_window=50]
[NeMo I 2024-08-26 04:20:22 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-08-26 04:20:22 speaker_utils:93] Number of files to diarize: 1



[NeMo W 2024-08-26 04:20:22 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 04:20:22 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-08-26 04:20:22 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 04:20:22 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-08-26 04:20:22 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 04:20:22 msdd_models:1431]   
    
archivo e0913f30a77847f89585f9219bb14dd5_20230610t16_29_utc.wav listo


INFO:pytorch_lightning.utilities.migration.utils:Lightning automatically upgraded your loaded checkpoint from v1.5.4 to v2.0.7. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint --file ../root/.cache/torch/whisperx-vad-segmentation.bin`


No language specified, language will be first be detected for each audio file (increases inference time).
Model was trained with pyannote.audio 0.0.1, yours is 3.1.1. Bad things might happen unless you revert pyannote.audio to 0.x.
Model was trained with torch 1.10.0+cu102, yours is 2.0.1+cu117. Bad things might happen unless you revert torch to 1.x.
Suppressing numeral and symbol tokens: [3, 4, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 502, 568, 628, 805, 945, 1017, 1025, 1266, 1294, 1360, 1386, 1525, 1614, 1649, 1722, 1848, 1958, 2009, 2119, 2217, 2272, 2319, 2331, 2443, 2625, 2803, 2975, 3165, 3279, 3282, 3356, 3405, 3446, 3499, 3552, 3705, 4022, 4060, 4289, 4303, 4436, 4550, 4688, 4702, 4762, 4808, 5080, 5211, 5254, 5285, 5348, 5853, 5867, 5923, 6071, 6074, 6096, 6375, 6494, 6549, 6591, 6641, 6673, 6856, 6866, 6879, 6905, 6976, 7143, 7201, 7271, 7490, 7526, 7546, 7551, 7560, 7562, 7629, 7634, 7668, 7771, 7773, 7911, 7998, 8132, 8227, 8423, 8451, 8465, 8494, 8652, 8794, 8858, 8923, 90

[NeMo W 2024-08-26 04:21:07 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: true
    
[NeMo W 2024-08-26 04:21:07 modelPT:168] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s). 
    Validation config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: false
    
[NeMo W 2024-08-26 04:21:07 modelPT:174] Please call the ModelPT.setup_test_data() or ModelPT.setup_multiple

[NeMo I 2024-08-26 04:21:07 features:289] PADDING: 16
[NeMo I 2024-08-26 04:21:07 features:289] PADDING: 16
[NeMo I 2024-08-26 04:21:07 save_restore_connector:249] Model EncDecDiarLabelModel was successfully restored from /root/.cache/torch/NeMo/NeMo_1.21.0/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo.
[NeMo I 2024-08-26 04:21:08 features:289] PADDING: 16
[NeMo I 2024-08-26 04:21:08 clustering_diarizer:127] Loading pretrained vad_multilingual_marblenet model from NGC
[NeMo I 2024-08-26 04:21:08 cloud:58] Found existing object /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-08-26 04:21:08 cloud:64] Re-using file from: /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo
[NeMo I 2024-08-26 04:21:08 common:913] Instantiating model from pre-trained checkpoint


[NeMo W 2024-08-26 04:21:08 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: /manifests/ami_train_0.63.json,/manifests/freesound_background_train.json,/manifests/freesound_laughter_train.json,/manifests/fisher_2004_background.json,/manifests/fisher_2004_speech_sampled.json,/manifests/google_train_manifest.json,/manifests/icsi_all_0.63.json,/manifests/musan_freesound_train.json,/manifests/musan_music_train.json,/manifests/musan_soundbible_train.json,/manifests/mandarin_train_sample.json,/manifests/german_train_sample.json,/manifests/spanish_train_sample.json,/manifests/french_train_sample.json,/manifests/russian_train_sample.json
    sample_rate: 16000
    labels:
    - background
    - speech
    batch_size: 256
    shuffle: true
    is_tarred: false
    tarred_audio_filepaths: null
    tarred_shard_strategy: sca

[NeMo I 2024-08-26 04:21:08 features:289] PADDING: 16
[NeMo I 2024-08-26 04:21:09 save_restore_connector:249] Model EncDecClassificationModel was successfully restored from /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-08-26 04:21:09 msdd_models:864] Multiscale Weights: [1, 1, 1, 1, 1]
[NeMo I 2024-08-26 04:21:09 msdd_models:865] Clustering Parameters: {
        "oracle_num_speakers": false,
        "max_num_speakers": 8,
        "enhanced_count_thres": 80,
        "max_rp_threshold": 0.25,
        "sparse_search_volume": 30,
        "maj_vote_spk_count": false,
        "chunk_cluster_count": 50,
        "embeddings_per_chunk": 10000
    }


[NeMo W 2024-08-26 04:21:09 clustering_diarizer:411] Deleting previous clustering diarizer outputs.


[NeMo I 2024-08-26 04:21:09 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-08-26 04:21:09 clustering_diarizer:309] Split long audio file to avoid CUDA memory issue


splitting manifest: 100%|██████████| 1/1 [00:00<00:00, 24.05it/s]

[NeMo I 2024-08-26 04:21:09 vad_utils:107] The prepared manifest file exists. Overwriting!
[NeMo I 2024-08-26 04:21:09 classification_models:273] Perform streaming frame-level VAD
[NeMo I 2024-08-26 04:21:09 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:21:09 collections:302] Dataset loaded with 3 items, total duration of  0.03 hours.
[NeMo I 2024-08-26 04:21:09 collections:304] # 3 files loaded accounting to # 1 labels



vad: 100%|██████████| 3/3 [00:00<00:00,  3.72it/s]

[NeMo I 2024-08-26 04:21:10 clustering_diarizer:250] Generating predictions with overlapping input segments



                                                               

[NeMo I 2024-08-26 04:21:11 clustering_diarizer:262] Converting frame level prediction to speech/no-speech segment in start and end times format.


creating speech segments: 100%|██████████| 1/1 [00:00<00:00,  5.75it/s]

[NeMo I 2024-08-26 04:21:11 clustering_diarizer:287] Subsegmentation for embedding extraction: scale0, /content/temp_outputs/speaker_outputs/subsegments_scale0.json





[NeMo I 2024-08-26 04:21:11 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 04:21:11 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:21:11 collections:302] Dataset loaded with 95 items, total duration of  0.03 hours.
[NeMo I 2024-08-26 04:21:11 collections:304] # 95 files loaded accounting to # 1 labels


[1/5] extract embeddings: 100%|██████████| 2/2 [00:00<00:00,  6.17it/s]

[NeMo I 2024-08-26 04:21:12 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 04:21:12 clustering_diarizer:287] Subsegmentation for embedding extraction: scale1, /content/temp_outputs/speaker_outputs/subsegments_scale1.json
[NeMo I 2024-08-26 04:21:12 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 04:21:12 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:21:12 collections:302] Dataset loaded with 115 items, total duration of  0.03 hours.
[NeMo I 2024-08-26 04:21:12 collections:304] # 115 files loaded accounting to # 1 labels



[2/5] extract embeddings: 100%|██████████| 2/2 [00:00<00:00,  5.80it/s]

[NeMo I 2024-08-26 04:21:12 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 04:21:12 clustering_diarizer:287] Subsegmentation for embedding extraction: scale2, /content/temp_outputs/speaker_outputs/subsegments_scale2.json





[NeMo I 2024-08-26 04:21:12 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 04:21:12 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:21:12 collections:302] Dataset loaded with 142 items, total duration of  0.04 hours.
[NeMo I 2024-08-26 04:21:12 collections:304] # 142 files loaded accounting to # 1 labels


[3/5] extract embeddings: 100%|██████████| 3/3 [00:00<00:00,  7.22it/s]

[NeMo I 2024-08-26 04:21:12 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 04:21:12 clustering_diarizer:287] Subsegmentation for embedding extraction: scale3, /content/temp_outputs/speaker_outputs/subsegments_scale3.json
[NeMo I 2024-08-26 04:21:12 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 04:21:12 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:21:12 collections:302] Dataset loaded with 189 items, total duration of  0.04 hours.
[NeMo I 2024-08-26 04:21:12 collections:304] # 189 files loaded accounting to # 1 labels



[4/5] extract embeddings: 100%|██████████| 3/3 [00:00<00:00,  6.26it/s]

[NeMo I 2024-08-26 04:21:13 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 04:21:13 clustering_diarizer:287] Subsegmentation for embedding extraction: scale4, /content/temp_outputs/speaker_outputs/subsegments_scale4.json
[NeMo I 2024-08-26 04:21:13 clustering_diarizer:343] Extracting embeddings for Diarization





[NeMo I 2024-08-26 04:21:13 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:21:13 collections:302] Dataset loaded with 289 items, total duration of  0.04 hours.
[NeMo I 2024-08-26 04:21:13 collections:304] # 289 files loaded accounting to # 1 labels


[5/5] extract embeddings: 100%|██████████| 5/5 [00:00<00:00,  7.74it/s]


[NeMo I 2024-08-26 04:21:14 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings


clustering: 100%|██████████| 1/1 [00:00<00:00,  2.40it/s]

[NeMo I 2024-08-26 04:21:14 clustering_diarizer:464] Outputs are saved in /content/temp_outputs directory



[NeMo W 2024-08-26 04:21:14 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 04:21:14 msdd_models:960] Loading embedding pickle file of scale:0 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale0_embeddings.pkl
[NeMo I 2024-08-26 04:21:14 msdd_models:960] Loading embedding pickle file of scale:1 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale1_embeddings.pkl
[NeMo I 2024-08-26 04:21:14 msdd_models:960] Loading embedding pickle file of scale:2 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale2_embeddings.pkl
[NeMo I 2024-08-26 04:21:14 msdd_models:960] Loading embedding pickle file of scale:3 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale3_embeddings.pkl
[NeMo I 2024-08-26 04:21:14 msdd_models:960] Loading embedding pickle file of scale:4 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale4_embeddings.pkl
[NeMo I 2024-08-26 04:21:14 msdd_models:938] Loading cluster label file from /content/temp_outputs/speaker_outputs/subsegments_scale4_cluste

100%|██████████| 1/1 [00:00<00:00, 36.18it/s]

[NeMo I 2024-08-26 04:21:14 msdd_models:1403]      [Threshold: 0.7000] [use_clus_as_main=False] [diar_window=50]
[NeMo I 2024-08-26 04:21:14 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-08-26 04:21:14 speaker_utils:93] Number of files to diarize: 1



[NeMo W 2024-08-26 04:21:14 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 04:21:14 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-08-26 04:21:14 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 04:21:14 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-08-26 04:21:14 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 04:21:14 msdd_models:1431]   
    
archivo d8363b1bbb1f46579cb66073a1423ad3_20230610t15_06_utc.wav listo


INFO:pytorch_lightning.utilities.migration.utils:Lightning automatically upgraded your loaded checkpoint from v1.5.4 to v2.0.7. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint --file ../root/.cache/torch/whisperx-vad-segmentation.bin`


No language specified, language will be first be detected for each audio file (increases inference time).
Model was trained with pyannote.audio 0.0.1, yours is 3.1.1. Bad things might happen unless you revert pyannote.audio to 0.x.
Model was trained with torch 1.10.0+cu102, yours is 2.0.1+cu117. Bad things might happen unless you revert torch to 1.x.
Suppressing numeral and symbol tokens: [3, 4, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 502, 568, 628, 805, 945, 1017, 1025, 1266, 1294, 1360, 1386, 1525, 1614, 1649, 1722, 1848, 1958, 2009, 2119, 2217, 2272, 2319, 2331, 2443, 2625, 2803, 2975, 3165, 3279, 3282, 3356, 3405, 3446, 3499, 3552, 3705, 4022, 4060, 4289, 4303, 4436, 4550, 4688, 4702, 4762, 4808, 5080, 5211, 5254, 5285, 5348, 5853, 5867, 5923, 6071, 6074, 6096, 6375, 6494, 6549, 6591, 6641, 6673, 6856, 6866, 6879, 6905, 6976, 7143, 7201, 7271, 7490, 7526, 7546, 7551, 7560, 7562, 7629, 7634, 7668, 7771, 7773, 7911, 7998, 8132, 8227, 8423, 8451, 8465, 8494, 8652, 8794, 8858, 8923, 90

[NeMo W 2024-08-26 04:21:57 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: true
    
[NeMo W 2024-08-26 04:21:57 modelPT:168] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s). 
    Validation config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: false
    
[NeMo W 2024-08-26 04:21:57 modelPT:174] Please call the ModelPT.setup_test_data() or ModelPT.setup_multiple

[NeMo I 2024-08-26 04:21:57 features:289] PADDING: 16
[NeMo I 2024-08-26 04:21:58 features:289] PADDING: 16
[NeMo I 2024-08-26 04:21:58 save_restore_connector:249] Model EncDecDiarLabelModel was successfully restored from /root/.cache/torch/NeMo/NeMo_1.21.0/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo.
[NeMo I 2024-08-26 04:21:58 features:289] PADDING: 16
[NeMo I 2024-08-26 04:21:59 clustering_diarizer:127] Loading pretrained vad_multilingual_marblenet model from NGC
[NeMo I 2024-08-26 04:21:59 cloud:58] Found existing object /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-08-26 04:21:59 cloud:64] Re-using file from: /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo
[NeMo I 2024-08-26 04:21:59 common:913] Instantiating model from pre-trained checkpoint


[NeMo W 2024-08-26 04:21:59 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: /manifests/ami_train_0.63.json,/manifests/freesound_background_train.json,/manifests/freesound_laughter_train.json,/manifests/fisher_2004_background.json,/manifests/fisher_2004_speech_sampled.json,/manifests/google_train_manifest.json,/manifests/icsi_all_0.63.json,/manifests/musan_freesound_train.json,/manifests/musan_music_train.json,/manifests/musan_soundbible_train.json,/manifests/mandarin_train_sample.json,/manifests/german_train_sample.json,/manifests/spanish_train_sample.json,/manifests/french_train_sample.json,/manifests/russian_train_sample.json
    sample_rate: 16000
    labels:
    - background
    - speech
    batch_size: 256
    shuffle: true
    is_tarred: false
    tarred_audio_filepaths: null
    tarred_shard_strategy: sca

[NeMo I 2024-08-26 04:21:59 features:289] PADDING: 16
[NeMo I 2024-08-26 04:21:59 save_restore_connector:249] Model EncDecClassificationModel was successfully restored from /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-08-26 04:21:59 msdd_models:864] Multiscale Weights: [1, 1, 1, 1, 1]
[NeMo I 2024-08-26 04:21:59 msdd_models:865] Clustering Parameters: {
        "oracle_num_speakers": false,
        "max_num_speakers": 8,
        "enhanced_count_thres": 80,
        "max_rp_threshold": 0.25,
        "sparse_search_volume": 30,
        "maj_vote_spk_count": false,
        "chunk_cluster_count": 50,
        "embeddings_per_chunk": 10000
    }


[NeMo W 2024-08-26 04:21:59 clustering_diarizer:411] Deleting previous clustering diarizer outputs.


[NeMo I 2024-08-26 04:21:59 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-08-26 04:21:59 clustering_diarizer:309] Split long audio file to avoid CUDA memory issue


splitting manifest: 100%|██████████| 1/1 [00:00<00:00, 35.36it/s]

[NeMo I 2024-08-26 04:21:59 vad_utils:107] The prepared manifest file exists. Overwriting!
[NeMo I 2024-08-26 04:21:59 classification_models:273] Perform streaming frame-level VAD
[NeMo I 2024-08-26 04:21:59 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:21:59 collections:302] Dataset loaded with 2 items, total duration of  0.02 hours.
[NeMo I 2024-08-26 04:21:59 collections:304] # 2 files loaded accounting to # 1 labels



vad: 100%|██████████| 2/2 [00:00<00:00,  5.25it/s]

[NeMo I 2024-08-26 04:22:00 clustering_diarizer:250] Generating predictions with overlapping input segments



                                                               

[NeMo I 2024-08-26 04:22:00 clustering_diarizer:262] Converting frame level prediction to speech/no-speech segment in start and end times format.


creating speech segments: 100%|██████████| 1/1 [00:00<00:00, 13.56it/s]

[NeMo I 2024-08-26 04:22:00 clustering_diarizer:287] Subsegmentation for embedding extraction: scale0, /content/temp_outputs/speaker_outputs/subsegments_scale0.json
[NeMo I 2024-08-26 04:22:00 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 04:22:00 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:22:00 collections:302] Dataset loaded with 53 items, total duration of  0.02 hours.
[NeMo I 2024-08-26 04:22:00 collections:304] # 53 files loaded accounting to # 1 labels



[1/5] extract embeddings: 100%|██████████| 1/1 [00:00<00:00,  5.29it/s]

[NeMo I 2024-08-26 04:22:01 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings





[NeMo I 2024-08-26 04:22:01 clustering_diarizer:287] Subsegmentation for embedding extraction: scale1, /content/temp_outputs/speaker_outputs/subsegments_scale1.json
[NeMo I 2024-08-26 04:22:01 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 04:22:01 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:22:01 collections:302] Dataset loaded with 63 items, total duration of  0.02 hours.
[NeMo I 2024-08-26 04:22:01 collections:304] # 63 files loaded accounting to # 1 labels


[2/5] extract embeddings: 100%|██████████| 1/1 [00:00<00:00,  5.41it/s]

[NeMo I 2024-08-26 04:22:01 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 04:22:01 clustering_diarizer:287] Subsegmentation for embedding extraction: scale2, /content/temp_outputs/speaker_outputs/subsegments_scale2.json





[NeMo I 2024-08-26 04:22:01 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 04:22:01 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:22:01 collections:302] Dataset loaded with 78 items, total duration of  0.02 hours.
[NeMo I 2024-08-26 04:22:01 collections:304] # 78 files loaded accounting to # 1 labels


[3/5] extract embeddings: 100%|██████████| 2/2 [00:00<00:00,  9.49it/s]

[NeMo I 2024-08-26 04:22:01 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 04:22:01 clustering_diarizer:287] Subsegmentation for embedding extraction: scale3, /content/temp_outputs/speaker_outputs/subsegments_scale3.json
[NeMo I 2024-08-26 04:22:01 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 04:22:01 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:22:01 collections:302] Dataset loaded with 103 items, total duration of  0.02 hours.
[NeMo I 2024-08-26 04:22:01 collections:304] # 103 files loaded accounting to # 1 labels



[4/5] extract embeddings: 100%|██████████| 2/2 [00:00<00:00,  7.52it/s]

[NeMo I 2024-08-26 04:22:01 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 04:22:01 clustering_diarizer:287] Subsegmentation for embedding extraction: scale4, /content/temp_outputs/speaker_outputs/subsegments_scale4.json
[NeMo I 2024-08-26 04:22:01 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 04:22:01 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:22:01 collections:302] Dataset loaded with 156 items, total duration of  0.02 hours.
[NeMo I 2024-08-26 04:22:01 collections:304] # 156 files loaded accounting to # 1 labels



[5/5] extract embeddings: 100%|██████████| 3/3 [00:00<00:00,  9.19it/s]


[NeMo I 2024-08-26 04:22:02 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings


clustering: 100%|██████████| 1/1 [00:00<00:00,  4.23it/s]

[NeMo I 2024-08-26 04:22:02 clustering_diarizer:464] Outputs are saved in /content/temp_outputs directory



[NeMo W 2024-08-26 04:22:02 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 04:22:02 msdd_models:960] Loading embedding pickle file of scale:0 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale0_embeddings.pkl
[NeMo I 2024-08-26 04:22:02 msdd_models:960] Loading embedding pickle file of scale:1 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale1_embeddings.pkl
[NeMo I 2024-08-26 04:22:02 msdd_models:960] Loading embedding pickle file of scale:2 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale2_embeddings.pkl
[NeMo I 2024-08-26 04:22:02 msdd_models:960] Loading embedding pickle file of scale:3 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale3_embeddings.pkl
[NeMo I 2024-08-26 04:22:02 msdd_models:960] Loading embedding pickle file of scale:4 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale4_embeddings.pkl
[NeMo I 2024-08-26 04:22:02 msdd_models:938] Loading cluster label file from /content/temp_outputs/speaker_outputs/subsegments_scale4_cluste

100%|██████████| 1/1 [00:00<00:00, 49.05it/s]

[NeMo I 2024-08-26 04:22:02 msdd_models:1403]      [Threshold: 0.7000] [use_clus_as_main=False] [diar_window=50]
[NeMo I 2024-08-26 04:22:02 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-08-26 04:22:02 speaker_utils:93] Number of files to diarize: 1



[NeMo W 2024-08-26 04:22:02 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 04:22:02 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-08-26 04:22:02 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 04:22:02 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-08-26 04:22:02 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 04:22:02 msdd_models:1431]   
    
archivo a8de54e7217b43bf9222236fe3a35143_20230610t15_59_utc.wav listo


INFO:pytorch_lightning.utilities.migration.utils:Lightning automatically upgraded your loaded checkpoint from v1.5.4 to v2.0.7. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint --file ../root/.cache/torch/whisperx-vad-segmentation.bin`


No language specified, language will be first be detected for each audio file (increases inference time).
Model was trained with pyannote.audio 0.0.1, yours is 3.1.1. Bad things might happen unless you revert pyannote.audio to 0.x.
Model was trained with torch 1.10.0+cu102, yours is 2.0.1+cu117. Bad things might happen unless you revert torch to 1.x.
Suppressing numeral and symbol tokens: [3, 4, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 502, 568, 628, 805, 945, 1017, 1025, 1266, 1294, 1360, 1386, 1525, 1614, 1649, 1722, 1848, 1958, 2009, 2119, 2217, 2272, 2319, 2331, 2443, 2625, 2803, 2975, 3165, 3279, 3282, 3356, 3405, 3446, 3499, 3552, 3705, 4022, 4060, 4289, 4303, 4436, 4550, 4688, 4702, 4762, 4808, 5080, 5211, 5254, 5285, 5348, 5853, 5867, 5923, 6071, 6074, 6096, 6375, 6494, 6549, 6591, 6641, 6673, 6856, 6866, 6879, 6905, 6976, 7143, 7201, 7271, 7490, 7526, 7546, 7551, 7560, 7562, 7629, 7634, 7668, 7771, 7773, 7911, 7998, 8132, 8227, 8423, 8451, 8465, 8494, 8652, 8794, 8858, 8923, 90

[NeMo W 2024-08-26 04:22:46 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: true
    
[NeMo W 2024-08-26 04:22:46 modelPT:168] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s). 
    Validation config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: false
    
[NeMo W 2024-08-26 04:22:46 modelPT:174] Please call the ModelPT.setup_test_data() or ModelPT.setup_multiple

[NeMo I 2024-08-26 04:22:46 features:289] PADDING: 16
[NeMo I 2024-08-26 04:22:46 features:289] PADDING: 16
[NeMo I 2024-08-26 04:22:47 save_restore_connector:249] Model EncDecDiarLabelModel was successfully restored from /root/.cache/torch/NeMo/NeMo_1.21.0/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo.
[NeMo I 2024-08-26 04:22:47 features:289] PADDING: 16
[NeMo I 2024-08-26 04:22:47 clustering_diarizer:127] Loading pretrained vad_multilingual_marblenet model from NGC
[NeMo I 2024-08-26 04:22:47 cloud:58] Found existing object /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-08-26 04:22:47 cloud:64] Re-using file from: /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo
[NeMo I 2024-08-26 04:22:47 common:913] Instantiating model from pre-trained checkpoint


[NeMo W 2024-08-26 04:22:48 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: /manifests/ami_train_0.63.json,/manifests/freesound_background_train.json,/manifests/freesound_laughter_train.json,/manifests/fisher_2004_background.json,/manifests/fisher_2004_speech_sampled.json,/manifests/google_train_manifest.json,/manifests/icsi_all_0.63.json,/manifests/musan_freesound_train.json,/manifests/musan_music_train.json,/manifests/musan_soundbible_train.json,/manifests/mandarin_train_sample.json,/manifests/german_train_sample.json,/manifests/spanish_train_sample.json,/manifests/french_train_sample.json,/manifests/russian_train_sample.json
    sample_rate: 16000
    labels:
    - background
    - speech
    batch_size: 256
    shuffle: true
    is_tarred: false
    tarred_audio_filepaths: null
    tarred_shard_strategy: sca

[NeMo I 2024-08-26 04:22:48 features:289] PADDING: 16
[NeMo I 2024-08-26 04:22:48 save_restore_connector:249] Model EncDecClassificationModel was successfully restored from /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-08-26 04:22:48 msdd_models:864] Multiscale Weights: [1, 1, 1, 1, 1]
[NeMo I 2024-08-26 04:22:48 msdd_models:865] Clustering Parameters: {
        "oracle_num_speakers": false,
        "max_num_speakers": 8,
        "enhanced_count_thres": 80,
        "max_rp_threshold": 0.25,
        "sparse_search_volume": 30,
        "maj_vote_spk_count": false,
        "chunk_cluster_count": 50,
        "embeddings_per_chunk": 10000
    }


[NeMo W 2024-08-26 04:22:48 clustering_diarizer:411] Deleting previous clustering diarizer outputs.


[NeMo I 2024-08-26 04:22:48 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-08-26 04:22:48 clustering_diarizer:309] Split long audio file to avoid CUDA memory issue


splitting manifest: 100%|██████████| 1/1 [00:00<00:00, 38.22it/s]

[NeMo I 2024-08-26 04:22:48 vad_utils:107] The prepared manifest file exists. Overwriting!
[NeMo I 2024-08-26 04:22:48 classification_models:273] Perform streaming frame-level VAD
[NeMo I 2024-08-26 04:22:48 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:22:48 collections:302] Dataset loaded with 2 items, total duration of  0.02 hours.
[NeMo I 2024-08-26 04:22:48 collections:304] # 2 files loaded accounting to # 1 labels



vad: 100%|██████████| 2/2 [00:00<00:00,  4.87it/s]

[NeMo I 2024-08-26 04:22:48 clustering_diarizer:250] Generating predictions with overlapping input segments



                                                               

[NeMo I 2024-08-26 04:22:49 clustering_diarizer:262] Converting frame level prediction to speech/no-speech segment in start and end times format.


creating speech segments: 100%|██████████| 1/1 [00:00<00:00, 12.42it/s]

[NeMo I 2024-08-26 04:22:49 clustering_diarizer:287] Subsegmentation for embedding extraction: scale0, /content/temp_outputs/speaker_outputs/subsegments_scale0.json
[NeMo I 2024-08-26 04:22:49 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 04:22:49 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:22:49 collections:302] Dataset loaded with 51 items, total duration of  0.02 hours.
[NeMo I 2024-08-26 04:22:49 collections:304] # 51 files loaded accounting to # 1 labels



[1/5] extract embeddings: 100%|██████████| 1/1 [00:00<00:00,  4.85it/s]

[NeMo I 2024-08-26 04:22:49 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 04:22:49 clustering_diarizer:287] Subsegmentation for embedding extraction: scale1, /content/temp_outputs/speaker_outputs/subsegments_scale1.json
[NeMo I 2024-08-26 04:22:49 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 04:22:49 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:22:49 collections:302] Dataset loaded with 64 items, total duration of  0.02 hours.
[NeMo I 2024-08-26 04:22:49 collections:304] # 64 files loaded accounting to # 1 labels



[2/5] extract embeddings: 100%|██████████| 1/1 [00:00<00:00,  5.14it/s]


[NeMo I 2024-08-26 04:22:49 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 04:22:49 clustering_diarizer:287] Subsegmentation for embedding extraction: scale2, /content/temp_outputs/speaker_outputs/subsegments_scale2.json
[NeMo I 2024-08-26 04:22:49 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 04:22:49 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:22:49 collections:302] Dataset loaded with 79 items, total duration of  0.02 hours.
[NeMo I 2024-08-26 04:22:49 collections:304] # 79 files loaded accounting to # 1 labels


[3/5] extract embeddings: 100%|██████████| 2/2 [00:00<00:00,  9.13it/s]

[NeMo I 2024-08-26 04:22:50 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 04:22:50 clustering_diarizer:287] Subsegmentation for embedding extraction: scale3, /content/temp_outputs/speaker_outputs/subsegments_scale3.json
[NeMo I 2024-08-26 04:22:50 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 04:22:50 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:22:50 collections:302] Dataset loaded with 107 items, total duration of  0.02 hours.
[NeMo I 2024-08-26 04:22:50 collections:304] # 107 files loaded accounting to # 1 labels



[4/5] extract embeddings: 100%|██████████| 2/2 [00:00<00:00,  8.04it/s]

[NeMo I 2024-08-26 04:22:50 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 04:22:50 clustering_diarizer:287] Subsegmentation for embedding extraction: scale4, /content/temp_outputs/speaker_outputs/subsegments_scale4.json
[NeMo I 2024-08-26 04:22:50 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 04:22:50 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:22:50 collections:302] Dataset loaded with 166 items, total duration of  0.02 hours.
[NeMo I 2024-08-26 04:22:50 collections:304] # 166 files loaded accounting to # 1 labels



[5/5] extract embeddings: 100%|██████████| 3/3 [00:00<00:00,  8.52it/s]

[NeMo I 2024-08-26 04:22:50 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings



clustering: 100%|██████████| 1/1 [00:00<00:00,  4.03it/s]

[NeMo I 2024-08-26 04:22:51 clustering_diarizer:464] Outputs are saved in /content/temp_outputs directory



[NeMo W 2024-08-26 04:22:51 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 04:22:51 msdd_models:960] Loading embedding pickle file of scale:0 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale0_embeddings.pkl
[NeMo I 2024-08-26 04:22:51 msdd_models:960] Loading embedding pickle file of scale:1 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale1_embeddings.pkl
[NeMo I 2024-08-26 04:22:51 msdd_models:960] Loading embedding pickle file of scale:2 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale2_embeddings.pkl
[NeMo I 2024-08-26 04:22:51 msdd_models:960] Loading embedding pickle file of scale:3 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale3_embeddings.pkl
[NeMo I 2024-08-26 04:22:51 msdd_models:960] Loading embedding pickle file of scale:4 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale4_embeddings.pkl
[NeMo I 2024-08-26 04:22:51 msdd_models:938] Loading cluster label file from /content/temp_outputs/speaker_outputs/subsegments_scale4_cluste

100%|██████████| 1/1 [00:00<00:00, 42.55it/s]

[NeMo I 2024-08-26 04:22:51 msdd_models:1403]      [Threshold: 0.7000] [use_clus_as_main=False] [diar_window=50]
[NeMo I 2024-08-26 04:22:51 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-08-26 04:22:51 speaker_utils:93] Number of files to diarize: 1



[NeMo W 2024-08-26 04:22:51 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 04:22:51 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-08-26 04:22:51 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 04:22:51 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-08-26 04:22:51 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 04:22:51 msdd_models:1431]   
    
archivo bab93478e7374a7da7ca383c9090a498_20230610t16_15_utc.wav listo


INFO:pytorch_lightning.utilities.migration.utils:Lightning automatically upgraded your loaded checkpoint from v1.5.4 to v2.0.7. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint --file ../root/.cache/torch/whisperx-vad-segmentation.bin`


No language specified, language will be first be detected for each audio file (increases inference time).
Model was trained with pyannote.audio 0.0.1, yours is 3.1.1. Bad things might happen unless you revert pyannote.audio to 0.x.
Model was trained with torch 1.10.0+cu102, yours is 2.0.1+cu117. Bad things might happen unless you revert torch to 1.x.
Suppressing numeral and symbol tokens: [3, 4, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 502, 568, 628, 805, 945, 1017, 1025, 1266, 1294, 1360, 1386, 1525, 1614, 1649, 1722, 1848, 1958, 2009, 2119, 2217, 2272, 2319, 2331, 2443, 2625, 2803, 2975, 3165, 3279, 3282, 3356, 3405, 3446, 3499, 3552, 3705, 4022, 4060, 4289, 4303, 4436, 4550, 4688, 4702, 4762, 4808, 5080, 5211, 5254, 5285, 5348, 5853, 5867, 5923, 6071, 6074, 6096, 6375, 6494, 6549, 6591, 6641, 6673, 6856, 6866, 6879, 6905, 6976, 7143, 7201, 7271, 7490, 7526, 7546, 7551, 7560, 7562, 7629, 7634, 7668, 7771, 7773, 7911, 7998, 8132, 8227, 8423, 8451, 8465, 8494, 8652, 8794, 8858, 8923, 90

[NeMo W 2024-08-26 04:23:32 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: true
    
[NeMo W 2024-08-26 04:23:32 modelPT:168] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s). 
    Validation config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: false
    
[NeMo W 2024-08-26 04:23:32 modelPT:174] Please call the ModelPT.setup_test_data() or ModelPT.setup_multiple

[NeMo I 2024-08-26 04:23:32 features:289] PADDING: 16
[NeMo I 2024-08-26 04:23:32 features:289] PADDING: 16
[NeMo I 2024-08-26 04:23:33 save_restore_connector:249] Model EncDecDiarLabelModel was successfully restored from /root/.cache/torch/NeMo/NeMo_1.21.0/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo.
[NeMo I 2024-08-26 04:23:33 features:289] PADDING: 16
[NeMo I 2024-08-26 04:23:34 clustering_diarizer:127] Loading pretrained vad_multilingual_marblenet model from NGC
[NeMo I 2024-08-26 04:23:34 cloud:58] Found existing object /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-08-26 04:23:34 cloud:64] Re-using file from: /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo
[NeMo I 2024-08-26 04:23:34 common:913] Instantiating model from pre-trained checkpoint


[NeMo W 2024-08-26 04:23:34 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: /manifests/ami_train_0.63.json,/manifests/freesound_background_train.json,/manifests/freesound_laughter_train.json,/manifests/fisher_2004_background.json,/manifests/fisher_2004_speech_sampled.json,/manifests/google_train_manifest.json,/manifests/icsi_all_0.63.json,/manifests/musan_freesound_train.json,/manifests/musan_music_train.json,/manifests/musan_soundbible_train.json,/manifests/mandarin_train_sample.json,/manifests/german_train_sample.json,/manifests/spanish_train_sample.json,/manifests/french_train_sample.json,/manifests/russian_train_sample.json
    sample_rate: 16000
    labels:
    - background
    - speech
    batch_size: 256
    shuffle: true
    is_tarred: false
    tarred_audio_filepaths: null
    tarred_shard_strategy: sca

[NeMo I 2024-08-26 04:23:34 features:289] PADDING: 16
[NeMo I 2024-08-26 04:23:35 save_restore_connector:249] Model EncDecClassificationModel was successfully restored from /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-08-26 04:23:35 msdd_models:864] Multiscale Weights: [1, 1, 1, 1, 1]
[NeMo I 2024-08-26 04:23:35 msdd_models:865] Clustering Parameters: {
        "oracle_num_speakers": false,
        "max_num_speakers": 8,
        "enhanced_count_thres": 80,
        "max_rp_threshold": 0.25,
        "sparse_search_volume": 30,
        "maj_vote_spk_count": false,
        "chunk_cluster_count": 50,
        "embeddings_per_chunk": 10000
    }


[NeMo W 2024-08-26 04:23:35 clustering_diarizer:411] Deleting previous clustering diarizer outputs.


[NeMo I 2024-08-26 04:23:35 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-08-26 04:23:35 clustering_diarizer:309] Split long audio file to avoid CUDA memory issue


splitting manifest: 100%|██████████| 1/1 [00:00<00:00, 39.56it/s]

[NeMo I 2024-08-26 04:23:35 vad_utils:107] The prepared manifest file exists. Overwriting!
[NeMo I 2024-08-26 04:23:35 classification_models:273] Perform streaming frame-level VAD
[NeMo I 2024-08-26 04:23:35 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:23:35 collections:302] Dataset loaded with 1 items, total duration of  0.01 hours.
[NeMo I 2024-08-26 04:23:35 collections:304] # 1 files loaded accounting to # 1 labels



vad: 100%|██████████| 1/1 [00:00<00:00,  3.22it/s]

[NeMo I 2024-08-26 04:23:35 clustering_diarizer:250] Generating predictions with overlapping input segments



                                                               

[NeMo I 2024-08-26 04:23:36 clustering_diarizer:262] Converting frame level prediction to speech/no-speech segment in start and end times format.


creating speech segments: 100%|██████████| 1/1 [00:00<00:00, 16.91it/s]

[NeMo I 2024-08-26 04:23:36 clustering_diarizer:287] Subsegmentation for embedding extraction: scale0, /content/temp_outputs/speaker_outputs/subsegments_scale0.json
[NeMo I 2024-08-26 04:23:36 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 04:23:36 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:23:36 collections:302] Dataset loaded with 28 items, total duration of  0.01 hours.
[NeMo I 2024-08-26 04:23:36 collections:304] # 28 files loaded accounting to # 1 labels



[1/5] extract embeddings: 100%|██████████| 1/1 [00:00<00:00, 10.09it/s]


[NeMo I 2024-08-26 04:23:36 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 04:23:36 clustering_diarizer:287] Subsegmentation for embedding extraction: scale1, /content/temp_outputs/speaker_outputs/subsegments_scale1.json
[NeMo I 2024-08-26 04:23:36 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 04:23:36 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:23:36 collections:302] Dataset loaded with 34 items, total duration of  0.01 hours.
[NeMo I 2024-08-26 04:23:36 collections:304] # 34 files loaded accounting to # 1 labels


[2/5] extract embeddings: 100%|██████████| 1/1 [00:00<00:00,  8.12it/s]

[NeMo I 2024-08-26 04:23:36 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 04:23:36 clustering_diarizer:287] Subsegmentation for embedding extraction: scale2, /content/temp_outputs/speaker_outputs/subsegments_scale2.json
[NeMo I 2024-08-26 04:23:36 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 04:23:36 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:23:36 collections:302] Dataset loaded with 42 items, total duration of  0.01 hours.
[NeMo I 2024-08-26 04:23:36 collections:304] # 42 files loaded accounting to # 1 labels



[3/5] extract embeddings: 100%|██████████| 1/1 [00:00<00:00,  7.97it/s]

[NeMo I 2024-08-26 04:23:36 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 04:23:36 clustering_diarizer:287] Subsegmentation for embedding extraction: scale3, /content/temp_outputs/speaker_outputs/subsegments_scale3.json
[NeMo I 2024-08-26 04:23:36 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 04:23:36 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:23:36 collections:302] Dataset loaded with 57 items, total duration of  0.01 hours.
[NeMo I 2024-08-26 04:23:36 collections:304] # 57 files loaded accounting to # 1 labels



[4/5] extract embeddings: 100%|██████████| 1/1 [00:00<00:00,  5.97it/s]

[NeMo I 2024-08-26 04:23:36 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 04:23:36 clustering_diarizer:287] Subsegmentation for embedding extraction: scale4, /content/temp_outputs/speaker_outputs/subsegments_scale4.json
[NeMo I 2024-08-26 04:23:36 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 04:23:36 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:23:36 collections:302] Dataset loaded with 87 items, total duration of  0.01 hours.
[NeMo I 2024-08-26 04:23:36 collections:304] # 87 files loaded accounting to # 1 labels



[5/5] extract embeddings: 100%|██████████| 2/2 [00:00<00:00,  8.90it/s]

[NeMo I 2024-08-26 04:23:37 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings



clustering: 100%|██████████| 1/1 [00:00<00:00,  6.62it/s]

[NeMo I 2024-08-26 04:23:37 clustering_diarizer:464] Outputs are saved in /content/temp_outputs directory



[NeMo W 2024-08-26 04:23:37 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 04:23:37 msdd_models:960] Loading embedding pickle file of scale:0 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale0_embeddings.pkl
[NeMo I 2024-08-26 04:23:37 msdd_models:960] Loading embedding pickle file of scale:1 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale1_embeddings.pkl
[NeMo I 2024-08-26 04:23:37 msdd_models:960] Loading embedding pickle file of scale:2 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale2_embeddings.pkl
[NeMo I 2024-08-26 04:23:37 msdd_models:960] Loading embedding pickle file of scale:3 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale3_embeddings.pkl
[NeMo I 2024-08-26 04:23:37 msdd_models:960] Loading embedding pickle file of scale:4 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale4_embeddings.pkl
[NeMo I 2024-08-26 04:23:37 msdd_models:938] Loading cluster label file from /content/temp_outputs/speaker_outputs/subsegments_scale4_cluste

100%|██████████| 1/1 [00:00<00:00, 42.93it/s]

[NeMo I 2024-08-26 04:23:37 msdd_models:1403]      [Threshold: 0.7000] [use_clus_as_main=False] [diar_window=50]
[NeMo I 2024-08-26 04:23:37 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-08-26 04:23:37 speaker_utils:93] Number of files to diarize: 1



[NeMo W 2024-08-26 04:23:37 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 04:23:37 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-08-26 04:23:37 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 04:23:37 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-08-26 04:23:37 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 04:23:37 msdd_models:1431]   
    
archivo 3e389aa8839748e78e57f6eb8ce0812a_20230610t13_55_utc.wav listo


INFO:pytorch_lightning.utilities.migration.utils:Lightning automatically upgraded your loaded checkpoint from v1.5.4 to v2.0.7. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint --file ../root/.cache/torch/whisperx-vad-segmentation.bin`


No language specified, language will be first be detected for each audio file (increases inference time).
Model was trained with pyannote.audio 0.0.1, yours is 3.1.1. Bad things might happen unless you revert pyannote.audio to 0.x.
Model was trained with torch 1.10.0+cu102, yours is 2.0.1+cu117. Bad things might happen unless you revert torch to 1.x.
Suppressing numeral and symbol tokens: [3, 4, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 502, 568, 628, 805, 945, 1017, 1025, 1266, 1294, 1360, 1386, 1525, 1614, 1649, 1722, 1848, 1958, 2009, 2119, 2217, 2272, 2319, 2331, 2443, 2625, 2803, 2975, 3165, 3279, 3282, 3356, 3405, 3446, 3499, 3552, 3705, 4022, 4060, 4289, 4303, 4436, 4550, 4688, 4702, 4762, 4808, 5080, 5211, 5254, 5285, 5348, 5853, 5867, 5923, 6071, 6074, 6096, 6375, 6494, 6549, 6591, 6641, 6673, 6856, 6866, 6879, 6905, 6976, 7143, 7201, 7271, 7490, 7526, 7546, 7551, 7560, 7562, 7629, 7634, 7668, 7771, 7773, 7911, 7998, 8132, 8227, 8423, 8451, 8465, 8494, 8652, 8794, 8858, 8923, 90

[NeMo W 2024-08-26 04:24:37 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: true
    
[NeMo W 2024-08-26 04:24:37 modelPT:168] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s). 
    Validation config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: false
    
[NeMo W 2024-08-26 04:24:37 modelPT:174] Please call the ModelPT.setup_test_data() or ModelPT.setup_multiple

[NeMo I 2024-08-26 04:24:37 features:289] PADDING: 16
[NeMo I 2024-08-26 04:24:38 features:289] PADDING: 16
[NeMo I 2024-08-26 04:24:38 save_restore_connector:249] Model EncDecDiarLabelModel was successfully restored from /root/.cache/torch/NeMo/NeMo_1.21.0/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo.
[NeMo I 2024-08-26 04:24:39 features:289] PADDING: 16
[NeMo I 2024-08-26 04:24:40 clustering_diarizer:127] Loading pretrained vad_multilingual_marblenet model from NGC
[NeMo I 2024-08-26 04:24:40 cloud:58] Found existing object /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-08-26 04:24:40 cloud:64] Re-using file from: /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo
[NeMo I 2024-08-26 04:24:40 common:913] Instantiating model from pre-trained checkpoint


[NeMo W 2024-08-26 04:24:40 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: /manifests/ami_train_0.63.json,/manifests/freesound_background_train.json,/manifests/freesound_laughter_train.json,/manifests/fisher_2004_background.json,/manifests/fisher_2004_speech_sampled.json,/manifests/google_train_manifest.json,/manifests/icsi_all_0.63.json,/manifests/musan_freesound_train.json,/manifests/musan_music_train.json,/manifests/musan_soundbible_train.json,/manifests/mandarin_train_sample.json,/manifests/german_train_sample.json,/manifests/spanish_train_sample.json,/manifests/french_train_sample.json,/manifests/russian_train_sample.json
    sample_rate: 16000
    labels:
    - background
    - speech
    batch_size: 256
    shuffle: true
    is_tarred: false
    tarred_audio_filepaths: null
    tarred_shard_strategy: sca

[NeMo I 2024-08-26 04:24:40 features:289] PADDING: 16
[NeMo I 2024-08-26 04:24:40 save_restore_connector:249] Model EncDecClassificationModel was successfully restored from /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-08-26 04:24:40 msdd_models:864] Multiscale Weights: [1, 1, 1, 1, 1]
[NeMo I 2024-08-26 04:24:40 msdd_models:865] Clustering Parameters: {
        "oracle_num_speakers": false,
        "max_num_speakers": 8,
        "enhanced_count_thres": 80,
        "max_rp_threshold": 0.25,
        "sparse_search_volume": 30,
        "maj_vote_spk_count": false,
        "chunk_cluster_count": 50,
        "embeddings_per_chunk": 10000
    }


[NeMo W 2024-08-26 04:24:40 clustering_diarizer:411] Deleting previous clustering diarizer outputs.


[NeMo I 2024-08-26 04:24:40 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-08-26 04:24:40 clustering_diarizer:309] Split long audio file to avoid CUDA memory issue


splitting manifest: 100%|██████████| 1/1 [00:00<00:00,  8.04it/s]

[NeMo I 2024-08-26 04:24:40 vad_utils:107] The prepared manifest file exists. Overwriting!
[NeMo I 2024-08-26 04:24:40 classification_models:273] Perform streaming frame-level VAD
[NeMo I 2024-08-26 04:24:40 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:24:40 collections:302] Dataset loaded with 8 items, total duration of  0.10 hours.
[NeMo I 2024-08-26 04:24:40 collections:304] # 8 files loaded accounting to # 1 labels



vad: 100%|██████████| 8/8 [00:02<00:00,  3.98it/s]

[NeMo I 2024-08-26 04:24:42 clustering_diarizer:250] Generating predictions with overlapping input segments



                                                               

[NeMo I 2024-08-26 04:24:45 clustering_diarizer:262] Converting frame level prediction to speech/no-speech segment in start and end times format.


creating speech segments: 100%|██████████| 1/1 [00:00<00:00,  3.16it/s]

[NeMo I 2024-08-26 04:24:46 clustering_diarizer:287] Subsegmentation for embedding extraction: scale0, /content/temp_outputs/speaker_outputs/subsegments_scale0.json
[NeMo I 2024-08-26 04:24:46 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 04:24:46 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:24:46 collections:302] Dataset loaded with 332 items, total duration of  0.12 hours.
[NeMo I 2024-08-26 04:24:46 collections:304] # 332 files loaded accounting to # 1 labels



[1/5] extract embeddings: 100%|██████████| 6/6 [00:00<00:00,  6.04it/s]

[NeMo I 2024-08-26 04:24:47 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 04:24:47 clustering_diarizer:287] Subsegmentation for embedding extraction: scale1, /content/temp_outputs/speaker_outputs/subsegments_scale1.json
[NeMo I 2024-08-26 04:24:47 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 04:24:47 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:24:47 collections:302] Dataset loaded with 407 items, total duration of  0.13 hours.
[NeMo I 2024-08-26 04:24:47 collections:304] # 407 files loaded accounting to # 1 labels



[2/5] extract embeddings: 100%|██████████| 7/7 [00:01<00:00,  6.13it/s]

[NeMo I 2024-08-26 04:24:48 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 04:24:48 clustering_diarizer:287] Subsegmentation for embedding extraction: scale2, /content/temp_outputs/speaker_outputs/subsegments_scale2.json
[NeMo I 2024-08-26 04:24:48 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 04:24:48 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:24:48 collections:302] Dataset loaded with 506 items, total duration of  0.13 hours.
[NeMo I 2024-08-26 04:24:48 collections:304] # 506 files loaded accounting to # 1 labels



[3/5] extract embeddings: 100%|██████████| 8/8 [00:01<00:00,  6.72it/s]

[NeMo I 2024-08-26 04:24:49 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 04:24:49 clustering_diarizer:287] Subsegmentation for embedding extraction: scale3, /content/temp_outputs/speaker_outputs/subsegments_scale3.json





[NeMo I 2024-08-26 04:24:49 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 04:24:49 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:24:49 collections:302] Dataset loaded with 687 items, total duration of  0.14 hours.
[NeMo I 2024-08-26 04:24:49 collections:304] # 687 files loaded accounting to # 1 labels


[4/5] extract embeddings: 100%|██████████| 11/11 [00:01<00:00,  7.92it/s]

[NeMo I 2024-08-26 04:24:51 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 04:24:51 clustering_diarizer:287] Subsegmentation for embedding extraction: scale4, /content/temp_outputs/speaker_outputs/subsegments_scale4.json





[NeMo I 2024-08-26 04:24:51 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 04:24:51 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:24:51 collections:302] Dataset loaded with 1047 items, total duration of  0.14 hours.
[NeMo I 2024-08-26 04:24:51 collections:304] # 1047 files loaded accounting to # 1 labels


[5/5] extract embeddings: 100%|██████████| 17/17 [00:02<00:00,  8.08it/s]


[NeMo I 2024-08-26 04:24:53 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings


clustering: 100%|██████████| 1/1 [00:00<00:00,  1.74it/s]

[NeMo I 2024-08-26 04:24:54 clustering_diarizer:464] Outputs are saved in /content/temp_outputs directory



[NeMo W 2024-08-26 04:24:54 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 04:24:54 msdd_models:960] Loading embedding pickle file of scale:0 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale0_embeddings.pkl
[NeMo I 2024-08-26 04:24:54 msdd_models:960] Loading embedding pickle file of scale:1 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale1_embeddings.pkl
[NeMo I 2024-08-26 04:24:54 msdd_models:960] Loading embedding pickle file of scale:2 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale2_embeddings.pkl
[NeMo I 2024-08-26 04:24:54 msdd_models:960] Loading embedding pickle file of scale:3 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale3_embeddings.pkl
[NeMo I 2024-08-26 04:24:54 msdd_models:960] Loading embedding pickle file of scale:4 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale4_embeddings.pkl
[NeMo I 2024-08-26 04:24:54 msdd_models:938] Loading cluster label file from /content/temp_outputs/speaker_outputs/subsegments_scale4_cluste

100%|██████████| 1/1 [00:00<00:00, 15.59it/s]

[NeMo I 2024-08-26 04:24:54 msdd_models:1403]      [Threshold: 0.7000] [use_clus_as_main=False] [diar_window=50]
[NeMo I 2024-08-26 04:24:54 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-08-26 04:24:54 speaker_utils:93] Number of files to diarize: 1



[NeMo W 2024-08-26 04:24:54 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 04:24:54 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-08-26 04:24:54 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 04:24:54 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-08-26 04:24:54 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 04:24:54 msdd_models:1431]   
    
archivo f332b3326cb34bddb0989e865a5831e4_20230610t14_13_utc.wav listo


INFO:pytorch_lightning.utilities.migration.utils:Lightning automatically upgraded your loaded checkpoint from v1.5.4 to v2.0.7. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint --file ../root/.cache/torch/whisperx-vad-segmentation.bin`


No language specified, language will be first be detected for each audio file (increases inference time).
Model was trained with pyannote.audio 0.0.1, yours is 3.1.1. Bad things might happen unless you revert pyannote.audio to 0.x.
Model was trained with torch 1.10.0+cu102, yours is 2.0.1+cu117. Bad things might happen unless you revert torch to 1.x.
Suppressing numeral and symbol tokens: [3, 4, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 502, 568, 628, 805, 945, 1017, 1025, 1266, 1294, 1360, 1386, 1525, 1614, 1649, 1722, 1848, 1958, 2009, 2119, 2217, 2272, 2319, 2331, 2443, 2625, 2803, 2975, 3165, 3279, 3282, 3356, 3405, 3446, 3499, 3552, 3705, 4022, 4060, 4289, 4303, 4436, 4550, 4688, 4702, 4762, 4808, 5080, 5211, 5254, 5285, 5348, 5853, 5867, 5923, 6071, 6074, 6096, 6375, 6494, 6549, 6591, 6641, 6673, 6856, 6866, 6879, 6905, 6976, 7143, 7201, 7271, 7490, 7526, 7546, 7551, 7560, 7562, 7629, 7634, 7668, 7771, 7773, 7911, 7998, 8132, 8227, 8423, 8451, 8465, 8494, 8652, 8794, 8858, 8923, 90

[NeMo W 2024-08-26 04:25:37 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: true
    
[NeMo W 2024-08-26 04:25:37 modelPT:168] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s). 
    Validation config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: false
    
[NeMo W 2024-08-26 04:25:37 modelPT:174] Please call the ModelPT.setup_test_data() or ModelPT.setup_multiple

[NeMo I 2024-08-26 04:25:37 features:289] PADDING: 16
[NeMo I 2024-08-26 04:25:37 features:289] PADDING: 16
[NeMo I 2024-08-26 04:25:38 save_restore_connector:249] Model EncDecDiarLabelModel was successfully restored from /root/.cache/torch/NeMo/NeMo_1.21.0/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo.
[NeMo I 2024-08-26 04:25:38 features:289] PADDING: 16
[NeMo I 2024-08-26 04:25:38 clustering_diarizer:127] Loading pretrained vad_multilingual_marblenet model from NGC
[NeMo I 2024-08-26 04:25:38 cloud:58] Found existing object /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-08-26 04:25:38 cloud:64] Re-using file from: /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo
[NeMo I 2024-08-26 04:25:38 common:913] Instantiating model from pre-trained checkpoint


[NeMo W 2024-08-26 04:25:39 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: /manifests/ami_train_0.63.json,/manifests/freesound_background_train.json,/manifests/freesound_laughter_train.json,/manifests/fisher_2004_background.json,/manifests/fisher_2004_speech_sampled.json,/manifests/google_train_manifest.json,/manifests/icsi_all_0.63.json,/manifests/musan_freesound_train.json,/manifests/musan_music_train.json,/manifests/musan_soundbible_train.json,/manifests/mandarin_train_sample.json,/manifests/german_train_sample.json,/manifests/spanish_train_sample.json,/manifests/french_train_sample.json,/manifests/russian_train_sample.json
    sample_rate: 16000
    labels:
    - background
    - speech
    batch_size: 256
    shuffle: true
    is_tarred: false
    tarred_audio_filepaths: null
    tarred_shard_strategy: sca

[NeMo I 2024-08-26 04:25:39 features:289] PADDING: 16
[NeMo I 2024-08-26 04:25:39 save_restore_connector:249] Model EncDecClassificationModel was successfully restored from /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-08-26 04:25:39 msdd_models:864] Multiscale Weights: [1, 1, 1, 1, 1]
[NeMo I 2024-08-26 04:25:39 msdd_models:865] Clustering Parameters: {
        "oracle_num_speakers": false,
        "max_num_speakers": 8,
        "enhanced_count_thres": 80,
        "max_rp_threshold": 0.25,
        "sparse_search_volume": 30,
        "maj_vote_spk_count": false,
        "chunk_cluster_count": 50,
        "embeddings_per_chunk": 10000
    }


[NeMo W 2024-08-26 04:25:39 clustering_diarizer:411] Deleting previous clustering diarizer outputs.


[NeMo I 2024-08-26 04:25:39 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-08-26 04:25:39 clustering_diarizer:309] Split long audio file to avoid CUDA memory issue


splitting manifest: 100%|██████████| 1/1 [00:00<00:00, 44.11it/s]

[NeMo I 2024-08-26 04:25:39 vad_utils:107] The prepared manifest file exists. Overwriting!
[NeMo I 2024-08-26 04:25:39 classification_models:273] Perform streaming frame-level VAD
[NeMo I 2024-08-26 04:25:39 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:25:39 collections:302] Dataset loaded with 1 items, total duration of  0.01 hours.
[NeMo I 2024-08-26 04:25:39 collections:304] # 1 files loaded accounting to # 1 labels



vad: 100%|██████████| 1/1 [00:00<00:00,  3.53it/s]

[NeMo I 2024-08-26 04:25:39 clustering_diarizer:250] Generating predictions with overlapping input segments



                                                               

[NeMo I 2024-08-26 04:25:40 clustering_diarizer:262] Converting frame level prediction to speech/no-speech segment in start and end times format.


creating speech segments: 100%|██████████| 1/1 [00:00<00:00, 18.34it/s]

[NeMo I 2024-08-26 04:25:40 clustering_diarizer:287] Subsegmentation for embedding extraction: scale0, /content/temp_outputs/speaker_outputs/subsegments_scale0.json
[NeMo I 2024-08-26 04:25:40 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 04:25:40 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:25:40 collections:302] Dataset loaded with 43 items, total duration of  0.02 hours.
[NeMo I 2024-08-26 04:25:40 collections:304] # 43 files loaded accounting to # 1 labels



[1/5] extract embeddings: 100%|██████████| 1/1 [00:00<00:00,  6.83it/s]

[NeMo I 2024-08-26 04:25:40 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 04:25:40 clustering_diarizer:287] Subsegmentation for embedding extraction: scale1, /content/temp_outputs/speaker_outputs/subsegments_scale1.json
[NeMo I 2024-08-26 04:25:40 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 04:25:40 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:25:40 collections:302] Dataset loaded with 55 items, total duration of  0.02 hours.
[NeMo I 2024-08-26 04:25:40 collections:304] # 55 files loaded accounting to # 1 labels



[2/5] extract embeddings: 100%|██████████| 1/1 [00:00<00:00,  5.21it/s]

[NeMo I 2024-08-26 04:25:40 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings





[NeMo I 2024-08-26 04:25:40 clustering_diarizer:287] Subsegmentation for embedding extraction: scale2, /content/temp_outputs/speaker_outputs/subsegments_scale2.json
[NeMo I 2024-08-26 04:25:40 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 04:25:40 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:25:40 collections:302] Dataset loaded with 69 items, total duration of  0.02 hours.
[NeMo I 2024-08-26 04:25:40 collections:304] # 69 files loaded accounting to # 1 labels


[3/5] extract embeddings: 100%|██████████| 2/2 [00:00<00:00,  8.23it/s]

[NeMo I 2024-08-26 04:25:40 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 04:25:40 clustering_diarizer:287] Subsegmentation for embedding extraction: scale3, /content/temp_outputs/speaker_outputs/subsegments_scale3.json
[NeMo I 2024-08-26 04:25:40 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 04:25:40 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:25:40 collections:302] Dataset loaded with 93 items, total duration of  0.02 hours.
[NeMo I 2024-08-26 04:25:40 collections:304] # 93 files loaded accounting to # 1 labels



[4/5] extract embeddings: 100%|██████████| 2/2 [00:00<00:00,  8.63it/s]

[NeMo I 2024-08-26 04:25:41 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 04:25:41 clustering_diarizer:287] Subsegmentation for embedding extraction: scale4, /content/temp_outputs/speaker_outputs/subsegments_scale4.json
[NeMo I 2024-08-26 04:25:41 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 04:25:41 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:25:41 collections:302] Dataset loaded with 141 items, total duration of  0.02 hours.
[NeMo I 2024-08-26 04:25:41 collections:304] # 141 files loaded accounting to # 1 labels



[5/5] extract embeddings: 100%|██████████| 3/3 [00:00<00:00, 10.21it/s]

[NeMo I 2024-08-26 04:25:41 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings



clustering: 100%|██████████| 1/1 [00:00<00:00,  3.95it/s]

[NeMo I 2024-08-26 04:25:41 clustering_diarizer:464] Outputs are saved in /content/temp_outputs directory



[NeMo W 2024-08-26 04:25:41 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 04:25:41 msdd_models:960] Loading embedding pickle file of scale:0 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale0_embeddings.pkl
[NeMo I 2024-08-26 04:25:41 msdd_models:960] Loading embedding pickle file of scale:1 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale1_embeddings.pkl
[NeMo I 2024-08-26 04:25:41 msdd_models:960] Loading embedding pickle file of scale:2 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale2_embeddings.pkl
[NeMo I 2024-08-26 04:25:41 msdd_models:960] Loading embedding pickle file of scale:3 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale3_embeddings.pkl
[NeMo I 2024-08-26 04:25:41 msdd_models:960] Loading embedding pickle file of scale:4 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale4_embeddings.pkl
[NeMo I 2024-08-26 04:25:41 msdd_models:938] Loading cluster label file from /content/temp_outputs/speaker_outputs/subsegments_scale4_cluste

100%|██████████| 1/1 [00:00<00:00, 38.81it/s]

[NeMo I 2024-08-26 04:25:41 msdd_models:1403]      [Threshold: 0.7000] [use_clus_as_main=False] [diar_window=50]
[NeMo I 2024-08-26 04:25:41 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-08-26 04:25:41 speaker_utils:93] Number of files to diarize: 1



[NeMo W 2024-08-26 04:25:41 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 04:25:41 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-08-26 04:25:41 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 04:25:41 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-08-26 04:25:41 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 04:25:41 msdd_models:1431]   
    
archivo e56abe4fb3ae4c0cbc1b1cdfc8d80cb3_20230610t15_15_utc.wav listo


INFO:pytorch_lightning.utilities.migration.utils:Lightning automatically upgraded your loaded checkpoint from v1.5.4 to v2.0.7. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint --file ../root/.cache/torch/whisperx-vad-segmentation.bin`


No language specified, language will be first be detected for each audio file (increases inference time).
Model was trained with pyannote.audio 0.0.1, yours is 3.1.1. Bad things might happen unless you revert pyannote.audio to 0.x.
Model was trained with torch 1.10.0+cu102, yours is 2.0.1+cu117. Bad things might happen unless you revert torch to 1.x.
Suppressing numeral and symbol tokens: [3, 4, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 502, 568, 628, 805, 945, 1017, 1025, 1266, 1294, 1360, 1386, 1525, 1614, 1649, 1722, 1848, 1958, 2009, 2119, 2217, 2272, 2319, 2331, 2443, 2625, 2803, 2975, 3165, 3279, 3282, 3356, 3405, 3446, 3499, 3552, 3705, 4022, 4060, 4289, 4303, 4436, 4550, 4688, 4702, 4762, 4808, 5080, 5211, 5254, 5285, 5348, 5853, 5867, 5923, 6071, 6074, 6096, 6375, 6494, 6549, 6591, 6641, 6673, 6856, 6866, 6879, 6905, 6976, 7143, 7201, 7271, 7490, 7526, 7546, 7551, 7560, 7562, 7629, 7634, 7668, 7771, 7773, 7911, 7998, 8132, 8227, 8423, 8451, 8465, 8494, 8652, 8794, 8858, 8923, 90

[NeMo W 2024-08-26 04:26:24 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: true
    
[NeMo W 2024-08-26 04:26:24 modelPT:168] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s). 
    Validation config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: false
    
[NeMo W 2024-08-26 04:26:24 modelPT:174] Please call the ModelPT.setup_test_data() or ModelPT.setup_multiple

[NeMo I 2024-08-26 04:26:24 features:289] PADDING: 16
[NeMo I 2024-08-26 04:26:25 features:289] PADDING: 16
[NeMo I 2024-08-26 04:26:25 save_restore_connector:249] Model EncDecDiarLabelModel was successfully restored from /root/.cache/torch/NeMo/NeMo_1.21.0/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo.
[NeMo I 2024-08-26 04:26:25 features:289] PADDING: 16
[NeMo I 2024-08-26 04:26:26 clustering_diarizer:127] Loading pretrained vad_multilingual_marblenet model from NGC
[NeMo I 2024-08-26 04:26:26 cloud:58] Found existing object /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-08-26 04:26:26 cloud:64] Re-using file from: /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo
[NeMo I 2024-08-26 04:26:26 common:913] Instantiating model from pre-trained checkpoint


[NeMo W 2024-08-26 04:26:26 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: /manifests/ami_train_0.63.json,/manifests/freesound_background_train.json,/manifests/freesound_laughter_train.json,/manifests/fisher_2004_background.json,/manifests/fisher_2004_speech_sampled.json,/manifests/google_train_manifest.json,/manifests/icsi_all_0.63.json,/manifests/musan_freesound_train.json,/manifests/musan_music_train.json,/manifests/musan_soundbible_train.json,/manifests/mandarin_train_sample.json,/manifests/german_train_sample.json,/manifests/spanish_train_sample.json,/manifests/french_train_sample.json,/manifests/russian_train_sample.json
    sample_rate: 16000
    labels:
    - background
    - speech
    batch_size: 256
    shuffle: true
    is_tarred: false
    tarred_audio_filepaths: null
    tarred_shard_strategy: sca

[NeMo I 2024-08-26 04:26:26 features:289] PADDING: 16
[NeMo I 2024-08-26 04:26:26 save_restore_connector:249] Model EncDecClassificationModel was successfully restored from /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-08-26 04:26:26 msdd_models:864] Multiscale Weights: [1, 1, 1, 1, 1]
[NeMo I 2024-08-26 04:26:26 msdd_models:865] Clustering Parameters: {
        "oracle_num_speakers": false,
        "max_num_speakers": 8,
        "enhanced_count_thres": 80,
        "max_rp_threshold": 0.25,
        "sparse_search_volume": 30,
        "maj_vote_spk_count": false,
        "chunk_cluster_count": 50,
        "embeddings_per_chunk": 10000
    }


[NeMo W 2024-08-26 04:26:26 clustering_diarizer:411] Deleting previous clustering diarizer outputs.


[NeMo I 2024-08-26 04:26:26 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-08-26 04:26:26 clustering_diarizer:309] Split long audio file to avoid CUDA memory issue


splitting manifest: 100%|██████████| 1/1 [00:00<00:00, 36.07it/s]

[NeMo I 2024-08-26 04:26:26 vad_utils:107] The prepared manifest file exists. Overwriting!
[NeMo I 2024-08-26 04:26:26 classification_models:273] Perform streaming frame-level VAD
[NeMo I 2024-08-26 04:26:26 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:26:26 collections:302] Dataset loaded with 2 items, total duration of  0.02 hours.
[NeMo I 2024-08-26 04:26:26 collections:304] # 2 files loaded accounting to # 1 labels



vad: 100%|██████████| 2/2 [00:00<00:00,  4.78it/s]

[NeMo I 2024-08-26 04:26:27 clustering_diarizer:250] Generating predictions with overlapping input segments



                                                               

[NeMo I 2024-08-26 04:26:27 clustering_diarizer:262] Converting frame level prediction to speech/no-speech segment in start and end times format.


creating speech segments: 100%|██████████| 1/1 [00:00<00:00, 12.69it/s]

[NeMo I 2024-08-26 04:26:28 clustering_diarizer:287] Subsegmentation for embedding extraction: scale0, /content/temp_outputs/speaker_outputs/subsegments_scale0.json
[NeMo I 2024-08-26 04:26:28 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 04:26:28 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:26:28 collections:302] Dataset loaded with 67 items, total duration of  0.03 hours.
[NeMo I 2024-08-26 04:26:28 collections:304] # 67 files loaded accounting to # 1 labels



[1/5] extract embeddings: 100%|██████████| 2/2 [00:00<00:00,  7.73it/s]

[NeMo I 2024-08-26 04:26:28 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 04:26:28 clustering_diarizer:287] Subsegmentation for embedding extraction: scale1, /content/temp_outputs/speaker_outputs/subsegments_scale1.json
[NeMo I 2024-08-26 04:26:28 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 04:26:28 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:26:28 collections:302] Dataset loaded with 82 items, total duration of  0.03 hours.
[NeMo I 2024-08-26 04:26:28 collections:304] # 82 files loaded accounting to # 1 labels



[2/5] extract embeddings: 100%|██████████| 2/2 [00:00<00:00,  6.84it/s]

[NeMo I 2024-08-26 04:26:28 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 04:26:28 clustering_diarizer:287] Subsegmentation for embedding extraction: scale2, /content/temp_outputs/speaker_outputs/subsegments_scale2.json
[NeMo I 2024-08-26 04:26:28 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 04:26:28 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:26:28 collections:302] Dataset loaded with 101 items, total duration of  0.03 hours.
[NeMo I 2024-08-26 04:26:28 collections:304] # 101 files loaded accounting to # 1 labels



[3/5] extract embeddings: 100%|██████████| 2/2 [00:00<00:00,  5.95it/s]

[NeMo I 2024-08-26 04:26:29 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 04:26:29 clustering_diarizer:287] Subsegmentation for embedding extraction: scale3, /content/temp_outputs/speaker_outputs/subsegments_scale3.json
[NeMo I 2024-08-26 04:26:29 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 04:26:29 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:26:29 collections:302] Dataset loaded with 138 items, total duration of  0.03 hours.
[NeMo I 2024-08-26 04:26:29 collections:304] # 138 files loaded accounting to # 1 labels



[4/5] extract embeddings: 100%|██████████| 3/3 [00:00<00:00,  8.91it/s]

[NeMo I 2024-08-26 04:26:29 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 04:26:29 clustering_diarizer:287] Subsegmentation for embedding extraction: scale4, /content/temp_outputs/speaker_outputs/subsegments_scale4.json
[NeMo I 2024-08-26 04:26:29 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 04:26:29 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:26:29 collections:302] Dataset loaded with 206 items, total duration of  0.03 hours.
[NeMo I 2024-08-26 04:26:29 collections:304] # 206 files loaded accounting to # 1 labels



[5/5] extract embeddings: 100%|██████████| 4/4 [00:00<00:00,  8.50it/s]

[NeMo I 2024-08-26 04:26:29 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings



clustering: 100%|██████████| 1/1 [00:00<00:00,  3.15it/s]

[NeMo I 2024-08-26 04:26:30 clustering_diarizer:464] Outputs are saved in /content/temp_outputs directory



[NeMo W 2024-08-26 04:26:30 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 04:26:30 msdd_models:960] Loading embedding pickle file of scale:0 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale0_embeddings.pkl
[NeMo I 2024-08-26 04:26:30 msdd_models:960] Loading embedding pickle file of scale:1 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale1_embeddings.pkl
[NeMo I 2024-08-26 04:26:30 msdd_models:960] Loading embedding pickle file of scale:2 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale2_embeddings.pkl
[NeMo I 2024-08-26 04:26:30 msdd_models:960] Loading embedding pickle file of scale:3 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale3_embeddings.pkl
[NeMo I 2024-08-26 04:26:30 msdd_models:960] Loading embedding pickle file of scale:4 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale4_embeddings.pkl
[NeMo I 2024-08-26 04:26:30 msdd_models:938] Loading cluster label file from /content/temp_outputs/speaker_outputs/subsegments_scale4_cluste

100%|██████████| 1/1 [00:00<00:00, 46.45it/s]

[NeMo I 2024-08-26 04:26:30 msdd_models:1403]      [Threshold: 0.7000] [use_clus_as_main=False] [diar_window=50]
[NeMo I 2024-08-26 04:26:30 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-08-26 04:26:30 speaker_utils:93] Number of files to diarize: 1



[NeMo W 2024-08-26 04:26:30 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 04:26:30 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-08-26 04:26:30 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 04:26:30 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-08-26 04:26:30 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 04:26:30 msdd_models:1431]   
    
archivo bdc1679b294f4d60844c566fef78c4c8_20230610t16_16_utc.wav listo


INFO:pytorch_lightning.utilities.migration.utils:Lightning automatically upgraded your loaded checkpoint from v1.5.4 to v2.0.7. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint --file ../root/.cache/torch/whisperx-vad-segmentation.bin`


No language specified, language will be first be detected for each audio file (increases inference time).
Model was trained with pyannote.audio 0.0.1, yours is 3.1.1. Bad things might happen unless you revert pyannote.audio to 0.x.
Model was trained with torch 1.10.0+cu102, yours is 2.0.1+cu117. Bad things might happen unless you revert torch to 1.x.
Suppressing numeral and symbol tokens: [3, 4, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 502, 568, 628, 805, 945, 1017, 1025, 1266, 1294, 1360, 1386, 1525, 1614, 1649, 1722, 1848, 1958, 2009, 2119, 2217, 2272, 2319, 2331, 2443, 2625, 2803, 2975, 3165, 3279, 3282, 3356, 3405, 3446, 3499, 3552, 3705, 4022, 4060, 4289, 4303, 4436, 4550, 4688, 4702, 4762, 4808, 5080, 5211, 5254, 5285, 5348, 5853, 5867, 5923, 6071, 6074, 6096, 6375, 6494, 6549, 6591, 6641, 6673, 6856, 6866, 6879, 6905, 6976, 7143, 7201, 7271, 7490, 7526, 7546, 7551, 7560, 7562, 7629, 7634, 7668, 7771, 7773, 7911, 7998, 8132, 8227, 8423, 8451, 8465, 8494, 8652, 8794, 8858, 8923, 90

[NeMo W 2024-08-26 04:27:10 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: true
    
[NeMo W 2024-08-26 04:27:10 modelPT:168] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s). 
    Validation config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: false
    
[NeMo W 2024-08-26 04:27:10 modelPT:174] Please call the ModelPT.setup_test_data() or ModelPT.setup_multiple

[NeMo I 2024-08-26 04:27:10 features:289] PADDING: 16
[NeMo I 2024-08-26 04:27:10 features:289] PADDING: 16
[NeMo I 2024-08-26 04:27:11 save_restore_connector:249] Model EncDecDiarLabelModel was successfully restored from /root/.cache/torch/NeMo/NeMo_1.21.0/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo.
[NeMo I 2024-08-26 04:27:11 features:289] PADDING: 16
[NeMo I 2024-08-26 04:27:12 clustering_diarizer:127] Loading pretrained vad_multilingual_marblenet model from NGC
[NeMo I 2024-08-26 04:27:12 cloud:58] Found existing object /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-08-26 04:27:12 cloud:64] Re-using file from: /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo
[NeMo I 2024-08-26 04:27:12 common:913] Instantiating model from pre-trained checkpoint


[NeMo W 2024-08-26 04:27:12 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: /manifests/ami_train_0.63.json,/manifests/freesound_background_train.json,/manifests/freesound_laughter_train.json,/manifests/fisher_2004_background.json,/manifests/fisher_2004_speech_sampled.json,/manifests/google_train_manifest.json,/manifests/icsi_all_0.63.json,/manifests/musan_freesound_train.json,/manifests/musan_music_train.json,/manifests/musan_soundbible_train.json,/manifests/mandarin_train_sample.json,/manifests/german_train_sample.json,/manifests/spanish_train_sample.json,/manifests/french_train_sample.json,/manifests/russian_train_sample.json
    sample_rate: 16000
    labels:
    - background
    - speech
    batch_size: 256
    shuffle: true
    is_tarred: false
    tarred_audio_filepaths: null
    tarred_shard_strategy: sca

[NeMo I 2024-08-26 04:27:12 features:289] PADDING: 16
[NeMo I 2024-08-26 04:27:12 save_restore_connector:249] Model EncDecClassificationModel was successfully restored from /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-08-26 04:27:13 msdd_models:864] Multiscale Weights: [1, 1, 1, 1, 1]
[NeMo I 2024-08-26 04:27:13 msdd_models:865] Clustering Parameters: {
        "oracle_num_speakers": false,
        "max_num_speakers": 8,
        "enhanced_count_thres": 80,
        "max_rp_threshold": 0.25,
        "sparse_search_volume": 30,
        "maj_vote_spk_count": false,
        "chunk_cluster_count": 50,
        "embeddings_per_chunk": 10000
    }


[NeMo W 2024-08-26 04:27:13 clustering_diarizer:411] Deleting previous clustering diarizer outputs.


[NeMo I 2024-08-26 04:27:13 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-08-26 04:27:13 clustering_diarizer:309] Split long audio file to avoid CUDA memory issue


splitting manifest: 100%|██████████| 1/1 [00:00<00:00, 51.88it/s]

[NeMo I 2024-08-26 04:27:13 vad_utils:107] The prepared manifest file exists. Overwriting!
[NeMo I 2024-08-26 04:27:13 classification_models:273] Perform streaming frame-level VAD
[NeMo I 2024-08-26 04:27:13 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:27:13 collections:302] Dataset loaded with 1 items, total duration of  0.01 hours.
[NeMo I 2024-08-26 04:27:13 collections:304] # 1 files loaded accounting to # 1 labels



vad: 100%|██████████| 1/1 [00:00<00:00,  4.43it/s]

[NeMo I 2024-08-26 04:27:13 clustering_diarizer:250] Generating predictions with overlapping input segments



                                                               

[NeMo I 2024-08-26 04:27:13 clustering_diarizer:262] Converting frame level prediction to speech/no-speech segment in start and end times format.


creating speech segments: 100%|██████████| 1/1 [00:00<00:00, 25.98it/s]

[NeMo I 2024-08-26 04:27:13 clustering_diarizer:287] Subsegmentation for embedding extraction: scale0, /content/temp_outputs/speaker_outputs/subsegments_scale0.json
[NeMo I 2024-08-26 04:27:13 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 04:27:13 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:27:13 collections:302] Dataset loaded with 20 items, total duration of  0.01 hours.
[NeMo I 2024-08-26 04:27:13 collections:304] # 20 files loaded accounting to # 1 labels



[1/5] extract embeddings: 100%|██████████| 1/1 [00:00<00:00, 10.75it/s]

[NeMo I 2024-08-26 04:27:13 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 04:27:13 clustering_diarizer:287] Subsegmentation for embedding extraction: scale1, /content/temp_outputs/speaker_outputs/subsegments_scale1.json
[NeMo I 2024-08-26 04:27:13 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 04:27:13 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:27:13 collections:302] Dataset loaded with 24 items, total duration of  0.01 hours.
[NeMo I 2024-08-26 04:27:13 collections:304] # 24 files loaded accounting to # 1 labels



[2/5] extract embeddings: 100%|██████████| 1/1 [00:00<00:00, 10.91it/s]

[NeMo I 2024-08-26 04:27:14 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 04:27:14 clustering_diarizer:287] Subsegmentation for embedding extraction: scale2, /content/temp_outputs/speaker_outputs/subsegments_scale2.json
[NeMo I 2024-08-26 04:27:14 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 04:27:14 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:27:14 collections:302] Dataset loaded with 31 items, total duration of  0.01 hours.
[NeMo I 2024-08-26 04:27:14 collections:304] # 31 files loaded accounting to # 1 labels



[3/5] extract embeddings: 100%|██████████| 1/1 [00:00<00:00,  8.89it/s]

[NeMo I 2024-08-26 04:27:14 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 04:27:14 clustering_diarizer:287] Subsegmentation for embedding extraction: scale3, /content/temp_outputs/speaker_outputs/subsegments_scale3.json
[NeMo I 2024-08-26 04:27:14 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 04:27:14 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:27:14 collections:302] Dataset loaded with 39 items, total duration of  0.01 hours.
[NeMo I 2024-08-26 04:27:14 collections:304] # 39 files loaded accounting to # 1 labels



[4/5] extract embeddings: 100%|██████████| 1/1 [00:00<00:00,  8.23it/s]

[NeMo I 2024-08-26 04:27:14 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 04:27:14 clustering_diarizer:287] Subsegmentation for embedding extraction: scale4, /content/temp_outputs/speaker_outputs/subsegments_scale4.json
[NeMo I 2024-08-26 04:27:14 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 04:27:14 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:27:14 collections:302] Dataset loaded with 62 items, total duration of  0.01 hours.
[NeMo I 2024-08-26 04:27:14 collections:304] # 62 files loaded accounting to # 1 labels



[5/5] extract embeddings: 100%|██████████| 1/1 [00:00<00:00,  6.46it/s]


[NeMo I 2024-08-26 04:27:14 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings


clustering: 100%|██████████| 1/1 [00:00<00:00,  7.22it/s]

[NeMo I 2024-08-26 04:27:14 clustering_diarizer:464] Outputs are saved in /content/temp_outputs directory



[NeMo W 2024-08-26 04:27:14 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 04:27:14 msdd_models:960] Loading embedding pickle file of scale:0 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale0_embeddings.pkl
[NeMo I 2024-08-26 04:27:14 msdd_models:960] Loading embedding pickle file of scale:1 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale1_embeddings.pkl
[NeMo I 2024-08-26 04:27:14 msdd_models:960] Loading embedding pickle file of scale:2 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale2_embeddings.pkl
[NeMo I 2024-08-26 04:27:14 msdd_models:960] Loading embedding pickle file of scale:3 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale3_embeddings.pkl
[NeMo I 2024-08-26 04:27:14 msdd_models:960] Loading embedding pickle file of scale:4 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale4_embeddings.pkl
[NeMo I 2024-08-26 04:27:14 msdd_models:938] Loading cluster label file from /content/temp_outputs/speaker_outputs/subsegments_scale4_cluste

100%|██████████| 2/2 [00:00<00:00, 15.79it/s]

[NeMo I 2024-08-26 04:27:14 msdd_models:1403]      [Threshold: 0.7000] [use_clus_as_main=False] [diar_window=50]
[NeMo I 2024-08-26 04:27:14 speaker_utils:93] Number of files to diarize: 1





[NeMo I 2024-08-26 04:27:14 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-08-26 04:27:14 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 04:27:14 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-08-26 04:27:14 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 04:27:14 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-08-26 04:27:14 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 04:27:14 msdd_models:1431]   
    
archivo 2bf1748f6df24d4494ad8a137f33ed77_20230610t15_22_utc.wav listo


INFO:pytorch_lightning.utilities.migration.utils:Lightning automatically upgraded your loaded checkpoint from v1.5.4 to v2.0.7. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint --file ../root/.cache/torch/whisperx-vad-segmentation.bin`


No language specified, language will be first be detected for each audio file (increases inference time).
Model was trained with pyannote.audio 0.0.1, yours is 3.1.1. Bad things might happen unless you revert pyannote.audio to 0.x.
Model was trained with torch 1.10.0+cu102, yours is 2.0.1+cu117. Bad things might happen unless you revert torch to 1.x.
Suppressing numeral and symbol tokens: [3, 4, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 502, 568, 628, 805, 945, 1017, 1025, 1266, 1294, 1360, 1386, 1525, 1614, 1649, 1722, 1848, 1958, 2009, 2119, 2217, 2272, 2319, 2331, 2443, 2625, 2803, 2975, 3165, 3279, 3282, 3356, 3405, 3446, 3499, 3552, 3705, 4022, 4060, 4289, 4303, 4436, 4550, 4688, 4702, 4762, 4808, 5080, 5211, 5254, 5285, 5348, 5853, 5867, 5923, 6071, 6074, 6096, 6375, 6494, 6549, 6591, 6641, 6673, 6856, 6866, 6879, 6905, 6976, 7143, 7201, 7271, 7490, 7526, 7546, 7551, 7560, 7562, 7629, 7634, 7668, 7771, 7773, 7911, 7998, 8132, 8227, 8423, 8451, 8465, 8494, 8652, 8794, 8858, 8923, 90

[NeMo W 2024-08-26 04:27:57 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: true
    
[NeMo W 2024-08-26 04:27:57 modelPT:168] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s). 
    Validation config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: false
    
[NeMo W 2024-08-26 04:27:57 modelPT:174] Please call the ModelPT.setup_test_data() or ModelPT.setup_multiple

[NeMo I 2024-08-26 04:27:57 features:289] PADDING: 16
[NeMo I 2024-08-26 04:27:57 features:289] PADDING: 16
[NeMo I 2024-08-26 04:27:58 save_restore_connector:249] Model EncDecDiarLabelModel was successfully restored from /root/.cache/torch/NeMo/NeMo_1.21.0/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo.
[NeMo I 2024-08-26 04:27:58 features:289] PADDING: 16
[NeMo I 2024-08-26 04:27:59 clustering_diarizer:127] Loading pretrained vad_multilingual_marblenet model from NGC
[NeMo I 2024-08-26 04:27:59 cloud:58] Found existing object /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-08-26 04:27:59 cloud:64] Re-using file from: /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo
[NeMo I 2024-08-26 04:27:59 common:913] Instantiating model from pre-trained checkpoint


[NeMo W 2024-08-26 04:27:59 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: /manifests/ami_train_0.63.json,/manifests/freesound_background_train.json,/manifests/freesound_laughter_train.json,/manifests/fisher_2004_background.json,/manifests/fisher_2004_speech_sampled.json,/manifests/google_train_manifest.json,/manifests/icsi_all_0.63.json,/manifests/musan_freesound_train.json,/manifests/musan_music_train.json,/manifests/musan_soundbible_train.json,/manifests/mandarin_train_sample.json,/manifests/german_train_sample.json,/manifests/spanish_train_sample.json,/manifests/french_train_sample.json,/manifests/russian_train_sample.json
    sample_rate: 16000
    labels:
    - background
    - speech
    batch_size: 256
    shuffle: true
    is_tarred: false
    tarred_audio_filepaths: null
    tarred_shard_strategy: sca

[NeMo I 2024-08-26 04:27:59 features:289] PADDING: 16
[NeMo I 2024-08-26 04:27:59 save_restore_connector:249] Model EncDecClassificationModel was successfully restored from /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-08-26 04:27:59 msdd_models:864] Multiscale Weights: [1, 1, 1, 1, 1]
[NeMo I 2024-08-26 04:27:59 msdd_models:865] Clustering Parameters: {
        "oracle_num_speakers": false,
        "max_num_speakers": 8,
        "enhanced_count_thres": 80,
        "max_rp_threshold": 0.25,
        "sparse_search_volume": 30,
        "maj_vote_spk_count": false,
        "chunk_cluster_count": 50,
        "embeddings_per_chunk": 10000
    }


[NeMo W 2024-08-26 04:27:59 clustering_diarizer:411] Deleting previous clustering diarizer outputs.


[NeMo I 2024-08-26 04:27:59 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-08-26 04:27:59 clustering_diarizer:309] Split long audio file to avoid CUDA memory issue


splitting manifest: 100%|██████████| 1/1 [00:00<00:00, 32.46it/s]

[NeMo I 2024-08-26 04:27:59 vad_utils:107] The prepared manifest file exists. Overwriting!
[NeMo I 2024-08-26 04:27:59 classification_models:273] Perform streaming frame-level VAD
[NeMo I 2024-08-26 04:27:59 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:27:59 collections:302] Dataset loaded with 2 items, total duration of  0.02 hours.
[NeMo I 2024-08-26 04:27:59 collections:304] # 2 files loaded accounting to # 1 labels



vad: 100%|██████████| 2/2 [00:00<00:00,  4.68it/s]

[NeMo I 2024-08-26 04:28:00 clustering_diarizer:250] Generating predictions with overlapping input segments



                                                               

[NeMo I 2024-08-26 04:28:00 clustering_diarizer:262] Converting frame level prediction to speech/no-speech segment in start and end times format.


creating speech segments: 100%|██████████| 1/1 [00:00<00:00, 12.18it/s]

[NeMo I 2024-08-26 04:28:00 clustering_diarizer:287] Subsegmentation for embedding extraction: scale0, /content/temp_outputs/speaker_outputs/subsegments_scale0.json
[NeMo I 2024-08-26 04:28:00 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 04:28:00 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:28:00 collections:302] Dataset loaded with 56 items, total duration of  0.02 hours.
[NeMo I 2024-08-26 04:28:00 collections:304] # 56 files loaded accounting to # 1 labels



[1/5] extract embeddings: 100%|██████████| 1/1 [00:00<00:00,  5.51it/s]

[NeMo I 2024-08-26 04:28:01 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 04:28:01 clustering_diarizer:287] Subsegmentation for embedding extraction: scale1, /content/temp_outputs/speaker_outputs/subsegments_scale1.json
[NeMo I 2024-08-26 04:28:01 clustering_diarizer:343] Extracting embeddings for Diarization





[NeMo I 2024-08-26 04:28:01 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:28:01 collections:302] Dataset loaded with 70 items, total duration of  0.02 hours.
[NeMo I 2024-08-26 04:28:01 collections:304] # 70 files loaded accounting to # 1 labels


[2/5] extract embeddings: 100%|██████████| 2/2 [00:00<00:00,  8.79it/s]

[NeMo I 2024-08-26 04:28:01 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 04:28:01 clustering_diarizer:287] Subsegmentation for embedding extraction: scale2, /content/temp_outputs/speaker_outputs/subsegments_scale2.json
[NeMo I 2024-08-26 04:28:01 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 04:28:01 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:28:01 collections:302] Dataset loaded with 88 items, total duration of  0.02 hours.
[NeMo I 2024-08-26 04:28:01 collections:304] # 88 files loaded accounting to # 1 labels



[3/5] extract embeddings: 100%|██████████| 2/2 [00:00<00:00,  9.18it/s]

[NeMo I 2024-08-26 04:28:01 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 04:28:01 clustering_diarizer:287] Subsegmentation for embedding extraction: scale3, /content/temp_outputs/speaker_outputs/subsegments_scale3.json
[NeMo I 2024-08-26 04:28:01 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 04:28:01 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:28:01 collections:302] Dataset loaded with 121 items, total duration of  0.02 hours.
[NeMo I 2024-08-26 04:28:01 collections:304] # 121 files loaded accounting to # 1 labels



[4/5] extract embeddings: 100%|██████████| 2/2 [00:00<00:00,  6.92it/s]

[NeMo I 2024-08-26 04:28:01 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 04:28:01 clustering_diarizer:287] Subsegmentation for embedding extraction: scale4, /content/temp_outputs/speaker_outputs/subsegments_scale4.json
[NeMo I 2024-08-26 04:28:01 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 04:28:01 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:28:01 collections:302] Dataset loaded with 185 items, total duration of  0.03 hours.
[NeMo I 2024-08-26 04:28:01 collections:304] # 185 files loaded accounting to # 1 labels



[5/5] extract embeddings: 100%|██████████| 3/3 [00:00<00:00,  8.44it/s]

[NeMo I 2024-08-26 04:28:02 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings



clustering: 100%|██████████| 1/1 [00:00<00:00,  3.82it/s]

[NeMo I 2024-08-26 04:28:02 clustering_diarizer:464] Outputs are saved in /content/temp_outputs directory



[NeMo W 2024-08-26 04:28:02 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 04:28:02 msdd_models:960] Loading embedding pickle file of scale:0 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale0_embeddings.pkl
[NeMo I 2024-08-26 04:28:02 msdd_models:960] Loading embedding pickle file of scale:1 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale1_embeddings.pkl
[NeMo I 2024-08-26 04:28:02 msdd_models:960] Loading embedding pickle file of scale:2 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale2_embeddings.pkl
[NeMo I 2024-08-26 04:28:02 msdd_models:960] Loading embedding pickle file of scale:3 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale3_embeddings.pkl
[NeMo I 2024-08-26 04:28:02 msdd_models:960] Loading embedding pickle file of scale:4 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale4_embeddings.pkl
[NeMo I 2024-08-26 04:28:02 msdd_models:938] Loading cluster label file from /content/temp_outputs/speaker_outputs/subsegments_scale4_cluste

100%|██████████| 1/1 [00:00<00:00, 30.87it/s]

[NeMo I 2024-08-26 04:28:02 msdd_models:1403]      [Threshold: 0.7000] [use_clus_as_main=False] [diar_window=50]
[NeMo I 2024-08-26 04:28:02 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-08-26 04:28:02 speaker_utils:93] Number of files to diarize: 1



[NeMo W 2024-08-26 04:28:02 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 04:28:02 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-08-26 04:28:02 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 04:28:02 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-08-26 04:28:02 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 04:28:02 msdd_models:1431]   
    
archivo 60e1b8f07eb844c8b7014251b86ecea1_20230610t13_35_utc.wav listo


INFO:pytorch_lightning.utilities.migration.utils:Lightning automatically upgraded your loaded checkpoint from v1.5.4 to v2.0.7. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint --file ../root/.cache/torch/whisperx-vad-segmentation.bin`


No language specified, language will be first be detected for each audio file (increases inference time).
Model was trained with pyannote.audio 0.0.1, yours is 3.1.1. Bad things might happen unless you revert pyannote.audio to 0.x.
Model was trained with torch 1.10.0+cu102, yours is 2.0.1+cu117. Bad things might happen unless you revert torch to 1.x.
Suppressing numeral and symbol tokens: [3, 4, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 502, 568, 628, 805, 945, 1017, 1025, 1266, 1294, 1360, 1386, 1525, 1614, 1649, 1722, 1848, 1958, 2009, 2119, 2217, 2272, 2319, 2331, 2443, 2625, 2803, 2975, 3165, 3279, 3282, 3356, 3405, 3446, 3499, 3552, 3705, 4022, 4060, 4289, 4303, 4436, 4550, 4688, 4702, 4762, 4808, 5080, 5211, 5254, 5285, 5348, 5853, 5867, 5923, 6071, 6074, 6096, 6375, 6494, 6549, 6591, 6641, 6673, 6856, 6866, 6879, 6905, 6976, 7143, 7201, 7271, 7490, 7526, 7546, 7551, 7560, 7562, 7629, 7634, 7668, 7771, 7773, 7911, 7998, 8132, 8227, 8423, 8451, 8465, 8494, 8652, 8794, 8858, 8923, 90

[NeMo W 2024-08-26 04:28:53 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: true
    
[NeMo W 2024-08-26 04:28:53 modelPT:168] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s). 
    Validation config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: false
    
[NeMo W 2024-08-26 04:28:53 modelPT:174] Please call the ModelPT.setup_test_data() or ModelPT.setup_multiple

[NeMo I 2024-08-26 04:28:53 features:289] PADDING: 16
[NeMo I 2024-08-26 04:28:53 features:289] PADDING: 16
[NeMo I 2024-08-26 04:28:54 save_restore_connector:249] Model EncDecDiarLabelModel was successfully restored from /root/.cache/torch/NeMo/NeMo_1.21.0/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo.
[NeMo I 2024-08-26 04:28:54 features:289] PADDING: 16
[NeMo I 2024-08-26 04:28:54 clustering_diarizer:127] Loading pretrained vad_multilingual_marblenet model from NGC
[NeMo I 2024-08-26 04:28:54 cloud:58] Found existing object /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-08-26 04:28:54 cloud:64] Re-using file from: /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo
[NeMo I 2024-08-26 04:28:54 common:913] Instantiating model from pre-trained checkpoint


[NeMo W 2024-08-26 04:28:54 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: /manifests/ami_train_0.63.json,/manifests/freesound_background_train.json,/manifests/freesound_laughter_train.json,/manifests/fisher_2004_background.json,/manifests/fisher_2004_speech_sampled.json,/manifests/google_train_manifest.json,/manifests/icsi_all_0.63.json,/manifests/musan_freesound_train.json,/manifests/musan_music_train.json,/manifests/musan_soundbible_train.json,/manifests/mandarin_train_sample.json,/manifests/german_train_sample.json,/manifests/spanish_train_sample.json,/manifests/french_train_sample.json,/manifests/russian_train_sample.json
    sample_rate: 16000
    labels:
    - background
    - speech
    batch_size: 256
    shuffle: true
    is_tarred: false
    tarred_audio_filepaths: null
    tarred_shard_strategy: sca

[NeMo I 2024-08-26 04:28:54 features:289] PADDING: 16
[NeMo I 2024-08-26 04:28:55 save_restore_connector:249] Model EncDecClassificationModel was successfully restored from /root/.cache/torch/NeMo/NeMo_1.21.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-08-26 04:28:55 msdd_models:864] Multiscale Weights: [1, 1, 1, 1, 1]
[NeMo I 2024-08-26 04:28:55 msdd_models:865] Clustering Parameters: {
        "oracle_num_speakers": false,
        "max_num_speakers": 8,
        "enhanced_count_thres": 80,
        "max_rp_threshold": 0.25,
        "sparse_search_volume": 30,
        "maj_vote_spk_count": false,
        "chunk_cluster_count": 50,
        "embeddings_per_chunk": 10000
    }


[NeMo W 2024-08-26 04:28:55 clustering_diarizer:411] Deleting previous clustering diarizer outputs.


[NeMo I 2024-08-26 04:28:55 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-08-26 04:28:55 clustering_diarizer:309] Split long audio file to avoid CUDA memory issue


splitting manifest: 100%|██████████| 1/1 [00:00<00:00, 16.98it/s]

[NeMo I 2024-08-26 04:28:55 vad_utils:107] The prepared manifest file exists. Overwriting!
[NeMo I 2024-08-26 04:28:55 classification_models:273] Perform streaming frame-level VAD
[NeMo I 2024-08-26 04:28:55 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:28:55 collections:302] Dataset loaded with 4 items, total duration of  0.05 hours.
[NeMo I 2024-08-26 04:28:55 collections:304] # 4 files loaded accounting to # 1 labels



vad: 100%|██████████| 4/4 [00:01<00:00,  3.66it/s]

[NeMo I 2024-08-26 04:28:56 clustering_diarizer:250] Generating predictions with overlapping input segments



                                                               

[NeMo I 2024-08-26 04:28:57 clustering_diarizer:262] Converting frame level prediction to speech/no-speech segment in start and end times format.


creating speech segments: 100%|██████████| 1/1 [00:00<00:00,  4.97it/s]

[NeMo I 2024-08-26 04:28:58 clustering_diarizer:287] Subsegmentation for embedding extraction: scale0, /content/temp_outputs/speaker_outputs/subsegments_scale0.json
[NeMo I 2024-08-26 04:28:58 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 04:28:58 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:28:58 collections:302] Dataset loaded with 185 items, total duration of  0.07 hours.
[NeMo I 2024-08-26 04:28:58 collections:304] # 185 files loaded accounting to # 1 labels



[1/5] extract embeddings: 100%|██████████| 3/3 [00:00<00:00,  4.93it/s]

[NeMo I 2024-08-26 04:28:58 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 04:28:58 clustering_diarizer:287] Subsegmentation for embedding extraction: scale1, /content/temp_outputs/speaker_outputs/subsegments_scale1.json
[NeMo I 2024-08-26 04:28:58 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 04:28:58 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:28:58 collections:302] Dataset loaded with 222 items, total duration of  0.07 hours.
[NeMo I 2024-08-26 04:28:58 collections:304] # 222 files loaded accounting to # 1 labels



[2/5] extract embeddings: 100%|██████████| 4/4 [00:00<00:00,  6.40it/s]

[NeMo I 2024-08-26 04:28:59 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 04:28:59 clustering_diarizer:287] Subsegmentation for embedding extraction: scale2, /content/temp_outputs/speaker_outputs/subsegments_scale2.json
[NeMo I 2024-08-26 04:28:59 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 04:28:59 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:28:59 collections:302] Dataset loaded with 277 items, total duration of  0.07 hours.
[NeMo I 2024-08-26 04:28:59 collections:304] # 277 files loaded accounting to # 1 labels



[3/5] extract embeddings: 100%|██████████| 5/5 [00:00<00:00,  6.39it/s]

[NeMo I 2024-08-26 04:29:00 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 04:29:00 clustering_diarizer:287] Subsegmentation for embedding extraction: scale3, /content/temp_outputs/speaker_outputs/subsegments_scale3.json
[NeMo I 2024-08-26 04:29:00 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 04:29:00 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-08-26 04:29:00 collections:302] Dataset loaded with 379 items, total duration of  0.08 hours.
[NeMo I 2024-08-26 04:29:00 collections:304] # 379 files loaded accounting to # 1 labels



[4/5] extract embeddings: 100%|██████████| 6/6 [00:00<00:00,  6.94it/s]

[NeMo I 2024-08-26 04:29:01 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-08-26 04:29:01 clustering_diarizer:287] Subsegmentation for embedding extraction: scale4, /content/temp_outputs/speaker_outputs/subsegments_scale4.json
[NeMo I 2024-08-26 04:29:01 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-08-26 04:29:01 collections:301] Filtered duration for loading collection is  0.00 hours.





[NeMo I 2024-08-26 04:29:01 collections:302] Dataset loaded with 574 items, total duration of  0.08 hours.
[NeMo I 2024-08-26 04:29:01 collections:304] # 574 files loaded accounting to # 1 labels


[5/5] extract embeddings: 100%|██████████| 9/9 [00:01<00:00,  7.30it/s]

[NeMo I 2024-08-26 04:29:02 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings



clustering: 100%|██████████| 1/1 [00:00<00:00,  1.97it/s]

[NeMo I 2024-08-26 04:29:03 clustering_diarizer:464] Outputs are saved in /content/temp_outputs directory



[NeMo W 2024-08-26 04:29:03 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 04:29:03 msdd_models:960] Loading embedding pickle file of scale:0 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale0_embeddings.pkl
[NeMo I 2024-08-26 04:29:03 msdd_models:960] Loading embedding pickle file of scale:1 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale1_embeddings.pkl
[NeMo I 2024-08-26 04:29:03 msdd_models:960] Loading embedding pickle file of scale:2 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale2_embeddings.pkl
[NeMo I 2024-08-26 04:29:03 msdd_models:960] Loading embedding pickle file of scale:3 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale3_embeddings.pkl
[NeMo I 2024-08-26 04:29:03 msdd_models:960] Loading embedding pickle file of scale:4 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale4_embeddings.pkl
[NeMo I 2024-08-26 04:29:03 msdd_models:938] Loading cluster label file from /content/temp_outputs/speaker_outputs/subsegments_scale4_cluste

100%|██████████| 1/1 [00:00<00:00, 10.34it/s]

[NeMo I 2024-08-26 04:29:03 msdd_models:1403]      [Threshold: 0.7000] [use_clus_as_main=False] [diar_window=50]
[NeMo I 2024-08-26 04:29:03 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-08-26 04:29:03 speaker_utils:93] Number of files to diarize: 1



[NeMo W 2024-08-26 04:29:03 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 04:29:03 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-08-26 04:29:03 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 04:29:03 speaker_utils:93] Number of files to diarize: 1


[NeMo W 2024-08-26 04:29:03 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate


[NeMo I 2024-08-26 04:29:03 msdd_models:1431]   
    
archivo 2debb7328d994ecdb89a0a7afd03cb2e_20230610t16_57_utc.wav listo


In [None]:
#Limpieza Cache
import gc
gc.collect()


30686