Skip to content

Removing the dependency on Pyannote for Diarization and VAD#15632

Open
tango4j wants to merge 17 commits into
mainfrom
add_py_md_eval
Open

Removing the dependency on Pyannote for Diarization and VAD#15632
tango4j wants to merge 17 commits into
mainfrom
add_py_md_eval

Conversation

@tango4j
Copy link
Copy Markdown
Collaborator

@tango4j tango4j commented Apr 21, 2026

Important

The Update branch button must only be pressed in very rare occassions.
An outdated branch is never blocking the merge of a PR.
Please reach out to the automation team before pressing that button.

What does this PR do ?

Removes the pyannote.core and pyannote.metrics dependencies from NeMo's
speaker-diarization stack and replaces them with an in-tree, NIST
md-eval-22.pl-faithful Python engine plus lhotse.SupervisionSegment-based
annotation objects. The public API of nemo.collections.asr.metrics.der is
preserved, including byte-for-byte numerical parity with historical NeMo
diarization results (no shift in published DER numbers).

Tried to replace Pyannote classes with Lhotse's classes, to minimize the code
added to the repo by removing Pyannote imports. Except RTTM writing functions,
mostly replaceable.

Collection: ASR (speaker tasks / diarization, VAD)

Changelog

New: in-tree DER engine (nemo/collections/asr/metrics/md_eval.py)

  • New module: a Python port of NIST md-eval-22.pl, written in NeMo style
    (Apache header, type hints, Google-style docstrings, __all__,
    nemo.utils.logging, no CLI). Drives all DER computation.
  • New DiarizationErrorResult result object exposing the dict-like interface
    used throughout NeMo (abs(result), result['total' | 'confusion' | 'false alarm' | 'missed detection'], result.results_,
    result.optimal_mapping(...), result.report()).

nemo/collections/asr/metrics/der.py (DER public API)

  • score_labels, evaluate_der, score_labels_from_rttm_labels,
    get_partial_ref_labels, get_online_DER_stats, calculate_session_cpWER,
    calculate_session_cpWER_bruteforce, concat_perm_word_error_rate are
    all preserved with their original names, signatures, and return shapes.
    No breaking changes for downstream callers.
  • New lhotse-backed annotation helpers (replacements for the previous
    pyannote.core types):
    • make_diar_segment(start, end, speaker, ...) -> SupervisionSegment
    • make_diar_annotation(labels, uniq_name=...) -> list[SupervisionSegment]
    • make_uem_timeline(uem_lines, uniq_id=...) -> list[SupervisionSegment]
      (UEM regions carried as supervisions with speaker="UEM")
    • unique_speakers(annotation) -> list[str]
    • write_supervisions_to_rttm(annotation, file_handle, ...)
  • New score_labels_from_rttm_labels(...) convenience entry point that takes
    raw "start end speaker" label strings (no annotation object construction
    required by the caller).
  • New _default_uem_from_ref_sys(ref_data, sys_data) helper. When a caller
    does not supply a UEM, the high-level wrappers now auto-derive
    [min(ref ∪ sys TBEG), max(ref ∪ sys TEND)] per (file_id, channel) and
    pass it to evaluate(). This matches the historical no-UEM scoring map
    used by the previous external engine and prevents any over-shoot of the
    hypothesis past the last reference segment from being silently dropped.
    md_eval.evaluate() itself remains a faithful NIST port (ref-extent only)
    for power users that call it directly.
  • Docstring on collar argument in both score_labels and
    score_labels_from_rttm_labels clarifies the NIST half-width semantics
    (total no-score zone = 2 * collar) and gives the cross-engine conversion
    rule (NeMo collar=X <==> external libs that define collar as total width
    collar=2X).

Source code rename / scrub (no behaviour change)

  • nemo/collections/asr/parts/utils/speaker_utils.py:
    • labels_to_pyannote_object -> labels_to_supervisions
    • timestamps_to_pyannote_object -> timestamps_to_supervisions
    • now returns list[SupervisionSegment]
  • nemo/collections/asr/parts/utils/vad_utils.py:
    • vad_construct_pyannote_object_per_file -> vad_construct_supervisions_per_file
    • frame_vad_construct_pyannote_object_per_file -> frame_vad_construct_supervisions_per_file
    • read_rttm_as_pyannote_object -> read_rttm_as_supervisions
    • new internal _DetectionErrorRateAccumulator class replaces
      pyannote.metrics.detection.DetectionErrorRate, backed by md_eval. It
      preserves the metric(reference, hypothesis) accumulation +
      metric.report(display=False) API and returns a pandas DataFrame with
      the same ('detection error rate', '%'), ('false alarm', '%'),
      ('miss', '%') columns that downstream code consumes.
  • scripts/speaker_tasks/eval_diar_with_asr.py:
    • get_pyannote_objs_from_rttms -> get_supervisions_from_rttms
  • examples/speaker_tasks/diarization/neural_diarizer/e2e_diarize_speech.py:
    • call sites updated to the new timestamps_to_supervisions name
  • All docstrings, comments, and reference URLs that mentioned the third-party
    package by name have been rewritten (or replaced with neutral wording such
    as "External Annotation Library") so a git grep -i pyannote over the
    branch returns zero matches.
  • Two tutorial notebooks (tutorials/speaker_tasks/End_to_End_Diarization_*.ipynb,
    tutorials/tools/Multispeaker_Simulator.ipynb) and the inference notebook
    updated to use the new names and score_labels_from_rttm_labels.

Dependencies removed

  • requirements/requirements_asr.txt: removed pyannote.core and pyannote.metrics.
  • examples/voice_agent/environment.yaml: removed pyannote-core==5.0.0,
    pyannote-database==5.1.3, pyannote-metrics==3.2.1.
  • uv.lock: removed the three corresponding [[package]] blocks and every
    transitive { name = "pyannote-..." } entry. TOML structure validated
    after edit.

Tests

  • New tests/collections/speaker_tasks/utils/test_der.py (119 unit tests)
    covering:
    • md-eval engine: basic, collar, overlap, speaker count, UEM
    • score_labels_from_rttm_labels (string-label public API)
    • Multi-file aggregation
    • 21 hardcoded values verified independently against the previous external
      engine implementation (class TestExternalEngineVerifiedValues)
    • Lhotse-backed annotation pipeline end-to-end + bit-exact equivalence
      with the string-label path
    • 7-test TestNoUemAutoUnion regression class pinning the auto-UEM
      behaviour and the NIST collar semantics with hand-derived expected
      values from the diarization tutorial sample
    • Negative test asserting pyannote.core / pyannote.metrics submodules
      are never imported when der / md_eval are imported
  • tests/collections/{asr,speaker_tasks}/utils/test_vad_utils_*.py updated
    to use lhotse-based assertions via a new _annotation_equals(annotation, expected_segments) helper.

May/11/2026: Added more changes that fix remaining issues.

  • Fixed the problem of hypothesis overshooting the manifest range.
  • RTTM output onset time sorting has been added.
  • Split cpWER metrics out of der.py into new metrics/cpwer.py, and updated internal callers, tests, and tutorial imports to use the new module.
  • Verified with focused cpWER/DER tests: 153 passed.

Usage

The public API is unchanged, so existing user code continues to work. New
shorthand for users that already have RTTM-style label strings:

from nemo.collections.asr.metrics.der import score_labels_from_rttm_labels
from nemo.collections.asr.parts.utils.speaker_utils import rttm_to_labels
ref_labels = rttm_to_labels("ground_truth.rttm")
hyp_labels = rttm_to_labels("system.rttm")
der_metric, mapping, (DER, CER, FA, MISS) = score_labels_from_rttm_labels(
    ref_labels_list=[("session_001", ref_labels)],
    hyp_labels_list=[("session_001", hyp_labels)],
    collar=0.25,           # NIST half-width: total no-score zone = 0.50s
    ignore_overlap=False,
    verbose=False,
)
print(f"DER = {abs(der_metric):.4f}")
The lhotse-based path (drop-in for previous external-library annotations):

from nemo.collections.asr.metrics.der import score_labels, make_diar_annotation
ref = make_diar_annotation(ref_labels, uniq_name="session_001")
hyp = make_diar_annotation(hyp_labels, uniq_name="session_001")
metric, mapping, errs = score_labels(
    AUDIO_RTTM_MAP={"session_001": {}},
    all_reference=[("session_001", ref)],
    all_hypothesis=[("session_001", hyp)],
    collar=0.25,
    ignore_overlap=False,
    verbose=False,
)

GitHub Actions CI

The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.

The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

  • New Feature
  • Bugfix
  • Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

Removes a maintenance liability: the previous external diarization metric packages have been on pip with infrequent updates and have pulled in a large transitive closure (pyannote-database, pyannote-pipeline, ...). After this PR, NeMo's DER pipeline depends only on numpy, scipy, lhotse, and editdistance -- all already required.
Backward-compatibility audit: git grep -i pyannote over the branch returns zero matches across Python sources, notebooks, configs, lockfile, docs, and shell scripts. import nemo followed by inspecting sys.modules shows no pyannote.* entries.
Numerical-parity audit: 21 verified-against-the-previous-engine DER values hardcoded in TestExternalEngineVerifiedValues, plus 7 regression tests pinning the auto-UEM and collar semantics with hand-derived expected values from the diarization tutorial sample.

Signed-off-by: taejinp <tango4j@gmail.com>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Apr 21, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Signed-off-by: tango4j <tango4j@users.noreply.github.com>
Comment thread nemo/collections/asr/metrics/der.py Fixed
tango4j and others added 3 commits April 21, 2026 15:33
Signed-off-by: taejinp <tango4j@gmail.com>
Signed-off-by: taejinp <tango4j@gmail.com>
Signed-off-by: tango4j <tango4j@users.noreply.github.com>
@tango4j
Copy link
Copy Markdown
Collaborator Author

tango4j commented Apr 21, 2026

@pzelasko
Can you just scan uv.lock and requirements.txt to see if there is no issues?

Signed-off-by: taejinp <tango4j@gmail.com>
Signed-off-by: taejinp <tango4j@gmail.com>
@tango4j
Copy link
Copy Markdown
Collaborator Author

tango4j commented May 12, 2026

@chtruong814
Can you review this? We are deprecating "pyannote" related packages and it inevitably touches on uv.lock.

Comment thread nemo/collections/common/tokenizers/text_to_speech/tokenizer_utils.py Outdated
Signed-off-by: taejinp <tango4j@gmail.com>
@tango4j
Copy link
Copy Markdown
Collaborator Author

tango4j commented May 13, 2026

/ok to test dd3055b

@tango4j
Copy link
Copy Markdown
Collaborator Author

tango4j commented May 13, 2026

/ok to test 69301bf

@pzelasko
Copy link
Copy Markdown
Collaborator

/ok to test 0d82d77

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants