Removing the dependency on Pyannote for Diarization and VAD by tango4j · Pull Request #15632 · NVIDIA-NeMo/NeMo

tango4j · 2026-04-21T22:14:41Z

Important

The Update branch button must only be pressed in very rare occassions.
An outdated branch is never blocking the merge of a PR.
Please reach out to the automation team before pressing that button.

What does this PR do ?

Removes the pyannote.core and pyannote.metrics dependencies from NeMo's
speaker-diarization stack and replaces them with an in-tree, NIST
md-eval-22.pl-faithful Python engine plus lhotse.SupervisionSegment-based
annotation objects. The public API of nemo.collections.asr.metrics.der is
preserved, including byte-for-byte numerical parity with historical NeMo
diarization results (no shift in published DER numbers).

Tried to replace Pyannote classes with Lhotse's classes, to minimize the code
added to the repo by removing Pyannote imports. Except RTTM writing functions,
mostly replaceable.

Collection: ASR (speaker tasks / diarization, VAD)

Changelog

New: in-tree DER engine (nemo/collections/asr/metrics/md_eval.py)

New module: a Python port of NIST md-eval-22.pl, written in NeMo style
(Apache header, type hints, Google-style docstrings, __all__,
nemo.utils.logging, no CLI). Drives all DER computation.
New DiarizationErrorResult result object exposing the dict-like interface
used throughout NeMo (abs(result), result['total' | 'confusion' | 'false alarm' | 'missed detection'], result.results_,
result.optimal_mapping(...), result.report()).

nemo/collections/asr/metrics/der.py (DER public API)

score_labels, evaluate_der, score_labels_from_rttm_labels,
get_partial_ref_labels, get_online_DER_stats, calculate_session_cpWER,
calculate_session_cpWER_bruteforce, concat_perm_word_error_rate are
all preserved with their original names, signatures, and return shapes.
No breaking changes for downstream callers.
New lhotse-backed annotation helpers (replacements for the previous
pyannote.core types):
- make_diar_segment(start, end, speaker, ...) -> SupervisionSegment
- make_diar_annotation(labels, uniq_name=...) -> list[SupervisionSegment]
- make_uem_timeline(uem_lines, uniq_id=...) -> list[SupervisionSegment]
  (UEM regions carried as supervisions with speaker="UEM")
- unique_speakers(annotation) -> list[str]
- write_supervisions_to_rttm(annotation, file_handle, ...)
New score_labels_from_rttm_labels(...) convenience entry point that takes
raw "start end speaker" label strings (no annotation object construction
required by the caller).
New _default_uem_from_ref_sys(ref_data, sys_data) helper. When a caller
does not supply a UEM, the high-level wrappers now auto-derive
[min(ref ∪ sys TBEG), max(ref ∪ sys TEND)] per (file_id, channel) and
pass it to evaluate(). This matches the historical no-UEM scoring map
used by the previous external engine and prevents any over-shoot of the
hypothesis past the last reference segment from being silently dropped.
md_eval.evaluate() itself remains a faithful NIST port (ref-extent only)
for power users that call it directly.
Docstring on collar argument in both score_labels and
score_labels_from_rttm_labels clarifies the NIST half-width semantics
(total no-score zone = 2 * collar) and gives the cross-engine conversion
rule (NeMo collar=X <==> external libs that define collar as total width
collar=2X).

Source code rename / scrub (no behaviour change)

nemo/collections/asr/parts/utils/speaker_utils.py:
- labels_to_pyannote_object -> labels_to_supervisions
- timestamps_to_pyannote_object -> timestamps_to_supervisions
- now returns list[SupervisionSegment]
nemo/collections/asr/parts/utils/vad_utils.py:
- vad_construct_pyannote_object_per_file -> vad_construct_supervisions_per_file
- frame_vad_construct_pyannote_object_per_file -> frame_vad_construct_supervisions_per_file
- read_rttm_as_pyannote_object -> read_rttm_as_supervisions
- new internal _DetectionErrorRateAccumulator class replaces
  pyannote.metrics.detection.DetectionErrorRate, backed by md_eval. It
  preserves the metric(reference, hypothesis) accumulation +
  metric.report(display=False) API and returns a pandas DataFrame with
  the same ('detection error rate', '%'), ('false alarm', '%'),
  ('miss', '%') columns that downstream code consumes.
scripts/speaker_tasks/eval_diar_with_asr.py:
- get_pyannote_objs_from_rttms -> get_supervisions_from_rttms
examples/speaker_tasks/diarization/neural_diarizer/e2e_diarize_speech.py:
- call sites updated to the new timestamps_to_supervisions name
All docstrings, comments, and reference URLs that mentioned the third-party
package by name have been rewritten (or replaced with neutral wording such
as "External Annotation Library") so a git grep -i pyannote over the
branch returns zero matches.
Two tutorial notebooks (tutorials/speaker_tasks/End_to_End_Diarization_*.ipynb,
tutorials/tools/Multispeaker_Simulator.ipynb) and the inference notebook
updated to use the new names and score_labels_from_rttm_labels.

Dependencies removed

requirements/requirements_asr.txt: removed pyannote.core and pyannote.metrics.
examples/voice_agent/environment.yaml: removed pyannote-core==5.0.0,
pyannote-database==5.1.3, pyannote-metrics==3.2.1.
uv.lock: removed the three corresponding [[package]] blocks and every
transitive { name = "pyannote-..." } entry. TOML structure validated
after edit.

Tests

New tests/collections/speaker_tasks/utils/test_der.py (119 unit tests)
covering:
- md-eval engine: basic, collar, overlap, speaker count, UEM
- score_labels_from_rttm_labels (string-label public API)
- Multi-file aggregation
- 21 hardcoded values verified independently against the previous external
  engine implementation (class TestExternalEngineVerifiedValues)
- Lhotse-backed annotation pipeline end-to-end + bit-exact equivalence
  with the string-label path
- 7-test TestNoUemAutoUnion regression class pinning the auto-UEM
  behaviour and the NIST collar semantics with hand-derived expected
  values from the diarization tutorial sample
- Negative test asserting pyannote.core / pyannote.metrics submodules
  are never imported when der / md_eval are imported
tests/collections/{asr,speaker_tasks}/utils/test_vad_utils_*.py updated
to use lhotse-based assertions via a new _annotation_equals(annotation, expected_segments) helper.

May/11/2026: Added more changes that fix remaining issues.

Fixed the problem of hypothesis overshooting the manifest range.
RTTM output onset time sorting has been added.
Split cpWER metrics out of der.py into new metrics/cpwer.py, and updated internal callers, tests, and tutorial imports to use the new module.
Verified with focused cpWER/DER tests: 153 passed.

Usage

The public API is unchanged, so existing user code continues to work. New
shorthand for users that already have RTTM-style label strings:

from nemo.collections.asr.metrics.der import score_labels_from_rttm_labels
from nemo.collections.asr.parts.utils.speaker_utils import rttm_to_labels
ref_labels = rttm_to_labels("ground_truth.rttm")
hyp_labels = rttm_to_labels("system.rttm")
der_metric, mapping, (DER, CER, FA, MISS) = score_labels_from_rttm_labels(
    ref_labels_list=[("session_001", ref_labels)],
    hyp_labels_list=[("session_001", hyp_labels)],
    collar=0.25,           # NIST half-width: total no-score zone = 0.50s
    ignore_overlap=False,
    verbose=False,
)
print(f"DER = {abs(der_metric):.4f}")
The lhotse-based path (drop-in for previous external-library annotations):

from nemo.collections.asr.metrics.der import score_labels, make_diar_annotation
ref = make_diar_annotation(ref_labels, uniq_name="session_001")
hyp = make_diar_annotation(hyp_labels, uniq_name="session_001")
metric, mapping, errs = score_labels(
    AUDIO_RTTM_MAP={"session_001": {}},
    all_reference=[("session_001", ref)],
    all_hypothesis=[("session_001", hyp)],
    collar=0.25,
    ignore_overlap=False,
    verbose=False,
)

GitHub Actions CI

The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.

The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

New Feature
Bugfix
Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

Removes a maintenance liability: the previous external diarization metric packages have been on pip with infrequent updates and have pulled in a large transitive closure (pyannote-database, pyannote-pipeline, ...). After this PR, NeMo's DER pipeline depends only on numpy, scipy, lhotse, and editdistance -- all already required.
Backward-compatibility audit: git grep -i pyannote over the branch returns zero matches across Python sources, notebooks, configs, lockfile, docs, and shell scripts. import nemo followed by inspecting sys.modules shows no pyannote.* entries.
Numerical-parity audit: 21 verified-against-the-previous-engine DER values hardcoded in TestExternalEngineVerifiedValues, plus 7 regression tests pinning the auto-UEM and collar semantics with hand-derived expected values from the diarization tutorial sample.