Add generic audio processing pipeline for ASR and TTS data preparation by sushmitha-deva-09 · Pull Request #1679 · NVIDIA-NeMo/Curator

sushmitha-deva-09 · 2026-03-30T15:41:51Z

Description

Adds a complete audio tagging pipeline that processes raw audio into labeled segments suitable for ASR and TTS training. The pipeline includes: audio resampling, speaker diarization (PyAnnote), audio splitting with ASR alignment (NeMo), segment merging.
Introduces new inference stages for speaker diarization (PyAnnoteDiarizationStage) and voice activity detection (WhisperXVADStage)
Adds comprehensive unit tests for all pipeline stages, end-to-end tests for TTS pipeline, a benchmarking script, tutorial with YAML configs

Checklist

I am familiar with the Contributing Guide.
New or Existing tests cover these changes.
The documentation is up to date with these changes.

Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>

…_generic

Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>

copy-pr-bot · 2026-03-30T15:41:56Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

greptile-apps · 2026-03-30T15:50:00Z

Greptile Summary

This PR adds a complete audio tagging pipeline for ASR and TTS data preparation, including resampling, speaker diarization (PyAnnote), ASR alignment (NeMo), and segment merging stages. Several issues flagged in previous review rounds have been addressed (hash-based filename deduplication, soundfile.info() for duration reads, env-var restore pattern for TORCH weights loading, count moved outside the manifest loop, split data-loss fix), but a few P1 regressions remain open from prior threads.

InferenceAsrNemoStage.process() (asr_nemo.py:115) still raises NotImplementedError unconditionally, breaking any single-task dispatch path.
PreserveByValueStage.process() (common.py:111) still raises NotImplementedError unconditionally, same issue.
overlaps in pyannote.py is still computed before diarization.crop(), so spurious PyAnnote tracks beyond the audio boundary can cause legitimate end-of-file speaker turns to be silently excluded.

Confidence Score: 3/5

Not safe to merge — three P1 issues from previous review rounds remain unaddressed and will cause runtime failures on single-task executor paths.

Good progress resolving prior P1s (env-var scoping, filename collisions, data-loss in last chunk, task_id uniqueness, soundfile.info, batch_size passthrough). However, InferenceAsrNemoStage.process() and PreserveByValueStage.process() still raise NotImplementedError unconditionally, and the stale overlaps-before-crop ordering in pyannote.py can silently exclude valid speaker turns. Three confirmed P1s remain.

nemo_curator/stages/audio/inference/asr/asr_nemo.py (line 115), nemo_curator/stages/audio/common.py (line 111), nemo_curator/stages/audio/inference/speaker_diarization/pyannote.py (overlaps before crop)

Important Files Changed

Filename	Overview
nemo_curator/stages/audio/common.py	Many improvements merged (get_audio_duration now uses soundfile.info, ManifestReaderStage task_id uniqueness fixed, ManifestWriterStage setup_on_node no longer truncates). PreserveByValueStage.process() still raises NotImplementedError unconditionally.
nemo_curator/stages/audio/inference/asr/asr_nemo.py	InferenceAsrNemoStage.process() at line 115 still raises NotImplementedError unconditionally — single-task dispatch paths will always fail.
nemo_curator/stages/audio/inference/speaker_diarization/pyannote.py	has_overlap boundary conditions now use strict comparisons (fixed). overlaps list is still computed before diarization.crop(), meaning spurious out-of-bounds tracks can incorrectly exclude legitimate end-of-file turns.
nemo_curator/stages/audio/inference/vad/whisperx_vad.py	All previously flagged issues resolved: env-var now saved/restored in a try/finally block, setup_on_node uses device=cpu, segments_key is always written (empty list for short audio).
nemo_curator/stages/audio/tagging/resample_audio.py	Filename collision fixed (sha256 hash suffix). sox/ffmpeg issue fixed (only checks ffmpeg now). os.path.join for cloud resampled_audio_dir still present but constrained to local paths by documentation.
nemo_curator/stages/audio/tagging/inference/nemo_asr_align.py	process() now correctly delegates to process_batch (fixed). _asr_model and _override_cfg now use field() consistently (fixed). No new issues found.
nemo_curator/stages/audio/tagging/split.py	Last-chunk data loss fixed (remaining_frames > min_len * sr). min_pause_len removed. get_split_points still lacks defensive sort on segments (previous thread). JoinSplitAudioMetadataStage missing default values for text/alignment when split_metadata is empty (previous thread).
nemo_curator/config/run.py	batch_size is no longer popped before instantiation (only resources is removed), fixing the silent YAML value discard for NeMoASRAlignerStage and SplitASRAlignJoinStage.
nemo_curator/stages/audio/tagging/merge_alignment_diarization.py	Words falling in diarization gaps are now logged at debug level. Behavior (word still discarded) is unchanged from prior thread, but the silent data loss is now observable via logging.
nemo_curator/stages/audio/tagging/utils.py	add_non_speaker_segments now sorts input segments before iterating (fixes the unsorted assumption from prior thread). Looks correct.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[ManifestReader\nFilePartitioningStage + ManifestReaderStage] --> B[ResampleAudioStage\nffmpeg resampling with hash-based dedup]
    B --> C[GetAudioDurationStage\nsoundfile.info header-only read]
    C --> D[PyAnnoteDiarizationStage\nPyAnnote speaker diarization + WhisperX VAD]
    C --> E[WhisperXVADStage\nVAD-only path]
    D --> F[SplitASRAlignJoinStage]
    E --> F
    F --> F1[SplitLongAudioStage\nsplit at natural pauses]
    F1 --> F2[NeMoASRAlignerStage\nNeMo forced alignment / transcription]
    F2 --> F3[JoinSplitAudioMetadataStage\nadjust timestamps + concatenate transcripts]
    F3 --> G[MergeAlignmentDiarizationStage\nalign words to speaker segments]
    G --> H[PreserveByValueStage\nfilter by value predicate]
    H --> I[ManifestWriterStage\nappend JSONL output]

_{Reviews (54): Last reviewed commit: "Merge branch 'main' into audio_core" | Re-trigger Greptile}

Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>

sushmitha-deva-09 · 2026-04-22T09:09:41Z

/ok to test 57822fc

Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>

sushmitha-deva-09 · 2026-04-22T11:39:23Z

/ok to test d3ea060

Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>

sushmitha-deva-09 · 2026-04-22T17:53:22Z

/ok to test 449fcb2

Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>

sushmitha-deva-09 · 2026-04-22T18:17:45Z

/ok to test c1bb7b3

greptile-apps · 2026-04-22T18:20:35Z

+        splits_joined = 0
+        words_aligned = 0
+
+        # Check if this is a meta-entry with split information
+        if "split_filepaths" in data_entry:
+            if data_entry["split_filepaths"] is None:
+                del data_entry["split_filepaths"]
+            else:


outputs() contract broken when split_metadata is empty

outputs() declares ["text", "alignment"] as guaranteed output keys, but _join_split_metadata returns early without setting either key when split_metadata is empty:

if not split_metadata: del meta_entry["split_filepaths"] return # text and alignment are never set

Any downstream stage (e.g. MergeAlignmentDiarizationStage) that reads alignment from the task will receive a KeyError for these entries. At minimum, set empty defaults before returning:

if not split_metadata: del meta_entry["split_filepaths"] meta_entry.setdefault("text", "") meta_entry.setdefault("alignment", []) return

Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>

sushmitha-deva-09 · 2026-04-23T02:22:27Z

/ok to test 9a1e3e4

greptile-apps · 2026-04-23T02:43:03Z

+        t0 = time.perf_counter()
        for task in tasks:
            if not self.validate_input(task):


process() raises unconditionally, breaking single-task dispatch

InferenceAsrNemoStage.process() raises NotImplementedError for every single-task invocation. Any executor path that dispatches tasks one-by-one — testing without batching, fallback paths, or non-batch executors — will fail for every task. NeMoASRAlignerStage (in the same PR) now correctly delegates to process_batch, and the same fix applies here:

Suggested change

t0 = time.perf_counter()

for task in tasks:

if not self.validate_input(task):

def process(self, task: AudioTask) -> AudioTask:

results = self.process_batch([task])

return results[0] if results else task

greptile-apps · 2026-04-23T02:43:08Z

+            elif "speaker_id" in data_entry:
+                speaker_id = data_entry["speaker_id"] + "_" + speaker
+            elif self.audio_filepath_key in data_entry:
+                speaker_id = Path(data_entry[self.audio_filepath_key]).stem + "_" + speaker


overlaps computed before diarization.crop() — stale entries can exclude legitimate end-of-file turns

overlaps is extracted from the uncropped annotation. PyAnnote may emit spurious tracks beyond the audio boundary (the exact bug being fixed by diarization.crop()), and those tracks can produce spurious overlap segments. A valid speaker turn near the end of the file can then be matched by has_overlap check 4 (overlap.start < turn.start and overlap.end > turn.end), silently placing it in overlap_segments instead of segments.

Move the overlap extraction to after the crop:

# Crop to audio length first diarization = diarization.crop(Segment(0, len(s[0]) / fs)) # Recompute overlaps on the cropped annotation overlaps = diarization.get_overlap().segments_list_

Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>

sushmitha-deva-09 · 2026-04-24T06:42:08Z

/ok to test f8a3c7c

Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>

sushmitha-deva-09 · 2026-04-24T21:16:24Z

/ok to test 00409b3

Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>

sushmitha-deva-09 · 2026-04-24T22:15:52Z

/ok to test dcd291c

sarahyurick · 2026-04-24T22:49:13Z

/ok to test d8951b8

sushmitha-deva-09 added 19 commits February 25, 2026 15:31

Update pyptoject.toml

dc9c4e0

Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>

Merge branch 'main' of https://github.com/NVIDIA-NeMo/Curator into yt…

34db504

…_generic

Add generic audio tagging pipeline

3c4e97e

Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>

Update configs and benchmarking scripts

e90ec75

Rename files and use common get duration method

6f0276f

Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>

Fix formatting

e4fa6de

Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>

Fix minor bugs

cf93c3d

Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>

Update random usage in pyannote.py

db22329

Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>

Update get duration method

6f88859

Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>

Fix minor issues

0dad1e1

Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>

Merge with main

44cf69d

Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>

Add inputs and outputs methods to all stages

79f9ccc

Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>

Fix ruff check

b90875d

Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>

Update scripts

f47192e

Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>

Fix bug prepare segments

47baa5e

Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>

Update scripts

498e740

Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>

Merge with main

04a4b99

Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>

AudioBatch to AudioTask migration

2debd02

Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>

Remove unwanted stages

c5152c5

Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>

sushmitha-deva-09 requested review from a team as code owners March 30, 2026 15:41

sushmitha-deva-09 requested review from weijiac0619 and removed request for a team March 30, 2026 15:41

sushmitha-deva-09 changed the title ~~Audio core~~ Add generic audio processing pipeline for ASR and TTS data preparation Mar 30, 2026

greptile-apps Bot reviewed Mar 30, 2026

View reviewed changes

sushmitha-deva-09 added 2 commits March 30, 2026 21:57

Remove metric stages

ce1f3ba

Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>

Update scripts

65f49ab

Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>

sushmitha-deva-09 mentioned this pull request Mar 30, 2026

Add generic audio tagging pipeline for ASR and TTS data preparation #1602

Open

3 tasks

Add torchcodec constraints

57822fc

Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>

Keep upper bounds on torch overrides

d3ea060

Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>

Merge with main

449fcb2

Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>

greptile-apps Bot reviewed Apr 22, 2026

View reviewed changes

Comment thread nemo_curator/stages/audio/inference/speaker_diarization/pyannote.py

Revert deleted files

c1bb7b3

Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>

greptile-apps Bot reviewed Apr 22, 2026

View reviewed changes

Merge with main

9a1e3e4

Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>

greptile-apps Bot reviewed Apr 23, 2026

View reviewed changes

mohammadaaftabv approved these changes Apr 23, 2026

View reviewed changes

ayushdg previously requested changes Apr 23, 2026

View reviewed changes

sushmitha-deva-09 added 3 commits April 24, 2026 11:44

Remove sox dependency

08bf777

Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>

Merge with main

091668f

Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>

Fix import

f8a3c7c

Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>

sarahyurick reviewed Apr 24, 2026

View reviewed changes

Comment thread tests/stages/audio/test_common.py Outdated

Comment thread pyproject.toml

Comment thread pyproject.toml

Comment thread pyproject.toml Outdated

sushmitha-deva-09 added 3 commits April 25, 2026 02:37

Update lock file

f947e5f

Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>

Make top level import

4502c01

Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>

Merge with main

00409b3

Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>

sarahyurick approved these changes Apr 24, 2026

View reviewed changes

Remove torch upper bound

dcd291c

Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>

Merge branch 'main' into audio_core

d8951b8

ayushdg mentioned this pull request Apr 27, 2026

Regenerate uv.lock from prev version from main #1875

Merged

3 tasks

Conversation

sushmitha-deva-09 commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist

Uh oh!

copy-pr-bot Bot commented Mar 30, 2026

Uh oh!

greptile-apps Bot commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 3/5

Important Files Changed

Flowchart

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sushmitha-deva-09 commented Apr 22, 2026

Uh oh!

sushmitha-deva-09 commented Apr 22, 2026

Uh oh!

sushmitha-deva-09 commented Apr 22, 2026

Uh oh!

Uh oh!

sushmitha-deva-09 commented Apr 22, 2026

Uh oh!

greptile-apps Bot Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

sushmitha-deva-09 commented Apr 23, 2026

Uh oh!

greptile-apps Bot Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

sushmitha-deva-09 commented Apr 24, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sushmitha-deva-09 commented Apr 24, 2026

Uh oh!

sushmitha-deva-09 commented Apr 24, 2026

Uh oh!

sarahyurick commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

sushmitha-deva-09 commented Mar 30, 2026 •

edited

Loading

greptile-apps Bot commented Mar 30, 2026 •

edited

Loading