Skip to content

Add generic audio processing pipeline for ASR and TTS data preparation#1679

Merged
sarahyurick merged 81 commits intoNVIDIA-NeMo:mainfrom
sushmitha-deva-09:audio_core
Apr 24, 2026
Merged

Add generic audio processing pipeline for ASR and TTS data preparation#1679
sarahyurick merged 81 commits intoNVIDIA-NeMo:mainfrom
sushmitha-deva-09:audio_core

Conversation

@sushmitha-deva-09
Copy link
Copy Markdown
Contributor

@sushmitha-deva-09 sushmitha-deva-09 commented Mar 30, 2026

Description

  • Adds a complete audio tagging pipeline that processes raw audio into labeled segments suitable for ASR and TTS training. The pipeline includes: audio resampling, speaker diarization (PyAnnote), audio splitting with ASR alignment (NeMo), segment merging.
  • Introduces new inference stages for speaker diarization (PyAnnoteDiarizationStage) and voice activity detection (WhisperXVADStage)
  • Adds comprehensive unit tests for all pipeline stages, end-to-end tests for TTS pipeline, a benchmarking script, tutorial with YAML configs

Checklist

  • I am familiar with the Contributing Guide.
  • New or Existing tests cover these changes.
  • The documentation is up to date with these changes.

Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>
Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>
Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>
Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>
Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>
Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>
Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>
Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>
Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>
Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>
Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>
Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>
Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>
Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>
Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>
Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>
Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>
@sushmitha-deva-09 sushmitha-deva-09 requested review from a team as code owners March 30, 2026 15:41
@sushmitha-deva-09 sushmitha-deva-09 requested review from weijiac0619 and removed request for a team March 30, 2026 15:41
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Mar 30, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@sushmitha-deva-09 sushmitha-deva-09 changed the title Audio core Add generic audio processing pipeline for ASR and TTS data preparation Mar 30, 2026
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Mar 30, 2026

Greptile Summary

This PR adds a complete audio tagging pipeline for ASR and TTS data preparation, including resampling, speaker diarization (PyAnnote), ASR alignment (NeMo), and segment merging stages. Several issues flagged in previous review rounds have been addressed (hash-based filename deduplication, soundfile.info() for duration reads, env-var restore pattern for TORCH weights loading, count moved outside the manifest loop, split data-loss fix), but a few P1 regressions remain open from prior threads.

  • InferenceAsrNemoStage.process() (asr_nemo.py:115) still raises NotImplementedError unconditionally, breaking any single-task dispatch path.
  • PreserveByValueStage.process() (common.py:111) still raises NotImplementedError unconditionally, same issue.
  • overlaps in pyannote.py is still computed before diarization.crop(), so spurious PyAnnote tracks beyond the audio boundary can cause legitimate end-of-file speaker turns to be silently excluded.

Confidence Score: 3/5

Not safe to merge — three P1 issues from previous review rounds remain unaddressed and will cause runtime failures on single-task executor paths.

Good progress resolving prior P1s (env-var scoping, filename collisions, data-loss in last chunk, task_id uniqueness, soundfile.info, batch_size passthrough). However, InferenceAsrNemoStage.process() and PreserveByValueStage.process() still raise NotImplementedError unconditionally, and the stale overlaps-before-crop ordering in pyannote.py can silently exclude valid speaker turns. Three confirmed P1s remain.

nemo_curator/stages/audio/inference/asr/asr_nemo.py (line 115), nemo_curator/stages/audio/common.py (line 111), nemo_curator/stages/audio/inference/speaker_diarization/pyannote.py (overlaps before crop)

Important Files Changed

Filename Overview
nemo_curator/stages/audio/common.py Many improvements merged (get_audio_duration now uses soundfile.info, ManifestReaderStage task_id uniqueness fixed, ManifestWriterStage setup_on_node no longer truncates). PreserveByValueStage.process() still raises NotImplementedError unconditionally.
nemo_curator/stages/audio/inference/asr/asr_nemo.py InferenceAsrNemoStage.process() at line 115 still raises NotImplementedError unconditionally — single-task dispatch paths will always fail.
nemo_curator/stages/audio/inference/speaker_diarization/pyannote.py has_overlap boundary conditions now use strict comparisons (fixed). overlaps list is still computed before diarization.crop(), meaning spurious out-of-bounds tracks can incorrectly exclude legitimate end-of-file turns.
nemo_curator/stages/audio/inference/vad/whisperx_vad.py All previously flagged issues resolved: env-var now saved/restored in a try/finally block, setup_on_node uses device=cpu, segments_key is always written (empty list for short audio).
nemo_curator/stages/audio/tagging/resample_audio.py Filename collision fixed (sha256 hash suffix). sox/ffmpeg issue fixed (only checks ffmpeg now). os.path.join for cloud resampled_audio_dir still present but constrained to local paths by documentation.
nemo_curator/stages/audio/tagging/inference/nemo_asr_align.py process() now correctly delegates to process_batch (fixed). _asr_model and _override_cfg now use field() consistently (fixed). No new issues found.
nemo_curator/stages/audio/tagging/split.py Last-chunk data loss fixed (remaining_frames > min_len * sr). min_pause_len removed. get_split_points still lacks defensive sort on segments (previous thread). JoinSplitAudioMetadataStage missing default values for text/alignment when split_metadata is empty (previous thread).
nemo_curator/config/run.py batch_size is no longer popped before instantiation (only resources is removed), fixing the silent YAML value discard for NeMoASRAlignerStage and SplitASRAlignJoinStage.
nemo_curator/stages/audio/tagging/merge_alignment_diarization.py Words falling in diarization gaps are now logged at debug level. Behavior (word still discarded) is unchanged from prior thread, but the silent data loss is now observable via logging.
nemo_curator/stages/audio/tagging/utils.py add_non_speaker_segments now sorts input segments before iterating (fixes the unsorted assumption from prior thread). Looks correct.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[ManifestReader\nFilePartitioningStage + ManifestReaderStage] --> B[ResampleAudioStage\nffmpeg resampling with hash-based dedup]
    B --> C[GetAudioDurationStage\nsoundfile.info header-only read]
    C --> D[PyAnnoteDiarizationStage\nPyAnnote speaker diarization + WhisperX VAD]
    C --> E[WhisperXVADStage\nVAD-only path]
    D --> F[SplitASRAlignJoinStage]
    E --> F
    F --> F1[SplitLongAudioStage\nsplit at natural pauses]
    F1 --> F2[NeMoASRAlignerStage\nNeMo forced alignment / transcription]
    F2 --> F3[JoinSplitAudioMetadataStage\nadjust timestamps + concatenate transcripts]
    F3 --> G[MergeAlignmentDiarizationStage\nalign words to speaker segments]
    G --> H[PreserveByValueStage\nfilter by value predicate]
    H --> I[ManifestWriterStage\nappend JSONL output]
Loading

Reviews (54): Last reviewed commit: "Merge branch 'main' into audio_core" | Re-trigger Greptile

Comment thread nemo_curator/stages/audio/common.py
Comment thread nemo_curator/stages/audio/common.py
Comment thread nemo_curator/stages/audio/tagging/utils.py
Comment thread nemo_curator/stages/audio/tagging/split.py
Comment thread nemo_curator/stages/audio/tagging/inference/nemo_asr_align.py
Comment thread nemo_curator/stages/audio/tagging/inference/nemo_asr_align.py
Comment thread nemo_curator/stages/audio/tagging/resample_audio.py Outdated
Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>
Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>
Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>
@sushmitha-deva-09
Copy link
Copy Markdown
Contributor Author

/ok to test 57822fc

Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>
@sushmitha-deva-09
Copy link
Copy Markdown
Contributor Author

/ok to test d3ea060

Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>
@sushmitha-deva-09
Copy link
Copy Markdown
Contributor Author

/ok to test 449fcb2

Comment thread nemo_curator/stages/audio/inference/speaker_diarization/pyannote.py
Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>
@sushmitha-deva-09
Copy link
Copy Markdown
Contributor Author

/ok to test c1bb7b3

Comment on lines +229 to +236
splits_joined = 0
words_aligned = 0

# Check if this is a meta-entry with split information
if "split_filepaths" in data_entry:
if data_entry["split_filepaths"] is None:
del data_entry["split_filepaths"]
else:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 outputs() contract broken when split_metadata is empty

outputs() declares ["text", "alignment"] as guaranteed output keys, but _join_split_metadata returns early without setting either key when split_metadata is empty:

if not split_metadata:
    del meta_entry["split_filepaths"]
    return  # text and alignment are never set

Any downstream stage (e.g. MergeAlignmentDiarizationStage) that reads alignment from the task will receive a KeyError for these entries. At minimum, set empty defaults before returning:

if not split_metadata:
    del meta_entry["split_filepaths"]
    meta_entry.setdefault("text", "")
    meta_entry.setdefault("alignment", [])
    return

Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>
@sushmitha-deva-09
Copy link
Copy Markdown
Contributor Author

/ok to test 9a1e3e4

Comment on lines +115 to 117
t0 = time.perf_counter()
for task in tasks:
if not self.validate_input(task):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 process() raises unconditionally, breaking single-task dispatch

InferenceAsrNemoStage.process() raises NotImplementedError for every single-task invocation. Any executor path that dispatches tasks one-by-one — testing without batching, fallback paths, or non-batch executors — will fail for every task. NeMoASRAlignerStage (in the same PR) now correctly delegates to process_batch, and the same fix applies here:

Suggested change
t0 = time.perf_counter()
for task in tasks:
if not self.validate_input(task):
def process(self, task: AudioTask) -> AudioTask:
results = self.process_batch([task])
return results[0] if results else task

Comment on lines +258 to +261
elif "speaker_id" in data_entry:
speaker_id = data_entry["speaker_id"] + "_" + speaker
elif self.audio_filepath_key in data_entry:
speaker_id = Path(data_entry[self.audio_filepath_key]).stem + "_" + speaker
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 overlaps computed before diarization.crop() — stale entries can exclude legitimate end-of-file turns

overlaps is extracted from the uncropped annotation. PyAnnote may emit spurious tracks beyond the audio boundary (the exact bug being fixed by diarization.crop()), and those tracks can produce spurious overlap segments. A valid speaker turn near the end of the file can then be matched by has_overlap check 4 (overlap.start < turn.start and overlap.end > turn.end), silently placing it in overlap_segments instead of segments.

Move the overlap extraction to after the crop:

# Crop to audio length first
diarization = diarization.crop(Segment(0, len(s[0]) / fs))

# Recompute overlaps on the cropped annotation
overlaps = diarization.get_overlap().segments_list_

Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>
Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>
Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>
@sushmitha-deva-09
Copy link
Copy Markdown
Contributor Author

/ok to test f8a3c7c

Comment thread tests/stages/audio/test_common.py Outdated
Comment thread pyproject.toml
Comment thread pyproject.toml
Comment thread pyproject.toml Outdated
Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>
Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>
Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>
@sushmitha-deva-09
Copy link
Copy Markdown
Contributor Author

/ok to test 00409b3

Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>
@sushmitha-deva-09
Copy link
Copy Markdown
Contributor Author

/ok to test dcd291c

@sarahyurick
Copy link
Copy Markdown
Contributor

/ok to test d8951b8

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants