Add generic audio processing pipeline for ASR and TTS data preparation#1679
Add generic audio processing pipeline for ASR and TTS data preparation#1679sarahyurick merged 81 commits intoNVIDIA-NeMo:mainfrom
Conversation
Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>
Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>
Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>
Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>
Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>
Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>
Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>
Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>
Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>
Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>
Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>
Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>
Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>
Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>
Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>
Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>
Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>
Greptile SummaryThis PR adds a complete audio tagging pipeline for ASR and TTS data preparation, including resampling, speaker diarization (PyAnnote), ASR alignment (NeMo), and segment merging stages. Several issues flagged in previous review rounds have been addressed (hash-based filename deduplication,
Confidence Score: 3/5Not safe to merge — three P1 issues from previous review rounds remain unaddressed and will cause runtime failures on single-task executor paths. Good progress resolving prior P1s (env-var scoping, filename collisions, data-loss in last chunk, task_id uniqueness, soundfile.info, batch_size passthrough). However, InferenceAsrNemoStage.process() and PreserveByValueStage.process() still raise NotImplementedError unconditionally, and the stale overlaps-before-crop ordering in pyannote.py can silently exclude valid speaker turns. Three confirmed P1s remain. nemo_curator/stages/audio/inference/asr/asr_nemo.py (line 115), nemo_curator/stages/audio/common.py (line 111), nemo_curator/stages/audio/inference/speaker_diarization/pyannote.py (overlaps before crop) Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[ManifestReader\nFilePartitioningStage + ManifestReaderStage] --> B[ResampleAudioStage\nffmpeg resampling with hash-based dedup]
B --> C[GetAudioDurationStage\nsoundfile.info header-only read]
C --> D[PyAnnoteDiarizationStage\nPyAnnote speaker diarization + WhisperX VAD]
C --> E[WhisperXVADStage\nVAD-only path]
D --> F[SplitASRAlignJoinStage]
E --> F
F --> F1[SplitLongAudioStage\nsplit at natural pauses]
F1 --> F2[NeMoASRAlignerStage\nNeMo forced alignment / transcription]
F2 --> F3[JoinSplitAudioMetadataStage\nadjust timestamps + concatenate transcripts]
F3 --> G[MergeAlignmentDiarizationStage\nalign words to speaker segments]
G --> H[PreserveByValueStage\nfilter by value predicate]
H --> I[ManifestWriterStage\nappend JSONL output]
Reviews (54): Last reviewed commit: "Merge branch 'main' into audio_core" | Re-trigger Greptile |
Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>
Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>
Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>
|
/ok to test 57822fc |
Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>
|
/ok to test d3ea060 |
Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>
|
/ok to test 449fcb2 |
Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>
|
/ok to test c1bb7b3 |
| splits_joined = 0 | ||
| words_aligned = 0 | ||
|
|
||
| # Check if this is a meta-entry with split information | ||
| if "split_filepaths" in data_entry: | ||
| if data_entry["split_filepaths"] is None: | ||
| del data_entry["split_filepaths"] | ||
| else: |
There was a problem hiding this comment.
outputs() contract broken when split_metadata is empty
outputs() declares ["text", "alignment"] as guaranteed output keys, but _join_split_metadata returns early without setting either key when split_metadata is empty:
if not split_metadata:
del meta_entry["split_filepaths"]
return # text and alignment are never setAny downstream stage (e.g. MergeAlignmentDiarizationStage) that reads alignment from the task will receive a KeyError for these entries. At minimum, set empty defaults before returning:
if not split_metadata:
del meta_entry["split_filepaths"]
meta_entry.setdefault("text", "")
meta_entry.setdefault("alignment", [])
returnSigned-off-by: Sushmitha Deva <sdeva@nvidia.com>
|
/ok to test 9a1e3e4 |
| t0 = time.perf_counter() | ||
| for task in tasks: | ||
| if not self.validate_input(task): |
There was a problem hiding this comment.
process() raises unconditionally, breaking single-task dispatch
InferenceAsrNemoStage.process() raises NotImplementedError for every single-task invocation. Any executor path that dispatches tasks one-by-one — testing without batching, fallback paths, or non-batch executors — will fail for every task. NeMoASRAlignerStage (in the same PR) now correctly delegates to process_batch, and the same fix applies here:
| t0 = time.perf_counter() | |
| for task in tasks: | |
| if not self.validate_input(task): | |
| def process(self, task: AudioTask) -> AudioTask: | |
| results = self.process_batch([task]) | |
| return results[0] if results else task |
| elif "speaker_id" in data_entry: | ||
| speaker_id = data_entry["speaker_id"] + "_" + speaker | ||
| elif self.audio_filepath_key in data_entry: | ||
| speaker_id = Path(data_entry[self.audio_filepath_key]).stem + "_" + speaker |
There was a problem hiding this comment.
overlaps computed before diarization.crop() — stale entries can exclude legitimate end-of-file turns
overlaps is extracted from the uncropped annotation. PyAnnote may emit spurious tracks beyond the audio boundary (the exact bug being fixed by diarization.crop()), and those tracks can produce spurious overlap segments. A valid speaker turn near the end of the file can then be matched by has_overlap check 4 (overlap.start < turn.start and overlap.end > turn.end), silently placing it in overlap_segments instead of segments.
Move the overlap extraction to after the crop:
# Crop to audio length first
diarization = diarization.crop(Segment(0, len(s[0]) / fs))
# Recompute overlaps on the cropped annotation
overlaps = diarization.get_overlap().segments_list_Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>
Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>
Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>
|
/ok to test f8a3c7c |
Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>
Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>
Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>
|
/ok to test 00409b3 |
Signed-off-by: Sushmitha Deva <sdeva@nvidia.com>
|
/ok to test dcd291c |
|
/ok to test d8951b8 |
Description
Checklist