feat(tutorials/readspeech): add interactive Jupyter notebook tutorial by shubhamNvidia · Pull Request #1870 · NVIDIA-NeMo/Curator

shubhamNvidia · 2026-04-24T17:58:05Z

Summary

Add interactive Jupyter notebook walkthrough for DNS Challenge Read Speech audio curation pipeline
Provides step-by-step execution with visualization of quality score distributions
Demonstrates threshold tuning to control quality vs. data retention tradeoffs

What's Included

tutorials/audio/readspeech/readspeech_tutorial.ipynb - Interactive notebook covering:
1. Dataset download and manifest creation
2. Pipeline stages: Mono → VAD → Band Filter → UTMOS → SIGMOS → Speaker Separation
3. Quality filtering with configurable thresholds
4. Visualization of intermediate outputs and score distributions

Test Plan

Open notebook in Jupyter and run all cells
Verify plots render correctly
Confirm pipeline stages execute without errors

copy-pr-bot · 2026-04-24T17:58:09Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

shubhamNvidia · 2026-04-24T17:59:34Z

/ok to test 972b1c1

greptile-apps · 2026-04-24T18:00:04Z

Greptile Summary

This PR adds an interactive Jupyter notebook tutorial (tutorials/audio/readspeech/readspeech_tutorial.ipynb) that walks through the DNS Challenge Read Speech audio curation pipeline, covering dataset download, all filter stages, quality-score visualization, and threshold tuning. The .secrets.baseline is regenerated to track the new notebook's embedded plot images in place of stale multimodal notebook entries.

Confidence Score: 5/5

Safe to merge; all findings are P2 style suggestions that do not affect correctness.

No P0 or P1 issues found. The two P2 comments cover a potentially misleading threshold-sensitivity chart scope and an unguarded Ray cleanup path — neither blocks functionality. The secrets baseline update is a clean regeneration.

No files require special attention.

Important Files Changed

Filename	Overview
tutorials/audio/readspeech/readspeech_tutorial.ipynb	New interactive tutorial notebook for DNS Challenge Read Speech pipeline; two minor P2 issues: threshold-sensitivity chart misleads with passing-only sample scope, and Ray cluster lifecycle has no cleanup on failure.
.github/workflows/config/.secrets.baseline	Baseline regenerated via detect-secrets scan: swaps stale multimodal notebook entries (base64 plot images) for the new audio tutorial notebook entries; timestamp updated to current run.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Cell 2: Config & Imports] --> B[Cell 4: CreateInitialManifestReadSpeechStage\nstandalone preview]
    A --> C[Cell 6: Full Pipeline]
    C --> D[CreateInitialManifestReadSpeechStage\nauto_download=True]
    D --> E[AudioDataFilterStage]
    E --> E1[Mono Conversion 48 kHz]
    E1 --> E2[VAD 2-60 s]
    E2 --> E3[Band Filter full_band]
    E3 --> E4[UTMOS >= 3.4]
    E4 --> E5[SIGMOS OVRL >= 3.5 / NOISE >= 4.0]
    E5 --> E6[Speaker Separation]
    E6 --> F[AudioToDocumentStage]
    F --> G[JsonlWriter -> RESULT_DIR]
    G --> H[Cell 9: load_jsonl_results]
    H --> I[Cell 11: Score Distributions]
    H --> J[Cell 13: Band Classification]
    H --> K[Cell 15: Speaker Distribution]
    H --> L[Cell 17: Threshold Sensitivity]
    G --> M[Cell 19: ray_client.stop]

_{Reviews (2): Last reviewed commit: "Merge branch 'main' into pr/notebook" | Re-trigger Greptile}

greptile-apps · 2026-04-24T18:00:08Z

+        "**What you'll learn:**\n",
+        "1. Download and inspect the dataset\n",
+        "2. Run each filter stage and examine intermediate outputs\n",
+        "3. Visualize quality score distributions\n",


Internal codename leaked into public tutorial

The comments reference "Xenna" — which appears to be an internal project codename — in two places. This will be confusing to external contributors and users who won't know what "Xenna" refers to. These should be replaced with a neutral description.

The second occurrence (# Use the default executor (Xenna), matching pipeline.py CLI defaults.) should similarly be reworded to something like # Use the default executor, matching pipeline.py CLI defaults.

greptile-apps · 2026-04-24T18:00:09Z

+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "# DNS Challenge Read Speech \u2014 Interactive Tutorial\n",


Pass rate denominator may be misleading

The pass rate is calculated as len(results) / MAX_SAMPLES, but MAX_SAMPLES is an upper bound on how many files to download — the pipeline may have actually processed fewer than MAX_SAMPLES inputs (e.g., if the dataset has fewer matching files, or some are skipped). Using MAX_SAMPLES as the denominator can silently understate the true pass rate.

Consider tracking the actual number of inputs from the manifest stage output and using that as the denominator, or at least adding a note that the denominator is the requested cap, not the actual input count.

Add Jupyter notebook walkthrough for DNS Challenge Read Speech audio curation pipeline with step-by-step execution and visualization. Also update secrets baseline for notebook image false positives.

sarahyurick · 2026-04-24T20:54:15Z

/ok to test e26ad56

Merge origin/main into dev to pick up upstream changes (492 files, +57k/-6k): - 26.04 staging release - Generic ASR/TTS audio processing pipeline (#1679) - Dynamo disaggregated serving + validators (#1813, #1820, #1833, #1834, #1861) - ReadSpeech audio curation benchmark + tutorials (#1841, #1851, #1870) - VideoReader path validation, audio waveform leak fixes (#1845, #1765) - Sortformer tutorial fixes + benchmarks (#1764) - Generic audio pipeline + qwen3 support (#1827) - Fern docs (audio + curate-audio sections) Conflict resolution: - nemo_curator/stages/audio/__init__.py: kept dev's lazy __getattr__ registry, added main's new ManifestReader and ManifestWriterStage to both __all__ and _LAZY_IMPORTS (now lazy-loaded from nemo_curator.stages.audio.common). - uv.lock: took main's version (latest dependency resolutions). Removals propagated from main (pre-merge-base files we no longer need): - nemo_curator/stages/audio/alm/alm_manifest_writer.py (replaced by ShardedManifestWriterStage) - nemo_curator/stages/audio/alm/alm_manifest_reader.py - nemo_curator/backends/experimental/* (refactored away) - nemo_curator/core/serve.py (replaced by typed serve config) Verified intact: - SCOTCH pipeline: speaker_id/, hifi_pipeline/slurm_e2e/ (dev-only additions, untouched). - Cherry-picked audio PRs (#1853, #3, #1, #1839, integration-test) all present. Signed-off-by: George Zelenfroynd <gzelenfroind@nvidia.com>

shubhamNvidia requested review from a team as code owners April 24, 2026 17:58

shubhamNvidia requested review from meatybobby and removed request for a team April 24, 2026 17:58

copy-pr-bot Bot temporarily deployed to test April 24, 2026 17:59 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci April 24, 2026 17:59 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci April 24, 2026 18:00 Inactive

greptile-apps Bot reviewed Apr 24, 2026

View reviewed changes

feat(tutorials/readspeech): add interactive notebook tutorial

972b1c1

Add Jupyter notebook walkthrough for DNS Challenge Read Speech audio curation pipeline with step-by-step execution and visualization. Also update secrets baseline for notebook image false positives.

copy-pr-bot Bot temporarily deployed to nemo-ci April 24, 2026 18:11 Inactive

Merge branch 'main' into pr/notebook

e26ad56

sarahyurick approved these changes Apr 24, 2026

View reviewed changes

sarahyurick added the docs-only label Apr 24, 2026

sarahyurick merged commit 6cdd923 into NVIDIA-NeMo:main Apr 24, 2026
23 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(tutorials/readspeech): add interactive Jupyter notebook tutorial#1870

feat(tutorials/readspeech): add interactive Jupyter notebook tutorial#1870
sarahyurick merged 2 commits intoNVIDIA-NeMo:mainfrom
shubhamNvidia:pr/notebook

shubhamNvidia commented Apr 24, 2026

Uh oh!

copy-pr-bot Bot commented Apr 24, 2026

Uh oh!

shubhamNvidia commented Apr 24, 2026

Uh oh!

greptile-apps Bot commented Apr 24, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot Apr 24, 2026

Uh oh!

greptile-apps Bot Apr 24, 2026

Uh oh!

sarahyurick commented Apr 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

shubhamNvidia commented Apr 24, 2026

Summary

What's Included

Test Plan

Uh oh!

copy-pr-bot Bot commented Apr 24, 2026

Uh oh!

shubhamNvidia commented Apr 24, 2026

Uh oh!

greptile-apps Bot commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

greptile-apps Bot Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

sarahyurick commented Apr 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

greptile-apps Bot commented Apr 24, 2026 •

edited

Loading