Qwen omni filter pipeline by nune-tadevosyan · Pull Request #1853 · NVIDIA-NeMo/Curator

nune-tadevosyan · 2026-04-22T08:11:03Z

Description

Adds two more stages for Granary v2 pipeline

Abbreviation concatenation based on deterministic rules for different languages (currently specialised on English)
PnC restoration with Qwen 3.5 text model
PnC filtering if the context was changed

Usage

PnC restoration we do with two stages.

Check if the sentence is complete
Ask the model to restore PnC

 --pnc_prompt_file  ${PNC_PROMPT_FILE}

Signed-off-by: Nune Tadevosyan <ntadevosyan@cw-dfw-cs-001-login-01.cm.cluster>

copy-pr-bot · 2026-04-22T08:11:06Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

greptile-apps · 2026-04-22T08:13:32Z

Greptile Summary

This PR adds three new post-processing stages to the Granary v2 pipeline: AbbreviationConcatStage (deterministic re-joining of spaced-out single letters), PnCRestorationStage (two-step LLM-based punctuation/capitalisation restoration via QwenTextLLM), and PnCContentGuardStage (word-content guard that reverts LLM output when words were added/removed). Existing stages are also updated to rename skip_me → _skip_me and gain process_batch support. Several issues flagged in previous review rounds remain open in the code (contraction trimming bug in abbreviation_concat.py, batch_size not used for chunking, Unicode punctuation gap in the content guard).

Confidence Score: 4/5

Safe to merge with caveats — open issues from prior review rounds remain in the code but the pipeline wiring is now correct.

No new P0/P1 issues found; the previous-thread P1 issues (contraction trimming producing broken output, batch_size not applied as a chunking knob, Unicode punctuation gap in the content guard) are still present in the code. Two new P2 findings: redundant del self._llm in teardown and rstrip-based character stripping on abbreviation metadata. Score is capped at 4 due to unresolved P1 items carried over from prior rounds.

nemo_curator/stages/audio/text_filtering/abbreviation_concat.py (contraction trimming + rstrip metadata bug), nemo_curator/stages/audio/text_filtering/pnc_content_guard.py (Unicode punctuation gap), nemo_curator/stages/audio/text_filtering/pnc_restoration.py (batch_size not used for chunking).

Important Files Changed

Filename	Overview
nemo_curator/models/qwen_text_llm.py	New text-only LLM wrapper for two-step PnC restoration; well-structured with thread-pooled prompt preparation, but has a redundant `del self._llm` in teardown and a silent bare-except fallback in `_format_prompt` that swallows tokenizer errors.
nemo_curator/stages/audio/text_filtering/pnc_restoration.py	New PnC restoration stage; `batch_size` field is declared and accepted via CLI but never used to chunk `eligible_texts` in `process_batch`, so the tuning knob silently has no effect on actual GPU batch size.
nemo_curator/stages/audio/text_filtering/pnc_content_guard.py	New content guard stage that correctly defaults `pnc_text_key` to `"pnc_text"`. The `_PUNCT_TABLE` only strips ASCII punctuation, so Unicode typographic characters emitted by the LLM can still cause spurious reversions.
nemo_curator/stages/audio/text_filtering/abbreviation_concat.py	New abbreviation normalization stage; has unresolved contraction-trimming issue from previous threads plus a new P2 where `rstrip("'s")` misstrips non-possessive abbreviations ending in lowercase `s` in the metadata field.
nemo_curator/stages/audio/text_filtering/disfluency_wer_guard.py	New WER-based guard stage; clean implementation that correctly falls back to the Turn-1 prediction when WER exceeds threshold.
examples/audio/qwen_omni_inprocess/run_pipeline.py	Pipeline driver updated to wire in all four new stages; `batch_size` is now correctly forwarded to a declared dataclass field; arg parser refactored into `_build_arg_parser`.
nemo_curator/stages/audio/text_filtering/whisper_hallucination.py	Removes low-char-rate check, renames `skip_me` → `_skip_me`, and adds `process_batch` support; straightforward and correct.
scripts/measure_realtime.py	Minor cleanup: makes script executable, fixes f-string-in-constant warnings, and correctly attaches `timezone.utc` to parsed log timestamps.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[NemoTarredAudioReader\nCPU – streams NeMo-tarred shards] --> B[InitializeFieldsStage\nrenames text → granary_v1_prediction\nsets _skip_me = '']
    B --> C[InferenceQwenOmniStage\nGPU – vLLM → qwen3_prediction_s1]
    C --> D{followup_prompt?}
    D -- yes --> E[Turn-2 Inference\n→ qwen3_prediction_s2]
    E --> F[DisfluencyWerGuardStage NEW\nWER guard: reverts s2 → s1 if WER > 50%]
    F --> G
    D -- no --> G[WhisperHallucinationStage\nflags hallucinations → _skip_me]
    G --> H[FastTextLIDStage\nflags wrong language → _skip_me]
    H --> I[RegexSubstitutionStage\n→ cleaned_text]
    I --> J[AbbreviationConcatStage NEW\njoins spaced letters → abbreviated_text]
    J --> K{skip_pnc?}
    K -- no --> L[PnCRestorationStage NEW\nGPU text LLM – two-step restore\n→ pnc_text]
    L --> M[PnCContentGuardStage NEW\nreverts pnc_text if words changed]
    M --> N[ShardedManifestWriterStage\n→ JSONL output]
    K -- yes --> N

_{Reviews (12): Last reviewed commit: "FastText on short utterances" | Re-trigger Greptile}

Signed-off-by: Nune Tadevosyan <ntadevosyan@cw-dfw-cs-001-login-01.cm.cluster>

greptile-apps · 2026-04-23T13:41:50Z

Want your agent to iterate on Greptile's feedback? Try greploops.

Signed-off-by: Nune Tadevosyan <ntadevosyan@cw-dfw-cs-001-login-01.cm.cluster>

nithinraok · 2026-04-22T14:30:00Z

+    completeness_prompt: str = (
+        "Is the following text a complete sentence? Answer only 'yes' or 'no'.\n\nText: {text}"
+    )
+    pnc_prompt: str = (


is this prompt good enough?

nithinraok · 2026-04-23T14:42:32Z

+        self,
+        model_id: str = _QWEN_TEXT_MODEL_ID,
+        completeness_prompt: str = (
+            "Is the following text a complete sentence? Answer only 'yes' or 'no'.\n\nText: {text}"


Do we give additional context here, like:

The following text is a transcript segment from an audio recording. It may be a complete, self-contained utterance or thought, or it may be cut off mid-sentence or mid-idea. Determine if the text is complete and self-contained (i.e., not cut off). Answer only "yes" or "no". Text: {text}

Updated the prompt

nithinraok · 2026-04-23T14:43:19Z

+        system_prompt: str | None = None,
+        max_model_len: int = 4096,
+        max_num_seqs: int = 16,
+        gpu_memory_utilization: float = 0.8,


we can go higher to 0.95 in all inferences IMO

nithinraok · 2026-04-23T15:01:56Z

+    text_key: str = "text"
+    pnc_text_key: str = "text"


should these be same keys?

We set it through pipeline but for consistency updated

nithinraok · 2026-04-23T15:03:22Z

+
+    def teardown(self) -> None:
+        if self._prep_pool is not None:
+            self._prep_pool.shutdown(wait=False)


should this be wait=True?

Yes!,
updated

nithinraok · 2026-04-23T15:04:09Z

+            if len(sample_answers) < _MAX_SAMPLE_LOG:
+                sample_answers.append(repr(answer))


why we need 5 answers?

It was for logging. removed

Signed-off-by: Nune Tadevosyan <ntadevosyan@cw-dfw-cs-001-login-01.cm.cluster>

nithinraok

LGTM.

oyilmaz-nvidia · 2026-04-23T20:02:33Z

+        pass
+
+
+_QWEN_TEXT_MODEL_ID = "Qwen/Qwen3.5-35B-A3B-FP8"


Any reason why this is not added in here https://github.com/NVIDIA-NeMo/Curator/blob/main/nemo_curator/models/qwen_lm.py?

oyilmaz-nvidia

Also, will this qwen-omni-inprocess branch be merged into main?

Signed-off-by: Nune Tadevosyan <ntadevosyan@cw-dfw-cs-001-login-01.cm.cluster>

…ext_filtering module) Adds Granary v2 post-processing pipeline stages for ASR text refinement: - Abbreviation concatenation (deterministic rules, English-focused) - PnC restoration via Qwen 3.5 text model with content-change filtering - Disfluency / WER guard, FastText LID, regex substitution - Whisper hallucination detection, initialize/finalize fields helpers - New nemo_curator.models.qwen_text_llm wrapper Squash cherry-pick of #1853 (qwen-omni-filter-pipeline branch). Conflict resolution: - nemo_curator/models/qwen_omni.py: kept PR's turn-2 disfluency method, appended dev's generate_from_messages + _inject_waveform helper. - nemo_curator/stages/audio/__init__.py: took PR's lazy __getattr__ registry (includes new text_filtering stages); dev's _try_import scheme replaced. - scripts/measure_realtime.py: skipped (file absent in dev). #NO_PR Signed-off-by: George Zelenfroynd <gzelenfroind@nvidia.com>

…st, validated improvements on top of the 4 PRs Squash cherry-pick of integration-test's unique commits on top of #1853 + #1 + #3 + #1839: - 633acc7 FastText and Hallucination update → SelectBestPredictionStage: cross-model WER agreement. If both omni and ASR are flagged hallucinated but agree (WER ≤ 100 - min_agreement_pct, default 80%), keep omni and mark recovered — two independent models producing near-identical text is strong evidence the text is correct. → FastTextLIDStage: HuggingFace-format model loader, proper _predict() abstraction, source-tracked _skip_me ("Wrong language:{name}"). - 5fdfa0a additional notes key + skip writing keys after skip_me + pnc prompt + prefill caching → Models (qwen_omni, qwen_asr, qwen_text_llm): notes_key field for diagnostic info, vLLM enable_prefix_caching=True with xxhash. → text_filtering stages: skip writing output keys when skip_me is set. → New file: prompts/pnc_prompt.md. - 15424e3 updated prompt for ITN → Sharper ITN prompt (handles more conversion edge cases). - 0cf8e6c match max model len for ITN and PnC → Aligned ITN/PnC max_model_len (4096), max_num_seqs (16), gpu_memory_utilization (0.95). Wired ITN args through run_pipeline. - 7e32df1 add Qwen3ASR for all → Apply QwenASR recovery to all hallucination flags, not just specific patterns. WhisperHallucinationStage tweaks. - caccd37 Add min word count for FastText → Re-adds min_word_count=2 (FastText is unreliable on single-word inputs). Conflict resolution: - run_pipeline.py: kept multi-line argparse style (ours), kept --source_lang_key, adopted theirs' ITN stage construction (with new max_model_len/num_seqs/gpu_mem args). - fasttext_lid.py: took theirs' richer process logic (min_word_count check, per-sample expected language via source_lang_key, source-tracked _skip_me values). #NO_PR Signed-off-by: George Zelenfroynd <gzelenfroind@nvidia.com>

Merge origin/main into dev to pick up upstream changes (492 files, +57k/-6k): - 26.04 staging release - Generic ASR/TTS audio processing pipeline (#1679) - Dynamo disaggregated serving + validators (#1813, #1820, #1833, #1834, #1861) - ReadSpeech audio curation benchmark + tutorials (#1841, #1851, #1870) - VideoReader path validation, audio waveform leak fixes (#1845, #1765) - Sortformer tutorial fixes + benchmarks (#1764) - Generic audio pipeline + qwen3 support (#1827) - Fern docs (audio + curate-audio sections) Conflict resolution: - nemo_curator/stages/audio/__init__.py: kept dev's lazy __getattr__ registry, added main's new ManifestReader and ManifestWriterStage to both __all__ and _LAZY_IMPORTS (now lazy-loaded from nemo_curator.stages.audio.common). - uv.lock: took main's version (latest dependency resolutions). Removals propagated from main (pre-merge-base files we no longer need): - nemo_curator/stages/audio/alm/alm_manifest_writer.py (replaced by ShardedManifestWriterStage) - nemo_curator/stages/audio/alm/alm_manifest_reader.py - nemo_curator/backends/experimental/* (refactored away) - nemo_curator/core/serve.py (replaced by typed serve config) Verified intact: - SCOTCH pipeline: speaker_id/, hifi_pipeline/slurm_e2e/ (dev-only additions, untouched). - Cherry-picked audio PRs (#1853, #3, #1, #1839, integration-test) all present. Signed-off-by: George Zelenfroynd <gzelenfroind@nvidia.com>

Nune Tadevosyan added 3 commits April 21, 2026 04:48

pnc restoration

37c4f4c

Signed-off-by: Nune Tadevosyan <ntadevosyan@cw-dfw-cs-001-login-01.cm.cluster>

working changes for english

e312832

Signed-off-by: Nune Tadevosyan <ntadevosyan@cw-dfw-cs-001-login-01.cm.cluster>

Final pnc

89a93af

Signed-off-by: Nune Tadevosyan <ntadevosyan@cw-dfw-cs-001-login-01.cm.cluster>

nune-tadevosyan requested a review from a team as a code owner April 22, 2026 08:11

nune-tadevosyan requested review from meatybobby and removed request for a team April 22, 2026 08:11

greptile-apps Bot reviewed Apr 22, 2026

View reviewed changes

Comment thread nemo_curator/stages/audio/text_filtering/pnc_content_guard.py Outdated

greptile-apps Bot reviewed Apr 22, 2026

View reviewed changes

Comment thread nemo_curator/stages/audio/text_filtering/pnc_restoration.py

Fix for different stages

8e10503

Signed-off-by: Nune Tadevosyan <ntadevosyan@cw-dfw-cs-001-login-01.cm.cluster>

nune-tadevosyan force-pushed the qwen-omni-filter-pipeline branch from 20c605c to 8e10503 Compare April 23, 2026 11:09

Nune Tadevosyan added 3 commits April 23, 2026 04:34

Merging checkpointing to pipeline

ec445fd

Signed-off-by: Nune Tadevosyan <ntadevosyan@cw-dfw-cs-001-login-01.cm.cluster>

remove tests

1a4638f

Signed-off-by: Nune Tadevosyan <ntadevosyan@cw-dfw-cs-001-login-01.cm.cluster>

remove low threshold check in hallucination detection

8b5b401

Signed-off-by: Nune Tadevosyan <ntadevosyan@cw-dfw-cs-001-login-01.cm.cluster>

greptile-apps Bot reviewed Apr 23, 2026

View reviewed changes

Comment thread nemo_curator/stages/audio/text_filtering/pnc_content_guard.py Outdated

Nune Tadevosyan added 2 commits April 23, 2026 06:08

batch support for pnc

f33d95e

Signed-off-by: Nune Tadevosyan <ntadevosyan@cw-dfw-cs-001-login-01.cm.cluster>

clean up

0c1070e

Signed-off-by: Nune Tadevosyan <ntadevosyan@cw-dfw-cs-001-login-01.cm.cluster>

greptile-apps Bot reviewed Apr 23, 2026

View reviewed changes

Comment thread examples/audio/qwen_omni_inprocess/run_pipeline.py

ruff update

74c2b2c

Signed-off-by: Nune Tadevosyan <ntadevosyan@cw-dfw-cs-001-login-01.cm.cluster>

Nune Tadevosyan added 3 commits April 23, 2026 06:45

ruff update

1d014dc

Signed-off-by: Nune Tadevosyan <ntadevosyan@cw-dfw-cs-001-login-01.cm.cluster>

fix

4dab67a

Signed-off-by: Nune Tadevosyan <ntadevosyan@cw-dfw-cs-001-login-01.cm.cluster>

FP8 model

31e8c5d

Signed-off-by: Nune Tadevosyan <ntadevosyan@cw-dfw-cs-001-login-01.cm.cluster>

greptile-apps Bot reviewed Apr 23, 2026

View reviewed changes

Comment thread nemo_curator/stages/audio/text_filtering/abbreviation_concat.py

nithinraok reviewed Apr 23, 2026

View reviewed changes

Nune Tadevosyan added 2 commits April 23, 2026 08:21

Updates

3c04350

Signed-off-by: Nune Tadevosyan <ntadevosyan@cw-dfw-cs-001-login-01.cm.cluster>

ruff update

bdd1f3b

Signed-off-by: Nune Tadevosyan <ntadevosyan@cw-dfw-cs-001-login-01.cm.cluster>

nithinraok approved these changes Apr 23, 2026

View reviewed changes

oyilmaz-nvidia requested changes Apr 23, 2026

View reviewed changes

oyilmaz-nvidia reviewed Apr 23, 2026

View reviewed changes

FastText on short utterances

6fe5864

Signed-off-by: Nune Tadevosyan <ntadevosyan@cw-dfw-cs-001-login-01.cm.cluster>

		if len(sample_answers) < _MAX_SAMPLE_LOG:
		sample_answers.append(repr(answer))

Conversation

nune-tadevosyan commented Apr 22, 2026

Description

Usage

Uh oh!

copy-pr-bot Bot commented Apr 22, 2026

Uh oh!

greptile-apps Bot commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Flowchart

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

greptile-apps Bot commented Apr 23, 2026

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nithinraok left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

oyilmaz-nvidia left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

greptile-apps Bot commented Apr 22, 2026 •

edited

Loading