Skip to content

Qwen omni filter pipeline#1853

Open
nune-tadevosyan wants to merge 16 commits intoNVIDIA-NeMo:mmkrtchyan/qwen-omni-inprocessfrom
nune-tadevosyan:qwen-omni-filter-pipeline
Open

Qwen omni filter pipeline#1853
nune-tadevosyan wants to merge 16 commits intoNVIDIA-NeMo:mmkrtchyan/qwen-omni-inprocessfrom
nune-tadevosyan:qwen-omni-filter-pipeline

Conversation

@nune-tadevosyan
Copy link
Copy Markdown

Description

Adds two more stages for Granary v2 pipeline

  1. Abbreviation concatenation based on deterministic rules for different languages (currently specialised on English)
  2. PnC restoration with Qwen 3.5 text model
  3. PnC filtering if the context was changed

Usage

PnC restoration we do with two stages.

  1. Check if the sentence is complete
  2. Ask the model to restore PnC
 --pnc_prompt_file  ${PNC_PROMPT_FILE}

Nune Tadevosyan added 3 commits April 21, 2026 04:48
Signed-off-by: Nune Tadevosyan <ntadevosyan@cw-dfw-cs-001-login-01.cm.cluster>
Signed-off-by: Nune Tadevosyan <ntadevosyan@cw-dfw-cs-001-login-01.cm.cluster>
Signed-off-by: Nune Tadevosyan <ntadevosyan@cw-dfw-cs-001-login-01.cm.cluster>
@nune-tadevosyan nune-tadevosyan requested a review from a team as a code owner April 22, 2026 08:11
@nune-tadevosyan nune-tadevosyan requested review from meatybobby and removed request for a team April 22, 2026 08:11
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Apr 22, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 22, 2026

Greptile Summary

This PR adds three new post-processing stages to the Granary v2 pipeline: AbbreviationConcatStage (deterministic re-joining of spaced-out single letters), PnCRestorationStage (two-step LLM-based punctuation/capitalisation restoration via QwenTextLLM), and PnCContentGuardStage (word-content guard that reverts LLM output when words were added/removed). Existing stages are also updated to rename skip_me_skip_me and gain process_batch support. Several issues flagged in previous review rounds remain open in the code (contraction trimming bug in abbreviation_concat.py, batch_size not used for chunking, Unicode punctuation gap in the content guard).

Confidence Score: 4/5

Safe to merge with caveats — open issues from prior review rounds remain in the code but the pipeline wiring is now correct.

No new P0/P1 issues found; the previous-thread P1 issues (contraction trimming producing broken output, batch_size not applied as a chunking knob, Unicode punctuation gap in the content guard) are still present in the code. Two new P2 findings: redundant del self._llm in teardown and rstrip-based character stripping on abbreviation metadata. Score is capped at 4 due to unresolved P1 items carried over from prior rounds.

nemo_curator/stages/audio/text_filtering/abbreviation_concat.py (contraction trimming + rstrip metadata bug), nemo_curator/stages/audio/text_filtering/pnc_content_guard.py (Unicode punctuation gap), nemo_curator/stages/audio/text_filtering/pnc_restoration.py (batch_size not used for chunking).

Important Files Changed

Filename Overview
nemo_curator/models/qwen_text_llm.py New text-only LLM wrapper for two-step PnC restoration; well-structured with thread-pooled prompt preparation, but has a redundant del self._llm in teardown and a silent bare-except fallback in _format_prompt that swallows tokenizer errors.
nemo_curator/stages/audio/text_filtering/pnc_restoration.py New PnC restoration stage; batch_size field is declared and accepted via CLI but never used to chunk eligible_texts in process_batch, so the tuning knob silently has no effect on actual GPU batch size.
nemo_curator/stages/audio/text_filtering/pnc_content_guard.py New content guard stage that correctly defaults pnc_text_key to "pnc_text". The _PUNCT_TABLE only strips ASCII punctuation, so Unicode typographic characters emitted by the LLM can still cause spurious reversions.
nemo_curator/stages/audio/text_filtering/abbreviation_concat.py New abbreviation normalization stage; has unresolved contraction-trimming issue from previous threads plus a new P2 where rstrip("'s") misstrips non-possessive abbreviations ending in lowercase s in the metadata field.
nemo_curator/stages/audio/text_filtering/disfluency_wer_guard.py New WER-based guard stage; clean implementation that correctly falls back to the Turn-1 prediction when WER exceeds threshold.
examples/audio/qwen_omni_inprocess/run_pipeline.py Pipeline driver updated to wire in all four new stages; batch_size is now correctly forwarded to a declared dataclass field; arg parser refactored into _build_arg_parser.
nemo_curator/stages/audio/text_filtering/whisper_hallucination.py Removes low-char-rate check, renames skip_me_skip_me, and adds process_batch support; straightforward and correct.
scripts/measure_realtime.py Minor cleanup: makes script executable, fixes f-string-in-constant warnings, and correctly attaches timezone.utc to parsed log timestamps.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[NemoTarredAudioReader\nCPU – streams NeMo-tarred shards] --> B[InitializeFieldsStage\nrenames text → granary_v1_prediction\nsets _skip_me = '']
    B --> C[InferenceQwenOmniStage\nGPU – vLLM → qwen3_prediction_s1]
    C --> D{followup_prompt?}
    D -- yes --> E[Turn-2 Inference\n→ qwen3_prediction_s2]
    E --> F[DisfluencyWerGuardStage NEW\nWER guard: reverts s2 → s1 if WER > 50%]
    F --> G
    D -- no --> G[WhisperHallucinationStage\nflags hallucinations → _skip_me]
    G --> H[FastTextLIDStage\nflags wrong language → _skip_me]
    H --> I[RegexSubstitutionStage\n→ cleaned_text]
    I --> J[AbbreviationConcatStage NEW\njoins spaced letters → abbreviated_text]
    J --> K{skip_pnc?}
    K -- no --> L[PnCRestorationStage NEW\nGPU text LLM – two-step restore\n→ pnc_text]
    L --> M[PnCContentGuardStage NEW\nreverts pnc_text if words changed]
    M --> N[ShardedManifestWriterStage\n→ JSONL output]
    K -- yes --> N
Loading

Reviews (12): Last reviewed commit: "FastText on short utterances" | Re-trigger Greptile

Comment thread nemo_curator/stages/audio/text_filtering/pnc_content_guard.py Outdated
Comment thread nemo_curator/stages/audio/text_filtering/pnc_restoration.py
Signed-off-by: Nune Tadevosyan <ntadevosyan@cw-dfw-cs-001-login-01.cm.cluster>
@nune-tadevosyan nune-tadevosyan force-pushed the qwen-omni-filter-pipeline branch from 20c605c to 8e10503 Compare April 23, 2026 11:09
Nune Tadevosyan added 3 commits April 23, 2026 04:34
Signed-off-by: Nune Tadevosyan <ntadevosyan@cw-dfw-cs-001-login-01.cm.cluster>
Signed-off-by: Nune Tadevosyan <ntadevosyan@cw-dfw-cs-001-login-01.cm.cluster>
Signed-off-by: Nune Tadevosyan <ntadevosyan@cw-dfw-cs-001-login-01.cm.cluster>
Comment thread nemo_curator/stages/audio/text_filtering/pnc_content_guard.py Outdated
Nune Tadevosyan added 2 commits April 23, 2026 06:08
Signed-off-by: Nune Tadevosyan <ntadevosyan@cw-dfw-cs-001-login-01.cm.cluster>
Signed-off-by: Nune Tadevosyan <ntadevosyan@cw-dfw-cs-001-login-01.cm.cluster>
Comment thread examples/audio/qwen_omni_inprocess/run_pipeline.py
Signed-off-by: Nune Tadevosyan <ntadevosyan@cw-dfw-cs-001-login-01.cm.cluster>
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 23, 2026

Want your agent to iterate on Greptile's feedback? Try greploops.

Nune Tadevosyan added 3 commits April 23, 2026 06:45
Signed-off-by: Nune Tadevosyan <ntadevosyan@cw-dfw-cs-001-login-01.cm.cluster>
fix
Signed-off-by: Nune Tadevosyan <ntadevosyan@cw-dfw-cs-001-login-01.cm.cluster>
Signed-off-by: Nune Tadevosyan <ntadevosyan@cw-dfw-cs-001-login-01.cm.cluster>
Comment thread nemo_curator/stages/audio/text_filtering/abbreviation_concat.py
completeness_prompt: str = (
"Is the following text a complete sentence? Answer only 'yes' or 'no'.\n\nText: {text}"
)
pnc_prompt: str = (
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this prompt good enough?

Comment thread nemo_curator/models/qwen_text_llm.py Outdated
self,
model_id: str = _QWEN_TEXT_MODEL_ID,
completeness_prompt: str = (
"Is the following text a complete sentence? Answer only 'yes' or 'no'.\n\nText: {text}"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we give additional context here, like:

The following text is a transcript segment from an audio recording. 
It may be a complete, self-contained utterance or thought, or it may be cut off mid-sentence or mid-idea.

Determine if the text is complete and self-contained (i.e., not cut off). Answer only "yes" or "no".

Text: {text}

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the prompt

Comment thread nemo_curator/models/qwen_text_llm.py Outdated
system_prompt: str | None = None,
max_model_len: int = 4096,
max_num_seqs: int = 16,
gpu_memory_utilization: float = 0.8,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can go higher to 0.95 in all inferences IMO

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines +49 to +50
text_key: str = "text"
pnc_text_key: str = "text"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should these be same keys?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We set it through pipeline but for consistency updated

Comment thread nemo_curator/models/qwen_text_llm.py Outdated

def teardown(self) -> None:
if self._prep_pool is not None:
self._prep_pool.shutdown(wait=False)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be wait=True?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes!,
updated

Comment thread nemo_curator/models/qwen_text_llm.py Outdated
Comment on lines +288 to +289
if len(sample_answers) < _MAX_SAMPLE_LOG:
sample_answers.append(repr(answer))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why we need 5 answers?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was for logging. removed

Nune Tadevosyan added 2 commits April 23, 2026 08:21
Signed-off-by: Nune Tadevosyan <ntadevosyan@cw-dfw-cs-001-login-01.cm.cluster>
Signed-off-by: Nune Tadevosyan <ntadevosyan@cw-dfw-cs-001-login-01.cm.cluster>
Copy link
Copy Markdown
Member

@nithinraok nithinraok left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

pass


_QWEN_TEXT_MODEL_ID = "Qwen/Qwen3.5-35B-A3B-FP8"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor

@oyilmaz-nvidia oyilmaz-nvidia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, will this qwen-omni-inprocess branch be merged into main?

Signed-off-by: Nune Tadevosyan <ntadevosyan@cw-dfw-cs-001-login-01.cm.cluster>
Jorjeous added a commit that referenced this pull request Apr 27, 2026
…ext_filtering module)

Adds Granary v2 post-processing pipeline stages for ASR text refinement:
- Abbreviation concatenation (deterministic rules, English-focused)
- PnC restoration via Qwen 3.5 text model with content-change filtering
- Disfluency / WER guard, FastText LID, regex substitution
- Whisper hallucination detection, initialize/finalize fields helpers
- New nemo_curator.models.qwen_text_llm wrapper

Squash cherry-pick of #1853 (qwen-omni-filter-pipeline branch).
Conflict resolution:
- nemo_curator/models/qwen_omni.py: kept PR's turn-2 disfluency method,
  appended dev's generate_from_messages + _inject_waveform helper.
- nemo_curator/stages/audio/__init__.py: took PR's lazy __getattr__ registry
  (includes new text_filtering stages); dev's _try_import scheme replaced.
- scripts/measure_realtime.py: skipped (file absent in dev). #NO_PR

Signed-off-by: George Zelenfroynd <gzelenfroind@nvidia.com>
Jorjeous added a commit that referenced this pull request Apr 27, 2026
…st, validated improvements on top of the 4 PRs

Squash cherry-pick of integration-test's unique commits on top of #1853 + #1 + #3 + #1839:

- 633acc7 FastText and Hallucination update
  → SelectBestPredictionStage: cross-model WER agreement. If both omni and
    ASR are flagged hallucinated but agree (WER ≤ 100 - min_agreement_pct,
    default 80%), keep omni and mark recovered — two independent models
    producing near-identical text is strong evidence the text is correct.
  → FastTextLIDStage: HuggingFace-format model loader, proper _predict()
    abstraction, source-tracked _skip_me ("Wrong language:{name}").

- 5fdfa0a additional notes key + skip writing keys after skip_me + pnc prompt + prefill caching
  → Models (qwen_omni, qwen_asr, qwen_text_llm): notes_key field for
    diagnostic info, vLLM enable_prefix_caching=True with xxhash.
  → text_filtering stages: skip writing output keys when skip_me is set.
  → New file: prompts/pnc_prompt.md.

- 15424e3 updated prompt for ITN
  → Sharper ITN prompt (handles more conversion edge cases).

- 0cf8e6c match max model len for ITN and PnC
  → Aligned ITN/PnC max_model_len (4096), max_num_seqs (16),
    gpu_memory_utilization (0.95). Wired ITN args through run_pipeline.

- 7e32df1 add Qwen3ASR for all
  → Apply QwenASR recovery to all hallucination flags, not just specific
    patterns. WhisperHallucinationStage tweaks.

- caccd37 Add min word count for FastText
  → Re-adds min_word_count=2 (FastText is unreliable on single-word inputs).

Conflict resolution:
- run_pipeline.py: kept multi-line argparse style (ours), kept --source_lang_key,
  adopted theirs' ITN stage construction (with new max_model_len/num_seqs/gpu_mem args).
- fasttext_lid.py: took theirs' richer process logic (min_word_count check,
  per-sample expected language via source_lang_key, source-tracked _skip_me values). #NO_PR

Signed-off-by: George Zelenfroynd <gzelenfroind@nvidia.com>
Jorjeous added a commit that referenced this pull request Apr 27, 2026
Merge origin/main into dev to pick up upstream changes (492 files, +57k/-6k):
- 26.04 staging release
- Generic ASR/TTS audio processing pipeline (#1679)
- Dynamo disaggregated serving + validators (#1813, #1820, #1833, #1834, #1861)
- ReadSpeech audio curation benchmark + tutorials (#1841, #1851, #1870)
- VideoReader path validation, audio waveform leak fixes (#1845, #1765)
- Sortformer tutorial fixes + benchmarks (#1764)
- Generic audio pipeline + qwen3 support (#1827)
- Fern docs (audio + curate-audio sections)

Conflict resolution:
- nemo_curator/stages/audio/__init__.py: kept dev's lazy __getattr__ registry,
  added main's new ManifestReader and ManifestWriterStage to both __all__ and
  _LAZY_IMPORTS (now lazy-loaded from nemo_curator.stages.audio.common).
- uv.lock: took main's version (latest dependency resolutions).

Removals propagated from main (pre-merge-base files we no longer need):
- nemo_curator/stages/audio/alm/alm_manifest_writer.py (replaced by ShardedManifestWriterStage)
- nemo_curator/stages/audio/alm/alm_manifest_reader.py
- nemo_curator/backends/experimental/* (refactored away)
- nemo_curator/core/serve.py (replaced by typed serve config)

Verified intact:
- SCOTCH pipeline: speaker_id/, hifi_pipeline/slurm_e2e/ (dev-only additions, untouched).
- Cherry-picked audio PRs (#1853, #3, #1, #1839, integration-test) all present.

Signed-off-by: George Zelenfroynd <gzelenfroind@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants