Qwen omni filter pipeline#1853
Qwen omni filter pipeline#1853nune-tadevosyan wants to merge 16 commits intoNVIDIA-NeMo:mmkrtchyan/qwen-omni-inprocessfrom
Conversation
Signed-off-by: Nune Tadevosyan <ntadevosyan@cw-dfw-cs-001-login-01.cm.cluster>
Signed-off-by: Nune Tadevosyan <ntadevosyan@cw-dfw-cs-001-login-01.cm.cluster>
Greptile SummaryThis PR adds three new post-processing stages to the Granary v2 pipeline: Confidence Score: 4/5Safe to merge with caveats — open issues from prior review rounds remain in the code but the pipeline wiring is now correct. No new P0/P1 issues found; the previous-thread P1 issues (contraction trimming producing broken output, batch_size not applied as a chunking knob, Unicode punctuation gap in the content guard) are still present in the code. Two new P2 findings: redundant del self._llm in teardown and rstrip-based character stripping on abbreviation metadata. Score is capped at 4 due to unresolved P1 items carried over from prior rounds. nemo_curator/stages/audio/text_filtering/abbreviation_concat.py (contraction trimming + rstrip metadata bug), nemo_curator/stages/audio/text_filtering/pnc_content_guard.py (Unicode punctuation gap), nemo_curator/stages/audio/text_filtering/pnc_restoration.py (batch_size not used for chunking). Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[NemoTarredAudioReader\nCPU – streams NeMo-tarred shards] --> B[InitializeFieldsStage\nrenames text → granary_v1_prediction\nsets _skip_me = '']
B --> C[InferenceQwenOmniStage\nGPU – vLLM → qwen3_prediction_s1]
C --> D{followup_prompt?}
D -- yes --> E[Turn-2 Inference\n→ qwen3_prediction_s2]
E --> F[DisfluencyWerGuardStage NEW\nWER guard: reverts s2 → s1 if WER > 50%]
F --> G
D -- no --> G[WhisperHallucinationStage\nflags hallucinations → _skip_me]
G --> H[FastTextLIDStage\nflags wrong language → _skip_me]
H --> I[RegexSubstitutionStage\n→ cleaned_text]
I --> J[AbbreviationConcatStage NEW\njoins spaced letters → abbreviated_text]
J --> K{skip_pnc?}
K -- no --> L[PnCRestorationStage NEW\nGPU text LLM – two-step restore\n→ pnc_text]
L --> M[PnCContentGuardStage NEW\nreverts pnc_text if words changed]
M --> N[ShardedManifestWriterStage\n→ JSONL output]
K -- yes --> N
Reviews (12): Last reviewed commit: "FastText on short utterances" | Re-trigger Greptile |
Signed-off-by: Nune Tadevosyan <ntadevosyan@cw-dfw-cs-001-login-01.cm.cluster>
20c605c to
8e10503
Compare
Signed-off-by: Nune Tadevosyan <ntadevosyan@cw-dfw-cs-001-login-01.cm.cluster>
Signed-off-by: Nune Tadevosyan <ntadevosyan@cw-dfw-cs-001-login-01.cm.cluster>
Signed-off-by: Nune Tadevosyan <ntadevosyan@cw-dfw-cs-001-login-01.cm.cluster>
Signed-off-by: Nune Tadevosyan <ntadevosyan@cw-dfw-cs-001-login-01.cm.cluster>
Signed-off-by: Nune Tadevosyan <ntadevosyan@cw-dfw-cs-001-login-01.cm.cluster>
|
Want your agent to iterate on Greptile's feedback? Try greploops. |
| completeness_prompt: str = ( | ||
| "Is the following text a complete sentence? Answer only 'yes' or 'no'.\n\nText: {text}" | ||
| ) | ||
| pnc_prompt: str = ( |
| self, | ||
| model_id: str = _QWEN_TEXT_MODEL_ID, | ||
| completeness_prompt: str = ( | ||
| "Is the following text a complete sentence? Answer only 'yes' or 'no'.\n\nText: {text}" |
There was a problem hiding this comment.
Do we give additional context here, like:
The following text is a transcript segment from an audio recording.
It may be a complete, self-contained utterance or thought, or it may be cut off mid-sentence or mid-idea.
Determine if the text is complete and self-contained (i.e., not cut off). Answer only "yes" or "no".
Text: {text}
| system_prompt: str | None = None, | ||
| max_model_len: int = 4096, | ||
| max_num_seqs: int = 16, | ||
| gpu_memory_utilization: float = 0.8, |
There was a problem hiding this comment.
we can go higher to 0.95 in all inferences IMO
| text_key: str = "text" | ||
| pnc_text_key: str = "text" |
There was a problem hiding this comment.
We set it through pipeline but for consistency updated
|
|
||
| def teardown(self) -> None: | ||
| if self._prep_pool is not None: | ||
| self._prep_pool.shutdown(wait=False) |
| if len(sample_answers) < _MAX_SAMPLE_LOG: | ||
| sample_answers.append(repr(answer)) |
There was a problem hiding this comment.
It was for logging. removed
Signed-off-by: Nune Tadevosyan <ntadevosyan@cw-dfw-cs-001-login-01.cm.cluster>
| pass | ||
|
|
||
|
|
||
| _QWEN_TEXT_MODEL_ID = "Qwen/Qwen3.5-35B-A3B-FP8" |
There was a problem hiding this comment.
Any reason why this is not added in here https://github.com/NVIDIA-NeMo/Curator/blob/main/nemo_curator/models/qwen_lm.py?
oyilmaz-nvidia
left a comment
There was a problem hiding this comment.
Also, will this qwen-omni-inprocess branch be merged into main?
Signed-off-by: Nune Tadevosyan <ntadevosyan@cw-dfw-cs-001-login-01.cm.cluster>
…ext_filtering module) Adds Granary v2 post-processing pipeline stages for ASR text refinement: - Abbreviation concatenation (deterministic rules, English-focused) - PnC restoration via Qwen 3.5 text model with content-change filtering - Disfluency / WER guard, FastText LID, regex substitution - Whisper hallucination detection, initialize/finalize fields helpers - New nemo_curator.models.qwen_text_llm wrapper Squash cherry-pick of #1853 (qwen-omni-filter-pipeline branch). Conflict resolution: - nemo_curator/models/qwen_omni.py: kept PR's turn-2 disfluency method, appended dev's generate_from_messages + _inject_waveform helper. - nemo_curator/stages/audio/__init__.py: took PR's lazy __getattr__ registry (includes new text_filtering stages); dev's _try_import scheme replaced. - scripts/measure_realtime.py: skipped (file absent in dev). #NO_PR Signed-off-by: George Zelenfroynd <gzelenfroind@nvidia.com>
…st, validated improvements on top of the 4 PRs Squash cherry-pick of integration-test's unique commits on top of #1853 + #1 + #3 + #1839: - 633acc7 FastText and Hallucination update → SelectBestPredictionStage: cross-model WER agreement. If both omni and ASR are flagged hallucinated but agree (WER ≤ 100 - min_agreement_pct, default 80%), keep omni and mark recovered — two independent models producing near-identical text is strong evidence the text is correct. → FastTextLIDStage: HuggingFace-format model loader, proper _predict() abstraction, source-tracked _skip_me ("Wrong language:{name}"). - 5fdfa0a additional notes key + skip writing keys after skip_me + pnc prompt + prefill caching → Models (qwen_omni, qwen_asr, qwen_text_llm): notes_key field for diagnostic info, vLLM enable_prefix_caching=True with xxhash. → text_filtering stages: skip writing output keys when skip_me is set. → New file: prompts/pnc_prompt.md. - 15424e3 updated prompt for ITN → Sharper ITN prompt (handles more conversion edge cases). - 0cf8e6c match max model len for ITN and PnC → Aligned ITN/PnC max_model_len (4096), max_num_seqs (16), gpu_memory_utilization (0.95). Wired ITN args through run_pipeline. - 7e32df1 add Qwen3ASR for all → Apply QwenASR recovery to all hallucination flags, not just specific patterns. WhisperHallucinationStage tweaks. - caccd37 Add min word count for FastText → Re-adds min_word_count=2 (FastText is unreliable on single-word inputs). Conflict resolution: - run_pipeline.py: kept multi-line argparse style (ours), kept --source_lang_key, adopted theirs' ITN stage construction (with new max_model_len/num_seqs/gpu_mem args). - fasttext_lid.py: took theirs' richer process logic (min_word_count check, per-sample expected language via source_lang_key, source-tracked _skip_me values). #NO_PR Signed-off-by: George Zelenfroynd <gzelenfroind@nvidia.com>
Merge origin/main into dev to pick up upstream changes (492 files, +57k/-6k): - 26.04 staging release - Generic ASR/TTS audio processing pipeline (#1679) - Dynamo disaggregated serving + validators (#1813, #1820, #1833, #1834, #1861) - ReadSpeech audio curation benchmark + tutorials (#1841, #1851, #1870) - VideoReader path validation, audio waveform leak fixes (#1845, #1765) - Sortformer tutorial fixes + benchmarks (#1764) - Generic audio pipeline + qwen3 support (#1827) - Fern docs (audio + curate-audio sections) Conflict resolution: - nemo_curator/stages/audio/__init__.py: kept dev's lazy __getattr__ registry, added main's new ManifestReader and ManifestWriterStage to both __all__ and _LAZY_IMPORTS (now lazy-loaded from nemo_curator.stages.audio.common). - uv.lock: took main's version (latest dependency resolutions). Removals propagated from main (pre-merge-base files we no longer need): - nemo_curator/stages/audio/alm/alm_manifest_writer.py (replaced by ShardedManifestWriterStage) - nemo_curator/stages/audio/alm/alm_manifest_reader.py - nemo_curator/backends/experimental/* (refactored away) - nemo_curator/core/serve.py (replaced by typed serve config) Verified intact: - SCOTCH pipeline: speaker_id/, hifi_pipeline/slurm_e2e/ (dev-only additions, untouched). - Cherry-picked audio PRs (#1853, #3, #1, #1839, integration-test) all present. Signed-off-by: George Zelenfroynd <gzelenfroind@nvidia.com>
Description
Adds two more stages for Granary v2 pipeline
Usage
PnC restoration we do with two stages.