Fix Sortformer tutorial issues and add InferenceSortformerStage benchmark#1764
Fix Sortformer tutorial issues and add InferenceSortformerStage benchmark#1764sarahyurick merged 5 commits intomainfrom
Conversation
…mark Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>
Greptile SummaryThis PR fixes two runtime bugs in the Sortformer tutorial pipeline (sox→ffmpeg for Confidence Score: 4/5Safe to merge after addressing the All new findings in this review are P2. However a P1 from the previous review round ( tutorials/audio/callhome_diar/run.py — Ray cluster cleanup and the new Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A["CallHomeReaderStage\n(num_workers_per_node=1)\nDiscovers WAV+CHA pairs"] --> B["EnsureMonoStage\n(ffmpeg -ac 1 -ar 16000)\nDownmix stereo → mono"]
B --> C["InferenceSortformerStage\nsetup_on_node: snapshot_download\n_resolve_model_path → .nemo\n_configure_streaming + _extend_pos_enc"]
C --> D["DERComputationStage\nCHA ground-truth scoring\ncollar tolerance"]
D --> E["Results JSON + RTTM files"]
subgraph "sortformer.py setup flow"
F["setup_on_node()\nsnapshot_download only"] --> G["_resolve_model_path()\nmodel_path override\n OR sorted .nemo from cache"]
G --> H["SortformerEncLabelModel\n.restore_from()"]
H --> I["_configure_streaming()\nchunk_len / chunk_left_context\nchunk_right_context / fifo_len"]
I --> J["_extend_pos_enc_for_long_audio()\nextend_pe(max_len=30000)"]
end
Reviews (6): Last reviewed commit: "Merge branch 'main' into sortformer-fixe..." | Re-trigger Greptile |
sarahyurick
left a comment
There was a problem hiding this comment.
LGTM. We can merge after approval from @mohammadaaftabv and Satish.
Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>
|
@mohammadaaftabv Could you please take another look? I’ve made the requested changes. |
| repo_dir = getattr(self, "_cached_repo_dir", None) or snapshot_download( | ||
| repo_id=self.model_name, cache_dir=self.cache_dir | ||
| ) |
There was a problem hiding this comment.
This shouldn't need snapshot_download right? Since setup_on_node is always called.
| repo_dir = getattr(self, "_cached_repo_dir", None) or snapshot_download( | |
| repo_id=self.model_name, cache_dir=self.cache_dir | |
| ) | |
| repo_dir = getattr(self, "_cached_repo_dir", None) |
There was a problem hiding this comment.
Workers don't see _cached_repo_dir due to serialization, so the second snapshot_download() is needed, it just reads from cache, no re-download.
| spkcache_update_period: int = 300 | ||
| spkcache_len: int = 188 | ||
| inference_batch_size: int = 1 | ||
| batch_duration: int = 100000 |
There was a problem hiding this comment.
batch_duration declared but never passed to diarize()
batch_duration is documented as "Maximum total audio duration (seconds) per lhotse batch" but is never forwarded to the underlying diarize() call (line 210-213). Any user who sets this field expecting it to control lhotse batching will be silently ignored; the default 100 000 s limit is effectively never enforced.
If SortformerEncLabelModel.diarize() accepts a batch_duration kwarg, it needs to be threaded through:
predicted_segments = self.diar_model.diarize(
audio=audio_paths,
batch_size=self.inference_batch_size,
batch_duration=self.batch_duration,
)If the NeMo API does not expose this parameter, the field and its docstring should be removed to avoid confusion.
There was a problem hiding this comment.
+1 it doesn't look like batch_duration is being used anywhere atm.
Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>
eafdd10 to
5c3be57
Compare
| def run_audio_sortformer_benchmark( | ||
| benchmark_results_path: str, | ||
| manifest_path: str, | ||
| model_name: str, | ||
| rttm_out_dir: str | None = None, | ||
| executor: str = "xenna", | ||
| **kwargs, # noqa: ARG001 | ||
| ) -> dict[str, Any]: | ||
| """Run the audio Sortformer diarization benchmark and collect metrics.""" | ||
| benchmark_results_path = Path(benchmark_results_path) |
There was a problem hiding this comment.
This is minor but it looks like benchmark_results_path isn't used by this function, only main, so it can be removed?
| return | ||
| device = next(self.diar_model.parameters()).device | ||
| try: | ||
| pos_enc.extend_pe(max_len, device, torch.float32) |
There was a problem hiding this comment.
Following up on some greptile comments, should it always be float32?
| sm.spkcache_update_period = self.spkcache_update_period | ||
| sm.chunk_left_context = self.chunk_left_context | ||
| if hasattr(sm, "spkcache_update_period"): | ||
| sm.spkcache_update_period = self.spkcache_update_period | ||
| sm.spkcache_len = self.spkcache_len |
There was a problem hiding this comment.
Following up on some greptile comments, why is there a hasattr guard for spkcache_update_period? Also should spkcache_len have a guard too?
There was a problem hiding this comment.
Older NeMo versions don't have spkcache_update_period on SortformerModules, without the guard it crashes.
Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>
|
/ok to test c5564bf |
It looks like all requests have been implemented, thanks!
…mark (#1764) (#1866) * Fix Sortformer tutorial issues and add InferenceSortformerStage benchmark * Address PR review feedback and update model to v2.1 * Address second round of review feedback * fixing some more issues --------- Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com> Co-authored-by: Meline Mkrtchyan <72409758+melllinia@users.noreply.github.com> Co-authored-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com>
Merge origin/main into dev to pick up upstream changes (492 files, +57k/-6k): - 26.04 staging release - Generic ASR/TTS audio processing pipeline (#1679) - Dynamo disaggregated serving + validators (#1813, #1820, #1833, #1834, #1861) - ReadSpeech audio curation benchmark + tutorials (#1841, #1851, #1870) - VideoReader path validation, audio waveform leak fixes (#1845, #1765) - Sortformer tutorial fixes + benchmarks (#1764) - Generic audio pipeline + qwen3 support (#1827) - Fern docs (audio + curate-audio sections) Conflict resolution: - nemo_curator/stages/audio/__init__.py: kept dev's lazy __getattr__ registry, added main's new ManifestReader and ManifestWriterStage to both __all__ and _LAZY_IMPORTS (now lazy-loaded from nemo_curator.stages.audio.common). - uv.lock: took main's version (latest dependency resolutions). Removals propagated from main (pre-merge-base files we no longer need): - nemo_curator/stages/audio/alm/alm_manifest_writer.py (replaced by ShardedManifestWriterStage) - nemo_curator/stages/audio/alm/alm_manifest_reader.py - nemo_curator/backends/experimental/* (refactored away) - nemo_curator/core/serve.py (replaced by typed serve config) Verified intact: - SCOTCH pipeline: speaker_id/, hifi_pipeline/slurm_e2e/ (dev-only additions, untouched). - Cherry-picked audio PRs (#1853, #3, #1, #1839, integration-test) all present. Signed-off-by: George Zelenfroynd <gzelenfroind@nvidia.com>
Summary
EnsureMonoStageto useffmpeginstead ofsox(not installed in nightly container)InferenceSortformerStage.setup()model-path resolution bug (_resolve_model_path)CallHomeReaderStageworkers to 1 viaxenna_stage_spec(prevents crash on 64-CPU hosts)audio_sortformer_benchmark.pybenchmarking scriptCloses #1728