VoiceChat EA STT training reproducible features#15558

Draft

ankitapasad wants to merge 2 commits intoNVIDIA-NeMo:mainfrom

ankitapasad:stt_vc_ea_parity

Collaborator

ankitapasad commented Mar 27, 2026

What does this PR do ?

Adds following features to the dataset class to support VoiceChat EA STT training and fine-tuning

Correct agent EOS placement
Clean implementation of token IDs and update user BOS ID to match EA
MCQ system prompt
Filler responses for ASR training data
Number normalization
Corresponding tests

Collection: speechlm2

Usage

You can potentially add a usage example below

# Add a code snippet demonstrating how to use this

PR Type:

New Feature
Bugfix

If you haven't finished some of the above items you can still open "Draft" PR.

ankitapasad added 2 commits

March 27, 2026 12:39


          Separate train, val datasets

0a857c4

Co-authored-by: Claude <noreply@anthropic.com>

Signed-off-by: Ankita Pasad <apasad@nvidia.com>


          Correct EOS placement, MCQ prompt, ASR filler response, number normal…

a633150

…ization, clean-up token ID init, and corresponding tests

Co-authored-by: Claude <noreply@anthropic.com>

Signed-off-by: Ankita Pasad <apasad@nvidia.com>

ankitapasad requested review from kevinhu-nv and zhehuaichen

March 27, 2026 19:50

github-advanced-security bot found potential problems

View reviewed changes

tests/collections/speechlm2/test_duplex_eos_placement.py

+              import os
+              import pytest
+              import torch

Check notice

Code scanning / CodeQL

Unused import Note test

Import of 'torch' is not used.

tests/collections/speechlm2/test_duplex_eos_placement.py

+                  assert (target_tokens == eos).sum().item() == 0, "skip_eos=True should not place any EOS"
+                  # Now collate source tokens, passing in the target channel for EOS placement
+                  source_tokens, source_token_lens = collate_token_channel(

Check notice

Code scanning / CodeQL

Unused local variable Note test

Variable source_tokens is not used.

tests/collections/speechlm2/test_duplex_eos_placement.py

+                  assert (target_tokens == eos).sum().item() == 0, "skip_eos=True should not place any EOS"
+                  # Now collate source tokens, passing in the target channel for EOS placement
+                  source_tokens, source_token_lens = collate_token_channel(

Check notice

Code scanning / CodeQL

Unused local variable Note test

Variable source_token_lens is not used.

tests/collections/speechlm2/test_duplex_eos_placement.py

+                      skip_eos=True,
+                  )
+                  source_tokens, source_token_lens = collate_token_channel(

Check notice

Code scanning / CodeQL

Unused local variable Note test

Variable source_tokens is not used.

tests/collections/speechlm2/test_duplex_eos_placement.py

+                      skip_eos=True,
+                  )
+                  source_tokens, source_token_lens = collate_token_channel(

Check notice

Code scanning / CodeQL

Unused local variable Note test

Variable source_token_lens is not used.

tests/collections/speechlm2/test_duplex_is_training_flag.py

+              from nemo.collections.common.tokenizers import AutoTokenizer
+              from nemo.collections.speechlm2.data.duplex_stt_dataset import DuplexSTTDataset
+              from nemo.collections.speechlm2.data.utils import get_pad_id

Check notice

Code scanning / CodeQL

Unused import Note test

Import of 'get_pad_id' is not used.

tests/collections/speechlm2/test_duplex_is_training_flag.py

+                  train_batch = train_ds[cuts]
+                  val_batch = val_ds[cuts]
+                  train_targets = train_batch["audio_data"]["target_tokens"]

Check notice

Code scanning / CodeQL

Unused local variable Note test

Variable train_targets is not used.

tests/collections/speechlm2/test_duplex_is_training_flag.py

+                  # Force aligner should be created but never called during validation
+                  val_ds.force_aligner = MagicMock()
+                  val_ds[cuts]

Check notice

Code scanning / CodeQL

Statement has no effect Note test

This statement has no effect.

tests/collections/speechlm2/test_duplex_is_training_flag.py

+                  # Mock the force aligner to avoid loading wav2vec2
+                  train_ds.force_aligner = MagicMock()
+                  train_ds.force_aligner.batch_force_align_user_audio.side_effect = lambda cuts, **kwargs: cuts
+                  train_ds[cuts]

Check notice

Code scanning / CodeQL

Statement has no effect Note test

This statement has no effect.

tests/collections/speechlm2/test_duplex_utils.py

+                - is_mcq_cut_train / is_mcq_cut_val / is_asr_cut
+              """
+              import pytest

Check notice

Code scanning / CodeQL

Unused import Note test

Import of 'pytest' is not used.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet