Skip to content

[audio codec] Add support for Lhotse training format#15622

Merged
rfejgin merged 4 commits into
NVIDIA-NeMo:mainfrom
rfejgin:codec_lhotse
Apr 29, 2026
Merged

[audio codec] Add support for Lhotse training format#15622
rfejgin merged 4 commits into
NVIDIA-NeMo:mainfrom
rfejgin:codec_lhotse

Conversation

@rfejgin
Copy link
Copy Markdown
Collaborator

@rfejgin rfejgin commented Apr 17, 2026

This PR adds a Lhotse data loader for audio codec training. It also introduces mechanisms to make the training process more stable and to help debug potential NaN issues.

Data Loading

The functionality is split between built-in Lhotse capabilities and a simple custom dataset class.

Lhotse

These operations are handled directly in Lhotse:

  1. Duration filtering: Only keeps cuts whose duration is at least that of the training chunk size (n_samples).
  2. Random segment selection: Selects a random segment of length n_samples from the cut.

Configuring Lhotse happens in AudioCodecModel._get_lhotse_dataloader().

AudioCodecLhotseDataset

The custom dataset class, AudioCodecLhotseDataset, receives a CutSet from Lhotse and performs the following:

  1. Loads and collates the audio.
  2. Resamples to the target sample rate, which is the codec's output_sample_rate.
  3. Performs sanity checks on the audio (details below).

Training Robustness

When training models on Lhotse datasets, we observed convergence similar to the previous training recipe. However, in one instance (out of ~4 training runs), the training ran into NaNs. To help debug this issue if it ever reoccurs and to stabilize training, the following mechanisms were added:

  1. Sanity checks in the loader: Errors out if the loaded audio has an unexpected length or contains NaN or infinite values. It also warns if suspicious sample values (abs(sample) > 1.5) are encountered.
  2. Gradient norm tracking (operates separately for the discriminator and generator):
    • (Optional) Skip updates: If an infinite or NaN gradient norm is detected, it triggers a warning and skips the current parameter update.
    • (Optional) Gradient clipping: Applies gradient norm clipping.
    • Logging: Logs the gradient norm both before and after clipping.

Additional notes

  • I have confirmed that old YAML configs still run without error after these changes
  • A companion PR is being prepared in the internal repo that correspondingly updates the training recipe.

@github-actions github-actions Bot added the TTS label Apr 17, 2026
Comment thread nemo/collections/tts/models/audio_codec.py Fixed
Comment thread nemo/collections/tts/models/audio_codec.py Fixed
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Apr 22, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@rfejgin rfejgin marked this pull request as ready for review April 22, 2026 23:55
@rfejgin rfejgin requested review from blisc and rlangman April 22, 2026 23:55
Comment thread nemo/collections/tts/models/audio_codec.py
Comment thread nemo/collections/tts/models/audio_codec.py Outdated
Comment thread tests/collections/tts/data/test_audio_codec_dataset_lhotse.py Fixed
Comment thread nemo/collections/tts/models/audio_codec.py Outdated
Comment thread nemo/collections/tts/models/audio_codec.py Outdated
Comment thread nemo/collections/tts/data/audio_codec_dataset_lhotse.py Outdated
Comment thread nemo/collections/tts/models/audio_codec.py Outdated
Comment thread nemo/collections/tts/models/audio_codec.py Outdated
Signed-off-by: Fejgin, Roy <rfejgin@nvidia.com>
@rfejgin
Copy link
Copy Markdown
Collaborator Author

rfejgin commented Apr 25, 2026

I've addressed the PR comments. Apologies, but the commit history is lost because I needed to force-push to repair some unsigned commits that were blocking CI from running.

blisc
blisc previously approved these changes Apr 28, 2026
rlangman
rlangman previously approved these changes Apr 28, 2026
@rfejgin rfejgin dismissed stale reviews from rlangman and blisc via d57e2f9 April 28, 2026 23:04
@rfejgin rfejgin enabled auto-merge (squash) April 29, 2026 00:08
@github-actions
Copy link
Copy Markdown
Contributor

[🤖]: Hi @rfejgin 👋,

We wanted to let you know that a CICD pipeline for this PR just finished successfully.

So it might be time to merge this PR or get some approvals.

@rfejgin rfejgin merged commit 32d01f3 into NVIDIA-NeMo:main Apr 29, 2026
126 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants