[audio codec] Add support for Lhotse training format by rfejgin · Pull Request #15622 · NVIDIA-NeMo/NeMo

rfejgin · 2026-04-17T23:17:10Z

This PR adds a Lhotse data loader for audio codec training. It also introduces mechanisms to make the training process more stable and to help debug potential NaN issues.

Data Loading

The functionality is split between built-in Lhotse capabilities and a simple custom dataset class.

Lhotse

These operations are handled directly in Lhotse:

Duration filtering: Only keeps cuts whose duration is at least that of the training chunk size (n_samples).
Random segment selection: Selects a random segment of length n_samples from the cut.

Configuring Lhotse happens in AudioCodecModel._get_lhotse_dataloader().

`AudioCodecLhotseDataset`

The custom dataset class, AudioCodecLhotseDataset, receives a CutSet from Lhotse and performs the following:

Loads and collates the audio.
Resamples to the target sample rate, which is the codec's output_sample_rate.
Performs sanity checks on the audio (details below).

Training Robustness

When training models on Lhotse datasets, we observed convergence similar to the previous training recipe. However, in one instance (out of ~4 training runs), the training ran into NaNs. To help debug this issue if it ever reoccurs and to stabilize training, the following mechanisms were added:

Sanity checks in the loader: Errors out if the loaded audio has an unexpected length or contains NaN or infinite values. It also warns if suspicious sample values (abs(sample) > 1.5) are encountered.
Gradient norm tracking (operates separately for the discriminator and generator):
- (Optional) Skip updates: If an infinite or NaN gradient norm is detected, it triggers a warning and skips the current parameter update.
- (Optional) Gradient clipping: Applies gradient norm clipping.
- Logging: Logs the gradient norm both before and after clipping.

Additional notes

I have confirmed that old YAML configs still run without error after these changes
A companion PR is being prepared in the internal repo that correspondingly updates the training recipe.

copy-pr-bot · 2026-04-22T04:39:01Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Signed-off-by: Fejgin, Roy <rfejgin@nvidia.com>

rfejgin · 2026-04-25T00:11:38Z

I've addressed the PR comments. Apologies, but the commit history is lost because I needed to force-push to repair some unsigned commits that were blocking CI from running.

To be more descriptive. Signed-off-by: Fejgin, Roy <rfejgin@nvidia.com>

github-actions · 2026-04-29T06:31:51Z

[🤖]: Hi @rfejgin 👋,

We wanted to let you know that a CICD pipeline for this PR just finished successfully.

So it might be time to merge this PR or get some approvals.

github-actions Bot added the TTS label Apr 17, 2026

github-advanced-security AI found potential problems Apr 17, 2026

View reviewed changes

Comment thread nemo/collections/tts/models/audio_codec.py Fixed

Comment thread nemo/collections/tts/models/audio_codec.py Fixed

rfejgin marked this pull request as ready for review April 22, 2026 23:55

rfejgin requested review from blisc and rlangman April 22, 2026 23:55

rfejgin commented Apr 23, 2026

View reviewed changes

Comment thread nemo/collections/tts/models/audio_codec.py

rfejgin commented Apr 23, 2026

View reviewed changes

Comment thread nemo/collections/tts/models/audio_codec.py Outdated

github-advanced-security AI found potential problems Apr 23, 2026

View reviewed changes

Comment thread tests/collections/tts/data/test_audio_codec_dataset_lhotse.py Fixed

rlangman reviewed Apr 23, 2026

View reviewed changes

rfejgin added the Run CICD label Apr 23, 2026

rfejgin had a problem deploying to test April 23, 2026 20:16 — with GitHub Actions Error

chtruong814 added Run CICD and removed Run CICD labels Apr 24, 2026

chtruong814 temporarily deployed to test April 24, 2026 05:39 — with GitHub Actions Inactive

chtruong814 added Run CICD and removed Run CICD labels Apr 24, 2026

chtruong814 had a problem deploying to test April 24, 2026 23:06 — with GitHub Actions Error

chtruong814 added Run CICD and removed Run CICD labels Apr 24, 2026

chtruong814 had a problem deploying to test April 24, 2026 23:28 — with GitHub Actions Error

rfejgin requested a review from rlangman April 24, 2026 23:32

chtruong814 added Run CICD and removed Run CICD labels Apr 24, 2026

chtruong814 had a problem deploying to test April 24, 2026 23:50 — with GitHub Actions Error

rfejgin force-pushed the codec_lhotse branch from 30ec347 to 5e77462 Compare April 25, 2026 00:09

chtruong814 added Run CICD and removed Run CICD labels Apr 25, 2026

[audio codec] Add support for Lhotse training

54f94fe

Signed-off-by: Fejgin, Roy <rfejgin@nvidia.com>

rfejgin force-pushed the codec_lhotse branch from 5e77462 to 54f94fe Compare April 25, 2026 00:10

chtruong814 added Run CICD and removed Run CICD labels Apr 25, 2026

chtruong814 temporarily deployed to test April 25, 2026 00:11 — with GitHub Actions Inactive

pzelasko removed the Run CICD label Apr 28, 2026

Merge branch 'main' into codec_lhotse

9aa6d60

copy-pr-bot Bot had a problem deploying to test April 28, 2026 19:01 Error

blisc previously approved these changes Apr 28, 2026

View reviewed changes

rlangman previously approved these changes Apr 28, 2026

View reviewed changes

rfejgin added 2 commits April 28, 2026 15:46

Rename "legacy" to "non-tarred" data loading

a30bcda

To be more descriptive. Signed-off-by: Fejgin, Roy <rfejgin@nvidia.com>

Merge branch 'codec_lhotse' of github.com:rfejgin/NeMo into codec_lhotse

d57e2f9

rfejgin dismissed stale reviews from rlangman and blisc via d57e2f9 April 28, 2026 23:04

copy-pr-bot Bot temporarily deployed to test April 28, 2026 23:06 Inactive

rfejgin enabled auto-merge (squash) April 29, 2026 00:08

blisc approved these changes Apr 29, 2026

View reviewed changes

rfejgin merged commit 32d01f3 into NVIDIA-NeMo:main Apr 29, 2026
126 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[audio codec] Add support for Lhotse training format#15622

[audio codec] Add support for Lhotse training format#15622
rfejgin merged 4 commits into
NVIDIA-NeMo:mainfrom
rfejgin:codec_lhotse

rfejgin commented Apr 17, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

copy-pr-bot Bot commented Apr 22, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rfejgin commented Apr 25, 2026

Uh oh!

github-actions Bot commented Apr 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

rfejgin commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Data Loading

Lhotse

AudioCodecLhotseDataset

Training Robustness

Additional notes

Uh oh!

Uh oh!

Uh oh!

copy-pr-bot Bot commented Apr 22, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rfejgin commented Apr 25, 2026

Uh oh!

github-actions Bot commented Apr 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

rfejgin commented Apr 17, 2026 •

edited

Loading

`AudioCodecLhotseDataset`