[MagpieTTS][bugfix] defaults to force_map_dataset=True for validation datast to avoid duplicates by a factor of num_workers. by XuesongYang · Pull Request #15387 · NVIDIA-NeMo/NeMo

XuesongYang · 2026-02-12T08:13:54Z

Summary

Fix validation dataloader data duplication for Lhotse Shar datasets by adding force_map_dataset: true to the MagpieTTS validation config.

Problem

When using lhotse_shar data with force_map_dataset=False (the default), the validation dataloader uses an iterable dataset path that causes two compounding issues:

No DDP data partitioning -- The Lhotse sampler is created with rank=0, world_size=1 (hardcoded for iterable datasets), so every GPU independently iterates through the entire validation dataset instead of its 1/world_size share.
Worker-level data duplication -- Each DataLoader worker gets a full copy of the IterableDatasetWrapper and independently iterates all shards. With num_workers=N, data is duplicated N× per GPU.

Combined, each GPU processes num_workers × total_dataset_batches instead of the correct total_dataset_batches / world_size.

This design is intentional for training (infinite datasets with force_finite=False), where unique per-worker seeds and infinite repetition avoid explicit data splitting. But for finite validation (force_finite=True), it results in massive redundant computation and metrics computed on duplicated data.

Empirical Validation

Tested on LibriTTS dev-clean (5,620 records = 176 batches at batch_size=32, num_workers=2, quadratic_duration=null):

`force_map_dataset`	8 GPUs	16 GPUs	Expected
`False` (before)	352	352	`num_workers × total = 2 × 176 = 352` (GPU-count independent)
`True` (after)	22	11	`total / world_size = 176/8=22, 176/16=11` (proper DDP scaling)

With force_map_dataset=True:

Validation iterations scale inversely with GPU count (correct DDP behavior)
No worker duplication (map dataset dispatches work to workers without duplication)
Reduction factor = num_workers × world_size (e.g., 2×8 = 16× fewer iterations on 8 GPUs)

Fix

Setting force_map_dataset: true in the validation config switches from iterable dataset to map dataset, where:

The sampler uses the actual global_rank and world_size to partition data across GPUs
The DataLoader manages worker dispatch without duplication

This follows the same pattern used in speechlm2/data/datamodule.py for validation/test dataloaders. Unit test at tests/collections/common/test_lhotse_dataloading.py::test_force_map_dataset validate the effectiveness.

factor of num_workers. Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>

…NVIDIA-NeMo#15387) factor of num_workers. Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>

[bugfix] defaults to force_map_dataset=True to avoide duplicates as a

3347e19

factor of num_workers. Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>

XuesongYang requested review from blisc and Copilot and removed request for Copilot February 12, 2026 08:13

github-actions bot added the TTS label Feb 12, 2026

XuesongYang requested review from paarthneekhara and pzelasko February 12, 2026 08:14

XuesongYang added the Run CICD label Feb 12, 2026

Copilot started reviewing on behalf of XuesongYang February 12, 2026 08:14 View session

pzelasko approved these changes Feb 12, 2026

View reviewed changes

pzelasko merged commit 438ac8a into NVIDIA-NeMo:main Feb 12, 2026
58 checks passed

XuesongYang deleted the xueayng/pr-bugfix-val-dataloader branch February 12, 2026 16:51

nemoramo pushed a commit to nemoramo/MoNeMo that referenced this pull request Feb 13, 2026

[bugfix] defaults to force_map_dataset=True to avoide duplicates as a (…

5fe6200

…NVIDIA-NeMo#15387) factor of num_workers. Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>

XuesongYang mentioned this pull request Mar 31, 2026

MagpieTTS decoder model on top of NeMo main branch #15277

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MagpieTTS][bugfix] defaults to force_map_dataset=True for validation datast to avoid duplicates by a factor of num_workers.#15387

[MagpieTTS][bugfix] defaults to force_map_dataset=True for validation datast to avoid duplicates by a factor of num_workers.#15387
pzelasko merged 1 commit intoNVIDIA-NeMo:mainfrom
XuesongYang:xueayng/pr-bugfix-val-dataloader

XuesongYang commented Feb 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

XuesongYang commented Feb 12, 2026

Summary

Problem

Empirical Validation

Fix

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants