[MagpieTTS][bugfix] defaults to force_map_dataset=True for validation datast to avoid duplicates by a factor of num_workers.#15387
Merged
pzelasko merged 1 commit intoNVIDIA-NeMo:mainfrom Feb 12, 2026
Conversation
factor of num_workers. Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>
pzelasko
approved these changes
Feb 12, 2026
nemoramo
pushed a commit
to nemoramo/MoNeMo
that referenced
this pull request
Feb 13, 2026
…NVIDIA-NeMo#15387) factor of num_workers. Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>
nune-tadevosyan
pushed a commit
to nune-tadevosyan/NeMo
that referenced
this pull request
Mar 13, 2026
…NVIDIA-NeMo#15387) factor of num_workers. Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fix validation dataloader data duplication for Lhotse Shar datasets by adding
force_map_dataset: trueto the MagpieTTS validation config.Problem
When using
lhotse_shardata withforce_map_dataset=False(the default), the validation dataloader uses an iterable dataset path that causes two compounding issues:No DDP data partitioning -- The Lhotse sampler is created with
rank=0, world_size=1(hardcoded for iterable datasets), so every GPU independently iterates through the entire validation dataset instead of its1/world_sizeshare.Worker-level data duplication -- Each DataLoader worker gets a full copy of the
IterableDatasetWrapperand independently iterates all shards. Withnum_workers=N, data is duplicated N× per GPU.Combined, each GPU processes
num_workers × total_dataset_batchesinstead of the correcttotal_dataset_batches / world_size.This design is intentional for training (infinite datasets with
force_finite=False), where unique per-worker seeds and infinite repetition avoid explicit data splitting. But for finite validation (force_finite=True), it results in massive redundant computation and metrics computed on duplicated data.Empirical Validation
Tested on LibriTTS dev-clean (5,620 records = 176 batches at
batch_size=32,num_workers=2,quadratic_duration=null):force_map_datasetFalse(before)num_workers × total = 2 × 176 = 352(GPU-count independent)True(after)total / world_size = 176/8=22, 176/16=11(proper DDP scaling)With
force_map_dataset=True:num_workers × world_size(e.g., 2×8 = 16× fewer iterations on 8 GPUs)Fix
Setting
force_map_dataset: truein the validation config switches from iterable dataset to map dataset, where:global_rankandworld_sizeto partition data across GPUsThis follows the same pattern used in
speechlm2/data/datamodule.pyfor validation/test dataloaders. Unit test attests/collections/common/test_lhotse_dataloading.py::test_force_map_datasetvalidate the effectiveness.