Skip to content

Regarding ideal settings of n_shard and n_bucket to generate shard data when converting wavs to tarred dataset #8218

Closed Answered by titu1994
eesungkim asked this question in Q&A
Discussion options

You must be logged in to vote

Number of tarfiles and buckets somewhat depends on your compute cluster. We generally train on 128-256 gpus. Therefore we need to have enough tarfiles to allocate to at least 1 tarfile per GPU usually. So that sets an lower bound on the number of tarfiles * buckets.

Generally, we use 8 buckets, and around 512-8192 tarfiles depending on the dataset size. @nithinraok can give more accurate numbers

About data loading, make sure you use sharded manifest option in the config https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/asr/datasets.html#sharded-manifests

Another speedup would be to convert all audio to flac which is a new option added recently. https://github.com/NVIDIA/…

Replies: 4 comments

Comment options

You must be logged in to vote
0 replies
Answer selected by eesungkim
Comment options

You must be logged in to vote
0 replies
Comment options

You must be logged in to vote
0 replies
Comment options

You must be logged in to vote
0 replies
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
4 participants