Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QUESTION] [TTS] 'num_elements_batch_sampler' loses the randomness of the samples #5690

Open
dbkest opened this issue Mar 1, 2024 · 1 comment
Labels
Question Question

Comments

@dbkest
Copy link

dbkest commented Mar 1, 2024

Hello, Could you help me solve a question, please? In the script espnet2/samplers/num_elements_batch_sampler.py, the method of dividing batches based on bin is implemented, and each batch loses the randomness of the samples. Does that greatly affect the training effect (although there is randomness between batches)? For TTS task, it will first sort the data globally and divide each batch of samples based on the sorting and bin. This process will be completed before training and will only be executed once (espnet2/tasks/abs_task.py: build_batch_sampler). The training process only shuffles all batches.

I sincerely await your reply.

@dbkest dbkest added the Question Question label Mar 1, 2024
@dbkest dbkest changed the title [QUESTION] [TTS] num_elements_batch_sampler loses the randomness of the samples [QUESTION] [TTS] 'num_elements_batch_sampler' loses the randomness of the samples Mar 1, 2024
@sw005320
Copy link
Contributor

sw005320 commented Mar 1, 2024

Good question.
The reason for this implementation is to make the balance of random shuffling and GPU memory usage.

Actually, we had an experiment before (7 years ago) for ASR between utterance-level shuffling and batch-level shuffling, and the difference was marginal (but this experiment causes different effective batch sizes, and the comparison could have been better).
Also, some people even sort it from short to long for all utterances and report that it is better (due to curriculum learning effects).
So, the entirely random shuffling may not be needed.

However, this is an old experience.
Nowadays, many technologies have changed, and we may have different conclusions. It's worth revisiting.
Also, we started to use fixed-length utterances (with padding) in some projects, where we can perform random shuffling for all utterances.

It would be great if you could do some investigations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Question Question
Projects
None yet
Development

No branches or pull requests

2 participants