Model Training (dropout, batchsize, STFT?) #9

leolya · 2022-05-03T14:15:20Z

Thanks for sharing the code. I have some questions about model training.

What is the batchsize during training? 1? gradients are accumulated for every 4 samples?
Is the dropout deactivated during training? As suggested by "Investigation of Practical Aspects of Single ChannelSpeech Separation for ASR", dropout is not used.
How long does it take to train the model?
STFT configurations? I think the pre-trainied model uses 512-point STFT with half overlap which is a little bit different from the setup shown below and "log" is not applied to the spectorgram?

"The 25 ms frame size with the frame shift of 10 ms is usedfor feature generation. A 512-point FFT size and hamming win-dow are used in (i)STFT, forming the 257-dimentional masksand spectrum. The log spectrogram with utterance-wise meanvariance normalization is extracted as the input feature for allthe separation models."

Sanyuan-Chen · 2022-05-05T08:18:10Z

Hi @leolya ,

Each training batch consists of 96 audio chunks, and chunk_size=4s, chunk_hop=2s. If the GPU number or memory is limited, the 96 batch size can be simulated with gradient_accumulation=n + batch_size=96/n
In this "Continuous Speech Separation with Conformer" paper, we activated the dropout during training. However, we found the dropout deactivated can result in better performance afterward. We recommend the training configuration in "Investigation of Practical Aspects of Single ChannelSpeech Separation for ASR" when training your own conformer separation model, and the performance can be improved with the L_{FA} training loss rather than L_{SA} loss.
In this "Continuous Speech Separation with Conformer" paper, we use the 512-point STFT with the 256-point shift. And in the "Investigation of Practical Aspects of Single ChannelSpeech Separation for ASR" paper, we found that similar performance can be achieved with a 25 ms frame size with the 10 ms frameshift. Also, I would recommend the setting in the latest paper, and this STFT feature can be easily combined with the pretrained speech model such as WavLM.

leolya · 2022-05-05T08:38:59Z

Thanks for your reply! It is really helpful! @Sanyuan-Chen

Another small question: Is a "step" in the paper means an update with a full batch? If I'm using gradient accumulation, I need to train for more steps?

Sanyuan-Chen · 2022-05-06T05:30:40Z

Yes, one step means one parameter update, i.e. optimizer.step()

leolya · 2022-05-06T07:59:34Z

Thanks for replying!

whisper0055 · 2023-08-12T11:44:44Z

Did you train the model successfully? I try to train the model but find that the loss does not go down

leolya changed the title ~~Model Training (dropout and batchsize?)~~ Model Training (dropout, batchsize, STFT?) May 3, 2022

leolya closed this as completed May 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Model Training (dropout, batchsize, STFT?) #9

Model Training (dropout, batchsize, STFT?) #9

leolya commented May 3, 2022 •

edited

Sanyuan-Chen commented May 5, 2022

leolya commented May 5, 2022

Sanyuan-Chen commented May 6, 2022

leolya commented May 6, 2022

whisper0055 commented Aug 12, 2023

Model Training (dropout, batchsize, STFT?) #9

Model Training (dropout, batchsize, STFT?) #9

Comments

leolya commented May 3, 2022 • edited

Sanyuan-Chen commented May 5, 2022

leolya commented May 5, 2022

Sanyuan-Chen commented May 6, 2022

leolya commented May 6, 2022

whisper0055 commented Aug 12, 2023

leolya commented May 3, 2022 •

edited