Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Model Training (dropout, batchsize, STFT?) #9

Closed
leolya opened this issue May 3, 2022 · 5 comments
Closed

Model Training (dropout, batchsize, STFT?) #9

leolya opened this issue May 3, 2022 · 5 comments

Comments

@leolya
Copy link

leolya commented May 3, 2022

Thanks for sharing the code. I have some questions about model training.

  1. What is the batchsize during training? 1? gradients are accumulated for every 4 samples?
  2. Is the dropout deactivated during training? As suggested by "Investigation of Practical Aspects of Single ChannelSpeech Separation for ASR", dropout is not used.
  3. How long does it take to train the model?
  4. STFT configurations? I think the pre-trainied model uses 512-point STFT with half overlap which is a little bit different from the setup shown below and "log" is not applied to the spectorgram?

"The 25 ms frame size with the frame shift of 10 ms is usedfor feature generation. A 512-point FFT size and hamming win-dow are used in (i)STFT, forming the 257-dimentional masksand spectrum. The log spectrogram with utterance-wise meanvariance normalization is extracted as the input feature for allthe separation models."

@leolya leolya changed the title Model Training (dropout and batchsize?) Model Training (dropout, batchsize, STFT?) May 3, 2022
@Sanyuan-Chen
Copy link
Owner

Hi @leolya ,

  1. Each training batch consists of 96 audio chunks, and chunk_size=4s, chunk_hop=2s. If the GPU number or memory is limited, the 96 batch size can be simulated with gradient_accumulation=n + batch_size=96/n
  2. In this "Continuous Speech Separation with Conformer" paper, we activated the dropout during training. However, we found the dropout deactivated can result in better performance afterward. We recommend the training configuration in "Investigation of Practical Aspects of Single ChannelSpeech Separation for ASR" when training your own conformer separation model, and the performance can be improved with the L_{FA} training loss rather than L_{SA} loss.
  3. In this "Continuous Speech Separation with Conformer" paper, we use the 512-point STFT with the 256-point shift. And in the "Investigation of Practical Aspects of Single ChannelSpeech Separation for ASR" paper, we found that similar performance can be achieved with a 25 ms frame size with the 10 ms frameshift. Also, I would recommend the setting in the latest paper, and this STFT feature can be easily combined with the pretrained speech model such as WavLM.

@leolya
Copy link
Author

leolya commented May 5, 2022

Thanks for your reply! It is really helpful! @Sanyuan-Chen

Another small question: Is a "step" in the paper means an update with a full batch? If I'm using gradient accumulation, I need to train for more steps?

@Sanyuan-Chen
Copy link
Owner

Yes, one step means one parameter update, i.e. optimizer.step()

@leolya
Copy link
Author

leolya commented May 6, 2022

Thanks for replying!

@leolya leolya closed this as completed May 6, 2022
@whisper0055
Copy link

Did you train the model successfully? I try to train the model but find that the loss does not go down

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants