How to determine batch_size, learning and accumulate_grad_batches for CitriNet models #2055
Replies: 1 comment 5 replies
-
Thank you for trying out Citrinets and the other models! There's a preprint which is undergoing review, but it explains how these Citrinet models were trained - Citrinet: Closing the Gap between Non-Autoregressive and Autoregressive End-to-End Models for Automatic Speech Recognition. The Citrinet model that is available as a pretrained model was trained on roughly 7000 hours of english speech. In most cases when we finetune these models, it will provide much better scores as compared to scratch training directly (see FT scores for various other datasets). What I would suggest is to use a moderate vocab size (1K is sufficient for most tasks), and finetune for a long duration with NovoGrad configuration in the paper/config file. These models converge to much better numbers in general with longer training |
Beta Was this translation helpful? Give feedback.
-
Thanks for this great toolkit!
I'm currently exploring various toolkits and model architectures. So far, I managed to get interesting results with the conformer and the RNN-T approaches, but I struggle training a good citrinet model from scratch. On our internal test suite, the published stt_citrinet_en_1024 yield 25% WER and if I run one epoch on our 10k hours training data suite, I can get the WER to 19%.
Now if I use that same 10k hours with a 3k BPE model and train a similar citrinet model from scratch, I get something more like 45+% WER. Clearly, something is wrong with my learning rate (that same data set produces good models in other toolkits). What I find really interesting with Citrinet models, is the speed! this type of model is really fast...
Let's say I have 8 V100 GPUs or 8 A100 GPUs, what would be a good combo of batch_size, learning rate and accumulate_grad_batches to use?
Beta Was this translation helpful? Give feedback.
All reactions