How to determine batch_size, learning and accumulate_grad_batches for CitriNet models #2055

jprobichaud · 2021-04-13T03:44:42Z

jprobichaud
Apr 13, 2021

Thanks for this great toolkit!

I'm currently exploring various toolkits and model architectures. So far, I managed to get interesting results with the conformer and the RNN-T approaches, but I struggle training a good citrinet model from scratch. On our internal test suite, the published stt_citrinet_en_1024 yield 25% WER and if I run one epoch on our 10k hours training data suite, I can get the WER to 19%.

Now if I use that same 10k hours with a 3k BPE model and train a similar citrinet model from scratch, I get something more like 45+% WER. Clearly, something is wrong with my learning rate (that same data set produces good models in other toolkits). What I find really interesting with Citrinet models, is the speed! this type of model is really fast...

Let's say I have 8 V100 GPUs or 8 A100 GPUs, what would be a good combo of batch_size, learning rate and accumulate_grad_batches to use?

titu1994 · 2021-04-13T04:46:14Z

titu1994
Apr 13, 2021
Maintainer

Thank you for trying out Citrinets and the other models!

There's a preprint which is undergoing review, but it explains how these Citrinet models were trained - Citrinet: Closing the Gap between Non-Autoregressive and Autoregressive End-to-End Models for Automatic Speech Recognition.

The Citrinet model that is available as a pretrained model was trained on roughly 7000 hours of english speech. In most cases when we finetune these models, it will provide much better scores as compared to scratch training directly (see FT scores for various other datasets).

What I would suggest is to use a moderate vocab size (1K is sufficient for most tasks), and finetune for a long duration with NovoGrad configuration in the paper/config file. These models converge to much better numbers in general with longer training

5 replies

titu1994 Apr 13, 2021
Maintainer

Also, with respect to hyper parameters, as you'll note in the paper, we use a relatively high peak LR, and similar hyper parameter setting for all tasks. 0.05 LR, 0.001 Weight Decay, betas of (0.8, 0.25), 1K warmup steps and 0.1 dropout.

We also use aggressive data augmentation, which might be too aggressive for shorter training runs (2 Freq masks, max 27 Freq frames, 10 time masks, 5% adaptive time masking).

jprobichaud Apr 13, 2021
Author

Thank you very much for these information. Let me see if I understood your advice properly, you basically suggest that:

I start again with the pretrained citrinet 1024 model
replace the tokenizer with my own tokenizer with a similar vocab size (1024)
Set the learning rate 0.005 (assuming I have 16 GPUs, with bs=32) and train for hundreds of epochs (you had it to 200 epochs in the TED-LIUM2 section).

Is that right?

Now in my environment, I have access to machines with only 8 GPUs (either 8 V100 or 8 A100). I really struggle with the batch_size, number of GPUs and accumulate_grad_batch values.

Is keeping nGPU * batch_size_per_gpu * accumulate_grad_batches constant a valid solution to keep the learning rate constant as well? My previous attempts with this didn't pan out really well (or perhaps I was too impatient and stop the training too soon when I saw that the training loss wasn't going down anymore. You guys trained on 1000's of epochs, that's a very long wait with 7k hours of audios.

What would be your recommendation here when number of GPUs available is a bit more limited? Should I really take the trouble to setup multi-nodes here and group 4 machines with 8 GPUs each?

titu1994 Apr 13, 2021
Maintainer

We use 32 GPUs simply to cut down time, 8 GPUs is more than enough to finetune these models. If your dataset is properly formatted (ie lower case english alphabet, apostrophe and space characters only), then there is no need to even replace the Tokenizer.

You can fine tune with any number of GPUs, however you need to finetune for a lot of steps. These models converge slowly to better and better WER the longer you train. The train loss will not go down beyond a certain point due to Spec Augment, that is completely fine and the model is still training. That's the only reason we use so many GPUs, to cut down training time. You can get similar results with 8 GPUs, it just would take longer.

For simplicity keep batch size consistent at 32, and train longer keeping almost all hyper parameters same regardless of 8 vs 32 gpus. The longer your train the better the results are, and number of GPUs just speeds that process up.

bermeitinger-b Jun 15, 2021

Could I get a quick estimate on what "it just would take longer" means and how long you trained on the 32 GPUs? Are we talking about hours, days, weeks, or months?

titu1994 Jun 15, 2021
Maintainer

400 epochs of Librispeech on 8 GPUs takes roughly few days, depending on many factors. You could extrapolate with that configuration.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to determine batch_size, learning and accumulate_grad_batches for CitriNet models #2055

{{title}}

Replies: 1 comment 5 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

How to determine batch_size, learning and accumulate_grad_batches for CitriNet models #2055

jprobichaud Apr 13, 2021

Replies: 1 comment · 5 replies

titu1994 Apr 13, 2021 Maintainer

titu1994 Apr 13, 2021 Maintainer

jprobichaud Apr 13, 2021 Author

titu1994 Apr 13, 2021 Maintainer

bermeitinger-b Jun 15, 2021

titu1994 Jun 15, 2021 Maintainer

jprobichaud
Apr 13, 2021

Replies: 1 comment 5 replies

titu1994
Apr 13, 2021
Maintainer

titu1994 Apr 13, 2021
Maintainer

jprobichaud Apr 13, 2021
Author

titu1994 Apr 13, 2021
Maintainer

titu1994 Jun 15, 2021
Maintainer