Read my articles for a detailed explanation: How Transformer Works ; Build a Transformer from Scratch and Train a Transformer Model
I have chosen the translation task (English to Italian) to train my Transformer model on the opus_books dataset from Hugging Face. The training of this model was done on Kaggle using an NVIDIA Tesla P100 - 16GB GPU. It took 5 hours and 11 minutes for training over 20 epochs and each epoch has 3638 batches to train on.