The Training Environment: Athena
Training Data: A subset was random selected 1000 samples from HKUST training dataset.
Network: LAS
Model
Primary Network Configuration: NUM_EPOCHS
1, BATCH_SIZE
10
The training time is changed by different number of server and GPU when using Horovod
+Tensorflow
. At the same time, the training data and network structure etc still keep same to train one epoch. These results of experiment are shown below:
Server and GPU number | 1S - 1GPU | 1S - 2GPU | 1S - 3GPU | 1S - 4GPU | 2S - 2GPU | 2S - 4GPU | 2S - 6GPU | 2S - 8GPU |
---|---|---|---|---|---|---|---|---|
Training time(s/1 epoch) | 121.409 | 83.111 | 61.607 | 54.507 | 82.486 | 49.888 | 33.333 | 28.101 |
-
As shown in Table above, training time gets shorter when more GPUs are used. The speedup using four GPU is 2.2 times compared to using one GPU.
-
The communication overhead is really small between difference server using
Horovod
. We have trained model with same structure respectively using 1 servers with 2 GPUs and using 2 servers with 1 GPU each. The total training time is almost the same.