This Repo contains code for Distributed ResNet Training and scripts to submit distributed tasks in slurm system, specific to multiple machine each having one GPU card. I use the official resnet model provided by google and wrap it with my distributed code using SyncReplicaOptimzor. Some modifications from official model in r1.4 are made to fit r1.3 version Tensorflow.
I met the same problem with SyncReplicaOptimzor as mentioned in
Results with this code:
- Cifar-10 global batch size = 128, evaluation results with test data are as following. A. One CPU with 4 Titan Xp GPU
CIFAR-10 Model | Horovod Best Precision | #node | steps | speed (stp/sec) |
---|---|---|---|---|
50 layer | 93.3% | 4 | ~90k | 21.82 |
Each node is a P100 GPU.
CIFAR-10 Model | TF Best Precision | PS-WK | Steps | Speed (stp/sec) | Horovod Best Prec. | #node | speed |
---|---|---|---|---|---|---|---|
50 layer | 93.6% | local | ~80k | 13.94 | |||
50 layer | 85.2% | 1ps-1wk | ~80k | 10.19 | |||
50 layer | 86.4% | 2ps-4wk | ~80k | 20.3 | |||
50 layer | 87.3% | 4ps-8wk | ~60k | 19.19 | - | 8 | 28.66 |
The eval best precisions are illustrated in the following picture. Jumps in curves are due to restart evaluation from checkpoint, which will loss previous best precision values and shows sudden drop of curves in picture.
Distributed Versions get lower eval accuracy results as provided in Tensorflow Model Research
- ImageNet We set global batch size as 128*8 = 1024. Follows the Hyperparameter settting in Intel-Caffe, i.e. sub-batch-size is 128 for each node. Runing out of memory warning will occure for 128 sub-batch-size.
Model Layer | Batch Size | TF Best Precision | PS-WK | Steps | Speed (stp/sec) | Horovod Best Prec. | #node | speed |
---|---|---|---|---|---|---|---|---|
50 | 128 | 62.6% | 8-ps-8wk | ~76k | 0.93 | |||
50 | 128 | 64.4% | 4-ps-8wk | ~75k | 0.90 | |||
50 | 64 | - | 1-ps-1wk | - | 1.56 | |||
50 | 32 | - | 1-ps-1wk | - | 2.20 | |||
50 | 128 | - | 1-ps-1wk | - | 0.96 | |||
50 | 128 | - | 8-ps-128wk | - | 0.285 | |||
50 | 32 | - | 8-ps-128wk | - | 0.292 |
Also get lower eval accuracy values.
Prerequists
-
Install TensorFlow
-
Download ImageNet Dataset To avoid the error raised from unrecognition of the relative directory path, the following modification should made in download_and_preprocess_imagenet.sh. replace
WORK_DIR="$0.runfiles/inception/inception"
with
WORK_DIR="$(realpath -s "$0").runfiles/inception/inception"
After few days, you will see the following data in your data path. Due to the file system of Daint dose not support storage of millions of files, you have to deleted raw-data directory.
- Download CIFAR-10/CIFAR-100 dataset.
curl -o cifar-10-binary.tar.gz https://www.cs.toronto.edu/~kriz/cifar-10-binary.tar.gz
curl -o cifar-100-binary.tar.gz https://www.cs.toronto.edu/~kriz/cifar-100-binary.tar.gz
<b>Related papers:</b>
Identity Mappings in Deep Residual Networks
https://arxiv.org/pdf/1603.05027v2.pdf
Deep Residual Learning for Image Recognition
https://arxiv.org/pdf/1512.03385v1.pdf
Wide Residual Networks
https://arxiv.org/pdf/1605.07146v1.pdf