Distributed ResNet on Cifar and Imagenet Dataset.

This Repo contains code for Distributed ResNet Training and scripts to submit distributed tasks in slurm system, specific to multiple machine each having one GPU card. I use the official resnet model provided by google and wrap it with my distributed code using SyncReplicaOptimzor. Some modifications from official model in r1.4 are made to fit r1.3 version Tensorflow.

Problem occured:

I met the same problem with SyncReplicaOptimzor as mentioned in

github issue

stackoverflow

Results with this code:

Cifar-10 global batch size = 128, evaluation results with test data are as following. A. One CPU with 4 Titan Xp GPU

CIFAR-10 Model	Horovod Best Precision	#node	steps	speed (stp/sec)
50 layer	93.3%	4	~90k	21.82

Each node is a P100 GPU.

CIFAR-10 Model	TF Best Precision	PS-WK	Steps	Speed (stp/sec)	Horovod Best Prec.	#node	speed
50 layer	93.6%	local	~80k	13.94
50 layer	85.2%	1ps-1wk	~80k	10.19
50 layer	86.4%	2ps-4wk	~80k	20.3
50 layer	87.3%	4ps-8wk	~60k	19.19	-	8	28.66

The eval best precisions are illustrated in the following picture. Jumps in curves are due to restart evaluation from checkpoint, which will loss previous best precision values and shows sudden drop of curves in picture.

Distributed Versions get lower eval accuracy results as provided in Tensorflow Model Research

ImageNet We set global batch size as 128*8 = 1024. Follows the Hyperparameter settting in Intel-Caffe, i.e. sub-batch-size is 128 for each node. Runing out of memory warning will occure for 128 sub-batch-size.

Model Layer	Batch Size	TF Best Precision	PS-WK	Steps	Speed (stp/sec)
50	128	62.6%	8-ps-8wk	~76k	0.93
50	128	64.4%	4-ps-8wk	~75k	0.90
50	64	-	1-ps-1wk	-	1.56
50	32	-	1-ps-1wk	-	2.20
50	128	-	1-ps-1wk	-	0.96
50	128	-	8-ps-128wk	-	0.285
50	32	-	8-ps-128wk	-	0.292

Also get lower eval accuracy values.

Usage

Prerequists

Install TensorFlow
Download ImageNet Dataset To avoid the error raised from unrecognition of the relative directory path, the following modification should made in download_and_preprocess_imagenet.sh. replace

WORK_DIR="$0.runfiles/inception/inception"

with

WORK_DIR="$(realpath -s "$0").runfiles/inception/inception"

After few days, you will see the following data in your data path. Due to the file system of Daint dose not support storage of millions of files, you have to deleted raw-data directory.

Download CIFAR-10/CIFAR-100 dataset.

curl -o cifar-10-binary.tar.gz https://www.cs.toronto.edu/~kriz/cifar-10-binary.tar.gz
curl -o cifar-100-binary.tar.gz https://www.cs.toronto.edu/~kriz/cifar-100-binary.tar.gz


<b>Related papers:</b>

Identity Mappings in Deep Residual Networks

https://arxiv.org/pdf/1603.05027v2.pdf

Deep Residual Learning for Image Recognition

https://arxiv.org/pdf/1512.03385v1.pdf

Wide Residual Networks

https://arxiv.org/pdf/1605.07146v1.pdf

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
__pycache__		__pycache__
data		data
mkl-scripts		mkl-scripts
models		models
results		results
scripts/backup		scripts/backup
test		test
utils		utils
README.md		README.md
cifar_input.py		cifar_input.py
logist_model.py		logist_model.py
ps1workers1.csv		ps1workers1.csv
resnet_cifar_eval.py		resnet_cifar_eval.py
resnet_cifar_frozen_model.py		resnet_cifar_frozen_model.py
resnet_cifar_main.py		resnet_cifar_main.py
resnet_cifar_predict.ipynb		resnet_cifar_predict.ipynb
resnet_cifar_predict.py		resnet_cifar_predict.py
resnet_cifar_predict_from_pd.py		resnet_cifar_predict_from_pd.py
resnet_cifar_train.py		resnet_cifar_train.py
resnet_cifar_train_horovod.py		resnet_cifar_train_horovod.py
resnet_imagenet_eval.py		resnet_imagenet_eval.py
resnet_imagenet_horovod_train.py		resnet_imagenet_horovod_train.py
resnet_imagenet_main.py		resnet_imagenet_main.py
resnet_imagenet_predict.ipynb		resnet_imagenet_predict.ipynb
resnet_imagenet_train.py		resnet_imagenet_train.py
resnet_model.py		resnet_model.py
resnet_model_official.py		resnet_model_official.py
resnet_single.py		resnet_single.py
start-macvlan-2host-cmd.sh		start-macvlan-2host-cmd.sh
start-macvlan-2host.sh		start-macvlan-2host.sh
start-resnet-cifar-eval.sh		start-resnet-cifar-eval.sh
start-resnet-cifar-horovod-train.sh		start-resnet-cifar-horovod-train.sh
start-resnet-cifar-main.sh		start-resnet-cifar-main.sh
start-resnet-cifar-predict.sh		start-resnet-cifar-predict.sh
start-resnet-cifar-train.sh		start-resnet-cifar-train.sh
start-resnet-imagenet-eval.sh		start-resnet-imagenet-eval.sh
start-resnet-imagenet-horovod-train.sh		start-resnet-imagenet-horovod-train.sh
start-resnet-imagenet-main.sh		start-resnet-imagenet-main.sh
start-resnet-imagenet-train-2.sh		start-resnet-imagenet-train-2.sh
stop-2.sh		stop-2.sh
stop.sh		stop.sh
tf_saver.py		tf_saver.py
vgg_preprocessing.py		vgg_preprocessing.py

michaelwfc/distributed-tensorflow-resnet

Folders and files

Latest commit

History

Repository files navigation

Distributed ResNet on Cifar and Imagenet Dataset.

Problem occured:

Usage

About

Resources

Stars

Watchers

Forks

Languages