Skip to content

[resnet50v1.5/tensorflow] Training performance cannot reproduce #837

@Haijunlv

Description

@Haijunlv

Related to Model/Framework(s)
resnet50v1.5/tensorflow

Describe the bug
mixed precsion+xla+batch size 256 1gpu can only get 786 img/s
mixed precsion+xla+batch size 256 4gpu get 3068 img/s.
much slower than benchmark 1270 img/s(1gpu)

To Reproduce
Steps to reproduce the behavior:
we did not change code, just use below code to start

mpiexec --allow-run-as-root --bind-to none --map-by slot -np 2 python main.py \
    --mode=training_benchmark \
	--use_xla \
	--warmup_steps 200 \
	--num_iter 500 \
	--iter_unit batch \
	--batch_size 256 \
	--data_dir=/ssd2/imagenet/tfrecord/train \
	--results_dir=${work_dirs} \
	--use_tf_amp \
	--use_static_loss_scaling \
	--loss_scale=128

we use mpiexec --allow-run-as-root --bind-to none --map-by slot -np xxx to start job because --bind-to socket will cause failed to bind memory warning and decrease speed.

and use our code start there wiil be one warning, seems like our environment donot have openib.
i donot know if this message will decrease speed.
image

Environment
Please provide at least:

  • Container version (e.g. pytorch:19.05-py3): nvcr.io/nvidia/tensorflow:20.06-tf1-py3
  • GPUs in the system: (e.g. 8x Tesla V100-SXM2-16GB): 4*Tesla V100-SXM2-32GB
  • CUDA driver version (e.g. 418.67): 450.80.02, cuda11
  • CPU:Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz * 16
    *CODE: NGC-20.06.5 official code

** log ***
[
log.txt
](url)

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions