Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance comparsion: AMD with ROCm vs NVIDIA with cuDNN? #173

Open
NIIAS3050 opened this issue Sep 20, 2018 · 115 comments

Comments

@NIIAS3050
Copy link

commented Sep 20, 2018

It would be very useful to compare real training performance on amd and nvidia cards.
For Nvidia cards we have a lot of graphs and tests, for example:
https://github.com/u39kun/deep-learning-benchmark
But for AMD cards there is no performance metrics.
It will be great to made direct comparsion between AND and NVIDIA with last cuDNN.

@pricebenjamin

This comment has been minimized.

Copy link

commented Nov 8, 2018

If you happen to have access to some AMD GPUs that are supported by the ROCm stack, consider running some benchmarks from the TensorFlow benchmarks repository. The README in the benchmarks/scripts/tf_cnn_benchmarks directory provides some example usage.

Those scripts were used for the benchmarks shown on TensorFlows website.

I've run the following on a Vega FE (tensorflow-rocm==1.11.0 and rocm-dkms==1.9.211).

python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=128 --model=resnet50

This yields the following.

[...]
Done warm up
Step	Img/sec	total_loss
1	images/sec: 182.2 +/- 0.0 (jitter = 0.0)	8.325
10	images/sec: 182.3 +/- 0.1 (jitter = 0.2)	8.170
20	images/sec: 182.3 +/- 0.1 (jitter = 0.3)	8.247
30	images/sec: 182.1 +/- 0.1 (jitter = 0.3)	8.369
40	images/sec: 182.0 +/- 0.1 (jitter = 0.4)	8.401
50	images/sec: 181.9 +/- 0.1 (jitter = 0.5)	8.147
60	images/sec: 181.8 +/- 0.1 (jitter = 0.6)	8.340
70	images/sec: 181.6 +/- 0.1 (jitter = 0.7)	8.120
80	images/sec: 181.3 +/- 0.2 (jitter = 0.9)	8.415
90	images/sec: 180.5 +/- 0.3 (jitter = 1.1)	8.278
100	images/sec: 179.5 +/- 0.4 (jitter = 1.4)	8.328
----------------------------------------------------------------
total images/sec: 179.44
----------------------------------------------------------------

For comparison, the same command being run on a Tesla P100-PCIE-16GB (CUDA==9.2, cuDNN==7.1.4, and tf.__version__ == '1.11.0')

[...]
Done warm up
Step	Img/sec	total_loss
1	images/sec: 248.6 +/- 0.0 (jitter = 0.0)	8.325
10	images/sec: 248.6 +/- 0.2 (jitter = 0.6)	8.164
20	images/sec: 248.5 +/- 0.1 (jitter = 0.8)	8.251
30	images/sec: 248.4 +/- 0.1 (jitter = 0.7)	8.355
40	images/sec: 248.3 +/- 0.1 (jitter = 0.6)	8.417
50	images/sec: 248.2 +/- 0.1 (jitter = 0.6)	8.152
60	images/sec: 248.2 +/- 0.1 (jitter = 0.6)	8.353
70	images/sec: 248.1 +/- 0.1 (jitter = 0.7)	8.109
80	images/sec: 247.7 +/- 0.1 (jitter = 0.8)	8.405
90	images/sec: 247.5 +/- 0.1 (jitter = 0.9)	8.266
100	images/sec: 247.2 +/- 0.2 (jitter = 1.2)	8.344
----------------------------------------------------------------
total images/sec: 247.13
----------------------------------------------------------------

Bear in mind, I haven't done anything to try and optimize performance on the Vega FE. These are essentially "out-of-the-box" results.

@Mandrewoid

This comment has been minimized.

Copy link

commented Nov 17, 2018

@pricebenjamin when I try to run that same script ( I cloned the repo ) I get an import error:

ImportError: No module named 'tensorflow.python.data.experimental'

@pricebenjamin

This comment has been minimized.

Copy link

commented Nov 17, 2018

@Mandrewoid, if you haven't already, I'd recommend checking out the branch corresponding to your version of tensorflow, e.g.

cd /path/to/benchmarks
git checkout cnn_tf_v1.11_compatible
@Mandrewoid

This comment has been minimized.

Copy link

commented Nov 17, 2018

Nice that seems to have done it. I did not realize mainline TF had already advanced to 1.12 rookie mistake

@kazulittlefox

This comment has been minimized.

Copy link

commented Nov 23, 2018

I have tried runnning benchmarks on my environment(Kernel 4.15, ROCm1.9.2, TF1.12 with RX 580).

python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=(32|64)  \ 
--model=(alexnet|inceptionv3|vgg16|googlenet|resnet50)

result are as follow:

AlexNet        batch:32 397.27/sec
                     batch:64 518.03/sec
InceptionV3 batch:32   47.78/sec
                    batch:64   50.66/sec
googLeNet batch:32 239.28/sec
                   batch:64 256.05/sec
ResNet50   batch:32  86.81/sec
                 batch:64  98.57/sec

In my environment, Vgg16 has not runnning well.

@fshi98

This comment has been minimized.

Copy link

commented Nov 30, 2018

I have tested with vega64, ubuntu18.04, ROCm1.9.2, tf1.12:
1 resnet50: python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=64 --model=resnet50
1080ti: 212 images/sec (278 fp16)
vega64: 191 images/sec (190 fp16)
2 resnet101: python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=32 --model=resnet101
1080ti: 121.14 images/sec (168 fp16)
vega64: 101.15 images/sec (93 fp16), if fp16, --batch_size could be 64, while fp32, 64 will crash
3. inception3: python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=64 --model=inception3
1080ti: 140.08 images/sec (166 fp16)
vega64: 99.02 images/sec (50 fp16)

4 mobilenet: python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=64 --model=mobilenet
1080ti: 2865 images/sec
vega64: 462 images/sec

The nv gtx1080 ti was tested on another machine with cuda10, ubuntu 18.04.

There are two values didn't add up:

  1. for mobilenet, the 1080ti result doesn't make sense.
  2. i also tested with --use_fp16, which gives fair amount of speedup for 1080ti. However, for vega64, it ends up slower in all tests if using --use_fp16. This is especially true for inception3.

Considering vega64 supports native half precision and fp16 should be a good selling point for amd vega. how is it slower if using fp16? I guess this is probably due to software support, especially ROCm. Can anyone please test it with --use_fp16 and see if having similar results.

@kazulittlefox my vega runs smoothly with vgg16 @105images/sec

@Mandrewoid

This comment has been minimized.

Copy link

commented Dec 1, 2018

@fshi98 that might be because of
#143 (comment)

@fshi98

This comment has been minimized.

Copy link

commented Dec 1, 2018

@Mandrewoid Thanks. That may be the reason. However, my rocblas version is 0.14.3.0,
and I tested //tensorflow/python/kernel_tests:batch_matmul_op_test, and passed all 47 tests in 10.653s as in #143
Also, i tested and passed ROCmSoftwarePlatform/rocBLAS#340

This may not be the same error bugs as #143, but may be some performance issues

@pricebenjamin

This comment has been minimized.

Copy link

commented Feb 16, 2019

@sebpuetz Would you be willing to post some numbers for the Radeon VII, including fp16 performance? I have yet to find any cloud providers with these cards. Trying to get some info for #288.

@sebpuetz

This comment has been minimized.

Copy link

commented Feb 16, 2019

#288
Radeon VII
rocm==2.1.96 installed through apt
tensorflow==1.12 installed through pip
no further tuning

python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=64 --model=resnet50
Step	Img/sec	total_loss
1	images/sec: 190.3 +/- 0.0 (jitter = 0.0)	8.217
10	images/sec: 195.7 +/- 0.9 (jitter = 3.1)	8.123
20	images/sec: 196.4 +/- 0.5 (jitter = 1.8)	8.231
30	images/sec: 196.8 +/- 0.4 (jitter = 1.1)	8.268
40	images/sec: 197.1 +/- 0.3 (jitter = 0.9)	8.355
50	images/sec: 197.2 +/- 0.2 (jitter = 0.8)	8.013
60	images/sec: 197.3 +/- 0.2 (jitter = 0.7)	8.263
70	images/sec: 196.8 +/- 0.3 (jitter = 1.1)	8.304
80	images/sec: 196.9 +/- 0.2 (jitter = 1.1)	8.228
90	images/sec: 196.9 +/- 0.2 (jitter = 0.9)	8.283
100	images/sec: 197.0 +/- 0.2 (jitter = 0.8)	8.271
----------------------------------------------------------------
total images/sec: 196.98
----------------------------------------------------------------

FP16:

python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=64 --model=resnet50  --use_fp16
Step	Img/sec	total_loss
1	images/sec: 262.9 +/- 0.0 (jitter = 0.0)	8.162
10	images/sec: 261.9 +/- 0.6 (jitter = 0.7)	8.211
20	images/sec: 260.4 +/- 0.6 (jitter = 2.6)	8.375
30	images/sec: 260.6 +/- 0.5 (jitter = 2.6)	8.264
40	images/sec: 259.6 +/- 0.6 (jitter = 3.1)	8.116
50	images/sec: 259.6 +/- 0.5 (jitter = 3.1)	8.169
60	images/sec: 259.9 +/- 0.5 (jitter = 2.6)	8.325
70	images/sec: 259.3 +/- 0.5 (jitter = 3.5)	8.374
80	images/sec: 259.4 +/- 0.4 (jitter = 3.4)	8.041
90	images/sec: 259.3 +/- 0.4 (jitter = 3.6)	8.298
100	images/sec: 259.4 +/- 0.3 (jitter = 3.5)	8.376
----------------------------------------------------------------
total images/sec: 259.29
----------------------------------------------------------------

This one made the GPU sound like a jet engine:

python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=128 --model=resnet50
Step	Img/sec	total_loss
1	images/sec: 216.3 +/- 0.0 (jitter = 0.0)	8.219
10	images/sec: 215.9 +/- 0.3 (jitter = 0.3)	8.289
20	images/sec: 216.0 +/- 0.2 (jitter = 0.3)	8.064
30	images/sec: 215.9 +/- 0.1 (jitter = 0.3)	8.310
40	images/sec: 215.9 +/- 0.1 (jitter = 0.3)	8.197
50	images/sec: 215.9 +/- 0.1 (jitter = 0.3)	8.277
60	images/sec: 215.7 +/- 0.1 (jitter = 0.4)	8.162
70	images/sec: 215.7 +/- 0.1 (jitter = 0.4)	8.159
80	images/sec: 215.7 +/- 0.1 (jitter = 0.4)	8.139
90	images/sec: 215.7 +/- 0.1 (jitter = 0.4)	8.196
100	images/sec: 215.7 +/- 0.1 (jitter = 0.4)	8.163
----------------------------------------------------------------
total images/sec: 215.72
----------------------------------------------------------------

FP 16:

python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=128 --model=resnet50 --use_fp16
Step	Img/sec	total_loss
1	images/sec: 288.2 +/- 0.0 (jitter = 0.0)	8.209
10	images/sec: 283.8 +/- 1.1 (jitter = 2.7)	8.189
20	images/sec: 284.0 +/- 0.9 (jitter = 4.6)	8.316
30	images/sec: 284.9 +/- 0.7 (jitter = 4.5)	8.195
40	images/sec: 284.5 +/- 0.6 (jitter = 4.0)	8.180
50	images/sec: 284.3 +/- 0.5 (jitter = 3.7)	8.402
60	images/sec: 285.0 +/- 0.5 (jitter = 4.8)	8.271
70	images/sec: 285.4 +/- 0.4 (jitter = 3.7)	8.134
80	images/sec: 285.7 +/- 0.4 (jitter = 2.7)	8.299
90	images/sec: 286.0 +/- 0.4 (jitter = 1.5)	8.349
100	images/sec: 286.2 +/- 0.3 (jitter = 1.4)	8.213
----------------------------------------------------------------
total images/sec: 286.17
----------------------------------------------------------------
@sunway513

This comment has been minimized.

Copy link

commented Feb 18, 2019

@sebpuetz

This comment has been minimized.

Copy link

commented Feb 18, 2019

Improvements across the board with TF_ROCM_FUSION_ENABLE=1. The displayed temp in rocm-smi went above 90°C on all tests, the rocm-smi output didn't include clocks so I can't tell whether any termal throttling was happening.

TF_ROCM_FUSION_ENABLE=1 python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=64 --model=resnet50
Step	Img/sec	total_loss
1	images/sec: 208.4 +/- 0.0 (jitter = 0.0)	8.217
10	images/sec: 207.6 +/- 0.5 (jitter = 0.5)	8.124
20	images/sec: 207.7 +/- 0.3 (jitter = 0.5)	8.235
30	images/sec: 207.3 +/- 0.4 (jitter = 0.4)	8.268
40	images/sec: 207.2 +/- 0.4 (jitter = 0.4)	8.357
50	images/sec: 207.2 +/- 0.4 (jitter = 0.4)	8.012
60	images/sec: 207.2 +/- 0.3 (jitter = 0.4)	8.248
70	images/sec: 207.1 +/- 0.3 (jitter = 0.4)	8.305
80	images/sec: 207.0 +/- 0.3 (jitter = 0.5)	8.223
90	images/sec: 205.7 +/- 0.9 (jitter = 0.5)	8.322
100	images/sec: 205.7 +/- 0.8 (jitter = 0.5)	8.268
----------------------------------------------------------------
total images/sec: 205.65
----------------------------------------------------------------
TF_ROCM_FUSION_ENABLE=1 python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=64 --model=resnet50 --use_fp16
Step	Img/sec	total_loss
1	images/sec: 273.0 +/- 0.0 (jitter = 0.0)	8.171
10	images/sec: 272.6 +/- 0.9 (jitter = 1.0)	8.223
20	images/sec: 271.5 +/- 1.1 (jitter = 0.9)	8.375
30	images/sec: 272.0 +/- 0.8 (jitter = 0.9)	8.282
40	images/sec: 272.1 +/- 0.6 (jitter = 0.9)	8.122
50	images/sec: 272.1 +/- 0.6 (jitter = 0.8)	8.144
60	images/sec: 272.0 +/- 0.5 (jitter = 0.8)	8.333
70	images/sec: 271.5 +/- 0.5 (jitter = 1.0)	8.357
80	images/sec: 271.2 +/- 0.5 (jitter = 1.3)	8.034
90	images/sec: 271.2 +/- 0.4 (jitter = 1.3)	8.289
100	images/sec: 270.9 +/- 0.4 (jitter = 1.5)	8.361
----------------------------------------------------------------
total images/sec: 270.81
----------------------------------------------------------------
TF_ROCM_FUSION_ENABLE=1 python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=128 --model=resnet50
Step	Img/sec	total_loss
1	images/sec: 227.7 +/- 0.0 (jitter = 0.0)	8.221
10	images/sec: 225.6 +/- 0.5 (jitter = 2.2)	8.289
20	images/sec: 225.5 +/- 0.4 (jitter = 1.9)	8.068
30	images/sec: 225.7 +/- 0.3 (jitter = 1.8)	8.304
40	images/sec: 225.4 +/- 0.5 (jitter = 1.2)	8.183
50	images/sec: 225.5 +/- 0.4 (jitter = 1.0)	8.261
60	images/sec: 225.6 +/- 0.4 (jitter = 1.1)	8.203
70	images/sec: 225.6 +/- 0.3 (jitter = 1.1)	8.165
80	images/sec: 225.6 +/- 0.3 (jitter = 1.0)	8.168
90	images/sec: 225.7 +/- 0.3 (jitter = 1.0)	8.196
100	images/sec: 225.6 +/- 0.2 (jitter = 1.1)	8.138
----------------------------------------------------------------
total images/sec: 225.62
----------------------------------------------------------------
TF_ROCM_FUSION_ENABLE=1 python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=128 --model=resnet50 --use_fp16
Step	Img/sec	total_loss
1	images/sec: 302.0 +/- 0.0 (jitter = 0.0)	8.213
10	images/sec: 300.2 +/- 0.5 (jitter = 1.5)	8.181
20	images/sec: 298.7 +/- 0.8 (jitter = 2.5)	8.324
30	images/sec: 297.7 +/- 0.8 (jitter = 2.2)	8.197
40	images/sec: 297.7 +/- 0.6 (jitter = 3.0)	8.173
50	images/sec: 297.9 +/- 0.6 (jitter = 3.0)	8.400
60	images/sec: 297.9 +/- 0.5 (jitter = 3.0)	8.267
70	images/sec: 298.4 +/- 0.5 (jitter = 2.8)	8.140
80	images/sec: 298.6 +/- 0.4 (jitter = 2.7)	8.283
90	images/sec: 298.6 +/- 0.4 (jitter = 2.8)	8.337
100	images/sec: 298.7 +/- 0.4 (jitter = 2.6)	8.208
----------------------------------------------------------------
total images/sec: 298.60
----------------------------------------------------------------
@sunway513

This comment has been minimized.

Copy link

commented Feb 18, 2019

Hi @sebpuetz , thanks for the update!
However, the performance numbers seem not right.
Can you provide me the VBIOS version of your board? The following command would do:
/opt/rocm/bin/rocm-smi -v

@sebpuetz

This comment has been minimized.

Copy link

commented Feb 18, 2019

/opt/rocm/bin/rocm-smi -v 
GPU[0] 		: VBIOS version: 113-D3600200-105
@WrightChen

This comment has been minimized.

Copy link

commented Feb 19, 2019

Radeon RX Vega 64
memoryClockRate (GHz) 1.63
Total memory: 7.98GiB
Free memory: 7.73GiB
rocm==2.1.96 installed through apt
tensorflow==1.12 installed through pip

Some Frameworks use option ' TF_ROCM_FUSION_ENABLE=1 ' doesn't change much, so I'm not giving the FUSION = 1 results. Due to lack of memory, there are some frameworks can't run on the batch_size=128.

  ResNet50 AlexNet Inception v3 VGG16 GoogLeNet ResNet152
batch_size=512 / 1573.01 / / / /
batch_size=256 / 1420.65 / / / /
batch_size=128 / 1345.73 / / 498.73 /
batch_size=64 190.58 1151.98 103.82 101.95 474.07 /
batch_size=32 171.70 971.85 98.50 91.80 424.32 68.71
batch_size=128; FUSION = 1 / / / / / /
batch_size=64; FUSION = 1 208.78 / 109.66 / / /
batch_size=32; FUSION = 1 187.76 / 105.20 / / 75.81
@sunway513

This comment has been minimized.

Copy link

commented Feb 21, 2019

Hi @sebpuetz , could you try to refresh your performance numbers using our official docker image?
If you've not configured the docker, the following script should do:
curl -sSL https://get.docker.com/ | sh

To run the benchmarks inside docker image:

alias drun='sudo docker run -it --network=host --device=/dev/kfd --device=/dev/dri --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -v $HOME/dockerx:/dockerx -v /data/imagenet/tf:/imagenet'
drun rocm/tensorflow:rocm2.1-tf1.12-python3
cd ~/benchmarks/scripts/tf_cnn_benchmarks
TF_ROCM_FUSION_ENABLE=1 python3 tf_cnn_benchmarks.py --num_gpus=1 --batch_size=128 --model=resnet50
TF_ROCM_FUSION_ENABLE=1 python3 tf_cnn_benchmarks.py --num_gpus=1 --batch_size=128 --model=resnet50 --use_fp16

Thanks for your attention, and looking forward to your updates :-)

@jimdowling

This comment has been minimized.

Copy link

commented Feb 21, 2019

6-core Intel i7 8700 with 16GB ram, and 400GB SSD disk.
Radeon VII
rocm==2.1.96 installed through apt
tensorflow==1.12 installed through pip
no further tuning

TC_ROCM_FUSION_ENABLE=1 python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=64 --model=resnet50
Step Img/sec total_loss
1 images/sec: 250.0 +/- 0.0 (jitter = 0.0) 8.348
10 images/sec: 248.0 +/- 1.4 (jitter = 0.7) 8.144
20 images/sec: 248.7 +/- 0.8 (jitter = 0.4) 8.440
30 images/sec: 248.8 +/- 0.6 (jitter = 0.4) 8.140
40 images/sec: 248.7 +/- 0.6 (jitter = 0.4) 8.474
50 images/sec: 248.5 +/- 0.5 (jitter = 0.4) 8.322
60 images/sec: 248.5 +/- 0.5 (jitter = 0.5) 8.317
70 images/sec: 248.5 +/- 0.4 (jitter = 0.6) 8.010
80 images/sec: 248.4 +/- 0.4 (jitter = 0.6) 8.272
90 images/sec: 248.5 +/- 0.4 (jitter = 0.6) 8.289
100 images/sec: 248.4 +/- 0.3 (jitter = 0.6) 8.108

total images/sec: 248.34

TC_ROCM_FUSION_ENABLE=1 python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=128 --model=resnet50
Step Img/sec total_loss
1 images/sec: 265.1 +/- 0.0 (jitter = 0.0) 8.324
10 images/sec: 264.3 +/- 0.5 (jitter = 0.3) 8.168
20 images/sec: 264.5 +/- 0.3 (jitter = 0.2) 8.261
30 images/sec: 264.4 +/- 0.3 (jitter = 0.3) 8.377
40 images/sec: 264.2 +/- 0.2 (jitter = 0.4) 8.408
50 images/sec: 264.1 +/- 0.2 (jitter = 0.5) 8.160
60 images/sec: 263.9 +/- 0.2 (jitter = 0.6) 8.341
70 images/sec: 263.8 +/- 0.2 (jitter = 0.6) 8.107
80 images/sec: 263.8 +/- 0.2 (jitter = 0.8) 8.404
90 images/sec: 263.8 +/- 0.2 (jitter = 0.7) 8.296
100 images/sec: 263.7 +/- 0.2 (jitter = 0.6) 8.348

total images/sec: 263.65

With a batch size of 256, i get out of memory errors.
Funnily enough with a batch size of 155, it works, but is slower.

TC_ROCM_FUSION_ENABLE=1 python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=155 --model=resnet50

Step Img/sec total_loss
1 images/sec: 195.3 +/- 0.0 (jitter = 0.0) 8.394
10 images/sec: 194.6 +/- 0.7 (jitter = 0.6) 8.313
20 images/sec: 194.5 +/- 0.5 (jitter = 0.6) 8.154
30 images/sec: 194.4 +/- 0.3 (jitter = 0.7) 8.249
40 images/sec: 194.5 +/- 0.3 (jitter = 0.8) 8.165
50 images/sec: 194.4 +/- 0.2 (jitter = 1.0) 8.292
60 images/sec: 194.3 +/- 0.2 (jitter = 1.0) 8.340
70 images/sec: 194.3 +/- 0.2 (jitter = 0.9) 8.268
80 images/sec: 194.2 +/- 0.2 (jitter = 0.8) 8.227
90 images/sec: 194.2 +/- 0.2 (jitter = 0.8) 8.257
100 images/sec: 194.1 +/- 0.2 (jitter = 0.9) 8.183

total images/sec: 194.04

@jimdowling

This comment has been minimized.

Copy link

commented Feb 21, 2019

Leaving out TC_ROCM_FUSION_ENABLE does not make any difference.
/opt/rocm/bin/rocm-smi -v
VBIOS version: 113-D3600200-105

@jimdowling

This comment has been minimized.

Copy link

commented Feb 21, 2019

According to this blog, https://www.pugetsystems.com/labs/hpc/NVIDIA-RTX-2080-Ti-vs-2080-vs-1080-Ti-vs-Titan-V-TensorFlow-Performance-with-CUDA-10-0-1247/, the 2080Ti gets 280 images/sec and the 1080Ti gets 207 images/sec for FP32 training.

@jimdowling

This comment has been minimized.

Copy link

commented Feb 21, 2019

One more:
python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=128 --model=resnet50 --use_fp16
Step Img/sec total_loss
1 images/sec: 377.7 +/- 0.0 (jitter = 0.0) 8.246
10 images/sec: 375.9 +/- 2.2 (jitter = 0.7) 8.261
20 images/sec: 377.9 +/- 1.2 (jitter = 0.9) 8.279
30 images/sec: 378.3 +/- 0.9 (jitter = 0.9) 8.365
40 images/sec: 378.2 +/- 0.7 (jitter = 0.5) 8.237
50 images/sec: 378.3 +/- 0.6 (jitter = 0.4) 8.295
60 images/sec: 378.4 +/- 0.5 (jitter = 0.4) 8.203
70 images/sec: 378.4 +/- 0.5 (jitter = 0.5) 8.129
80 images/sec: 377.9 +/- 0.6 (jitter = 0.6) 8.264
90 images/sec: 378.0 +/- 0.5 (jitter = 0.8) 8.163
100 images/sec: 377.9 +/- 0.5 (jitter = 0.8) 8.239

total images/sec: 377.79

@Sumenia

This comment has been minimized.

Copy link

commented Feb 21, 2019

@jimdowling that's some impressive perf !

@sebpuetz

This comment has been minimized.

Copy link

commented Feb 21, 2019

@jimdowling these numbers seem substantially higher than the ones I got, what OS and kernel are you on?

@sebpuetz

This comment has been minimized.

Copy link

commented Feb 21, 2019

Hi,
I executed the benchmarks in the docker container:

TF_ROCM_FUSION_ENABLE=1 python3 tf_cnn_benchmarks.py --num_gpus=1 --batch_size=128 --model=resnet50
Step	Img/sec	total_loss
1	images/sec: 229.7 +/- 0.0 (jitter = 0.0)	8.221
10	images/sec: 225.4 +/- 0.8 (jitter = 2.7)	8.289
20	images/sec: 225.9 +/- 0.5 (jitter = 3.6)	8.054
30	images/sec: 226.6 +/- 0.4 (jitter = 2.1)	8.313
40	images/sec: 226.9 +/- 0.3 (jitter = 0.8)	8.187
50	images/sec: 227.2 +/- 0.3 (jitter = 0.7)	8.240
60	images/sec: 227.3 +/- 0.2 (jitter = 0.5)	8.192
70	images/sec: 227.4 +/- 0.2 (jitter = 0.5)	8.143
80	images/sec: 227.6 +/- 0.2 (jitter = 0.5)	8.150
90	images/sec: 227.6 +/- 0.2 (jitter = 0.5)	8.217
100	images/sec: 227.7 +/- 0.2 (jitter = 0.5)	8.163
----------------------------------------------------------------
total images/sec: 227.66
----------------------------------------------------------------

and

TF_ROCM_FUSION_ENABLE=1 python3 tf_cnn_benchmarks.py --num_gpus=1 --batch_size=128 --model=resnet50 --use_fp16
Step	Img/sec	total_loss
1	images/sec: 300.8 +/- 0.0 (jitter = 0.0)	8.205
10	images/sec: 300.3 +/- 0.4 (jitter = 0.2)	8.170
20	images/sec: 300.3 +/- 0.3 (jitter = 0.5)	8.317
30	images/sec: 300.5 +/- 0.2 (jitter = 0.6)	8.201
40	images/sec: 300.6 +/- 0.2 (jitter = 0.5)	8.176
50	images/sec: 300.5 +/- 0.2 (jitter = 0.5)	8.398
60	images/sec: 300.3 +/- 0.2 (jitter = 0.5)	8.268
70	images/sec: 300.3 +/- 0.2 (jitter = 0.6)	8.140
80	images/sec: 300.4 +/- 0.2 (jitter = 0.6)	8.279
90	images/sec: 300.4 +/- 0.2 (jitter = 0.6)	8.328
100	images/sec: 300.3 +/- 0.2 (jitter = 0.6)	8.214
----------------------------------------------------------------
total images/sec: 300.29
----------------------------------------------------------------

@sunway513 these numbers are still pretty far away from what @jimdowling got, do you see a reason for this to happen?

@jimdowling

This comment has been minimized.

Copy link

commented Feb 21, 2019

Ubuntu 18.04. Python 2.7. Kernel is 4.15.
I was not running Docker - bare metal.

@sunway513

This comment has been minimized.

Copy link

commented Feb 21, 2019

Hi @jimdowling , Thanks for your posting! However, it seems there's a typo in your script, therefore TF fusion is not really enabled there. Could you try the following command again?
TF_ROCM_FUSION_ENABLE=1 python3 tf_cnn_benchmarks.py --num_gpus=1 --batch_size=128 --model=resnet50
TF_ROCM_FUSION_ENABLE=1 python3 tf_cnn_benchmarks.py --num_gpus=1 --batch_size=128 --model=resnet50 --use_fp16
If fusion is enabled, you should see the following message at the run time:
2019-02-21 13:41:32.304325: I tensorflow/core/graph/gpu_fusion_pass.cc:454] ROCm Fusion is enabled.

@sunway513

This comment has been minimized.

Copy link

commented Feb 21, 2019

Hi @sebpuetz , thanks for your updated numbers with docker!
in a parallel issue, you mentioned your system is Linux Mint 19.1, is that the same OS you ran the benchmark? May I know the kernel and driver version of your configurations? The following command would help:
uname -a
apt --installed list | grep rock-dkms
I believe your user-bit components were properly configured, as you got similar perf numbers using our official docker image. VBIOS version is good as well. We need to look into kernels and firmware.

@sebpuetz

This comment has been minimized.

Copy link

commented Feb 21, 2019

Hi @sunway513 ,
I ran all benchmarks on Linux Mint 19.1

uname -a
Linux seb-desktop 4.20.7-042007-generic #201902061234 SMP Wed Feb 6 17:36:40 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
apt list --installed | grep rock-dkms
rock-dkms/Ubuntu 16.04,now 2.1-96 all [installed]

Linux Mint 19.1 is based on Ubuntu 18.04, so this looks like a mismatch here?

@ghostplant

This comment has been minimized.

Copy link

commented Feb 21, 2019

@sunway513

I am also using RX Vega 64 but I got such warning:

2019-02-21 14:26:23.732074: I tensorflow/core/kernels/conv_grad_filter_ops.cc:975] running auto-tune for Backward-Filter
warning: <unknown>:0:0: loop not unrolled: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering
2019-02-21 14:26:27.702436: I tensorflow/core/kernels/conv_grad_input_ops.cc:1023] running auto-tune for Backward-Data
2019-02-21 14:26:29.084753: I tensorflow/core/kernels/conv_grad_filter_ops.cc:975] running auto-tune for Backward-Filter
warning: <unknown>:0:0: loop not unrolled: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering
2019-02-21 14:26:33.818470: I tensorflow/core/kernels/conv_grad_input_ops.cc:1023] running auto-tune for Backward-Data
2019-02-21 14:26:33.839322: I tensorflow/core/kernels/conv_grad_filter_ops.cc:975] running auto-tune for Backward-Filter

And the performance is ~10% loss compared with others' benchmark:

Step    Img/sec total_loss
1       images/sec: 182.8 +/- 0.0 (jitter = 0.0)        8.217
10      images/sec: 187.2 +/- 0.9 (jitter = 0.7)        8.122
20      images/sec: 187.3 +/- 0.5 (jitter = 0.7)        8.229
30      images/sec: 187.1 +/- 0.4 (jitter = 0.9)        8.264
40      images/sec: 187.0 +/- 0.4 (jitter = 0.9)        8.347
50      images/sec: 187.0 +/- 0.3 (jitter = 1.1)        8.014
60      images/sec: 187.0 +/- 0.3 (jitter = 1.0)        8.264
70      images/sec: 186.8 +/- 0.3 (jitter = 1.1)        8.316
80      images/sec: 186.7 +/- 0.3 (jitter = 1.1)        8.231
90      images/sec: 186.7 +/- 0.2 (jitter = 1.2)        8.305

But it should be expected to have about 207 images/sec.
Is it influenced by the warning above and how to fix the performance?

@sunway513

This comment has been minimized.

Copy link

commented Apr 21, 2019

@WannaBeOCer , I was just curious about your description on the 300w power target :-)
I believe your current software configuration are in good shape, thanks for posting!

@robzor92

This comment has been minimized.

Copy link

commented Apr 23, 2019

ROCm 2.2 vs 2.3 on Radeon Vega VII:

FP 16
TF_ROCM_FUSION_ENABLE=1 python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=128 --model=resnet50 --use_fp16

ROCm 2.2 output:

Step	Img/sec	total_loss
1	images/sec: 388.3 +/- 0.0 (jitter = 0.0)	8.235
10	images/sec: 386.5 +/- 0.6 (jitter = 1.3)	8.250
20	images/sec: 386.6 +/- 0.3 (jitter = 1.3)	8.262
30	images/sec: 386.6 +/- 0.3 (jitter = 1.3)	8.371
40	images/sec: 386.4 +/- 0.2 (jitter = 1.2)	8.233
50	images/sec: 386.3 +/- 0.2 (jitter = 1.3)	8.311
60	images/sec: 386.7 +/- 0.2 (jitter = 1.6)	8.203
70	images/sec: 386.8 +/- 0.3 (jitter = 2.2)	8.111
80	images/sec: 386.7 +/- 0.3 (jitter = 2.2)	8.235
90	images/sec: 386.4 +/- 0.2 (jitter = 1.9)	8.168
100	images/sec: 386.3 +/- 0.2 (jitter = 1.8)	8.212
----------------------------------------------------------------
total images/sec: 386.14
----------------------------------------------------------------

ROCm 2.3 output:

Step	Img/sec	total_loss
1	images/sec: 410.9 +/- 0.0 (jitter = 0.0)	8.214
10	images/sec: 410.2 +/- 0.8 (jitter = 3.0)	8.175
20	images/sec: 409.4 +/- 0.5 (jitter = 2.0)	8.327
30	images/sec: 409.4 +/- 0.4 (jitter = 2.1)	8.181
40	images/sec: 409.8 +/- 0.4 (jitter = 2.4)	8.156
50	images/sec: 410.0 +/- 0.4 (jitter = 2.8)	8.397
60	images/sec: 409.9 +/- 0.3 (jitter = 3.0)	8.266
70	images/sec: 409.9 +/- 0.3 (jitter = 3.0)	8.156
80	images/sec: 410.1 +/- 0.3 (jitter = 3.1)	8.271
90	images/sec: 409.8 +/- 0.3 (jitter = 3.1)	8.321
100	images/sec: 409.9 +/- 0.3 (jitter = 3.1)	8.203
----------------------------------------------------------------
total images/sec: 409.76
----------------------------------------------------------------

FP 32
TF_ROCM_FUSION_ENABLE=1 python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=128 --model=resnet50

ROCm 2.2 output:

Step	Img/sec	total_loss
1	images/sec: 274.8 +/- 0.0 (jitter = 0.0)	8.324
10	images/sec: 274.0 +/- 0.4 (jitter = 0.7)	8.165
20	images/sec: 273.5 +/- 0.3 (jitter = 1.5)	8.253
30	images/sec: 273.3 +/- 0.2 (jitter = 1.7)	8.347
40	images/sec: 273.1 +/- 0.2 (jitter = 1.6)	8.412
50	images/sec: 272.8 +/- 0.2 (jitter = 1.4)	8.149
60	images/sec: 272.5 +/- 0.2 (jitter = 2.0)	8.326
70	images/sec: 272.5 +/- 0.2 (jitter = 1.8)	8.122
80	images/sec: 272.3 +/- 0.2 (jitter = 1.5)	8.412
90	images/sec: 272.3 +/- 0.1 (jitter = 1.5)	8.275
100	images/sec: 272.2 +/- 0.1 (jitter = 1.4)	8.329
----------------------------------------------------------------
total images/sec: 272.16
----------------------------------------------------------------

ROCm 2.3 output:

Step	Img/sec	total_loss
1	images/sec: 293.9 +/- 0.0 (jitter = 0.0)	7.972
10	images/sec: 295.5 +/- 0.4 (jitter = 0.5)	7.856
20	images/sec: 295.7 +/- 0.3 (jitter = 1.0)	7.913
30	images/sec: 295.6 +/- 0.2 (jitter = 1.1)	7.734
40	images/sec: 295.6 +/- 0.2 (jitter = 0.9)	7.968
50	images/sec: 295.4 +/- 0.1 (jitter = 1.0)	8.027
60	images/sec: 295.3 +/- 0.1 (jitter = 1.1)	7.887
70	images/sec: 295.2 +/- 0.1 (jitter = 1.1)	7.978
80	images/sec: 295.2 +/- 0.1 (jitter = 1.1)	7.811
90	images/sec: 295.1 +/- 0.1 (jitter = 1.2)	7.786
100	images/sec: 295.0 +/- 0.1 (jitter = 1.3)	7.817
----------------------------------------------------------------
total images/sec: 294.93
----------------------------------------------------------------

Giving something like 5-10% performance increase, nice work!

It might be worth mentioning that to test ROCm 2.2 I used TensorFlow 1.11.0 and ROCm 2.3 with TensorFlow 1.13.1 due to compatibility reasons.

@kinred

This comment has been minimized.

Copy link

commented May 9, 2019

Radeon VII

Update with ROCm 2.4 and Tensorflow 1.13.3:

FP32

TF_ROCM_FUSION_ENABLE=1 python tf_cnn_benchmarks.py --num_gpus=1 --model resnet50 --batch_size=128

70	images/sec: 311.0 +/- 0.1 (jitter = 0.6)	8.290
80	images/sec: 310.9 +/- 0.1 (jitter = 0.7)	8.306
90	images/sec: 310.8 +/- 0.1 (jitter = 0.7)	8.136
100	images/sec: 310.8 +/- 0.1 (jitter = 0.7)	8.447
----------------------------------------------------------------
total images/sec: 310.74
----------------------------------------------------------------

FP16

TF_ROCM_FUSION_ENABLE=1 python tf_cnn_benchmarks.py --num_gpus=1 --model resnet50 --batch_size=128 --use_fp16=true

70	images/sec: 443.9 +/- 0.1 (jitter = 0.8)	8.272
80	images/sec: 443.7 +/- 0.1 (jitter = 0.8)	8.189
90	images/sec: 443.6 +/- 0.1 (jitter = 1.0)	8.293
100	images/sec: 443.5 +/- 0.1 (jitter = 1.1)	8.289
----------------------------------------------------------------
total images/sec: 443.42
----------------------------------------------------------------

Also RNN performance made a jump. Great improvements!

@sunway513

This comment has been minimized.

Copy link

commented May 10, 2019

Hi @kinred , could you help clarify if you have changed any ROCm default settings, e.g. power target?

@kinred

This comment has been minimized.

Copy link

commented May 13, 2019

@sunway513, I did no specific tuning.

Running Ubuntu 18.04.2 LTS (4.15.0-48-generic) bare metal with rocm-dkms 2.4.25 packages.

After running "rocm-smi -d 0 --resetprofile" I get reproducible the same results. A log of "rocm-smi -a" is attached.

radeon_vii_rocm_smi.log

Any specific info I could look up for you?

@sunway513

This comment has been minimized.

Copy link

commented May 13, 2019

Thanks @kinred , could you set the following option and re-collect your result?
/opt/rocm/bin/rocm-smi --setperf auto

@kinred

This comment has been minimized.

Copy link

commented May 14, 2019

Hi @sunway513, i did the above command, it stated:

========================ROCm System Management Interface========================
================================================================================
GPU[0] 		: Successfully set current Performance Level to auto
================================================================================
==============================End of ROCm SMI Log ==============================

Re-run the benchmark and get similar results:

TF_ROCM_FUSION_ENABLE=1 python tf_cnn_benchmarks.py --num_gpus=1 --model resnet50 --batch_size=128

70	images/sec: 311.3 +/- 0.1 (jitter = 0.4)	8.281
80	images/sec: 311.2 +/- 0.1 (jitter = 0.4)	8.307
90	images/sec: 311.2 +/- 0.1 (jitter = 0.4)	8.122
100	images/sec: 311.2 +/- 0.1 (jitter = 0.4)	8.447
----------------------------------------------------------------
total images/sec: 311.12
----------------------------------------------------------------

Are the results reproducible on your side?

@ffleader1

This comment has been minimized.

Copy link

commented May 14, 2019

@kazulittlefox , or please anyone with a Rx 580, can you help me with benchmarking VGG16 (TF 1.12 preferably, BS=32, F32). I despairingly need this result for my personal work, but I have a VEGA 56 so I can't really do anything.

@NcuLz

This comment has been minimized.

Copy link

commented Jun 17, 2019

Hi @ffleader1 ,I have a RX580,this my result for you.
ubuntu 18.04 ROCm 2.5 tf-1.13.1
2019-06-17 20-42-53屏幕截图
2019-06-17 20-42-40屏幕截图

@Daniel451

This comment has been minimized.

Copy link

commented Jun 26, 2019

@kinred could you (or others) add more benchmarks for the Radeon VII? How's the stability so far? Any downsides?

Would be interesting to see the performance in really deep networks with the lot of architectural stuff used (residual connections/conacts, lots of & different convolutions, ...). For example, could you test InceptionV3 or V4 performance?

Slightly over 300 img/s in ResNet50 sounds really good since even the GTX 2080 Ti is only at 326 img/s (although I only saw batch size 64 tests, probably the 11GB VRAM does not allow for 128; you can find one example here Exxactcorp NVIDIA RTX 2080 Ti Benchmarks).

The 2080 Ti is only leading in FP16 (over 800 img/s with batch size 128).

@sebpuetz

This comment has been minimized.

Copy link

commented Jun 26, 2019

@kinred could you (or others) add more benchmarks for the Radeon VII? How's the stability so far? Any downsides?

I can mostly comment on stability, I don't do image processing, so I can't give insights on architectures/models used for those tasks.

You might want to check out #325, I opened the bug report more than 4 months ago. The last time something happened was about 6 weeks ago, but there's no fix or even an explanation in sight.

#414 describes some other (probably temp-related) issues, I'm unsure whether that was resolved.

With the most recent ROCm update my system becomes unresponsive on an RNN that previously worked fine, haven't dug deeper into it but I can get the system to become responsive again by killing the process.

I haven't seen many other people complaining about issues with the VII, so I guess, depending on your use-case, your mileage may vary.

@alexanderkjeldaas

This comment has been minimized.

Copy link

commented Jul 6, 2019

Any resnet50 benchmarks for ROCm 2.5 and TF 2.0?

@WannaBeOCer

This comment has been minimized.

Copy link

commented Jul 13, 2019

Radeon VII with ROCm 2.6 and TF 2.0

TF_ROCM_FUSION_ENABLE=1 python3 tf_cnn_benchmarks.py --num_gpus=1 --batch_size=64 --model=resnet50 --use_fp16

Step	Img/sec	total_loss
1	images/sec: 400.8 +/- 0.0 (jitter = 0.0)	8.104
10	images/sec: 399.9 +/- 0.3 (jitter = 0.5)	7.757
20	images/sec: 400.0 +/- 0.2 (jitter = 0.5)	7.913
30	images/sec: 399.8 +/- 0.2 (jitter = 0.6)	7.771
40	images/sec: 399.7 +/- 0.1 (jitter = 0.5)	7.920
50	images/sec: 399.7 +/- 0.1 (jitter = 0.6)	7.886
60	images/sec: 399.7 +/- 0.1 (jitter = 0.5)	7.710
70	images/sec: 399.7 +/- 0.1 (jitter = 0.6)	8.007
80	images/sec: 399.9 +/- 0.2 (jitter = 0.6)	7.780
90	images/sec: 400.1 +/- 0.2 (jitter = 0.7)	7.798
100	images/sec: 400.1 +/- 0.2 (jitter = 0.8)	8.035
----------------------------------------------------------------
total images/sec: 399.77
----------------------------------------------------------------
@dave-fl

This comment has been minimized.

Copy link

commented Jul 14, 2019

Is it possible to replicate the benchmarks from here and also list the hardware used they use fp16 and fp32.

https://lambdalabs.com/blog/2080-ti-deep-learning-benchmarks/

@alexanderkjeldaas

This comment has been minimized.

Copy link

commented Jul 14, 2019

@WannaBeOCer could you run with --batch-size=128 to match the other reported numbers in this issue?

@WannaBeOCer

This comment has been minimized.

Copy link

commented Jul 15, 2019

@alexanderkjeldaas FP16 or FP32? With or without Fusion?

@dave-fl

This comment has been minimized.

Copy link

commented Jul 15, 2019

Shouldn’t all permutations be done?

@WannaBeOCer

This comment has been minimized.

Copy link

commented Jul 15, 2019

@alexanderkjeldaas Here are the stock results of a Radeon VII running Ubuntu 18.04 w/ kernel 4.18.0-25. Radeon VII with ROCm 2.6 and TF 2.0


python3 tf_cnn_benchmarks.py --num_gpus=1 --batch_size=128 --model=resnet50

Done warm up
Step	Img/sec	total_loss
1	images/sec: 285.9 +/- 0.0 (jitter = 0.0)	7.972
10	images/sec: 285.8 +/- 0.1 (jitter = 0.1)	7.856
20	images/sec: 285.7 +/- 0.0 (jitter = 0.1)	7.913
30	images/sec: 285.7 +/- 0.0 (jitter = 0.2)	7.733
40	images/sec: 285.6 +/- 0.0 (jitter = 0.2)	7.968
50	images/sec: 285.6 +/- 0.0 (jitter = 0.2)	8.021
60	images/sec: 285.6 +/- 0.0 (jitter = 0.2)	7.896
70	images/sec: 285.6 +/- 0.0 (jitter = 0.3)	7.987
80	images/sec: 285.6 +/- 0.0 (jitter = 0.2)	7.807
90	images/sec: 285.6 +/- 0.0 (jitter = 0.2)	7.788
100	images/sec: 285.6 +/- 0.0 (jitter = 0.2)	7.823
----------------------------------------------------------------
total images/sec: 285.54
----------------------------------------------------------------



TF_ROCM_FUSION_ENABLE=1 python3 tf_cnn_benchmarks.py --num_gpus=1 --batch_size=128 --model=resnet50

Done warm up
Step	Img/sec	total_loss
1	images/sec: 304.0 +/- 0.0 (jitter = 0.0)	7.972
10	images/sec: 304.8 +/- 0.1 (jitter = 0.2)	7.856
20	images/sec: 304.8 +/- 0.1 (jitter = 0.2)	7.913
30	images/sec: 304.8 +/- 0.1 (jitter = 0.2)	7.734
40	images/sec: 304.7 +/- 0.1 (jitter = 0.2)	7.966
50	images/sec: 304.7 +/- 0.1 (jitter = 0.2)	8.029
60	images/sec: 304.7 +/- 0.0 (jitter = 0.3)	7.894
70	images/sec: 304.7 +/- 0.1 (jitter = 0.3)	7.986
80	images/sec: 304.7 +/- 0.0 (jitter = 0.3)	7.814
90	images/sec: 304.7 +/- 0.0 (jitter = 0.2)	7.791
100	images/sec: 304.7 +/- 0.0 (jitter = 0.3)	7.808
----------------------------------------------------------------
total images/sec: 304.59
----------------------------------------------------------------



python3 tf_cnn_benchmarks.py --num_gpus=1 --batch_size=128 --model=resnet50 --use_fp16

Done warm up
Step	Img/sec	total_loss
1	images/sec: 410.9 +/- 0.0 (jitter = 0.0)	7.876
10	images/sec: 411.0 +/- 0.2 (jitter = 0.6)	7.951
20	images/sec: 410.8 +/- 0.2 (jitter = 0.6)	7.950
30	images/sec: 410.7 +/- 0.1 (jitter = 0.6)	7.948
40	images/sec: 410.5 +/- 0.1 (jitter = 0.7)	7.954
50	images/sec: 410.4 +/- 0.1 (jitter = 0.8)	7.718
60	images/sec: 410.4 +/- 0.1 (jitter = 0.7)	7.909
70	images/sec: 410.3 +/- 0.1 (jitter = 0.6)	7.841
80	images/sec: 410.2 +/- 0.1 (jitter = 0.7)	7.965
90	images/sec: 410.2 +/- 0.1 (jitter = 0.7)	7.790
100	images/sec: 410.2 +/- 0.1 (jitter = 0.6)	7.776
----------------------------------------------------------------
total images/sec: 410.05
----------------------------------------------------------------



TF_ROCM_FUSION_ENABLE=1 python3 tf_cnn_benchmarks.py --num_gpus=1 --batch_size=128 --model=resnet50 --use_fp16

Done warm up
Step	Img/sec	total_loss
1	images/sec: 435.4 +/- 0.0 (jitter = 0.0)	7.879
10	images/sec: 435.0 +/- 0.4 (jitter = 0.8)	7.955
20	images/sec: 435.0 +/- 0.3 (jitter = 1.1)	7.947
30	images/sec: 435.0 +/- 0.2 (jitter = 0.9)	7.948
40	images/sec: 434.8 +/- 0.2 (jitter = 0.9)	7.954
50	images/sec: 434.7 +/- 0.1 (jitter = 0.8)	7.710
60	images/sec: 434.7 +/- 0.1 (jitter = 0.8)	7.926
70	images/sec: 434.6 +/- 0.1 (jitter = 0.8)	7.841
80	images/sec: 434.5 +/- 0.1 (jitter = 0.8)	7.968
90	images/sec: 434.5 +/- 0.1 (jitter = 0.8)	7.790
100	images/sec: 434.5 +/- 0.1 (jitter = 0.8)	7.770
----------------------------------------------------------------
total images/sec: 434.31
----------------------------------------------------------------
@aristeidist

This comment has been minimized.

Copy link

commented Aug 7, 2019

Hello, here are my results with:

AMD Ryzen 2600x | X470 | 2x Radeon VII | 16GB @3200
Ubuntu - ROCm 2.6 ----- TensorFlow=2.0.0b1

--resnet50 float32 and float16 w different batch sizes.

TF_ROCM_FUSION_ENABLE=1 python3 tf_cnn_benchmarks.py --num_gpus=2 --batch_size=64 --model=resnet50 --variable_update=parameter_server --local_parameter_device=cpu --use_fp16

1	images/sec: 649.8 +/- 0.0 (jitter = 0.0)	8.062
10	images/sec: 672.8 +/- 2.6 (jitter = 2.4)	7.859
20	images/sec: 675.6 +/- 1.6 (jitter = 4.0)	7.862
30	images/sec: 676.4 +/- 1.2 (jitter = 4.3)	7.924
40	images/sec: 676.1 +/- 1.0 (jitter = 4.3)	7.920
50	images/sec: 676.3 +/- 0.8 (jitter = 4.1)	7.910
60	images/sec: 676.4 +/- 0.7 (jitter = 3.8)	7.757
70	images/sec: 676.7 +/- 0.6 (jitter = 3.8)	7.909
80	images/sec: 676.6 +/- 0.6 (jitter = 4.0)	7.785
90	images/sec: 676.8 +/- 0.5 (jitter = 4.1)	7.934
100	images/sec: 676.3 +/- 0.5 (jitter = 4.4)	7.921
----------------------------------------------------------------
total images/sec: 676.03
----------------------------------------------------------------

TF_ROCM_FUSION_ENABLE=1 python3 tf_cnn_benchmarks.py --num_gpus=2 --batch_size=64 --model=resnet50 --variable_update=parameter_server --local_parameter_device=cpu

Step	Img/sec	total_loss
1	images/sec: 505.5 +/- 0.0 (jitter = 0.0)	8.047
10	images/sec: 505.7 +/- 0.4 (jitter = 1.1)	7.920
20	images/sec: 505.0 +/- 0.4 (jitter = 1.7)	7.823
30	images/sec: 504.6 +/- 0.4 (jitter = 1.3)	8.010
40	images/sec: 504.6 +/- 0.3 (jitter = 1.3)	8.007
50	images/sec: 504.5 +/- 0.3 (jitter = 1.6)	7.822
60	images/sec: 504.2 +/- 0.3 (jitter = 1.6)	7.952
70	images/sec: 504.2 +/- 0.2 (jitter = 1.6)	7.812
80	images/sec: 503.9 +/- 0.3 (jitter = 1.7)	7.843
90	images/sec: 503.8 +/- 0.2 (jitter = 1.7)	7.957
100	images/sec: 503.7 +/- 0.2 (jitter = 1.8)	8.101
----------------------------------------------------------------
total images/sec: 503.56
----------------------------------------------------------------



TF_ROCM_FUSION_ENABLE=1 python3 tf_cnn_benchmarks.py --num_gpus=2 --batch_size=128 --model=resnet50 --variable_update=parameter_server --local_parameter_device=cpu

Step    Img/sec total_loss
1       images/sec: 531.2 +/- 0.0 (jitter = 0.0)        7.923
10      images/sec: 524.0 +/- 2.3 (jitter = 6.8)        7.900
20      images/sec: 529.1 +/- 2.0 (jitter = 8.2)        7.870
30      images/sec: 530.7 +/- 1.5 (jitter = 7.5)        7.867
40      images/sec: 530.6 +/- 1.3 (jitter = 7.4)        7.947
50      images/sec: 530.3 +/- 1.1 (jitter = 7.2)        7.789
60      images/sec: 530.8 +/- 1.0 (jitter = 7.2)        7.870
70      images/sec: 531.2 +/- 0.9 (jitter = 7.2)        7.804
80      images/sec: 531.5 +/- 0.9 (jitter = 7.2)        7.805
90      images/sec: 531.5 +/- 0.8 (jitter = 7.1)        7.804
100     images/sec: 531.6 +/- 0.8 (jitter = 6.9)        7.792
----------------------------------------------------------------
total images/sec: 531.50
----------------------------------------------------------------

TF_ROCM_FUSION_ENABLE=1 python3 tf_cnn_benchmarks.py --num_gpus=2 --batch_size=128 --model=resnet50 --variable_update=parameter_server --local_parameter_device=cpu --use_fp16

Step    Img/sec total_loss
1       images/sec: 791.5 +/- 0.0 (jitter = 0.0)        7.926
10      images/sec: 790.0 +/- 1.4 (jitter = 0.8)        7.902
20      images/sec: 789.4 +/- 1.2 (jitter = 1.2)        7.871
30      images/sec: 789.4 +/- 0.9 (jitter = 1.2)        7.867
40      images/sec: 789.0 +/- 0.8 (jitter = 1.6)        7.943
50      images/sec: 788.9 +/- 0.7 (jitter = 1.9)        7.790
60      images/sec: 788.8 +/- 0.6 (jitter = 2.2)        7.871
70      images/sec: 788.4 +/- 0.6 (jitter = 2.5)        7.806
80      images/sec: 788.4 +/- 0.5 (jitter = 2.5)        7.807
90      images/sec: 788.3 +/- 0.5 (jitter = 2.6)        7.812
100     images/sec: 787.9 +/- 0.4 (jitter = 2.5)        7.798
----------------------------------------------------------------
total images/sec: 787.76
----------------------------------------------------------------

TF_ROCM_FUSION_ENABLE=1 python3 tf_cnn_benchmarks.py --num_gpus=2 --batch_size=256 --model=resnet50 --variable_update=parameter_server --local_parameter_device=cpu --use_fp16

Step	Img/sec	total_loss
1	images/sec: 867.1 +/- 0.0 (jitter = 0.0)	7.838
10	images/sec: 868.5 +/- 0.3 (jitter = 1.1)	7.875
20	images/sec: 868.2 +/- 0.3 (jitter = 1.2)	7.929
30	images/sec: 867.9 +/- 0.3 (jitter = 1.5)	7.815
40	images/sec: 867.5 +/- 0.3 (jitter = 1.7)	7.775
50	images/sec: 866.8 +/- 0.4 (jitter = 2.0)	7.762
60	images/sec: 866.5 +/- 0.3 (jitter = 2.5)	7.755
70	images/sec: 866.0 +/- 0.3 (jitter = 2.6)	7.718
80	images/sec: 865.7 +/- 0.3 (jitter = 2.9)	7.759
90	images/sec: 865.4 +/- 0.3 (jitter = 3.0)	7.739
100	images/sec: 865.1 +/- 0.3 (jitter = 2.9)	7.695
----------------------------------------------------------------
total images/sec: 865.00
----------------------------------------------------------------

--Inception--

TF_ROCM_FUSION_ENABLE=1 python3 tf_cnn_benchmarks.py --num_gpus=2 --batch_size=64 --model=inception3 --variable_update=parameter_server --local_parameter_device=cpu --use_fp16

Step	Img/sec	total_loss
1	images/sec: 308.5 +/- 0.0 (jitter = 0.0)	7.298
10	images/sec: 311.1 +/- 0.4 (jitter = 0.9)	7.346
20	images/sec: 311.0 +/- 0.3 (jitter = 0.6)	7.338
30	images/sec: 311.1 +/- 0.2 (jitter = 0.5)	7.325
40	images/sec: 310.9 +/- 0.2 (jitter = 0.7)	7.283
50	images/sec: 310.8 +/- 0.2 (jitter = 0.9)	7.369
60	images/sec: 310.8 +/- 0.2 (jitter = 0.9)	7.292
70	images/sec: 310.6 +/- 0.2 (jitter = 0.9)	7.315
80	images/sec: 310.5 +/- 0.1 (jitter = 1.0)	7.334
90	images/sec: 310.4 +/- 0.1 (jitter = 1.1)	7.340
100	images/sec: 310.3 +/- 0.1 (jitter = 1.1)	7.316
----------------------------------------------------------------
total images/sec: 310.26
----------------------------------------------------------------
@himanshugoel2797

This comment has been minimized.

Copy link

commented Aug 20, 2019

Ran some benchmarks on Radeon VII, TF 1.14.1, Kernel 4.18, ROCm 2.7:

python3 tf_cnn_benchmarks.py --num_gpus=1 --batch_size=128 --model=resnet50

Done warm up
Step	Img/sec	total_loss
1	images/sec: 296.2 +/- 0.0 (jitter = 0.0)	7.972
10	images/sec: 296.4 +/- 0.1 (jitter = 0.3)	7.856
20	images/sec: 296.2 +/- 0.1 (jitter = 0.3)	7.913
30	images/sec: 296.0 +/- 0.1 (jitter = 0.3)	7.734
40	images/sec: 295.8 +/- 0.1 (jitter = 0.4)	7.971
50	images/sec: 295.7 +/- 0.1 (jitter = 0.7)	8.026
60	images/sec: 295.6 +/- 0.1 (jitter = 0.8)	7.892
70	images/sec: 295.5 +/- 0.1 (jitter = 0.7)	7.985
80	images/sec: 295.4 +/- 0.1 (jitter = 0.7)	7.804
90	images/sec: 295.3 +/- 0.1 (jitter = 0.7)	7.787
100	images/sec: 295.2 +/- 0.1 (jitter = 0.8)	7.813
----------------------------------------------------------------
total images/sec: 295.11
----------------------------------------------------------------

TF_ROCM_FUSION_ENABLE=1 python3 tf_cnn_benchmarks.py --num_gpus=1 --batch_size=128 --model=resnet50

Done warm up
Step	Img/sec	total_loss
1	images/sec: 317.0 +/- 0.0 (jitter = 0.0)	7.972
10	images/sec: 317.2 +/- 0.1 (jitter = 0.3)	7.856
20	images/sec: 317.1 +/- 0.1 (jitter = 0.3)	7.913
30	images/sec: 316.9 +/- 0.1 (jitter = 0.4)	7.734
40	images/sec: 316.8 +/- 0.1 (jitter = 0.5)	7.968
50	images/sec: 316.6 +/- 0.1 (jitter = 0.5)	8.027
60	images/sec: 316.5 +/- 0.1 (jitter = 0.7)	7.896
70	images/sec: 316.4 +/- 0.1 (jitter = 0.8)	7.989
80	images/sec: 316.3 +/- 0.1 (jitter = 0.8)	7.808
90	images/sec: 316.2 +/- 0.1 (jitter = 0.8)	7.784
100	images/sec: 316.0 +/- 0.1 (jitter = 0.9)	7.808
----------------------------------------------------------------
total images/sec: 315.95
----------------------------------------------------------------


python3 tf_cnn_benchmarks.py --num_gpus=1 --batch_size=256 --model=resnet50 --use_fp16

Done warm up
Step	Img/sec	total_loss
1	images/sec: 432.7 +/- 0.0 (jitter = 0.0)	7.808
10	images/sec: 432.1 +/- 0.2 (jitter = 0.7)	7.884
20	images/sec: 432.0 +/- 0.1 (jitter = 0.5)	8.012
30	images/sec: 431.9 +/- 0.1 (jitter = 0.5)	7.848
40	images/sec: 431.8 +/- 0.1 (jitter = 0.5)	7.787
50	images/sec: 431.6 +/- 0.1 (jitter = 0.8)	7.866
60	images/sec: 431.4 +/- 0.1 (jitter = 0.8)	7.874
70	images/sec: 431.2 +/- 0.1 (jitter = 1.1)	7.844
80	images/sec: 431.1 +/- 0.1 (jitter = 1.2)	7.856
90	images/sec: 430.9 +/- 0.1 (jitter = 1.2)	7.857
100	images/sec: 430.7 +/- 0.1 (jitter = 1.4)	7.743
----------------------------------------------------------------
total images/sec: 430.68
----------------------------------------------------------------


TF_ROCM_FUSION_ENABLE=1 python3 tf_cnn_benchmarks.py --num_gpus=1 --batch_size=256 --model=resnet50 --use_fp16

Done warm up
Step	Img/sec	total_loss
1	images/sec: 458.7 +/- 0.0 (jitter = 0.0)	7.806
10	images/sec: 458.0 +/- 0.2 (jitter = 0.5)	7.883
20	images/sec: 457.6 +/- 0.2 (jitter = 0.6)	8.016
30	images/sec: 457.2 +/- 0.2 (jitter = 1.0)	7.843
40	images/sec: 456.9 +/- 0.1 (jitter = 1.1)	7.790
50	images/sec: 456.7 +/- 0.1 (jitter = 1.1)	7.862
60	images/sec: 456.5 +/- 0.1 (jitter = 1.1)	7.886
70	images/sec: 456.3 +/- 0.1 (jitter = 1.0)	7.846
80	images/sec: 456.1 +/- 0.1 (jitter = 1.1)	7.847
90	images/sec: 455.9 +/- 0.1 (jitter = 1.2)	7.856
100	images/sec: 455.7 +/- 0.1 (jitter = 1.4)	7.737
----------------------------------------------------------------
total images/sec: 455.60
----------------------------------------------------------------

So looks like there's been a bit more of a performance uplift!

@nikAizuddin

This comment has been minimized.

Copy link

commented Aug 21, 2019

Can anyone provide benchmarks for Radeon RX5700 XT?

@himanshugoel2797

This comment has been minimized.

Copy link

commented Aug 21, 2019

The 5700XT isn't supported in ROCm yet. I have the Anniversary Edition, but can't even really use it in Linux yet unless I use the latest kernel and its builtin driver instead of rock-dkms (which has better performance).

@ycsos

This comment has been minimized.

Copy link

commented Sep 5, 2019

Here is my result for RTX5000 with tensorflow 2.0 and CUDA 10.0:
command line is : python3 tf_cnn_benchmarks.py --num_gpus=1 --batch_size=128 --model=resnet50

Done warm up
Step	Img/sec	total_loss
1	images/sec: 229.6 +/- 0.0 (jitter = 0.0)	7.972
10	images/sec: 229.8 +/- 0.1 (jitter = 0.3)	7.856
20	images/sec: 229.6 +/- 0.1 (jitter = 0.3)	7.914
30	images/sec: 229.4 +/- 0.1 (jitter = 0.3)	7.733
40	images/sec: 229.3 +/- 0.1 (jitter = 0.4)	7.966
50	images/sec: 229.1 +/- 0.1 (jitter = 0.6)	8.027
60	images/sec: 229.0 +/- 0.1 (jitter = 0.7)	7.891
70	images/sec: 228.9 +/- 0.1 (jitter = 0.7)	7.991
80	images/sec: 228.7 +/- 0.1 (jitter = 0.8)	7.805
90	images/sec: 228.6 +/- 0.1 (jitter = 1.0)	7.788
100	images/sec: 228.5 +/- 0.1 (jitter = 1.0)	7.819
----------------------------------------------------------------
total images/sec: 228.49
----------------------------------------------------------------

and python3 tf_cnn_benchmarks.py --num_gpus=1 --batch_size=128 --model=resnet50 --use_fp16

Done warm up
Step	Img/sec	total_loss
1	images/sec: 438.0 +/- 0.0 (jitter = 0.0)	7.873
10	images/sec: 437.2 +/- 0.2 (jitter = 0.9)	7.953
20	images/sec: 437.4 +/- 0.1 (jitter = 0.6)	7.946
30	images/sec: 437.0 +/- 0.2 (jitter = 0.9)	7.940
40	images/sec: 436.8 +/- 0.1 (jitter = 0.8)	7.960
50	images/sec: 436.6 +/- 0.1 (jitter = 0.9)	7.707
60	images/sec: 436.6 +/- 0.1 (jitter = 0.8)	7.913
70	images/sec: 436.5 +/- 0.1 (jitter = 0.8)	7.836
80	images/sec: 436.4 +/- 0.1 (jitter = 0.8)	7.960
90	images/sec: 436.3 +/- 0.1 (jitter = 0.8)	7.799
100	images/sec: 436.2 +/- 0.1 (jitter = 0.9)	7.769
----------------------------------------------------------------
total images/sec: 436.11
----------------------------------------------------------------
@jpizarrom

This comment has been minimized.

Copy link

commented Sep 5, 2019

Hello, here are my results with:

AMD Ryzen 2600x | X370 | 16GB | 1x Radeon VII VBIOS version: 113-D3600200-106
Ubuntu 18.08 | 5.0.0-25-generic 26~18.04.1-Ubuntu
rock-dev 2.7.22 | Docker rocm/tensorflow:rocm2.6-tf1.14-python3

python3 tf_cnn_benchmarks.py --num_gpus=1 --batch_size=128 --model=resnet50
Done warm up
Step	Img/sec	total_loss
1	images/sec: 273.6 +/- 0.0 (jitter = 0.0)	8.221
10	images/sec: 273.3 +/- 0.2 (jitter = 0.4)	8.285
20	images/sec: 273.0 +/- 0.2 (jitter = 0.6)	8.062
30	images/sec: 272.8 +/- 0.2 (jitter = 0.7)	8.313
40	images/sec: 272.8 +/- 0.2 (jitter = 0.7)	8.162
50	images/sec: 272.7 +/- 0.1 (jitter = 0.7)	8.253
60	images/sec: 272.5 +/- 0.2 (jitter = 0.8)	8.184
70	images/sec: 272.2 +/- 0.2 (jitter = 1.0)	8.164
80	images/sec: 272.1 +/- 0.2 (jitter = 1.1)	8.144
90	images/sec: 272.0 +/- 0.2 (jitter = 1.2)	8.204
100	images/sec: 271.9 +/- 0.2 (jitter = 1.2)	8.151
----------------------------------------------------------------
total images/sec: 271.73
----------------------------------------------------------------


TF_ROCM_FUSION_ENABLE=1 python3 tf_cnn_benchmarks.py --num_gpus=1 --batch_size=128 --model=resnet50
Done warm up
Step	Img/sec	total_loss
1	images/sec: 287.8 +/- 0.0 (jitter = 0.0)	8.221
10	images/sec: 291.1 +/- 0.5 (jitter = 1.2)	8.282
20	images/sec: 291.1 +/- 0.4 (jitter = 1.7)	8.053
30	images/sec: 290.8 +/- 0.3 (jitter = 1.5)	8.319
40	images/sec: 290.9 +/- 0.2 (jitter = 1.4)	8.193
50	images/sec: 291.1 +/- 0.2 (jitter = 1.4)	8.249
60	images/sec: 291.0 +/- 0.2 (jitter = 1.4)	8.182
70	images/sec: 290.8 +/- 0.2 (jitter = 1.3)	8.162
80	images/sec: 290.7 +/- 0.2 (jitter = 1.2)	8.138
90	images/sec: 290.7 +/- 0.2 (jitter = 1.2)	8.203
100	images/sec: 290.7 +/- 0.1 (jitter = 1.1)	8.142
----------------------------------------------------------------
total images/sec: 290.54
----------------------------------------------------------------


python3 tf_cnn_benchmarks.py --num_gpus=1 --batch_size=128 --model=resnet50 --use_fp16
Done warm up
Step	Img/sec	total_loss
1	images/sec: 392.8 +/- 0.0 (jitter = 0.0)	8.215
10	images/sec: 388.5 +/- 1.8 (jitter = 2.5)	8.184
20	images/sec: 388.2 +/- 1.2 (jitter = 2.6)	8.328
30	images/sec: 388.8 +/- 0.8 (jitter = 1.4)	8.179
40	images/sec: 388.7 +/- 0.7 (jitter = 1.7)	8.154
50	images/sec: 389.1 +/- 0.6 (jitter = 2.2)	8.391
60	images/sec: 389.2 +/- 0.5 (jitter = 2.3)	8.255
70	images/sec: 389.3 +/- 0.4 (jitter = 2.0)	8.133
80	images/sec: 389.0 +/- 0.4 (jitter = 2.1)	8.276
90	images/sec: 388.6 +/- 0.4 (jitter = 2.5)	8.331
100	images/sec: 388.5 +/- 0.4 (jitter = 2.6)	8.207
----------------------------------------------------------------
total images/sec: 388.38
----------------------------------------------------------------


TF_ROCM_FUSION_ENABLE=1 python3 tf_cnn_benchmarks.py --num_gpus=1 --batch_size=128 --model=resnet50 --use_fp16
Done warm up
Step	Img/sec	total_loss
1	images/sec: 403.4 +/- 0.0 (jitter = 0.0)	8.212
10	images/sec: 410.7 +/- 0.9 (jitter = 2.2)	8.182
20	images/sec: 410.4 +/- 0.7 (jitter = 2.5)	8.334
30	images/sec: 410.5 +/- 0.6 (jitter = 2.4)	8.191
40	images/sec: 410.7 +/- 0.4 (jitter = 2.2)	8.156
50	images/sec: 410.7 +/- 0.4 (jitter = 2.2)	8.384
60	images/sec: 410.4 +/- 0.4 (jitter = 2.1)	8.257
70	images/sec: 410.5 +/- 0.4 (jitter = 2.0)	8.133
80	images/sec: 410.3 +/- 0.3 (jitter = 2.1)	8.279
90	images/sec: 410.3 +/- 0.3 (jitter = 2.1)	8.329
100	images/sec: 410.2 +/- 0.3 (jitter = 2.2)	8.195
----------------------------------------------------------------
total images/sec: 410.02
----------------------------------------------------------------
@jpizarrom

This comment has been minimized.

Copy link

commented Sep 5, 2019

Hello, here are my results with:

AMD Ryzen 2600x | X370 | 16GB | 1x Radeon VII VBIOS version: 113-D3600200-106
Ubuntu 18.08 | 5.0.0-25-generic #26~18.04.1-Ubuntu
rock-dev 2.7.22 | Docker rocm/tensorflow:rocm2.7-tf1.14-dev

python3 tf_cnn_benchmarks.py --num_gpus=1 --batch_size=128 --model=resnet50
Done warm up
Step	Img/sec	total_loss
1	images/sec: 289.2 +/- 0.0 (jitter = 0.0)	8.221
10	images/sec: 289.3 +/- 0.3 (jitter = 0.9)	8.285
20	images/sec: 288.9 +/- 0.2 (jitter = 0.7)	8.055
30	images/sec: 288.8 +/- 0.2 (jitter = 0.7)	8.313
40	images/sec: 288.7 +/- 0.2 (jitter = 0.7)	8.184
50	images/sec: 288.4 +/- 0.2 (jitter = 1.0)	8.269
60	images/sec: 288.3 +/- 0.2 (jitter = 0.9)	8.187
70	images/sec: 288.2 +/- 0.2 (jitter = 1.1)	8.168
80	images/sec: 288.2 +/- 0.1 (jitter = 1.0)	8.152
90	images/sec: 288.1 +/- 0.1 (jitter = 1.1)	8.195
100	images/sec: 287.9 +/- 0.1 (jitter = 1.1)	8.141
----------------------------------------------------------------
total images/sec: 287.63
----------------------------------------------------------------


TF_ROCM_FUSION_ENABLE=1 python3 tf_cnn_benchmarks.py --num_gpus=1 --batch_size=128 --model=resnet50
Done warm up
Step	Img/sec	total_loss
1	images/sec: 309.0 +/- 0.0 (jitter = 0.0)	8.222
10	images/sec: 307.8 +/- 0.5 (jitter = 1.1)	8.283
20	images/sec: 308.0 +/- 0.3 (jitter = 1.1)	8.068
30	images/sec: 307.3 +/- 0.4 (jitter = 1.6)	8.324
40	images/sec: 307.2 +/- 0.3 (jitter = 1.7)	8.178
50	images/sec: 306.9 +/- 0.3 (jitter = 1.8)	8.242
60	images/sec: 306.9 +/- 0.3 (jitter = 1.7)	8.158
70	images/sec: 307.0 +/- 0.3 (jitter = 1.2)	8.176
80	images/sec: 307.0 +/- 0.2 (jitter = 1.2)	8.148
90	images/sec: 306.9 +/- 0.2 (jitter = 1.2)	8.211
100	images/sec: 306.4 +/- 0.3 (jitter = 1.5)	8.157
----------------------------------------------------------------
total images/sec: 306.35
----------------------------------------------------------------

python3 tf_cnn_benchmarks.py --num_gpus=1 --batch_size=128 --model=resnet50 --use_fp16
Done warm up
Step	Img/sec	total_loss
1	images/sec: 401.9 +/- 0.0 (jitter = 0.0)	8.212
10	images/sec: 400.0 +/- 0.9 (jitter = 2.9)	8.181
20	images/sec: 398.3 +/- 0.9 (jitter = 3.5)	8.313
30	images/sec: 396.8 +/- 0.9 (jitter = 5.3)	8.190
40	images/sec: 397.4 +/- 0.7 (jitter = 3.8)	8.155
50	images/sec: 397.5 +/- 0.6 (jitter = 3.4)	8.389
60	images/sec: 397.5 +/- 0.5 (jitter = 3.4)	8.262
70	images/sec: 397.6 +/- 0.4 (jitter = 3.0)	8.155
80	images/sec: 397.4 +/- 0.4 (jitter = 3.2)	8.279
90	images/sec: 397.3 +/- 0.4 (jitter = 3.2)	8.315
100	images/sec: 397.5 +/- 0.3 (jitter = 3.0)	8.191
----------------------------------------------------------------
total images/sec: 397.32
----------------------------------------------------------------

TF_ROCM_FUSION_ENABLE=1 python3 tf_cnn_benchmarks.py --num_gpus=1 --batch_size=128 --model=resnet50 --use_fp16
Done warm up
Step	Img/sec	total_loss
1	images/sec: 419.3 +/- 0.0 (jitter = 0.0)	8.214
10	images/sec: 422.0 +/- 0.6 (jitter = 1.6)	8.178
20	images/sec: 423.5 +/- 0.5 (jitter = 2.6)	8.317
30	images/sec: 422.5 +/- 0.6 (jitter = 3.2)	8.209
40	images/sec: 422.5 +/- 0.5 (jitter = 2.9)	8.179
50	images/sec: 421.9 +/- 0.6 (jitter = 3.1)	8.395
60	images/sec: 421.9 +/- 0.5 (jitter = 3.2)	8.262
70	images/sec: 422.0 +/- 0.4 (jitter = 3.0)	8.119
80	images/sec: 422.0 +/- 0.4 (jitter = 2.6)	8.273
90	images/sec: 422.1 +/- 0.3 (jitter = 2.5)	8.328
100	images/sec: 421.7 +/- 0.4 (jitter = 2.7)	8.192
----------------------------------------------------------------
total images/sec: 421.52
----------------------------------------------------------------
@20II

This comment has been minimized.

Copy link

commented Sep 24, 2019

When I want to test with a model,raise many warning like this.
WARNING:tensorflow:Entity <bound method BatchNormalization.call of <tensorflow.python.layers.normalization.BatchNormalization object at 0x7f828f4d65c0>> could not be transformed and will be executed as-is. Please report this to the AutgoGraph team. When filing the bug, set the verbosity to 10 (on Linux, export AUTOGRAPH_VERBOSITY=10) and attach the full output. Cause: converting <bound method BatchNormalization.call of <tensorflow.python.layers.normalization.BatchNormalization object at 0x7f828f4d65c0>>: AssertionError: Bad argument number for Name: 3, expecting 4
At the end the program crashed
How to eliminate bug

@sunway513

This comment has been minimized.

Copy link

commented Sep 27, 2019

Hi @20II , the warning message you've observed is due to the upstream tensorflow issue tensorflow#32319 , please try the following command to fix it:
pip3 install gast==0.2.2
If you program still crashes, please firstly try our public docker images here:
https://hub.docker.com/r/rocm/tensorflow
Please feel free to create a dedicated issue to track, thanks.

@mwrnd

This comment has been minimized.

Copy link

commented Oct 16, 2019

Video Card: AMD Radeon RX 570 8GB OC (rocm-smi -v VBIOS version: 113V34122-F3)
Motherboard: MSI X570-A Pro with 32GB DDR4-3000
Processor: AMD Ryzen 5 3600X
OS: Ubuntu 18.04.2 with Kernel 4.18.0-15-generic, no apt dist-upgrade
rocm-dkms: 2.8.13 installed through apt
tensorflow-rocm: 1.14.2 installed through pip
tensorflow benchmarks: abb1aec2f2db4ba73fac2e1359227aef59b10258
tensorflow_models: 1.13.0

Command-line permutations were generated with cmds.py and log output processed with parse.py. Setup notes and a more complete summary are available. It took ~4.5h to run these benchmarks.

imagenet dataset total images/sec:

python tf_cnn_benchmarks.py --device=GPU --num_gpus=1 --num_batches=40 \
--batch_size={16,32,64,128,256} --model={model} --data_name=imagenet

na means the batch size was too large or benchmark would not run

model batchsize=16      32      64      128     256
trivial         3619    6778    11854   19279   26723
alexnet         193     258     311     340     355
googlenet       121     136     143     145     132
inception3      22.3    23.0    23.7    na      na
inception4      10.7    10.9    na      na      na
lenet5          3706    6212    10154   14438   17627
mobilenet       255     316     359     361     388
nasnet          7.9     8.6     8.1     8.8     na
official_ncf    1337    2657    5295    10503   20511
overfeat        56.2    74.0    87.3    94.9    98.5
resnet101       25.4    28.7    na      na      na
resnet101_v2    25.5    28.9    30.4    na      na
resnet152       17.8    20.0    na      na      na
resnet152_v2    17.9    20.1    na      na      na
resnet50        44.9    49.7    51.7    na      na
resnet50_v1.5   41.4    45.3    48.4    na      na
resnet50_v2     45.2    50.4    52.5    na      na
vgg11           33.3    37.6    40.0    41.1    na
vgg16           17.0    18.2    18.9    na      na
vgg19           13.7    14.7    15.2    na      na
ssd300/coco     13.9    14.7    14.9    na      na

cifar10 dataset total images/sec:

python tf_cnn_benchmarks.py --device=GPU --num_gpus=1 --num_batches=40 \
--batch_size={16,32,64,128,256} --model={model} --data_name=cifar10

model batchsize=16      32      64      128     256
trivial         7319    14457   27174   46782   74252
alexnet         2605    4118    5650    6875    7415
nasnet          40.3    48.0    52.9    93.6    na
resnet110       268     407     544     640     669
resnet110_v2    268     409     541     643     665
resnet20        1338    2056    2752    3278    3459
resnet20_v2     1341    2013    2684    3202    3390
resnet32        872     1343    1792    2111    2227
resnet32_v2     871     1321    1765    2083    2199
resnet44        654     994     1321    1559    1642
resnet44_v2     653     990     1311    1545    1622
resnet56        518     785     1050    1239    1296
resnet56_v2     520     780     1046    1226    1287

CPU (Ryzen 5 3600X) total images/sec:

python tf_cnn_benchmarks.py --device=CPU --batch_size={32,64,128}
--num_batches=40 --model={model} --data_name={dataset}

model/dataset  batchsize=32     64     128
trivial/imagenet         2185   2803   4482
trivial/cifar10          32191  48349  62636
mobilenet/imagenet       188    201    207
ncf/imagenet             362    721    1453

NASNET-Large worked with a batch size of 8:

python tf_cnn_benchmarks.py --device=GPU --num_gpus=1 --compute_lr_on_cpu \
--batch_size=8  --num_batches=40 --model=nasnetlarge --data_name=imagenet
  [...]
  total images/sec: 0.64

DeepSpeech worked with a batch size of 16:

python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=16 \
--model=deepspeech2 --data_name=librispeech
  [...]
  total images/sec: 0.50
rocm_bandwidth_test
    RocmBandwidthTest Version: 2.3.4
    Device: 0,  AMD Ryzen 5 3600X 6-Core Processor
    Device: 1,  Ellesmere [Radeon RX 470/480/570/570X/580/580X],  2d:0.0

    Unidirectional copy peak bandwidth GB/s
    D/D       0           1
    0         N/A         11.322347
    1         11.060208   99.628578

    Bdirectional copy peak bandwidth GB/s
    D/D       0           1
    0         N/A         14.792841
    1         14.792841   N/A
python all_reduce_benchmark.py --variable_update=replicated
  Average time per step: 0.000114200115204
dkms status | grep amd
  amdgpu, 2.8-13, 4.18.0-15-generic, x86_64: installed
dmesg | grep kfd
    [    3.328899] kfd kfd: Allocated 3969056 bytes on gart
    [    3.329348] kfd kfd: added device 1002:67df
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
You can’t perform that action at this time.