Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance comparsion: AMD with ROCm vs NVIDIA with cuDNN? #173

Open
NIIAS3050 opened this issue Sep 20, 2018 · 143 comments
Open

Performance comparsion: AMD with ROCm vs NVIDIA with cuDNN? #173

NIIAS3050 opened this issue Sep 20, 2018 · 143 comments

Comments

@NIIAS3050
Copy link

@NIIAS3050 NIIAS3050 commented Sep 20, 2018

It would be very useful to compare real training performance on amd and nvidia cards.
For Nvidia cards we have a lot of graphs and tests, for example:
https://github.com/u39kun/deep-learning-benchmark
But for AMD cards there is no performance metrics.
It will be great to made direct comparsion between AND and NVIDIA with last cuDNN.

@pricebenjamin
Copy link

@pricebenjamin pricebenjamin commented Nov 8, 2018

If you happen to have access to some AMD GPUs that are supported by the ROCm stack, consider running some benchmarks from the TensorFlow benchmarks repository. The README in the benchmarks/scripts/tf_cnn_benchmarks directory provides some example usage.

Those scripts were used for the benchmarks shown on TensorFlows website.

I've run the following on a Vega FE (tensorflow-rocm==1.11.0 and rocm-dkms==1.9.211).

python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=128 --model=resnet50

This yields the following.

[...]
Done warm up
Step	Img/sec	total_loss
1	images/sec: 182.2 +/- 0.0 (jitter = 0.0)	8.325
10	images/sec: 182.3 +/- 0.1 (jitter = 0.2)	8.170
20	images/sec: 182.3 +/- 0.1 (jitter = 0.3)	8.247
30	images/sec: 182.1 +/- 0.1 (jitter = 0.3)	8.369
40	images/sec: 182.0 +/- 0.1 (jitter = 0.4)	8.401
50	images/sec: 181.9 +/- 0.1 (jitter = 0.5)	8.147
60	images/sec: 181.8 +/- 0.1 (jitter = 0.6)	8.340
70	images/sec: 181.6 +/- 0.1 (jitter = 0.7)	8.120
80	images/sec: 181.3 +/- 0.2 (jitter = 0.9)	8.415
90	images/sec: 180.5 +/- 0.3 (jitter = 1.1)	8.278
100	images/sec: 179.5 +/- 0.4 (jitter = 1.4)	8.328
----------------------------------------------------------------
total images/sec: 179.44
----------------------------------------------------------------

For comparison, the same command being run on a Tesla P100-PCIE-16GB (CUDA==9.2, cuDNN==7.1.4, and tf.__version__ == '1.11.0')

[...]
Done warm up
Step	Img/sec	total_loss
1	images/sec: 248.6 +/- 0.0 (jitter = 0.0)	8.325
10	images/sec: 248.6 +/- 0.2 (jitter = 0.6)	8.164
20	images/sec: 248.5 +/- 0.1 (jitter = 0.8)	8.251
30	images/sec: 248.4 +/- 0.1 (jitter = 0.7)	8.355
40	images/sec: 248.3 +/- 0.1 (jitter = 0.6)	8.417
50	images/sec: 248.2 +/- 0.1 (jitter = 0.6)	8.152
60	images/sec: 248.2 +/- 0.1 (jitter = 0.6)	8.353
70	images/sec: 248.1 +/- 0.1 (jitter = 0.7)	8.109
80	images/sec: 247.7 +/- 0.1 (jitter = 0.8)	8.405
90	images/sec: 247.5 +/- 0.1 (jitter = 0.9)	8.266
100	images/sec: 247.2 +/- 0.2 (jitter = 1.2)	8.344
----------------------------------------------------------------
total images/sec: 247.13
----------------------------------------------------------------

Bear in mind, I haven't done anything to try and optimize performance on the Vega FE. These are essentially "out-of-the-box" results.

@Mandrewoid
Copy link

@Mandrewoid Mandrewoid commented Nov 17, 2018

@pricebenjamin when I try to run that same script ( I cloned the repo ) I get an import error:

ImportError: No module named 'tensorflow.python.data.experimental'

@pricebenjamin
Copy link

@pricebenjamin pricebenjamin commented Nov 17, 2018

@Mandrewoid, if you haven't already, I'd recommend checking out the branch corresponding to your version of tensorflow, e.g.

cd /path/to/benchmarks
git checkout cnn_tf_v1.11_compatible
@Mandrewoid
Copy link

@Mandrewoid Mandrewoid commented Nov 17, 2018

Nice that seems to have done it. I did not realize mainline TF had already advanced to 1.12 rookie mistake

@kazulittlefox
Copy link

@kazulittlefox kazulittlefox commented Nov 23, 2018

I have tried runnning benchmarks on my environment(Kernel 4.15, ROCm1.9.2, TF1.12 with RX 580).

python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=(32|64)  \ 
--model=(alexnet|inceptionv3|vgg16|googlenet|resnet50)

result are as follow:

AlexNet        batch:32 397.27/sec
                     batch:64 518.03/sec
InceptionV3 batch:32   47.78/sec
                    batch:64   50.66/sec
googLeNet batch:32 239.28/sec
                   batch:64 256.05/sec
ResNet50   batch:32  86.81/sec
                 batch:64  98.57/sec

In my environment, Vgg16 has not runnning well.

@fshi98
Copy link

@fshi98 fshi98 commented Nov 30, 2018

I have tested with vega64, ubuntu18.04, ROCm1.9.2, tf1.12:
1 resnet50: python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=64 --model=resnet50
1080ti: 212 images/sec (278 fp16)
vega64: 191 images/sec (190 fp16)
2 resnet101: python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=32 --model=resnet101
1080ti: 121.14 images/sec (168 fp16)
vega64: 101.15 images/sec (93 fp16), if fp16, --batch_size could be 64, while fp32, 64 will crash
3. inception3: python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=64 --model=inception3
1080ti: 140.08 images/sec (166 fp16)
vega64: 99.02 images/sec (50 fp16)

4 mobilenet: python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=64 --model=mobilenet
1080ti: 2865 images/sec
vega64: 462 images/sec

The nv gtx1080 ti was tested on another machine with cuda10, ubuntu 18.04.

There are two values didn't add up:

  1. for mobilenet, the 1080ti result doesn't make sense.
  2. i also tested with --use_fp16, which gives fair amount of speedup for 1080ti. However, for vega64, it ends up slower in all tests if using --use_fp16. This is especially true for inception3.

Considering vega64 supports native half precision and fp16 should be a good selling point for amd vega. how is it slower if using fp16? I guess this is probably due to software support, especially ROCm. Can anyone please test it with --use_fp16 and see if having similar results.

@kazulittlefox my vega runs smoothly with vgg16 @105images/sec

@Mandrewoid
Copy link

@Mandrewoid Mandrewoid commented Dec 1, 2018

@fshi98 that might be because of
#143 (comment)

@fshi98
Copy link

@fshi98 fshi98 commented Dec 1, 2018

@Mandrewoid Thanks. That may be the reason. However, my rocblas version is 0.14.3.0,
and I tested //tensorflow/python/kernel_tests:batch_matmul_op_test, and passed all 47 tests in 10.653s as in #143
Also, i tested and passed ROCmSoftwarePlatform/rocBLAS#340

This may not be the same error bugs as #143, but may be some performance issues

@pricebenjamin
Copy link

@pricebenjamin pricebenjamin commented Feb 16, 2019

@sebpuetz Would you be willing to post some numbers for the Radeon VII, including fp16 performance? I have yet to find any cloud providers with these cards. Trying to get some info for #288.

@sebpuetz
Copy link

@sebpuetz sebpuetz commented Feb 16, 2019

#288
Radeon VII
rocm==2.1.96 installed through apt
tensorflow==1.12 installed through pip
no further tuning

python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=64 --model=resnet50
Step	Img/sec	total_loss
1	images/sec: 190.3 +/- 0.0 (jitter = 0.0)	8.217
10	images/sec: 195.7 +/- 0.9 (jitter = 3.1)	8.123
20	images/sec: 196.4 +/- 0.5 (jitter = 1.8)	8.231
30	images/sec: 196.8 +/- 0.4 (jitter = 1.1)	8.268
40	images/sec: 197.1 +/- 0.3 (jitter = 0.9)	8.355
50	images/sec: 197.2 +/- 0.2 (jitter = 0.8)	8.013
60	images/sec: 197.3 +/- 0.2 (jitter = 0.7)	8.263
70	images/sec: 196.8 +/- 0.3 (jitter = 1.1)	8.304
80	images/sec: 196.9 +/- 0.2 (jitter = 1.1)	8.228
90	images/sec: 196.9 +/- 0.2 (jitter = 0.9)	8.283
100	images/sec: 197.0 +/- 0.2 (jitter = 0.8)	8.271
----------------------------------------------------------------
total images/sec: 196.98
----------------------------------------------------------------

FP16:

python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=64 --model=resnet50  --use_fp16
Step	Img/sec	total_loss
1	images/sec: 262.9 +/- 0.0 (jitter = 0.0)	8.162
10	images/sec: 261.9 +/- 0.6 (jitter = 0.7)	8.211
20	images/sec: 260.4 +/- 0.6 (jitter = 2.6)	8.375
30	images/sec: 260.6 +/- 0.5 (jitter = 2.6)	8.264
40	images/sec: 259.6 +/- 0.6 (jitter = 3.1)	8.116
50	images/sec: 259.6 +/- 0.5 (jitter = 3.1)	8.169
60	images/sec: 259.9 +/- 0.5 (jitter = 2.6)	8.325
70	images/sec: 259.3 +/- 0.5 (jitter = 3.5)	8.374
80	images/sec: 259.4 +/- 0.4 (jitter = 3.4)	8.041
90	images/sec: 259.3 +/- 0.4 (jitter = 3.6)	8.298
100	images/sec: 259.4 +/- 0.3 (jitter = 3.5)	8.376
----------------------------------------------------------------
total images/sec: 259.29
----------------------------------------------------------------

This one made the GPU sound like a jet engine:

python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=128 --model=resnet50
Step	Img/sec	total_loss
1	images/sec: 216.3 +/- 0.0 (jitter = 0.0)	8.219
10	images/sec: 215.9 +/- 0.3 (jitter = 0.3)	8.289
20	images/sec: 216.0 +/- 0.2 (jitter = 0.3)	8.064
30	images/sec: 215.9 +/- 0.1 (jitter = 0.3)	8.310
40	images/sec: 215.9 +/- 0.1 (jitter = 0.3)	8.197
50	images/sec: 215.9 +/- 0.1 (jitter = 0.3)	8.277
60	images/sec: 215.7 +/- 0.1 (jitter = 0.4)	8.162
70	images/sec: 215.7 +/- 0.1 (jitter = 0.4)	8.159
80	images/sec: 215.7 +/- 0.1 (jitter = 0.4)	8.139
90	images/sec: 215.7 +/- 0.1 (jitter = 0.4)	8.196
100	images/sec: 215.7 +/- 0.1 (jitter = 0.4)	8.163
----------------------------------------------------------------
total images/sec: 215.72
----------------------------------------------------------------

FP 16:

python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=128 --model=resnet50 --use_fp16
Step	Img/sec	total_loss
1	images/sec: 288.2 +/- 0.0 (jitter = 0.0)	8.209
10	images/sec: 283.8 +/- 1.1 (jitter = 2.7)	8.189
20	images/sec: 284.0 +/- 0.9 (jitter = 4.6)	8.316
30	images/sec: 284.9 +/- 0.7 (jitter = 4.5)	8.195
40	images/sec: 284.5 +/- 0.6 (jitter = 4.0)	8.180
50	images/sec: 284.3 +/- 0.5 (jitter = 3.7)	8.402
60	images/sec: 285.0 +/- 0.5 (jitter = 4.8)	8.271
70	images/sec: 285.4 +/- 0.4 (jitter = 3.7)	8.134
80	images/sec: 285.7 +/- 0.4 (jitter = 2.7)	8.299
90	images/sec: 286.0 +/- 0.4 (jitter = 1.5)	8.349
100	images/sec: 286.2 +/- 0.3 (jitter = 1.4)	8.213
----------------------------------------------------------------
total images/sec: 286.17
----------------------------------------------------------------
@sebpuetz
Copy link

@sebpuetz sebpuetz commented Feb 18, 2019

Improvements across the board with TF_ROCM_FUSION_ENABLE=1. The displayed temp in rocm-smi went above 90°C on all tests, the rocm-smi output didn't include clocks so I can't tell whether any termal throttling was happening.

TF_ROCM_FUSION_ENABLE=1 python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=64 --model=resnet50
Step	Img/sec	total_loss
1	images/sec: 208.4 +/- 0.0 (jitter = 0.0)	8.217
10	images/sec: 207.6 +/- 0.5 (jitter = 0.5)	8.124
20	images/sec: 207.7 +/- 0.3 (jitter = 0.5)	8.235
30	images/sec: 207.3 +/- 0.4 (jitter = 0.4)	8.268
40	images/sec: 207.2 +/- 0.4 (jitter = 0.4)	8.357
50	images/sec: 207.2 +/- 0.4 (jitter = 0.4)	8.012
60	images/sec: 207.2 +/- 0.3 (jitter = 0.4)	8.248
70	images/sec: 207.1 +/- 0.3 (jitter = 0.4)	8.305
80	images/sec: 207.0 +/- 0.3 (jitter = 0.5)	8.223
90	images/sec: 205.7 +/- 0.9 (jitter = 0.5)	8.322
100	images/sec: 205.7 +/- 0.8 (jitter = 0.5)	8.268
----------------------------------------------------------------
total images/sec: 205.65
----------------------------------------------------------------
TF_ROCM_FUSION_ENABLE=1 python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=64 --model=resnet50 --use_fp16
Step	Img/sec	total_loss
1	images/sec: 273.0 +/- 0.0 (jitter = 0.0)	8.171
10	images/sec: 272.6 +/- 0.9 (jitter = 1.0)	8.223
20	images/sec: 271.5 +/- 1.1 (jitter = 0.9)	8.375
30	images/sec: 272.0 +/- 0.8 (jitter = 0.9)	8.282
40	images/sec: 272.1 +/- 0.6 (jitter = 0.9)	8.122
50	images/sec: 272.1 +/- 0.6 (jitter = 0.8)	8.144
60	images/sec: 272.0 +/- 0.5 (jitter = 0.8)	8.333
70	images/sec: 271.5 +/- 0.5 (jitter = 1.0)	8.357
80	images/sec: 271.2 +/- 0.5 (jitter = 1.3)	8.034
90	images/sec: 271.2 +/- 0.4 (jitter = 1.3)	8.289
100	images/sec: 270.9 +/- 0.4 (jitter = 1.5)	8.361
----------------------------------------------------------------
total images/sec: 270.81
----------------------------------------------------------------
TF_ROCM_FUSION_ENABLE=1 python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=128 --model=resnet50
Step	Img/sec	total_loss
1	images/sec: 227.7 +/- 0.0 (jitter = 0.0)	8.221
10	images/sec: 225.6 +/- 0.5 (jitter = 2.2)	8.289
20	images/sec: 225.5 +/- 0.4 (jitter = 1.9)	8.068
30	images/sec: 225.7 +/- 0.3 (jitter = 1.8)	8.304
40	images/sec: 225.4 +/- 0.5 (jitter = 1.2)	8.183
50	images/sec: 225.5 +/- 0.4 (jitter = 1.0)	8.261
60	images/sec: 225.6 +/- 0.4 (jitter = 1.1)	8.203
70	images/sec: 225.6 +/- 0.3 (jitter = 1.1)	8.165
80	images/sec: 225.6 +/- 0.3 (jitter = 1.0)	8.168
90	images/sec: 225.7 +/- 0.3 (jitter = 1.0)	8.196
100	images/sec: 225.6 +/- 0.2 (jitter = 1.1)	8.138
----------------------------------------------------------------
total images/sec: 225.62
----------------------------------------------------------------
TF_ROCM_FUSION_ENABLE=1 python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=128 --model=resnet50 --use_fp16
Step	Img/sec	total_loss
1	images/sec: 302.0 +/- 0.0 (jitter = 0.0)	8.213
10	images/sec: 300.2 +/- 0.5 (jitter = 1.5)	8.181
20	images/sec: 298.7 +/- 0.8 (jitter = 2.5)	8.324
30	images/sec: 297.7 +/- 0.8 (jitter = 2.2)	8.197
40	images/sec: 297.7 +/- 0.6 (jitter = 3.0)	8.173
50	images/sec: 297.9 +/- 0.6 (jitter = 3.0)	8.400
60	images/sec: 297.9 +/- 0.5 (jitter = 3.0)	8.267
70	images/sec: 298.4 +/- 0.5 (jitter = 2.8)	8.140
80	images/sec: 298.6 +/- 0.4 (jitter = 2.7)	8.283
90	images/sec: 298.6 +/- 0.4 (jitter = 2.8)	8.337
100	images/sec: 298.7 +/- 0.4 (jitter = 2.6)	8.208
----------------------------------------------------------------
total images/sec: 298.60
----------------------------------------------------------------
@sunway513
Copy link

@sunway513 sunway513 commented Feb 18, 2019

Hi @sebpuetz , thanks for the update!
However, the performance numbers seem not right.
Can you provide me the VBIOS version of your board? The following command would do:
/opt/rocm/bin/rocm-smi -v

@sebpuetz
Copy link

@sebpuetz sebpuetz commented Feb 18, 2019

/opt/rocm/bin/rocm-smi -v 
GPU[0] 		: VBIOS version: 113-D3600200-105
@WrightChen
Copy link

@WrightChen WrightChen commented Feb 19, 2019

Radeon RX Vega 64
memoryClockRate (GHz) 1.63
Total memory: 7.98GiB
Free memory: 7.73GiB
rocm==2.1.96 installed through apt
tensorflow==1.12 installed through pip

Some Frameworks use option ' TF_ROCM_FUSION_ENABLE=1 ' doesn't change much, so I'm not giving the FUSION = 1 results. Due to lack of memory, there are some frameworks can't run on the batch_size=128.

  ResNet50 AlexNet Inception v3 VGG16 GoogLeNet ResNet152
batch_size=512 / 1573.01 / / / /
batch_size=256 / 1420.65 / / / /
batch_size=128 / 1345.73 / / 498.73 /
batch_size=64 190.58 1151.98 103.82 101.95 474.07 /
batch_size=32 171.70 971.85 98.50 91.80 424.32 68.71
batch_size=128; FUSION = 1 / / / / / /
batch_size=64; FUSION = 1 208.78 / 109.66 / / /
batch_size=32; FUSION = 1 187.76 / 105.20 / / 75.81
@sunway513
Copy link

@sunway513 sunway513 commented Feb 21, 2019

Hi @sebpuetz , could you try to refresh your performance numbers using our official docker image?
If you've not configured the docker, the following script should do:
curl -sSL https://get.docker.com/ | sh

To run the benchmarks inside docker image:

alias drun='sudo docker run -it --network=host --device=/dev/kfd --device=/dev/dri --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -v $HOME/dockerx:/dockerx -v /data/imagenet/tf:/imagenet'
drun rocm/tensorflow:rocm2.1-tf1.12-python3
cd ~/benchmarks/scripts/tf_cnn_benchmarks
TF_ROCM_FUSION_ENABLE=1 python3 tf_cnn_benchmarks.py --num_gpus=1 --batch_size=128 --model=resnet50
TF_ROCM_FUSION_ENABLE=1 python3 tf_cnn_benchmarks.py --num_gpus=1 --batch_size=128 --model=resnet50 --use_fp16

Thanks for your attention, and looking forward to your updates :-)

@jimdowling
Copy link

@jimdowling jimdowling commented Feb 21, 2019

6-core Intel i7 8700 with 16GB ram, and 400GB SSD disk.
Radeon VII
rocm==2.1.96 installed through apt
tensorflow==1.12 installed through pip
no further tuning

TC_ROCM_FUSION_ENABLE=1 python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=64 --model=resnet50
Step Img/sec total_loss
1 images/sec: 250.0 +/- 0.0 (jitter = 0.0) 8.348
10 images/sec: 248.0 +/- 1.4 (jitter = 0.7) 8.144
20 images/sec: 248.7 +/- 0.8 (jitter = 0.4) 8.440
30 images/sec: 248.8 +/- 0.6 (jitter = 0.4) 8.140
40 images/sec: 248.7 +/- 0.6 (jitter = 0.4) 8.474
50 images/sec: 248.5 +/- 0.5 (jitter = 0.4) 8.322
60 images/sec: 248.5 +/- 0.5 (jitter = 0.5) 8.317
70 images/sec: 248.5 +/- 0.4 (jitter = 0.6) 8.010
80 images/sec: 248.4 +/- 0.4 (jitter = 0.6) 8.272
90 images/sec: 248.5 +/- 0.4 (jitter = 0.6) 8.289
100 images/sec: 248.4 +/- 0.3 (jitter = 0.6) 8.108

total images/sec: 248.34

TC_ROCM_FUSION_ENABLE=1 python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=128 --model=resnet50
Step Img/sec total_loss
1 images/sec: 265.1 +/- 0.0 (jitter = 0.0) 8.324
10 images/sec: 264.3 +/- 0.5 (jitter = 0.3) 8.168
20 images/sec: 264.5 +/- 0.3 (jitter = 0.2) 8.261
30 images/sec: 264.4 +/- 0.3 (jitter = 0.3) 8.377
40 images/sec: 264.2 +/- 0.2 (jitter = 0.4) 8.408
50 images/sec: 264.1 +/- 0.2 (jitter = 0.5) 8.160
60 images/sec: 263.9 +/- 0.2 (jitter = 0.6) 8.341
70 images/sec: 263.8 +/- 0.2 (jitter = 0.6) 8.107
80 images/sec: 263.8 +/- 0.2 (jitter = 0.8) 8.404
90 images/sec: 263.8 +/- 0.2 (jitter = 0.7) 8.296
100 images/sec: 263.7 +/- 0.2 (jitter = 0.6) 8.348

total images/sec: 263.65

With a batch size of 256, i get out of memory errors.
Funnily enough with a batch size of 155, it works, but is slower.

TC_ROCM_FUSION_ENABLE=1 python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=155 --model=resnet50

Step Img/sec total_loss
1 images/sec: 195.3 +/- 0.0 (jitter = 0.0) 8.394
10 images/sec: 194.6 +/- 0.7 (jitter = 0.6) 8.313
20 images/sec: 194.5 +/- 0.5 (jitter = 0.6) 8.154
30 images/sec: 194.4 +/- 0.3 (jitter = 0.7) 8.249
40 images/sec: 194.5 +/- 0.3 (jitter = 0.8) 8.165
50 images/sec: 194.4 +/- 0.2 (jitter = 1.0) 8.292
60 images/sec: 194.3 +/- 0.2 (jitter = 1.0) 8.340
70 images/sec: 194.3 +/- 0.2 (jitter = 0.9) 8.268
80 images/sec: 194.2 +/- 0.2 (jitter = 0.8) 8.227
90 images/sec: 194.2 +/- 0.2 (jitter = 0.8) 8.257
100 images/sec: 194.1 +/- 0.2 (jitter = 0.9) 8.183

total images/sec: 194.04

@jimdowling
Copy link

@jimdowling jimdowling commented Feb 21, 2019

Leaving out TC_ROCM_FUSION_ENABLE does not make any difference.
/opt/rocm/bin/rocm-smi -v
VBIOS version: 113-D3600200-105

@jimdowling
Copy link

@jimdowling jimdowling commented Feb 21, 2019

According to this blog, https://www.pugetsystems.com/labs/hpc/NVIDIA-RTX-2080-Ti-vs-2080-vs-1080-Ti-vs-Titan-V-TensorFlow-Performance-with-CUDA-10-0-1247/, the 2080Ti gets 280 images/sec and the 1080Ti gets 207 images/sec for FP32 training.

@jimdowling
Copy link

@jimdowling jimdowling commented Feb 21, 2019

One more:
python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=128 --model=resnet50 --use_fp16
Step Img/sec total_loss
1 images/sec: 377.7 +/- 0.0 (jitter = 0.0) 8.246
10 images/sec: 375.9 +/- 2.2 (jitter = 0.7) 8.261
20 images/sec: 377.9 +/- 1.2 (jitter = 0.9) 8.279
30 images/sec: 378.3 +/- 0.9 (jitter = 0.9) 8.365
40 images/sec: 378.2 +/- 0.7 (jitter = 0.5) 8.237
50 images/sec: 378.3 +/- 0.6 (jitter = 0.4) 8.295
60 images/sec: 378.4 +/- 0.5 (jitter = 0.4) 8.203
70 images/sec: 378.4 +/- 0.5 (jitter = 0.5) 8.129
80 images/sec: 377.9 +/- 0.6 (jitter = 0.6) 8.264
90 images/sec: 378.0 +/- 0.5 (jitter = 0.8) 8.163
100 images/sec: 377.9 +/- 0.5 (jitter = 0.8) 8.239

total images/sec: 377.79

@Sumenia
Copy link

@Sumenia Sumenia commented Feb 21, 2019

@jimdowling that's some impressive perf !

@sebpuetz
Copy link

@sebpuetz sebpuetz commented Feb 21, 2019

@jimdowling these numbers seem substantially higher than the ones I got, what OS and kernel are you on?

@sebpuetz
Copy link

@sebpuetz sebpuetz commented Feb 21, 2019

Hi,
I executed the benchmarks in the docker container:

TF_ROCM_FUSION_ENABLE=1 python3 tf_cnn_benchmarks.py --num_gpus=1 --batch_size=128 --model=resnet50
Step	Img/sec	total_loss
1	images/sec: 229.7 +/- 0.0 (jitter = 0.0)	8.221
10	images/sec: 225.4 +/- 0.8 (jitter = 2.7)	8.289
20	images/sec: 225.9 +/- 0.5 (jitter = 3.6)	8.054
30	images/sec: 226.6 +/- 0.4 (jitter = 2.1)	8.313
40	images/sec: 226.9 +/- 0.3 (jitter = 0.8)	8.187
50	images/sec: 227.2 +/- 0.3 (jitter = 0.7)	8.240
60	images/sec: 227.3 +/- 0.2 (jitter = 0.5)	8.192
70	images/sec: 227.4 +/- 0.2 (jitter = 0.5)	8.143
80	images/sec: 227.6 +/- 0.2 (jitter = 0.5)	8.150
90	images/sec: 227.6 +/- 0.2 (jitter = 0.5)	8.217
100	images/sec: 227.7 +/- 0.2 (jitter = 0.5)	8.163
----------------------------------------------------------------
total images/sec: 227.66
----------------------------------------------------------------

and

TF_ROCM_FUSION_ENABLE=1 python3 tf_cnn_benchmarks.py --num_gpus=1 --batch_size=128 --model=resnet50 --use_fp16
Step	Img/sec	total_loss
1	images/sec: 300.8 +/- 0.0 (jitter = 0.0)	8.205
10	images/sec: 300.3 +/- 0.4 (jitter = 0.2)	8.170
20	images/sec: 300.3 +/- 0.3 (jitter = 0.5)	8.317
30	images/sec: 300.5 +/- 0.2 (jitter = 0.6)	8.201
40	images/sec: 300.6 +/- 0.2 (jitter = 0.5)	8.176
50	images/sec: 300.5 +/- 0.2 (jitter = 0.5)	8.398
60	images/sec: 300.3 +/- 0.2 (jitter = 0.5)	8.268
70	images/sec: 300.3 +/- 0.2 (jitter = 0.6)	8.140
80	images/sec: 300.4 +/- 0.2 (jitter = 0.6)	8.279
90	images/sec: 300.4 +/- 0.2 (jitter = 0.6)	8.328
100	images/sec: 300.3 +/- 0.2 (jitter = 0.6)	8.214
----------------------------------------------------------------
total images/sec: 300.29
----------------------------------------------------------------

@sunway513 these numbers are still pretty far away from what @jimdowling got, do you see a reason for this to happen?

@jimdowling
Copy link

@jimdowling jimdowling commented Feb 21, 2019

Ubuntu 18.04. Python 2.7. Kernel is 4.15.
I was not running Docker - bare metal.

@sunway513
Copy link

@sunway513 sunway513 commented Feb 21, 2019

Hi @jimdowling , Thanks for your posting! However, it seems there's a typo in your script, therefore TF fusion is not really enabled there. Could you try the following command again?
TF_ROCM_FUSION_ENABLE=1 python3 tf_cnn_benchmarks.py --num_gpus=1 --batch_size=128 --model=resnet50
TF_ROCM_FUSION_ENABLE=1 python3 tf_cnn_benchmarks.py --num_gpus=1 --batch_size=128 --model=resnet50 --use_fp16
If fusion is enabled, you should see the following message at the run time:
2019-02-21 13:41:32.304325: I tensorflow/core/graph/gpu_fusion_pass.cc:454] ROCm Fusion is enabled.

@sunway513
Copy link

@sunway513 sunway513 commented Feb 21, 2019

Hi @sebpuetz , thanks for your updated numbers with docker!
in a parallel issue, you mentioned your system is Linux Mint 19.1, is that the same OS you ran the benchmark? May I know the kernel and driver version of your configurations? The following command would help:
uname -a
apt --installed list | grep rock-dkms
I believe your user-bit components were properly configured, as you got similar perf numbers using our official docker image. VBIOS version is good as well. We need to look into kernels and firmware.

@sebpuetz
Copy link

@sebpuetz sebpuetz commented Feb 21, 2019

Hi @sunway513 ,
I ran all benchmarks on Linux Mint 19.1

uname -a
Linux seb-desktop 4.20.7-042007-generic #201902061234 SMP Wed Feb 6 17:36:40 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
apt list --installed | grep rock-dkms
rock-dkms/Ubuntu 16.04,now 2.1-96 all [installed]

Linux Mint 19.1 is based on Ubuntu 18.04, so this looks like a mismatch here?

@ghostplant
Copy link

@ghostplant ghostplant commented Feb 21, 2019

@sunway513

I am also using RX Vega 64 but I got such warning:

2019-02-21 14:26:23.732074: I tensorflow/core/kernels/conv_grad_filter_ops.cc:975] running auto-tune for Backward-Filter
warning: <unknown>:0:0: loop not unrolled: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering
2019-02-21 14:26:27.702436: I tensorflow/core/kernels/conv_grad_input_ops.cc:1023] running auto-tune for Backward-Data
2019-02-21 14:26:29.084753: I tensorflow/core/kernels/conv_grad_filter_ops.cc:975] running auto-tune for Backward-Filter
warning: <unknown>:0:0: loop not unrolled: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering
2019-02-21 14:26:33.818470: I tensorflow/core/kernels/conv_grad_input_ops.cc:1023] running auto-tune for Backward-Data
2019-02-21 14:26:33.839322: I tensorflow/core/kernels/conv_grad_filter_ops.cc:975] running auto-tune for Backward-Filter

And the performance is ~10% loss compared with others' benchmark:

Step    Img/sec total_loss
1       images/sec: 182.8 +/- 0.0 (jitter = 0.0)        8.217
10      images/sec: 187.2 +/- 0.9 (jitter = 0.7)        8.122
20      images/sec: 187.3 +/- 0.5 (jitter = 0.7)        8.229
30      images/sec: 187.1 +/- 0.4 (jitter = 0.9)        8.264
40      images/sec: 187.0 +/- 0.4 (jitter = 0.9)        8.347
50      images/sec: 187.0 +/- 0.3 (jitter = 1.1)        8.014
60      images/sec: 187.0 +/- 0.3 (jitter = 1.0)        8.264
70      images/sec: 186.8 +/- 0.3 (jitter = 1.1)        8.316
80      images/sec: 186.7 +/- 0.3 (jitter = 1.1)        8.231
90      images/sec: 186.7 +/- 0.2 (jitter = 1.2)        8.305

But it should be expected to have about 207 images/sec.
Is it influenced by the warning above and how to fix the performance?

@mwrnd
Copy link

@mwrnd mwrnd commented Oct 23, 2019

Video Card: MSI Radeon RX 580 8GB OC (rocm-smi -v Cannot get VBIOS version)
Motherboard: MSI X570-A Pro with 32GB DDR4-2133 BIOS H.40
Processor: AMD Ryzen 5 3600X
OS: Ubuntu 18.04.0 no apt upgrade or apt dist-upgrade
Kernel: 4.15.0-20-generic
rocm-dkms: 1.9.3 installed through apt
tensorflow-rocm: 1.12.0 installed through pip
tensorflow benchmarks: 091ef1e4d8832e14d1f874e66bff78a2522d0947
tensorflow_models: 1.12.0

Benchmark dump and recreation of @kazulittlefox's results. My ROCm 2.8.13 results were significantly lower (~65%) than kazulittlefox's 1.9.2 results so I was concerned I may have a hardware issue. Always compare apples to apples. My 1.9.3 results are consistent with kazulittlefox's.

            |------batch size = 32------|    |------batch size = 64------|
model        @kaz   1.9.3  2.8.13  Perf%      @kaz   1.9.3  2.8.13  Perf%
alexnet      397    401    318     79%        518    511    396     78%
googlenet    239    247    155     63%        256    261    165     63%
inception3   47.8   48.9   26.6    54%        50.7   51.6   27.5    53%
resnet50     86.8   92.0   57.9    63%        98.6   100    61.0    61%

Between ROCm_1.9.3/TF1.12 and ROCm_2.8.13/TF1.14 performance gains were moved around and made more consistent at the expense of raw throughput. FP16 performance has improved. 2.8.13 is also more stable as I did not encounter a crash. ROCm 1.9.3 froze my computer twice and some benchmark attempts stalled indefinitely.

model   batchsize=   16      32      032F    032XRF  64      064XR   128
...
ROCm1.9.3/alexnet    256     401     262     263     511     529     627
ROCm2.8.13/alexnet   223     318     325     327     396     401     446
...
ROCm1.9.3/AVG Gain   0       0       -40.1%  -39.3%  0       +27.9%  0
ROCm1.9.3/MED Gain   0       0       -44.5%  -42.5%  0       +2.5%   0
...
ROCm2.8.13/AVG Gain  0       0       +4.9%   +6.2%   0       +1.5%   0
ROCm2.8.13/MED Gain  0       0       +6.4%   +7.3%   0       +0.7%   0

ROCm was installed without apt upgrade or apt dist-upgrade and used the version-specific ROCm repo:

sudo echo 'deb [arch=amd64] http://repo.radeon.com/rocm/apt/1.9.3/ xenial main'\
  | sudo tee /etc/apt/sources.list.d/rocm.list

Command-line permutations were generated with cmds.py and log output processed with parse.py. See here for more benchmark results.

imagenet dataset total images/sec:

python tf_cnn_benchmarks.py --device=GPU --num_gpus=1 --num_batches=40 \
--batch_size={16,32,64,128,256} --model={model} --data_name=imagenet

XR means XLA and ROCm Fusion were enabled
  export TF_XLA_FLAGS=--tf_xla_cpu_global_jit
  export TF_ROCM_FUSION_ENABLE=1
F means --use_fp16 option was used
na means the batch size was too large or benchmark would not run

model batchsize=16      32      032F    032XRF  64      064XR   128     256
trivial         3869    7012    1038    1038    11879   12030   18737   26770
alexnet         256     401     262     263     511     529     627     710
googlenet       209     247     136     142     261     261     271     248
inception3      47      48.9    23.8    24.2    51.6    53.3    na      na
inception4      21.1    23      11.2    11.4    11.4    na      na      na
lenet5          3670    6238    5354    5316    10041   10273   15455   20690
mobilenet       328     434     264     482     304     1263    485     513
nasnet          8       na      8.8     7.5     8.8     na      na      na
overfeat        92.7    145     75.1    75.7    201     201     229     275
resnet101       na      54.5    30.3    31.9    32      na      na      na
resnet101_v2    47.6    55.2    31.4    32.3    33      na      na      na
resnet152       33.5    38.2    21.2    21.7    22.3    na      na      na
resnet152_v2    33.8    38.5    21.3    21.9    22.4    na      na      na
resnet50        78.6    92.0    91.9    59.4    100     112     60.7    na
resnet50_v1.5   62.3    75.7    51.5    53.7    na      86.1    56.6    na
resnet50_v2     79.6    93.6    58.2    60.3    106     114     63.6    na
vgg11           64.7    82.7    39.3    39.2    92.9    94.5    95.5    41.7
vgg16           35.5    39.9    20.2    20.2    44.6    45.7    21      na
vgg19           31.2    35.6    16.1    16.1    37.8    37.7    16.7    na
Average Gain    0       0       -40.1%  -39.3%  0       +27.9%  0       0
Median Gain     0       0       -44.5%  -42.5%  0       +2.5%   0       0

* Average and Median gains use 0 as baselines; 32 is baseline for 32F, 32XRF
cifar10 dataset total images/sec:

python tf_cnn_benchmarks.py --device=GPU --num_gpus=1 --num_batches=40 \
--batch_size={16,32,64,128,256} --model={model} --data_name=cifar10

model batchsize=16      32       032F    032XRF  64      064XR   128     256
trivial         7965    15422    8041    8392    25992   22976   46789   71242
alexnet         3021    4782     248     250     6704    6707    8322    9838
nasnet          na      47.9     49.8    19.7    53.7    na      55.3    na
resnet110       462     671      407     434     852     944     852     912
resnet110_v2    465     674      409     na      848     na      847     913
resnet20        2075    2988     1993    2093    3799    4073    3947    4313
resnet20_v2     2035    2902     1990    na      3674    na      3810    4197
resnet32        1437    2062     1320    1385    2616    2837    2664    2888
resnet32_v2     1442    2048     1310    na      2566    na      2615    2839
resnet44        1090    1571     979     1031    1988    2167    2005    2170
resnet44_v2     1094    1558     978     na      1960    na      1981    2141
resnet56        872     1272     785     826     1604    1745    1612    1739
resnet56_v2     881     1253     789     na      1585    na      1598    1719
Average Gain    0       0        -38.8   -45.8%  0       +4.7%   0       0
Median Gain     0       0        -37.2%  -35.2%  0       +8.5%   0       0

* Average and Median gains use 0 as baselines; 32 is baseline for 32F, 32XRF

NASNET-Large worked with a batch size of 8:

python tf_cnn_benchmarks.py --device=GPU --num_gpus=1 --compute_lr_on_cpu \
--batch_size=8  --num_batches=40 --model=nasnetlarge --data_name=imagenet
  [...]
  total images/sec: 0.70

DeepSpeech failed with Unable to find suitable algorithm for ... convolution

CPU (Ryzen 5 3600X) total images/sec:

python tf_cnn_benchmarks.py --device=CPU --batch_size={32,64,128}
--num_batches=40 --model={model} --data_name={dataset} {--use_fp16}

F means --use_fp16 option was used

model/dataset  batchsize= 32     32F    64     64F    128    128F
trivial/imagenet          1227   52.6   1387   52.6   1789   53.0
trivial/cifar10           22224  1050   22525  1914   37755  2504
mobilenet/imagenet        132    7.0    140    7.1    144    7.1
rocm_bandwidth_test
    RocmBandwidthTest Version: 1.0.0

    Device: 0,  AMD Ryzen 5 3600X 6-Core Processor
    Device: 1,  Ellesmere [Radeon RX 470/480]

    Unidirectional peak bandwidth GB/s
    D/D       0           1
    0         N/A         11.240193
    1         7.645693    43.802111

    Bdirectional peak bandwidth GB/s
    D/D       0           1
    0         N/A         14.460515
    1         14.542496   N/A
python all_reduce_benchmark.py --variable_update=replicated
  Average time per step: 0.000213811397552
dkms status | grep amd
  amdgpu, 1.9-320, 4.15.0-20-generic, x86_64: installed
dmesg | grep kfd
  [    3.179217] kfd kfd: Allocated 3969056 bytes on gart
  [    3.179655] kfd kfd: added device 1002:67df
rocm-smi
  ====================    ROCm System Management Interface    ================
  ============================================================================
   GPU  Temp    AvgPwr   SCLK     MCLK     Fan      Perf    SCLK OD    MCLK OD
    0   69c     135.227W 1366Mhz  2000Mhz  33.73%   auto      0%         0%
  ============================================================================
  ====================           End of ROCm SMI Log          ================
@nikAizuddin
Copy link

@nikAizuddin nikAizuddin commented Nov 23, 2019

Result for running benchmark in a GPU-passthrough QEMU/KVM virtual machine.

  • GPU: MSI RX Vega 56 Air Boost
  • Host OS: openSUSE Tumbleweed 20191007 (kernel 5.0.21-1)
  • Guest OS: Ubuntu 19.04 (kernel 5.0.0-13)
    • rocm-dkms version 2.9.6 (installed via apt)
    • tensorflow-rocm version 1.14.3 (installed via pip)

Table 1: FP32 without TF_ROCM_FUSION_ENABLE

batch_size ResNet50 Inception v3 VGG16 GoogLeNet ResNet152
128 x x x 440.08 x
64 154.63 99.34 93.62 411.59 x
32 144.38 93.64 91.78 375.67 60.08

** x denotes failed benchmark due to out of memory.

Table 2: FP32 with TF_ROCM_FUSION_ENABLE

batch_size ResNet50 Inception v3 VGG16 GoogLeNet ResNet152
128 x x x 439.86 x
64 167.91 105.28 93.78 408.91 x
32 157.98 99.42 91.80 374.53 65.39

Table 3: FP16 without TF_ROCM_FUSION_ENABLE

batch_size ResNet50 Inception v3 VGG16 GoogLeNet ResNet152
128 159.80 60.10 55.03 372.76 x
64 153.81 58.61 54.53 342.67 58.91
32 139.40 56.85 51.68 297.95 53.42

Table 4: FP16 with TF_ROCM_FUSION_ENABLE

batch_size ResNet50 Inception v3 VGG16 GoogLeNet ResNet152
128 168.11 61.63 55.58 377.90 x
64 161.62 60.49 55.40 343.88 60.31
32 147.85 58.23 52.43 303.36 56.48
@extraymond
Copy link

@extraymond extraymond commented Feb 20, 2020

Results for running benchmark in lxc container, host ubuntu 18.04, guest 18.04

GPU: vega56
ROCM: 3.0 with rocm-dev(without dkms, kernel 5.3)
tensorflow-rocm: 2.0.1

python3 tf_cnn_benchmarks.py --num_gpus=1 --batch_size=128 --model=resnet50

Done warm up
Step    Img/sec total_loss
1       images/sec: 130.3 +/- 0.0 (jitter = 0.0)        7.870
10      images/sec: 96.5 +/- 10.2 (jitter = 5.1)        7.957
20      images/sec: 103.9 +/- 6.3 (jitter = 10.1)       7.947
30      images/sec: 104.4 +/- 5.4 (jitter = 9.0)        7.934
40      images/sec: 108.8 +/- 4.2 (jitter = 9.0)        7.959
50      images/sec: 112.5 +/- 3.4 (jitter = 6.9)        7.703
60      images/sec: 115.3 +/- 2.9 (jitter = 5.4)        7.916
70      images/sec: 116.5 +/- 2.5 (jitter = 5.9)        7.836
80      images/sec: 117.3 +/- 2.3 (jitter = 7.4)        7.968
90      images/sec: 117.9 +/- 2.0 (jitter = 7.8)        7.789
100     images/sec: 117.6 +/- 1.9 (jitter = 11.0)       7.776
----------------------------------------------------------------
total images/sec: 117.61
----------------------------------------------------------------

@ant1s
Copy link

@ant1s ant1s commented Mar 4, 2020

GPU: radeon vii
OS: ubuntu 18.04

ROCM: 3.0
tensorflow: 2.0

python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=64 --model=resnet50
Done warm up
Step	Img/sec	total_loss
1	images/sec: 222.7 +/- 0.0 (jitter = 0.0)	8.220
10	images/sec: 220.5 +/- 0.5 (jitter = 0.7)	7.880
20	images/sec: 220.0 +/- 0.4 (jitter = 1.3)	7.910
30	images/sec: 220.1 +/- 0.4 (jitter = 1.7)	7.821
40	images/sec: 219.9 +/- 0.3 (jitter = 2.0)	8.004
50	images/sec: 220.0 +/- 0.3 (jitter = 2.1)	7.769
60	images/sec: 219.8 +/- 0.2 (jitter = 2.2)	8.115
70	images/sec: 220.0 +/- 0.2 (jitter = 2.1)	7.816
80	images/sec: 220.0 +/- 0.2 (jitter = 2.1)	7.979
90	images/sec: 220.1 +/- 0.2 (jitter = 2.1)	8.098
100	images/sec: 220.1 +/- 0.2 (jitter = 2.2)	8.029
----------------------------------------------------------------
total images/sec: 219.98
python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=64 --model=resnet50  --use_fp16
Done warm up
Step	Img/sec	total_loss
1	images/sec: 296.0 +/- 0.0 (jitter = 0.0)	8.105
10	images/sec: 299.9 +/- 1.2 (jitter = 2.2)	7.751
20	images/sec: 300.3 +/- 0.8 (jitter = 2.0)	7.913
30	images/sec: 299.8 +/- 0.6 (jitter = 3.1)	7.769
40	images/sec: 299.8 +/- 0.5 (jitter = 2.0)	7.918
50	images/sec: 299.6 +/- 0.5 (jitter = 2.4)	7.880
60	images/sec: 300.0 +/- 0.4 (jitter = 2.6)	7.718
70	images/sec: 300.2 +/- 0.4 (jitter = 2.5)	8.010
80	images/sec: 300.0 +/- 0.4 (jitter = 3.0)	7.772
90	images/sec: 299.9 +/- 0.4 (jitter = 3.2)	7.806
100	images/sec: 299.8 +/- 0.4 (jitter = 3.2)	8.043
----------------------------------------------------------------
total images/sec: 299.64

ROCM: 3.1
tensorflow: 2.1

python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=64 --model=resnet50
Done warm up
Step	Img/sec	total_loss
1	images/sec: 288.1 +/- 0.0 (jitter = 0.0)	8.220
10	images/sec: 282.7 +/- 1.3 (jitter = 5.0)	7.880
20	images/sec: 281.3 +/- 2.1 (jitter = 3.0)	7.910
30	images/sec: 280.4 +/- 1.8 (jitter = 2.9)	7.820
40	images/sec: 281.7 +/- 1.4 (jitter = 2.4)	8.003
50	images/sec: 282.1 +/- 1.2 (jitter = 2.5)	7.768
60	images/sec: 282.4 +/- 1.0 (jitter = 2.0)	8.113
70	images/sec: 282.7 +/- 0.8 (jitter = 1.8)	7.818
80	images/sec: 283.1 +/- 0.7 (jitter = 1.7)	7.978
90	images/sec: 283.3 +/- 0.7 (jitter = 1.6)	8.100
100	images/sec: 283.6 +/- 0.6 (jitter = 1.6)	8.035
----------------------------------------------------------------
total images/sec: 283.44
----------------------------------------------------------------
python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=64 --model=resnet50  --use_fp16
Done warm up
Step	Img/sec	total_loss
1	images/sec: 396.7 +/- 0.0 (jitter = 0.0)	8.107
10	images/sec: 397.1 +/- 0.8 (jitter = 2.5)	7.753
20	images/sec: 397.8 +/- 0.5 (jitter = 1.6)	7.907
30	images/sec: 398.0 +/- 0.4 (jitter = 2.0)	7.773
40	images/sec: 397.2 +/- 0.4 (jitter = 2.7)	7.926
50	images/sec: 397.7 +/- 0.4 (jitter = 2.9)	7.880
60	images/sec: 397.8 +/- 0.3 (jitter = 2.8)	7.704
70	images/sec: 397.1 +/- 0.4 (jitter = 3.0)	8.002
80	images/sec: 397.1 +/- 0.4 (jitter = 3.2)	7.783
90	images/sec: 397.4 +/- 0.3 (jitter = 3.0)	7.795
100	images/sec: 397.4 +/- 0.3 (jitter = 3.1)	8.041
----------------------------------------------------------------
total images/sec: 397.21
----------------------------------------------------------------

a huge boost!!!

@qixiang109
Copy link

@qixiang109 qixiang109 commented Mar 25, 2020

Radeon VII, with rocm 3.1 and tensorflow-rocm 1.15:

python tf_cnn_benchmarks.py --model=resnet50 --batch_size=128
Step	Img/sec	total_loss
1	images/sec: 295.4 +/- 0.0 (jitter = 0.0)	8.326
10	images/sec: 299.1 +/- 1.7 (jitter = 7.6)	8.174
20	images/sec: 300.1 +/- 1.1 (jitter = 5.4)	8.261
30	images/sec: 301.2 +/- 0.9 (jitter = 4.8)	8.354
40	images/sec: 301.6 +/- 0.7 (jitter = 3.8)	8.399
50	images/sec: 301.4 +/- 0.7 (jitter = 4.0)	8.140
60	images/sec: 301.7 +/- 0.6 (jitter = 3.8)	8.363
70	images/sec: 301.6 +/- 0.5 (jitter = 3.6)	8.136
80	images/sec: 301.2 +/- 0.5 (jitter = 3.1)	8.418
90	images/sec: 301.4 +/- 0.4 (jitter = 2.5)	8.279
100	images/sec: 301.5 +/- 0.4 (jitter = 2.5)	8.344
----------------------------------------------------------------
total images/sec: 301.39
----------------------------------------------------------------

Happy with the performance, but I have another problem: each time I change the model (e.g., from resnet50 to vgg16) or even the batch size from 64 to 128, it turns to wait a long long time after libMIOpen.so loaded and before the 1st batch begins to run, with warnings keep printing like

2020-03-26 00:16:12.700464: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] 
Running warm up
Successfully opened dynamic library libMIOpen.so
warning: <unknown>:0:0: loop not unrolled: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering
warning: <unknown>:0:0: loop not unrolled: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering
warning: <unknown>:0:0: loop not unrolled: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering

usually, it takes 3~5 minutes for the warmup to be done and the 1st batch begins to run.

Done warm up
Step	Img/sec	total_loss
1	images/sec: 249.9 +/- 0.0 (jitter = 0.0)	8.345

It seems MIOpen is searching for the best conv2d implementation during that time, and once found, stores it in ~/.cache/miopen for future using. However, the cache can only be reused when the model structure (even batch size) is absolutely the same, which makes itself rather useless.

My question is : since this searching step is way too much time wasting and useless , can I skip it ? Is that a must for miopen/rocm? If a must, it will be a huge disappointing to me...

@huanzhang12
Copy link

@huanzhang12 huanzhang12 commented Mar 25, 2020

It seems MIOpen is searching for the best conv2d implementation during that time, and once found, stores it in ~/.cache/miopen for future using. However, the cache can only be reused when the model structure (even batch size) is absolutely the same, which makes itself rather useless.

The majority of time seems to be spent on clang invocations. See ROCmSoftwarePlatform/MIOpen#130

The optimal kernel depends on batch size, so when batch size is changed, different kernels need to be compiled.

@qixiang109
Copy link

@qixiang109 qixiang109 commented Mar 26, 2020

It seems MIOpen is searching for the best conv2d implementation during that time, and once found, stores it in ~/.cache/miopen for future using. However, the cache can only be reused when the model structure (even batch size) is absolutely the same, which makes itself rather useless.

The majority of time seems to be spent on clang invocations. See ROCmSoftwarePlatform/MIOpen#130

The optimal kernel depends on batch size, so when batch size is changed, different kernels need to be compiled.

Thanks for the reply and link. I am not familiar with the "optimal kernel compiling" step before any training, seems tensorflow compiled against cuda/cudnn does not have such a step, it just start running immediately; why tensorflow on rocm/miopen does? The "optimal kernel compiling" time is really too long too frequent to use tf-rocm in any daily ML working... Can you provide a solution to that?

@sunway513
Copy link

@sunway513 sunway513 commented Mar 26, 2020

Hi @qixiang109 , the behavior is well studied, and we're working on improving the user experiences with MIOpen and the compilation toolchains. We'll keep you posted when there are improvements available in the future ROCm releases.

@huanzhang12
Copy link

@huanzhang12 huanzhang12 commented Apr 2, 2020

Benchmark dump and recreation of @kazulittlefox's results. My ROCm 2.8.13 results were significantly lower (~65%) than kazulittlefox's 1.9.2 results so I was concerned I may have a hardware issue. Always compare apples to apples. My 1.9.3 results are consistent with kazulittlefox's.

@mwrnd The performance regression on gfx803 has been fixed in ROCm v3.3. The issue was that assembly kernels were all disabled on gfx803 (see ROCmSoftwarePlatform/MIOpen#134).
On my RX570, resnet fp32 performance restored from 50 images/sec (ROCm v3.1) to 95 images/sec (ROCm v3.3).
I have a script for patching miopen.db for gfx803 targets with 32 CUs (duplicating performance db from 36 CU devices). This improves performance by about 20 images/sec.

@mwrnd
Copy link

@mwrnd mwrnd commented Apr 12, 2020

GPU: MSI Radeon RX 580 Armor 8GB OC
GPU BIOS: 015.050.002001 2017/11/13 21:41 according to Win10 Adrenalin 20.2.2
OS: Ubuntu 18.04.4
Kernel: 5.3.0-45-generic
rocm-dkms: 3.3.19 installed through apt
Python: 3.6.9
tensorflow-rocm: 2.1.1 installed through pip
tensorflow benchmarks: cnn_tf_v2.1_compatible
tensorflow_models: 2.1.0

Benchmark dump. Command-line permutations were generated with cmds.py and log output processed with parse.py.

Comparing ROCm 3.3.19 resnet50 performance to previous versions, 3.3.19 has improved throughput and stability. It did not crash even once for me. However, I ran into the ROCmSoftwarePlatform/MIOpen#130 issue. MIOpen pre-computations take longer than most of these benchmarks. I would not mind giving up drive space for a MIOpen database/cache but prefer the raw throughput for faster training runs on large models/datasets.

             batchsize=16     32     032F   032XRF   64     064XR   128
ROCm1.9.3/TF1.12.0     78.6   92.0   91.9   59.4     100    112     60.7
ROCm2.8.13/TF1.14.2    51.4   57.9   65.8   67.7     61.0   70.0    64.1
ROCm3.3.19/TF2.1.1     77.6   92.6   65.3   65.7     106    105     71.9

imagenet dataset total images/sec:

python tf_cnn_benchmarks.py --device=GPU --num_gpus=1 --num_batches=40 \
--batch_size={16,32,64,128,256} --model={model} --data_name=imagenet

XR means XLA and ROCm Fusion were enabled
  export TF_XLA_FLAGS=--tf_xla_cpu_global_jit
  export TF_ROCM_FUSION_ENABLE=1
F means --use_fp16 option was used
na means the batch size was too large or benchmark would not run

model/batchsize=16      32      032F    032XRF  64      064XR   128     256
trivial         4016    7942    1129    1126    13648   13895   21851   30133
alexnet         317     491     318     319     669     672     764     861
googlenet       207     241     155     162     277     279     288     290
inception3      49.8    56.7    37.4    37.5    58.4    58.6    34.7    na
inception4      22.6    25.4    17.6    18.2    17.6    na      na      na
lenet5          4541    7625    7536    7617    12178   12106   17257   22254
official_ncf    1373    2694    2767    2848    5440    5490    10812   21140
overfeat        95.7    145     81.6    82.1    198     na      233     250
resnet101       44.7    55.5    35.8    36.1    37.1    na      na      na
resnet101_v2    47.8    56.2    35.9    36.2    63.3    63.3    na      na
resnet152       33.5    38.9    24.2    24.5    25.2    na      na      na
resnet152_v2    33.9    39.4    24.5    24.7    25.4    na      na      na
resnet50        77.6    92.6    65.3    65.7    106     105     71.9    na
resnet50_v1.5   70.0    83.8    61.0    61.4    94.9    94.7    66.6    na
resnet50_v2     78.9    94.2    65.7    66.5    108     108     72.5    na
vgg11           70.4    87.7    44.4    44.6    100     100     103     47.2
vgg16           38.9    48.4    21.8    22.0    50.1    50.6    22.6    na
vgg19           33.3    39.4    17.5    17.6    41.4    41.4    18.1    na

cifar10 dataset total images/sec:

python tf_cnn_benchmarks.py --device=GPU --num_gpus=1 --num_batches=40 \
--batch_size={16,32,64,128,256} --model={model} --data_name=cifar10

model/batchsize=16      32      032F    032XRF  64      064XR   128     256
trivial         8651    15968   11978   11708   27686   29923   44124   89755
alexnet         3485    5403    472     480     7210    7159    513     10455
resnet110       na      na      725     727     na      na      1023    na
resnet110_v2    503     729     495     495     902     902     840     1032
resnet20        2372    3421    2483    2490    4364    4353    4246    5217
resnet20_v2     2330    3386    2483    2448    4242    4242    4178    5068
resnet32        1584    2301    1613    1618    2891    2876    2751    3399
resnet32_v2     1579    2268    1609    1614    2841    2836    2732    3335
resnet44        1180    1723    1197    1193    2153    2154    2033    2517
resnet44_v2     1172    1717    1195    1195    2134    2134    2028    2480
resnet56        944     1379    946     945     1720    1723    1616    2004
resnet56_v2     944     1375    952     952     1715    1711    1614    1981

DeepSpeech worked with a batch size of 16:

python3 tf_cnn_benchmarks.py --num_gpus=1 --batch_size=16 --num_batches=40 \
--model=deepspeech2 --data_name=librispeech
  [...]
  total images/sec: 0.56

CPU (Ryzen 5 3600X) total images/sec:

python3 tf_cnn_benchmarks.py --device=CPU  {--use_fp16} --num_batches=40 \
--batch_size={32,64,128} --model={model} --data_name=imagenet

F means --use_fp16 option was used

model/dataset batchsize=32      32F     64      64F     128     128F
trivial/cifar10         35401   2701    51733   2942    64842   3134
trivial/imagenet        2249    65.9    2821    66.4    4489    67.0
ncf/imagenet            347     326     701     558     1407    863
rocm-bandwidth-test
    RocmBandwidthTest Version: 2.3.11
    Device: 0,  AMD Ryzen 5 3600X 6-Core Processor
    Device: 1,  Ellesmere [Radeon RX 470/480/570/570X/580/580X],  2d:0.0

    Unidirectional copy peak bandwidth GB/s
    D/D       0           1
    0         N/A         11.325769
    1         11.244692   24.659122

    Bdirectional copy peak bandwidth GB/s
    D/D       0           1
    0         N/A         14.674771
    1         14.674771   N/A
python3 all_reduce_benchmark.py --variable_update=replicated
  Average time per step: 0.00011957406997680663
dkms status | grep amd
  amdgpu, 3.3-19, 5.3.0-45-generic, x86_64: installed
rocm-smi
  ========================ROCm System Management Interface==================
  ==========================================================================
  GPU  Temp   AvgPwr   SCLK     MCLK     Fan     Perf  PwrCap  VRAM%  GPU%
  0    31.0c  43.124W  1366Mhz  2000Mhz  26.67%  high  135.0W   98%   100%
  ==========================================================================
  ==============================End of ROCm SMI Log ========================
@ashaver
Copy link

@ashaver ashaver commented Apr 19, 2020

Anyone else still fighting the AMD/ROCM drivers on a laptop. Even with the latest (Rev 20.10) and/or the latest ROCm I have the following peristent bugs related to (https://bugzilla.kernel.org/show_bug.cgi?id=203035):

  • First, this is not so much the fault of AMD as the fault of ACPI not detecting AC power in a laptop (in combination with AMD starting to drive power levels from real values e.g., torvalds/linux@600ae89).
  • I would love to fix the root problem, but have not had any success.
  • After rebooting the laptop CPU thinks it is on battery, so it throttles each core to about 550 MHz (instead of the base 1500 MHz). This hamstrings basically everything. It doesn't matter that I have 8 cores and 16 threads, each runs 386 clock speeds. The solution for the CPU is to unplug and plug it back in.
  • Using amdgpu-utils (https://github.com/Ricks-Lab/amdgpu-utils/) seems to allow setting higher clock frequencies. In contrast, I cannot do anything with rocm-smi (the changes don't seem to stick).
  • Stock laptop Acer Predator Helios 500 PH517-61-R0GX Gaming Laptop, AMD Ryzen 7 2700 Desktop Processor, AMD Radeon RX Vega 56

Specs and results:

TF_ROCM_FUSION_ENABLE=1 python3 tf_cnn_benchmarks.py --num_gpus=1 --batch_size=32 --model=resnet50
Step	Img/sec	total_loss
1	images/sec: 131.4 +/- 0.0 (jitter = 0.0)	8.458
10	images/sec: 130.0 +/- 0.9 (jitter = 2.9)	7.997
20	images/sec: 129.1 +/- 0.6 (jitter = 2.2)	8.260
30	images/sec: 128.6 +/- 0.5 (jitter = 2.0)	8.338
40	images/sec: 128.4 +/- 0.4 (jitter = 2.3)	8.190
50	images/sec: 128.0 +/- 0.4 (jitter = 2.7)	7.742
60	images/sec: 128.2 +/- 0.4 (jitter = 2.4)	8.061
70	images/sec: 128.3 +/- 0.3 (jitter = 2.4)	inf
80	images/sec: 128.3 +/- 0.3 (jitter = 2.5)	inf
90	images/sec: 128.2 +/- 0.3 (jitter = 2.5)	inf
100	images/sec: 128.2 +/- 0.3 (jitter = 2.5)	inf
----------------------------------------------------------------
total images/sec: 128.13
----------------------------------------------------------------
@sunway513
Copy link

@sunway513 sunway513 commented Jun 5, 2020

@qixiang109 , MIOpen released pre-compiled kernels in ROCm3.5 release, aiming to reduce the overheads on startup. For more details, you can refer to the following document:
https://github.com/ROCmSoftwarePlatform/MIOpen#installing-miopen-kernels-package

@papadako
Copy link

@papadako papadako commented Jun 6, 2020

I guess the following numbers are a bit problematic. Any ideas? Could it be the kernel?

GPU: Radeon VII
Kernel: 5.7.0
rocm-dkms: from kernel
Python: 3.8.2
rocm: 3.5
tensorflow-rocm: 2.2 compiled from source
tensorflow benchmarks: master

python tf_cnn_benchmarks.py --model=resnet50 --batch_size=128

Step    Img/sec total_loss
1       images/sec: 95.9 +/- 0.0 (jitter = 0.0) 7.781
10      images/sec: 95.9 +/- 0.0 (jitter = 0.1) 7.740
20      images/sec: 95.9 +/- 0.0 (jitter = 0.1) 7.827
30      images/sec: 95.8 +/- 0.0 (jitter = 0.1) 7.965
40      images/sec: 95.8 +/- 0.0 (jitter = 0.1) 7.881
50      images/sec: 95.7 +/- 0.0 (jitter = 0.2) 7.795
60      images/sec: 95.7 +/- 0.0 (jitter = 0.1) 8.005
70      images/sec: 95.7 +/- 0.0 (jitter = 0.2) 7.863
80      images/sec: 95.7 +/- 0.0 (jitter = 0.2) 7.922
90      images/sec: 95.7 +/- 0.0 (jitter = 0.1) 7.740
100     images/sec: 95.7 +/- 0.0 (jitter = 0.1) 7.998
----------------------------------------------------------------
total images/sec: 95.66
----------------------------------------------------------------
@huanzhang12
Copy link

@huanzhang12 huanzhang12 commented Jun 7, 2020

@papadako Can you try to set MIOPEN_DEBUG_CONV_IMPLICIT_GEMM=0 and/or MIOPEN_DEBUG_CONV_GEMM=0 and see if it can improve performance?

MIOPEN_DEBUG_CONV_IMPLICIT_GEMM=0 MIOPEN_DEBUG_CONV_GEMM=0 python tf_cnn_benchmarks.py --model=resnet50 --batch_size=128
@papadako
Copy link

@papadako papadako commented Jun 8, 2020

@papadako Can you try to set MIOPEN_DEBUG_CONV_IMPLICIT_GEMM=0 and/or MIOPEN_DEBUG_CONV_GEMM=0 and see if it can improve performance?

MIOPEN_DEBUG_CONV_IMPLICIT_GEMM=0 MIOPEN_DEBUG_CONV_GEMM=0 python tf_cnn_benchmarks.py --model=resnet50 --batch_size=128

I get even worse results with the above settings

Step    Img/sec total_loss
1       images/sec: 75.8 +/- 0.0 (jitter = 0.0) 7.781
10      images/sec: 75.6 +/- 0.0 (jitter = 0.1) 7.740
20      images/sec: 75.6 +/- 0.0 (jitter = 0.1) 7.826
30      images/sec: 75.5 +/- 0.0 (jitter = 0.1) 7.964
40      images/sec: 75.5 +/- 0.0 (jitter = 0.1) 7.880
50      images/sec: 75.5 +/- 0.0 (jitter = 0.1) 7.793
60      images/sec: 75.4 +/- 0.0 (jitter = 0.1) 8.007
70      images/sec: 75.4 +/- 0.0 (jitter = 0.1) 7.865
80      images/sec: 75.3 +/- 0.0 (jitter = 0.1) 7.928
90      images/sec: 75.2 +/- 0.0 (jitter = 0.2) 7.741
100     images/sec: 75.1 +/- 0.1 (jitter = 0.2) 7.998

I will try try to use a rocm-dkms supported kernel (i.e., 5.4.0) and report back

@witeko
Copy link

@witeko witeko commented Jun 8, 2020

@papadako , @huanzhang12 , i have the same performance (or similar) issue. I use vega 7nm, rhel 8.2, dkms drivers, rocm 3.5, tensorflow 2.2.0 (on 2.1.0 works fine).

@logan-dunbar
Copy link

@logan-dunbar logan-dunbar commented Jun 21, 2020

Running inside a Singularity container (v3.5.2) on host Ubuntu 18.04.

GPU: Asus Radeon RX Vega 56 ROG Strix OC 8GB
Kernel: 5.4.0-37
Driver: amdgpu-pro 20.20 (Ubuntu would freeze sporadically with rock-dkms)
Python: 3.7.7 (deadsnakes)
rocm: 3.5.1 (apt)
tensorflow-rocm: 2.2 (PyPI)
tensorflow benchmarks: master (449e900)

python3.7 tf_cnn_benchmarks.py --model=resnet50 --batch_size=64

Step	Img/sec	total_loss
1	images/sec: 132.0 +/- 0.0 (jitter = 0.0)	7.608
10	images/sec: 131.7 +/- 0.4 (jitter = 0.7)	7.849
20	images/sec: 131.4 +/- 0.3 (jitter = 0.8)	8.013
30	images/sec: 131.5 +/- 0.2 (jitter = 0.8)	7.940
40	images/sec: 131.4 +/- 0.2 (jitter = 0.8)	8.136
50	images/sec: 131.2 +/- 0.2 (jitter = 1.1)	8.052
60	images/sec: 131.2 +/- 0.1 (jitter = 1.0)	7.782
70	images/sec: 131.1 +/- 0.1 (jitter = 1.1)	7.853
80	images/sec: 131.2 +/- 0.1 (jitter = 1.1)	8.012
90	images/sec: 131.1 +/- 0.1 (jitter = 1.1)	7.843
100	images/sec: 131.0 +/- 0.1 (jitter = 1.3)	8.088
----------------------------------------------------------------
total images/sec: 130.97
----------------------------------------------------------------
@webber26232
Copy link

@webber26232 webber26232 commented Jul 5, 2020

Radeon VII
rocm==3.5 installed through apt
tensorflow==2.2 installed through pip

python3.7 tf_cnn_benchmarks.py --model=resnet50 --batch_size=128

Step	Img/sec	total_loss
1	images/sec: 183.8 +/- 0.0 (jitter = 0.0)	7.781
10	images/sec: 183.7 +/- 0.1 (jitter = 0.3)	7.740
20	images/sec: 183.5 +/- 0.1 (jitter = 0.3)	7.827
30	images/sec: 183.4 +/- 0.1 (jitter = 0.2)	7.964
40	images/sec: 183.3 +/- 0.1 (jitter = 0.4)	7.882
50	images/sec: 183.3 +/- 0.1 (jitter = 0.3)	7.791
60	images/sec: 183.2 +/- 0.1 (jitter = 0.4)	8.016
70	images/sec: 183.2 +/- 0.1 (jitter = 0.4)	7.870
80	images/sec: 183.1 +/- 0.1 (jitter = 0.4)	7.933
90	images/sec: 183.1 +/- 0.1 (jitter = 0.4)	7.739
100	images/sec: 183.1 +/- 0.0 (jitter = 0.4)	8.008
----------------------------------------------------------------
total images/sec: 183.10

Seems not as good as other Radeon VII posts. Got similar overhead mentioned in qixiang109's post

@nickdon2007
Copy link

@nickdon2007 nickdon2007 commented Jul 20, 2020

I have similar issue, with lower than expected performance. The memory bandwidth is slow, which I don't know why.

CPU: AMD Ryzen 7 3700X
GPU: AMD Radeon RX Vega 56
OS: Ubuntu 18.04
Python: 3.6
rocm: 3 (apt)
tensorflow-rocm: 2.2 (PyPI)

python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=32 --model=resnet50

Done warm up
Step    Img/sec total_loss
1   images/sec: 81.0 +/- 0.0 (jitter = 0.0) 7.765
10  images/sec: 80.7 +/- 0.1 (jitter = 0.2) 8.049
20  images/sec: 80.7 +/- 0.0 (jitter = 0.1) 7.808
30  images/sec: 80.7 +/- 0.0 (jitter = 0.1) 7.976
40  images/sec: 80.9 +/- 0.1 (jitter = 0.2) 7.591
50  images/sec: 81.2 +/- 0.1 (jitter = 0.3) 7.549
60  images/sec: 81.5 +/- 0.1 (jitter = 0.6) 7.819
70  images/sec: 81.7 +/- 0.1 (jitter = 1.1) 7.820
80  images/sec: 81.8 +/- 0.1 (jitter = 1.5) 7.847
90  images/sec: 82.0 +/- 0.1 (jitter = 0.8) 8.025
100 images/sec: 82.1 +/- 0.1 (jitter = 0.6) 8.029
----------------------------------------------------------------
total images/sec: 82.07
----------------------------------------------------------------

clinfo

Number of platforms:                 1
  Platform Profile:              FULL_PROFILE
  Platform Version:              OpenCL 2.0 AMD-APP (3137.0)
  Platform Name:                 AMD Accelerated Parallel Processing
  Platform Vendor:               Advanced Micro Devices, Inc.
  Platform Extensions:               cl_khr_icd cl_amd_event_callback 


  Platform Name:                 AMD Accelerated Parallel Processing
Number of devices:               1
  Device Type:                   CL_DEVICE_TYPE_GPU
  Vendor ID:                     1002h
  Board name:                    Vega 10 XT [Radeon RX Vega 64]
  Device Topology:               PCI[ B#47, D#0, F#0 ]
  Max compute units:                 56
  Max work items dimensions:             3
    Max work items[0]:               1024
    Max work items[1]:               1024
    Max work items[2]:               1024
  Max work group size:               256
  Preferred vector width char:           4
  Preferred vector width short:          2
  Preferred vector width int:            1
  Preferred vector width long:           1
  Preferred vector width float:          1
  Preferred vector width double:         1
  Native vector width char:          4
  Native vector width short:             2
  Native vector width int:           1
  Native vector width long:          1
  Native vector width float:             1
  Native vector width double:            1
  Max clock frequency:               1590Mhz
  Address bits:                  64
  Max memory allocation:             7287183769
  Image support:                 Yes
  Max number of images read arguments:       128
  Max number of images write arguments:      8
  Max image 2D width:                16384
  Max image 2D height:               16384
  Max image 3D width:                2048
  Max image 3D height:               2048
  Max image 3D depth:                2048
  Max samplers within kernel:            26751
  Max size of kernel argument:           1024
  Alignment (bits) of base address:      1024
  Minimum alignment (bytes) for any datatype:    128
  Single precision floating point capability
    Denorms:                     Yes
    Quiet NaNs:                  Yes
    Round to nearest even:           Yes
    Round to zero:               Yes
    Round to +ve and infinity:           Yes
    IEEE754-2008 fused multiply-add:         Yes
  Cache type:                    Read/Write
  Cache line size:               64
  Cache size:                    16384
  Global memory size:                8573157376
  Constant buffer size:              7287183769
  Max number of constant args:           8
  Local memory type:                 Scratchpad
  Local memory size:                 65536
  Max pipe arguments:                16
  Max pipe active reservations:          16
  Max pipe packet size:              2992216473
  Max global variable size:          7287183769
  Max global variable preferred total size:  8573157376
  Max read/write image args:             64
  Max on device events:              1024
  Queue on device max size:          8388608
  Max on device queues:              1
  Queue on device preferred size:        262144
  SVM capabilities:              
    Coarse grain buffer:             Yes
    Fine grain buffer:               Yes
    Fine grain system:               No
    Atomics:                     No
  Preferred platform atomic alignment:       0
  Preferred global atomic alignment:         0
  Preferred local atomic alignment:      0
  Kernel Preferred work group size multiple:     64
  Error correction support:          0
  Unified memory for Host and Device:        0
  Profiling timer resolution:            1
  Device endianess:              Little
  Available:                     Yes
  Compiler available:                Yes
  Execution capabilities:                
    Execute OpenCL kernels:          Yes
    Execute native function:             No
  Queue on Host properties:              
    Out-of-Order:                No
    Profiling :                  Yes
  Queue on Device properties:                
    Out-of-Order:                Yes
    Profiling :                  Yes
  Platform ID:                   0x7fe56aa5fcf0
  Name:                      gfx900
  Vendor:                    Advanced Micro Devices, Inc.
  Device OpenCL C version:           OpenCL C 2.0 
  Driver version:                3137.0 (HSA1.1,LC)
  Profile:                   FULL_PROFILE
  Version:                   OpenCL 2.0 
  Extensions:                    cl_khr_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_fp16 cl_khr_gl_sharing cl_amd_device_attribute_query cl_amd_media_ops cl_amd_media_ops2 cl_khr_image2d_from_buffer cl_khr_subgroups cl_khr_depth_images cl_amd_copy_buffer_p2p cl_amd_assembly_program 

rocminfo

ROCk module is loaded
Able to open /dev/kfd read-write
=====================    
HSA System Attributes    
=====================    
Runtime Version:         1.1
System Timestamp Freq.:  1000.000000MHz
Sig. Max Wait Duration:  18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model:           LARGE                              
System Endianness:       LITTLE                             

==========               
HSA Agents               
==========               
*******                  
Agent 1                  
*******                  
  Name:                    AMD Ryzen 7 3700X 8-Core Processor 
  Uuid:                    CPU-XX                             
  Marketing Name:          AMD Ryzen 7 3700X 8-Core Processor 
  Vendor Name:             CPU                                
  Feature:                 None specified                     
  Profile:                 FULL_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        0(0x0)                             
  Queue Min Size:          0(0x0)                             
  Queue Max Size:          0(0x0)                             
  Queue Type:              MULTI                              
  Node:                    0                                  
  Device Type:             CPU                                
  Cache Info:              
    L1:                      32768(0x8000) KB                   
  Chip ID:                 0(0x0)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   0                                  
  BDFID:                   0                                  
  Internal Node ID:        0                                  
  Compute Unit:            16                                 
  SIMDs per CU:            0                                  
  Shader Engines:          0                                  
  Shader Arrs. per Eng.:   0                                  
  WatchPts on Addr. Ranges:1                                  
  Features:                None
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: KERNARG, FINE GRAINED
      Size:                    16436616(0xfacd88) KB              
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 2                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    16436616(0xfacd88) KB              
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
  ISA Info:                
    N/A                      
*******                  
Agent 2                  
*******                  
  Name:                    gfx900                             
  Uuid:                    GPU-02151e1bb9ee2144               
  Marketing Name:          Vega 10 XT [Radeon RX Vega 64]     
  Vendor Name:             AMD                                
  Feature:                 KERNEL_DISPATCH                    
  Profile:                 BASE_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        128(0x80)                          
  Queue Min Size:          4096(0x1000)                       
  Queue Max Size:          131072(0x20000)                    
  Queue Type:              MULTI                              
  Node:                    1                                  
  Device Type:             GPU                                
  Cache Info:              
    L1:                      16(0x10) KB                        
  Chip ID:                 26751(0x687f)                      
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   1590                               
  BDFID:                   12032                              
  Internal Node ID:        1                                  
  Compute Unit:            56                                 
  SIMDs per CU:            4                                  
  Shader Engines:          4                                  
  Shader Arrs. per Eng.:   1                                  
  WatchPts on Addr. Ranges:4                                  
  Features:                KERNEL_DISPATCH 
  Fast F16 Operation:      FALSE                              
  Wavefront Size:          64(0x40)                           
  Workgroup Max Size:      1024(0x400)                        
  Workgroup Max Size per Dimension:
    x                        1024(0x400)                        
    y                        1024(0x400)                        
    z                        1024(0x400)                        
  Max Waves Per CU:        40(0x28)                           
  Max Work-item Per CU:    2560(0xa00)                        
  Grid Max Size:           4294967295(0xffffffff)             
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)             
    y                        4294967295(0xffffffff)             
    z                        4294967295(0xffffffff)             
  Max fbarriers/Workgrp:   32                                 
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    8372224(0x7fc000) KB               
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 2                   
      Segment:                 GROUP                              
      Size:                    64(0x40) KB                        
      Allocatable:             FALSE                              
      Alloc Granule:           0KB                                
      Alloc Alignment:         0KB                                
      Accessible by all:       FALSE                              
  ISA Info:                
    ISA 1                    
      Name:                    amdgcn-amd-amdhsa--gfx900          
      Machine Models:          HSA_MACHINE_MODEL_LARGE            
      Profiles:                HSA_PROFILE_BASE                   
      Default Rounding Mode:   NEAR                               
      Default Rounding Mode:   NEAR                               
      Fast f16:                TRUE                               
      Workgroup Max Size:      1024(0x400)                        
      Workgroup Max Size per Dimension:
        x                        1024(0x400)                        
        y                        1024(0x400)                        
        z                        1024(0x400)                        
      Grid Max Size:           4294967295(0xffffffff)             
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)             
        y                        4294967295(0xffffffff)             
        z                        4294967295(0xffffffff)             
      FBarrier Max Size:       32                                 
*** Done ***   

rocm-bandwidth-test

          RocmBandwidthTest Version: 2.3.11

          Launch Command is: rocm-bandwidth-test (rocm_bandwidth -a + rocm_bandwidth -A)


          Device: 0,  AMD Ryzen 7 3700X 8-Core Processor
          Device: 1,  Vega 10 XT [Radeon RX Vega 64],  2f:0.0

          Inter-Device Access

          D/D       0         1         

          0         1         0         

          1         1         1         


          Inter-Device Numa Distance

          D/D       0         1         

          0         0         N/A       

          1         20        0         


          Unidirectional copy peak bandwidth GB/s

          D/D       0           1           

          0         N/A         9.295924    

          1         8.892247    72.654038   


          Bdirectional copy peak bandwidth GB/s

          D/D       0           1           

          0         N/A         17.103560   

          1         17.103560   N/A         
@sunway513
Copy link

@sunway513 sunway513 commented Jul 20, 2020

Hi @nickdon2007 @webber26232 , thanks for reporting your observations.
We've been looking into the performance drop reported for the TF2.2 release branch. The issue has been identified and we'll try to provide the fixes in the next a few weeks with the next ROCm release.
cc @ekuznetsov139 @deven-amd

@joket1999
Copy link

@joket1999 joket1999 commented Sep 13, 2020

Ubuntu 20.04

Radeon VII
VBIOS version: 113-D3600200-106

rocm==3.7
tensorflow==2.3
benchmarks==cnn_tf_v2.1_compatible

python3 ./tf_cnn_benchmarks.py --num_gpus=1 --batch_size=64 --model=resnet50

Step	Img/sec	total_loss
1	images/sec: 284.8 +/- 0.0 (jitter = 0.0)	7.608
10	images/sec: 284.0 +/- 0.3 (jitter = 0.7)	7.849
20	images/sec: 284.0 +/- 0.2 (jitter = 0.6)	8.013
30	images/sec: 284.0 +/- 0.1 (jitter = 0.7)	7.939
40	images/sec: 283.9 +/- 0.1 (jitter = 0.8)	8.137
50	images/sec: 283.8 +/- 0.2 (jitter = 0.8)	8.051
60	images/sec: 283.7 +/- 0.1 (jitter = 0.8)	7.781
70	images/sec: 283.7 +/- 0.1 (jitter = 0.8)	7.856
80	images/sec: 283.7 +/- 0.1 (jitter = 0.9)	8.012
90	images/sec: 283.7 +/- 0.1 (jitter = 0.8)	7.842
100	images/sec: 283.7 +/- 0.1 (jitter = 0.7)	8.090
----------------------------------------------------------------
total images/sec: 283.60
----------------------------------------------------------------
python3 ./tf_cnn_benchmarks.py --num_gpus=1 --batch_size=64 --model=resnet50 --use_fp16

Done warm up
Step	Img/sec	total_loss
1	images/sec: 391.8 +/- 0.0 (jitter = 0.0)	7.573
10	images/sec: 394.2 +/- 0.5 (jitter = 1.9)	7.848
20	images/sec: 394.6 +/- 0.3 (jitter = 1.4)	7.966
30	images/sec: 394.7 +/- 0.3 (jitter = 1.1)	7.907
40	images/sec: 394.1 +/- 0.3 (jitter = 1.7)	8.070
50	images/sec: 394.2 +/- 0.2 (jitter = 1.6)	8.047
60	images/sec: 394.3 +/- 0.2 (jitter = 1.6)	7.769
70	images/sec: 394.4 +/- 0.2 (jitter = 1.5)	7.859
80	images/sec: 394.2 +/- 0.2 (jitter = 1.6)	7.965
90	images/sec: 394.1 +/- 0.2 (jitter = 1.7)	7.822
100	images/sec: 394.1 +/- 0.2 (jitter = 1.7)	8.058
----------------------------------------------------------------
total images/sec: 393.89
----------------------------------------------------------------

python3 ./tf_cnn_benchmarks.py --num_gpus=1 --batch_size=128 --model=resnet50

Done warm up
Step	Img/sec	total_loss
1	images/sec: 292.8 +/- 0.0 (jitter = 0.0)	7.781
10	images/sec: 292.6 +/- 0.2 (jitter = 0.7)	7.740
20	images/sec: 292.3 +/- 0.1 (jitter = 0.6)	7.827
30	images/sec: 292.2 +/- 0.1 (jitter = 0.3)	7.963
40	images/sec: 292.0 +/- 0.1 (jitter = 0.4)	7.884
50	images/sec: 291.9 +/- 0.1 (jitter = 0.5)	7.792
60	images/sec: 291.8 +/- 0.1 (jitter = 0.5)	8.015
70	images/sec: 291.7 +/- 0.1 (jitter = 0.6)	7.868
80	images/sec: 291.6 +/- 0.1 (jitter = 0.6)	7.933
90	images/sec: 291.5 +/- 0.1 (jitter = 0.6)	7.746
100	images/sec: 291.4 +/- 0.1 (jitter = 0.7)	7.997
----------------------------------------------------------------
total images/sec: 291.38
----------------------------------------------------------------

python3 ./tf_cnn_benchmarks.py --num_gpus=1 --batch_size=128 --model=resnet50 --use_fp16

Done warm up
Step	Img/sec	total_loss
1	images/sec: 426.1 +/- 0.0 (jitter = 0.0)	7.794
10	images/sec: 428.1 +/- 0.3 (jitter = 0.9)	7.737
20	images/sec: 427.7 +/- 0.3 (jitter = 0.9)	7.828
30	images/sec: 427.5 +/- 0.2 (jitter = 1.0)	7.960
40	images/sec: 427.2 +/- 0.2 (jitter = 1.3)	7.889
50	images/sec: 427.0 +/- 0.2 (jitter = 1.3)	7.788
60	images/sec: 427.0 +/- 0.1 (jitter = 1.2)	8.019
70	images/sec: 426.8 +/- 0.1 (jitter = 1.2)	7.869
80	images/sec: 426.7 +/- 0.1 (jitter = 1.1)	7.931
90	images/sec: 426.6 +/- 0.1 (jitter = 1.2)	7.731
100	images/sec: 426.4 +/- 0.1 (jitter = 1.2)	7.992
----------------------------------------------------------------
total images/sec: 426.36
----------------------------------------------------------------

@dcominottim
Copy link

@dcominottim dcominottim commented Jan 15, 2021

Here are some RTX 3080 10GB results.

(Obs.: When you see some low scores at higher batch sizes with (UM), it's because CUDA Unified Memory and shared memory was used due to lack of VRAM.)

Ryzen 9 5950X
32GB 3200MHz RAM
Pop_OS! 20.04.1
NVIDIA 460 driver
tensorflow-gpu 2.4.0
NVIDIA 20.12-tf2-py3 Docker image

sudo docker run --gpus all --name tf-20.12 --shm-size=10g --ulimit memlock=-1 --ulimit stack=67108864 -it --rm -v $HOME/Projects/nvidia/tensorflow-gpu/benchmarks-master/scripts/tf_cnn_benchmarks:/projects nvcr.io/nvidia/tensorflow:20.12-tf2-py3

 FP32 ResNet50 AlexNet Inception v3 VGG16 GoogLeNet ResNet152
batch_size=512 / 4715.95 / / / /
batch_size=256 54.2 (UM) 4578.22 / / / /
batch_size=128 62.8 (UM) 4237.48 52.8 (UM) / 1016.12 /
batch_size=64 396.26 3373.96 278.23 245.71 906.01 /
batch_size=32 362.88 2467.48 260.47 238.11 802.6 150.18
FP16 ResNet50 AlexNet Inception v3 VGG16 GoogLeNet ResNet152
batch_size=512 / 6504.74 / / / /
batch_size=256 / 5819.6 / / 1790.52 /
batch_size=128 947.3 4919.44 635.26 355.78 1645.71 /
batch_size=64 900.25 3797.61 578.34 326.88 1498.69 384.89
batch_size=32 736.35 2512.88 517.68 295.81 1307.13 321.85
@EmilPi
Copy link

@EmilPi EmilPi commented Feb 2, 2021

Any 6900 XT benchmarks?

@Daniel451
Copy link

@Daniel451 Daniel451 commented Feb 15, 2021

@EmilPi 6900 XT would be very interesting indeed

@qixiang109
Copy link

@qixiang109 qixiang109 commented Mar 12, 2021

@dcominottim my GTX1080 and Radeon vii, training examples / second

image

@dcominottim
Copy link

@dcominottim dcominottim commented Mar 25, 2021

@Daniel451 @EmilPi @qixiang109 Unfortunately, without ROCm support for RDNA*, we can't test ROCm performance yet. However, I've managed to test a 6800 XT with tensorflow-directml (1.15.4, the latest version as of now) on W10! That's at least a little light for RDNA owners who are interested in ML. Here are the numbers:

Ryzen 9 5950X
32GB 3200MHz RAM
6800 XT
Windows 10 20H2 19042.867
AMD Adrenalin 21.3.1
Python 3.7.10
tensorflow-directml 1.15.4

 FP32 ResNet50 AlexNet Inception v3 VGG16 GoogLeNet ResNet152
batch_size=128 63.2 590.1 52.6 29.6 244.0 /
batch_size=64 / / / / / 27.9
 FP16 ResNet50 AlexNet Inception v3 VGG16 GoogLeNet ResNet152
batch_size=128 52 528.2 41.0 23.9 174.0 23.1
@plinnie
Copy link

@plinnie plinnie commented May 20, 2021

I have MI50 and V100 available which I can use for benchmarking. What would be the best benchmarks to run? I see the original benchmarks seem outdated

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet