Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tensorflow 2.0 AMD support #362

Closed
Cvikli opened this issue Mar 20, 2019 · 34 comments
Assignees

Comments

@Cvikli
Copy link

@Cvikli Cvikli commented Mar 20, 2019

I would be curious if Tensorflow 2.0 works with AMD Radeon VII?

Also, if it is available, are there any benchmark comparison with 2080Ti on some standard network to see if we should invest in Radeon VII clusters?

@sunway513

This comment has been minimized.

Copy link

@sunway513 sunway513 commented Mar 21, 2019

Hi @Cvikli , we are finalizing the 2.0-alpha docker image and will be available soon, please stay tuned.

@sunway513 sunway513 self-assigned this Mar 21, 2019
@sunway513

This comment has been minimized.

Copy link

@sunway513 sunway513 commented Mar 22, 2019

Hi @Cvikli , we've pushed out the preview build docker image for TF2.0-alpha0:
rocm/tensorflow:tf2.0-alpha0-preview
Please help review it and let us know your feedback :-)
Here's the link to our dockerhub repo:
https://cloud.docker.com/u/rocm/repository/docker/rocm/tensorflow/general

@Cvikli

This comment has been minimized.

Copy link
Author

@Cvikli Cvikli commented Mar 23, 2019

Great!
Just ordered our first card for testing. :) If the delivery and tests go well, then I will be back with results by April 2.

Thank you for the fast work! I am really excited about it!

@dagamayank

This comment has been minimized.

Copy link

@dagamayank dagamayank commented Mar 26, 2019

Please open a new issue if bugs are found with the 2.0 docker.

@dagamayank dagamayank closed this Mar 26, 2019
@Cvikli

This comment has been minimized.

Copy link
Author

@Cvikli Cvikli commented Apr 3, 2019

Sorry for opening the thread but I own you guys with a lot!

The RADEON VII's performance is crazy with tensorflow 2.0a.
In our tests, we reached close to the same speed like our 2080ti(about 10-15% less)! But the Radeon VII has more memory which was a bottleneck in our case. On this price this videocard has the best value to do machine learning we think that in our company!

We are glad to open our eyes towards AMD products, we are buying our first configuration which is 40% cheaper and as we measured capable to perform better in our scenario than our well optimised server configuration.

Thank you for all the work!

@briansp2020

This comment has been minimized.

Copy link

@briansp2020 briansp2020 commented Apr 3, 2019

@Cvikli

We are glad to open our eyes towards AMD products, we are buying our first configuration which is 40% cheaper and as we measured capable to perform better in our scenario than our well optimised server configuration.

Could you give a bit more detail? How much faster is Radeon VII for your application? What type of mode are you running (CNN/RNN/GAN/etc.)? What processor are you running?

Just curious.

@sunway513

This comment has been minimized.

Copy link

@sunway513 sunway513 commented Apr 3, 2019

Thank you @Cvikli , great to hear that your experiment went well and you are going to invest more on ROCm and AMD GPUs!

@Cvikli

This comment has been minimized.

Copy link
Author

@Cvikli Cvikli commented May 12, 2019

The system is something like this:

  • 1x ASRock x399 taichi
  • 1x AMD TR4 2950X
  • 1x Samsung 970 EVO 1TB M.2 PCIe MZ-V7E1T0BW
  • 4x SAPPHIRE Radeon VII
  • 2x G.SKILL FlareX 64GB
  • 1x Thermaltake Toughpower 1500W Gold
  • 1x FRYZEN fan
    The other system setup is close to the same, except it was with 4 NVidia 1080ti.

The result with RNN networks on 1 Radeon VII and 1080ti was close to the same.

Now after we switched over to 4 Radeon VII, we face two big scaling issue on convolutional networks.

  1. One of our computer has 4 AMD Radeon VII, but we can't have more than one calculation (without this error below) on the system if we would use two separate GPU card. The second calculation that is running on the other GPU writes this:
2019-05-12 15:28:04.632396: E tensorflow/stream_executor/rocm/rocm_driver.cc:629] failed to allocate 14.95G (16049923584 bytes) from device: hipError_t(1002)
2019-05-12 15:28:04.632456: E tensorflow/stream_executor/rocm/rocm_driver.cc:629] failed to allocate 13.45G (14444931072 bytes) from device: hipError_t(1002)
2019-05-12 15:28:04.632475: E tensorflow/stream_executor/rocm/rocm_driver.cc:629] failed to allocate 12.11G (13000437760 bytes) from device: hipError_t(1002)
... many lines like this
2019-05-12 15:36:58.756188: E tensorflow/stream_executor/rocm/rocm_driver.cc:629] failed to allocate 310.35M (325421568 bytes) from device: hipError_t(1002)
2019-05-12 15:36:58.756226: E tensorflow/stream_executor/rocm/rocm_driver.cc:629] failed to allocate 279.31M (292879616 bytes) from device: hipError_t(1002)
2019-05-12 15:36:58.756252: E tensorflow/stream_executor/rocm/rocm_driver.cc:629] failed to allocate 251.38M (263591680 bytes) from device: hipError_t(1002)
2019-05-12 15:36:58.756279: E tensorflow/stream_executor/rocm/rocm_driver.cc:629] failed to allocate 226.24M (237232640 bytes) from device: hipError_t(1002)
2019-05-12 15:36:58.756304: E tensorflow/stream_executor/rocm/rocm_driver.cc:629] failed to allocate 203.62M (213509376 bytes) from device: hipError_t(1002)
2019-05-12 15:36:58.756323: E tensorflow/stream_executor/rocm/rocm_driver.cc:629] failed to allocate 183.26M (192158464 bytes) from device: hipError_t(1002)
2019-05-12 15:36:58.756343: E tensorflow/stream_executor/rocm/rocm_driver.cc:629] failed to allocate 164.93M (172942848 bytes) from device: hipError_t(1002)
2019-05-12 15:37:01.337949: E tensorflow/stream_executor/rocm/rocm_driver.cc:493] failed to memset memory: HIP_ERROR_InvalidValue
Segmentation fault (core dumped)

We are pretty sure things should work, because it was working with NVidia 1080ti. However inspite of it writes, that it failed to allocate the memory, the whole program just start and somehow running normally I think.

Can this happen because of the docker image, we can't use separate GPUs for different runs?

  1. Comparing convolutional performance the 4AMD and 4Nvidia, difference got really huge because of cuDNN for Nvidia cards. We can get more than 10x performance from the 1080Ti than the Radeon VII card. We find this difference in speed a little too big at image recognition cuDNN, I can't believe that this should happen and the hardware shouldn't be able to achieve the same.

What do you guys think about this? Is this normal that we get 10x slower speed when it comes to cudNN? (For me cuDNN sounds totally a software with better arithmetic operations I guess, is it possible to improve on this?)

@sunway513

This comment has been minimized.

Copy link

@sunway513 sunway513 commented May 12, 2019

Hi @Cvikli , let's step back a bit and look at your system configuration:

  • 4x SAPPHIRE Radeon VII
  • 2x G.SKILL FlareX 64GB
  • 1x Thermaltake Toughpower 1500W Gold

The typical gold workstation power supply would run at 87% efficiency at full load, therefore it can supposedly power up to 1307W.
TR 2950x TDP is measured at 180W, Radeon VII TDP is 300W, but the peak power consumption can go up to 321.8W (according to third-party measurement here).
Considering the other components on your workstation, the current 1500W is not sufficient for your system at full load. We'd recommend you to go for 1800W PSU or dual 1000W PSU for your system provide sufficient juices for 4 Radeon VII GPUs.

2019-05-12 15:28:04.632396: E tensorflow/stream_executor/rocm/rocm_driver.cc:629] failed to allocate 14.95G (16049923584 bytes) from device: hipError_t(1002)

The above error message indicates the target GPU device memory has already been allocated by the other processes.
There're a couple of solutions to expose only selected GPUs to the user process:

  1. Use HIP_VISIBLE_DEVICES environment variable to select the target GPUs for the process from the HIP level. e.g. use the following to select the first GPU:
  • export HIP_VISIBLE_DEVICES=0
  1. Use ROCR_VISIBLE_DEVICES environment variable to select the target GPUs from the ROCr (ROCm user-bit driver) level. e.g. the following to select the first GPU:
  • export ROCR_VISIBLE_DEVICES=0
  1. Pass selected GPU driver interfaces (/dev/dri/render#) )to Docker container. e.g. use the following docker run command option to select the first GPU:
  • sudo docker run -it --network=host --device=/dev/kfd --device=/dev/dri/renderD128 --group-add video
    Note you show see the following four interfaces for your 4xRadeon VII system:
    $ ls /dev/dri/render*
    /dev/dri/renderD128 /dev/dri/renderD129 /dev/dri/renderD130 /dev/dri/renderD131

We recommend approach #3, as that would isolate the GPUs at a relatively lower level of the ROCm stack.

For your concern on mGPU performance, could you provide the exact commands to reproduce your observations?

Just FYI, we have been actively running regressions tests for single node multi-GPU performance, and there's no mGPU performance regression issue reported for TF1.13 on ROCm2.4 release.
After you can resolve the concern on the power supply, for tf_cnn_benchmarks resnet50 as an example, you should be able to see near-linear scalability on FP32 using the following command with 4 GPUs:
TF_ROCM_FUSION_ENABLE=1 python3 tf_cnn_benchmarks.py --data_format=NCHW --batch_size=128 --model=resnet50 --optimizer=sgd --num_batches=100 --variable_update=replicated --nodistortions --gpu_thread_mode=gpu_shared --num_gpus=4 --all_reduce_spec=pscpu --print_training_accuracy=True --display_every=10

@Cvikli

This comment has been minimized.

Copy link
Author

@Cvikli Cvikli commented May 13, 2019

hank you for the 3 different ways to manage visible devices.
The second solution (with export ROCR_VISIBLE_DEVICES=0) WORKED like a charm for us!
Interestingly the third solution didn't restrict the available GPU devices in the docker container.

Ran some test on TF2.0 on ROCm2.4 and performance is still a lot lower than what an Nvidia 1080Ti can provide benchmarking on MobileNetv2, what bothers us yet a little.
To get some direction for the TF2.0 ROCm2.4, I thought I share these logs.
Before the calculations would start for a MobileNetV2:

2019-05-13 18:48:40.653042: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library librocblas.so
2019-05-13 18:48:40.683726: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libMIOpen.so
2019-05-13 18:48:44.998231: I tensorflow/core/kernels/conv_grad_input_ops.cc:997] running auto-tune for Backward-Data
2019-05-13 18:48:45.094061: I tensorflow/core/kernels/conv_grad_filter_ops.cc:886] running auto-tune for Backward-Filter
... 2x14 lines like this with Backward-Data and Backward-Filter
2019-05-13 18:48:48.854030: I tensorflow/core/kernels/conv_grad_input_ops.cc:997] running auto-tune for Backward-Data
2019-05-13 18:48:48.945517: I tensorflow/core/kernels/conv_grad_filter_ops.cc:886] running auto-tune for Backward-Filter
2019-05-13 18:48:49.207930: I tensorflow/core/kernels/conv_grad_input_ops.cc:997] running auto-tune for Backward-Data
2019-05-13 18:48:49.295100: I tensorflow/core/kernels/conv_grad_filter_ops.cc:886] running auto-tune for Backward-Filter
2019-05-13 18:48:50.639570: I tensorflow/core/kernels/conv_grad_filter_ops.cc:886] running auto-tune for Backward-Filter

So I pretty much feel like we are running some operations 19 times, which leads to 10-15x speed loss, but it is only a guess. If I can help in any other way let me know.

PS.: on TF2.0 ROCm2.4, I couldn't run the tf_cnn_benchmarks.py because missing tensorflow.contrib.

@sunway513

This comment has been minimized.

Copy link

@sunway513 sunway513 commented May 13, 2019

Hi @Cvikli , glad the ROCr env var worked for you!
For approach #3, if you run ROCr level utils you should see the restricted access (e.g. /opt/rocm/bin/rocminfo); however, since rocm_smi uses different approaches to query the GPU status, you can still see all the GPUs using rocm_smi even you pass limited GPU device interfaces to docker container. Adding @jlgreathouse @y2kenny for awareness.

2019-05-13 18:48:44.998231: I tensorflow/core/kernels/conv_grad_input_ops.cc:997] running auto-tune for Backward-Data
2019-05-13 18:48:45.094061: I tensorflow/core/kernels/conv_grad_filter_ops.cc:886] running auto-tune for Backward-Filter

The above logs indicate the time spent there was actually for MIOpen to compile kernels, please refer to my previous comment here for reference.
Those are one-time effort, for the latter runs MIOpen will just pick the cached kernels under ~/.cache/miopen instead of compiling those again. If you have been using docker containers for the dev work, you can consider committing the docker container with MIOpen cache compiled so you can reuse those for later reference.

@sunway513

This comment has been minimized.

Copy link

@sunway513 sunway513 commented May 13, 2019

Besides, if your application is built on TF1.x api, you might use the following TF1.13 release instead of using TF2.0 branch built with --config=v1:
rocm/tensorflow:rocm2.4-tf1.13-python3

@Cvikli

This comment has been minimized.

Copy link
Author

@Cvikli Cvikli commented May 23, 2019

We ported our code from tf2.0 to tf1.13 and run the MobileNetV2 implementation from tf.keras.applications on the configuration you suggested (TF1.13 on ROCm2.4 release), and we still see NO improvement in speed.
Nvidia 1080Ti still performs 5-10x faster. I don't know if it is, because cudnn or cuda is not availabe for Radeon cards, but this performance difference is pretty high.

@sunway513

This comment has been minimized.

Copy link

@sunway513 sunway513 commented May 23, 2019

Hi @Cvikli , could you provide the exact steps to repro your observation?
FYI, Tensorflow-ROCm deploys the ROCm MIOpen library to accelerate the DL workloads, the repo is here:
https://github.com/ROCmSoftwarePlatform/MIOpen

@QuantumInformation

This comment has been minimized.

Copy link

@QuantumInformation QuantumInformation commented May 30, 2019

Anyone tested with the latest Macbook pros?

@quocdat32461997

This comment has been minimized.

Copy link

@quocdat32461997 quocdat32461997 commented Jun 10, 2019

I run into the error "failed to allocate 14.95G (16049923584 bytes) from device: hipError_t(1002)" as above.
System info:
Intel® Xeon(R) CPU E5-2630 v2 @ 2.60GHz × 12
Radeon VII
1500 W PSU
ROCm installed with Tensorflow-rocm 1.13.1 (through pip3)

I have not tried install tensorflow-rocm through docker.

Any help?

@sunway513

This comment has been minimized.

Copy link

@sunway513 sunway513 commented Jun 11, 2019

Hi @quocdat32461997 , can you try to set the following environment variables:
export HIP_HIDDEN_FREE_MEM=500
If it still fails, please create a new issue and provide more complete logs.

@quocdat32461997

This comment has been minimized.

Copy link

@quocdat32461997 quocdat32461997 commented Jun 11, 2019

Problem solved by re-installing ROCm and Tensorflow-rocm. Proabably I did not install the ROCm properly. Thanks a lot.

@Cvikli

This comment has been minimized.

Copy link
Author

@Cvikli Cvikli commented Jun 11, 2019

Hey there!
I would like to know if there will be a new docker image with tensorflow==2.0.0b installed, because now still only alpha version is available for tf2.0.
By the way we ran the https://github.com/lambdal/lambda-tensorflow-benchmark tests, and the difference between an Nvidia and the Radeon cards are less then stated above.
If you are interested I can share the tests results here.

@sunway513

This comment has been minimized.

Copy link

@sunway513 sunway513 commented Jun 11, 2019

Hi @Cvikli , we are preparing the TF2.0 beta release, it's currently under QA test coverage.
We'll update here after the new docker image is available.

@Cvikli

This comment has been minimized.

Copy link
Author

@Cvikli Cvikli commented Jun 11, 2019

You guys, you are crazy! I love it! :) Thank you for this speed!

@satvikpendem

This comment has been minimized.

Copy link

@satvikpendem satvikpendem commented Jun 11, 2019

Looks like the link at the beginning of the thread redirects to https://hub.docker.com, here's the link I'm using to track releases: https://hub.docker.com/r/rocm/tensorflow/tags

@sunway513

This comment has been minimized.

Copy link

@sunway513 sunway513 commented Jun 20, 2019

Hi @Cvikli , we have published the docker container for TF-ROCm 2.0 Beta1. Please kindly check it and let us know if you have any questions:
rocm/tensorflow:rocm2.5-tf2.0-beta1-config-v2

@moonshine502

This comment has been minimized.

Copy link

@moonshine502 moonshine502 commented Jun 21, 2019

Hi everyone,
when I run the rocm/tensorflow:rocm2.5-tf2.0-beta1-config-v2 docker container or any other container with tensorflow 2.0, trying to import tensorflow results in following error:
>>> import tensorflow as tf
Illegal instruction (core dumped)

I am using a rx 480 with rocm 2.5 and rocm with tensorflow 1.13 works fine.

@sunway513

This comment has been minimized.

Copy link

@sunway513 sunway513 commented Jun 21, 2019

Hi @moonshine502 , I've tried a couple of samples using the rocm2.5-tf2.0-beta1-config-v2 docker image on my GFX803 node, those are working fine.
Could you provide the steps to reproduce your issue?

@moonshine502

This comment has been minimized.

Copy link

@moonshine502 moonshine502 commented Jun 22, 2019

Hi @sunway513,
thank you for your response.

Hardware: Intel Celeron G3900 (Skylake), AMD Radeon RX 480 (gfx803)
Software:

Issue:
Executing python3 -c "import tensorflow as tf" inside the docker results in
python3 -c "import tensorflow as tf"
Illegal instruction (core dumped)

I am guessing that this error is caused by the cpu not being compatible with the new tensorflow version. Could this be the case?

@dundir

This comment has been minimized.

Copy link

@dundir dundir commented Jun 25, 2019

@moonshine502 I'm running almost the exact same system setup and its able to load and train for me.

The only difference appears to be the CPU, or possibly the card. I'm using a Ryzen 5 2400G; everything else looks near the same. I'm using a RX560 14cu, which registers in linux as an RX480 (gfx803), ROCM 2.5.27.

I ran through all the steps for training a mnist dataset at the link below to confirm tf2.0 was actually working, the accuracy for the evaluation wasn't the best (~87.7%) vs (98%) but it was able to compute.

https://www.tensorflow.org/beta/tutorials/quickstart/beginner

Edit: included more info.

@moonshine502

This comment has been minimized.

Copy link

@moonshine502 moonshine502 commented Jun 25, 2019

Hi @dundir, @sunway513,

I am now pretty sure that the cause of the problem is my cpu which does not support avx instructions. It seems that previous versions of tensorflow with rocm were compiled without avx, because they work on my machine. So I may try to build tensorflow 2.0 without avx or get a new cpu.

Thank you for your help.

@dundir

This comment has been minimized.

Copy link

@dundir dundir commented Jun 30, 2019

@sunway513 It looks like there may be an rocm related issue with the accuracy for training a basic mnist model.

Running this code: here
GPU passthru stdout: here

The docker container was set up with the same passthru options as 1.13, the resulting accuracy diverged to 87% accuracy from the baseline of 97%, and the overall computation time diverged 44s of training for 5 epochs, from the baseline of 20s (nopassthru).

No dev passthru stdout: here

@dundir

This comment has been minimized.

Copy link

@dundir dundir commented Jul 11, 2019

@sunway513 Looks like the accuracy issue I previously mentioned regarding mnist was resolved with the latest tf2.0 docker image (rocm/tensorflow:rocm2.6-tf2.0-config-v2-dev).

Thanks, and much appreciated. You guys are doing an awesome job.

@bionicles

This comment has been minimized.

Copy link

@bionicles bionicles commented Oct 7, 2019

Memory being the bottleneck, can we do bfloat16 and int8, float8, float16? Just curious

@salmanulhaq

This comment has been minimized.

Copy link

@salmanulhaq salmanulhaq commented Nov 28, 2019

We ported our code from tf2.0 to tf1.13 and run the MobileNetV2 implementation from tf.keras.applications on the configuration you suggested (TF1.13 on ROCm2.4 release), and we still see NO improvement in speed.
Nvidia 1080Ti still performs 5-10x faster. I don't know if it is, because cudnn or cuda is not availabe for Radeon cards, but this performance difference is pretty high.

cuDNN is not purely software play and is backed by actual silicon (dedicated tensor cores for MAD ops) which boosts half-precision performance. I'll need to check if Radeon VII has dedicated tensor cores as well. Also, nvidia won't automatically optimize code to make use of tensor cores, that has to be done w/ using cuDNN extensions

@michaelklachko

This comment has been minimized.

Copy link

@michaelklachko michaelklachko commented Nov 28, 2019

@salmanulhaq 1080Ti has no tensor cores.

@raxbits

This comment has been minimized.

Copy link

@raxbits raxbits commented Nov 28, 2019

We ported our code from tf2.0 to tf1.13 and run the MobileNetV2 implementation from tf.keras.applications on the configuration you suggested (TF1.13 on ROCm2.4 release), and we still see NO improvement in speed.
Nvidia 1080Ti still performs 5-10x faster. I don't know if it is, because cudnn or cuda is not availabe for Radeon cards, but this performance difference is pretty high.

cuDNN is not purely software play and is backed by actual silicon (dedicated tensor cores for MAD ops) which boosts half-precision performance. I'll need to check if Radeon VII has dedicated tensor cores as well. Also, nvidia won't automatically optimize code to make use of tensor cores, that has to be done w/ using cuDNN extensions

do u have a referece for hardware being involved in CUDNN?

CUDNN afaik is pure software play with optimization and what not , what u may be referring to is TENSOR cores which was added to packaged on Volta and carried to Turing silicons.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
You can’t perform that action at this time.