Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

W6800 and ROCm #1595

Closed
KurtStineAI2 opened this issue Oct 22, 2021 · 17 comments
Closed

W6800 and ROCm #1595

KurtStineAI2 opened this issue Oct 22, 2021 · 17 comments

Comments

@KurtStineAI2
Copy link

Is there an ETA on when the W6800 will support ROCm? We'd like to trial ROCm for our ML applications, but Vega chips are getting more and more difficult to find. Instinct chips are not an option at this point as we would be trialing the hardware, not investing yet.

@zrzrv5
Copy link

zrzrv5 commented Oct 30, 2021

@aoolmay
Copy link

aoolmay commented Oct 30, 2021

@zrzrv5 It's not included here https://github.com/RadeonOpenCompute/ROCm#Hardware-and-Software-Support
I took the time to install 4.5 to verify, NAVI is still a NOGO, not in regular, nor in PRO packaging.

@KurtStineAI2 If waiting for that support costs you money and opportunities, don't. I'm waiting over a year already, regretting every day.

Edit:
Not all bad though, further testing of 4.5 shows at least OpenCL on NAVI works predictably now. So that's progress, but the pace is unacceptable.LOL

@ianferreira
Copy link

@ROCmSupport you have committed to Navi support before end of year. What is going on?

@zrzrv5
Copy link

zrzrv5 commented Nov 1, 2021

@zrzrv5 It's not included here https://github.com/RadeonOpenCompute/ROCm#Hardware-and-Software-Support I took the time to install 4.5 to verify, NAVI is still a NOGO, not in regular, nor in PRO packaging.

@KurtStineAI2 If waiting for that support costs you money and opportunities, don't. I'm waiting over a year already, regretting every day.

Edit: Not all bad though, further testing of 4.5 shows at least OpenCL on NAVI works predictably now. So that's progress, but the pace is unacceptable.LOL

I just install the room 4.5 on my 6800xt. It kinda works (I guess), the latest Tensorflow docker works fine.

@ianferreira
Copy link

@zrzrv5 It's not included here https://github.com/RadeonOpenCompute/ROCm#Hardware-and-Software-Support I took the time to install 4.5 to verify, NAVI is still a NOGO, not in regular, nor in PRO packaging.
@KurtStineAI2 If waiting for that support costs you money and opportunities, don't. I'm waiting over a year already, regretting every day.
Edit: Not all bad though, further testing of 4.5 shows at least OpenCL on NAVI works predictably now. So that's progress, but the pace is unacceptable.LOL

I just install the room 4.5 on my 6800xt. It kinda works (I guess), the latest Tensorflow docker works fine.

Confirmed, the missing hip_binary is resolved. Simple repro

import tensorflow as tf
from tensorflow.keras.models import Sequential

model = Sequential()

This no passes with GPU devices created.

@aoolmay
Copy link

aoolmay commented Nov 1, 2021

@ianferreira Never used the dockers. Can you point me to instructions on that?

@ianferreira
Copy link

@ianferreira Never used the dockers. Can you point me to instructions on that?

Yes, there is rocm/tensorflow image. The run command is a bit messy but here is link https://github.com/ROCmSoftwarePlatform/tensorflow-upstream

I also confirmed on bare metal it works!

Many Thanks @ROCmSupport after buying two RX6800 cards at "scalper prices" I can no finally put them to use!!!!!

@aoolmay
Copy link

aoolmay commented Nov 1, 2021

@ianferreira You've put some hope into me so i tried a clean reinstall just to be sure, but i'm getting "hipErrorNoBinaryforGPU" again on baremetal. Any hints?
I hope it's not about any differences 6800XT might have from 6800 that get's it excluded somehow.

Edit:
Ok, i confirm at least docker works. I guess i have some issue with baremetal.

@ianferreira
Copy link

ianferreira commented Nov 1, 2021

@aoolmay Make sure to uninstall the amdgpu pro driver. Then follow the steps to reinstall the new version.

I.e. for 20.04

$ sudo apt-get update

$ wget https://repo.radeon.com/amdgpu-install/21.40/ubuntu/focal/amdgpu-install-21.40.40500-1_all.deb

$ sudo apt-get install ./amdgpu-install-21.40.40500-1_all.deb

$ sudo apt-get update

Reboot.

You have to run the drun alias for docker. If you running without docker then remember to set you LD_LiBRARY path to /opt/rocm-4.5.0/lib

@KurtStineAI2
Copy link
Author

@zrzrv5 Does PyTorch work yet?

@ianferreira
Copy link

tf2) ian@ian-TRX40-AORUS-PRO-WIFI:~/Documents$ /home/ian/.venvs/tf2/bin/python /home/ian/Documents/tensorflow/mnist.py
2021-11-01 10:27:39.123035: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: SSE3 SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-11-01 10:27:39.125747: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 15868 MB memory: -> device: 0, name: AMD Radeon RX 6800 XT, pci bus id: 0000:23:00.0
2021-11-01 10:27:39.422219: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 15868 MB memory: -> device: 1, name: AMD Radeon RX 6800 XT, pci bus id: 0000:4b:00.0
Epoch 1/6
2021-11-01 10:27:40.136548: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:185] None of the MLIR Optimization Passes are enabled (registered 2)
2021-11-01 10:27:40.188295: I tensorflow/core/common_runtime/gpu_fusion_pass.cc:507] ROCm Fusion is enabled.
2021-11-01 10:27:40.196151: I tensorflow/core/common_runtime/gpu_fusion_pass.cc:507] ROCm Fusion is enabled.
2021-11-01 10:27:40.200935: I tensorflow/core/common_runtime/gpu_fusion_pass.cc:507] ROCm Fusion is enabled.
2021-11-01 10:27:40.202924: I tensorflow/core/common_runtime/gpu_fusion_pass.cc:507] ROCm Fusion is enabled.
2021-11-01 10:27:40.207395: I tensorflow/core/common_runtime/gpu_fusion_pass.cc:507] ROCm Fusion is enabled.
461/469 [============================>.] - ETA: 0s - loss: 0.3600 - sparse_categorical_accuracy: 0.90082021-11-01 10:27:47.310747: I tensorflow/core/common_runtime/gpu_fusion_pass.cc:507] ROCm Fusion is enabled.
2021-11-01 10:27:47.313001: I tensorflow/core/common_runtime/gpu_fusion_pass.cc:507] ROCm Fusion is enabled.
2021-11-01 10:27:47.314621: I tensorflow/core/common_runtime/gpu_fusion_pass.cc:507] ROCm Fusion is enabled.
2021-11-01 10:27:47.325016: I tensorflow/core/common_runtime/gpu_fusion_pass.cc:507] ROCm Fusion is enabled.
2021-11-01 10:27:47.327242: I tensorflow/core/common_runtime/gpu_fusion_pass.cc:507] ROCm Fusion is enabled.
2021-11-01 10:27:47.328846: I tensorflow/core/common_runtime/gpu_fusion_pass.cc:507] ROCm Fusion is enabled.
2021-11-01 10:27:47.384100: I tensorflow/core/common_runtime/gpu_fusion_pass.cc:507] ROCm Fusion is enabled.
2021-11-01 10:27:47.389266: I tensorflow/core/common_runtime/gpu_fusion_pass.cc:507] ROCm Fusion is enabled.
469/469 [==============================] - 8s 3ms/step - loss: 0.3577 - sparse_categorical_accuracy: 0.9014 - val_loss: 0.1925 - val_sparse_categorical_accuracy: 0.9439
Epoch 2/6
469/469 [==============================] - 1s 2ms/step - loss: 0.1654 - sparse_categorical_accuracy: 0.9531 - val_loss: 0.1360 - val_sparse_categorical_accuracy: 0.9605
Epoch 3/6
469/469 [==============================] - 1s 2ms/step - loss: 0.1200 - sparse_categorical_accuracy: 0.9652 - val_loss: 0.1169 - val_sparse_categorical_accuracy: 0.9641
Epoch 4/6
469/469 [==============================] - 1s 2ms/step - loss: 0.0926 - sparse_categorical_accuracy: 0.9730 - val_loss: 0.1007 - val_sparse_categorical_accuracy: 0.9688
Epoch 5/6
469/469 [==============================] - 1s 2ms/step - loss: 0.0748 - sparse_categorical_accuracy: 0.9788 - val_loss: 0.0882 - val_sparse_categorical_accuracy: 0.9734
Epoch 6/6
469/469 [==============================] - 1s 2ms/step - loss: 0.0618 - sparse_categorical_accuracy: 0.9821 - val_loss: 0.0845 - val_sparse_categorical_accuracy: 0.9741

@aoolmay
Copy link

aoolmay commented Nov 1, 2021

@ianferreira Thanks for the previous comment, i'm fine with using docker now that i learned about it. I'll get around to finding out what's wrong with my bare metal setup in time. It's working, all i cared for.

About that example though, two GPUs are shown. Are those both used for the process automatically? I'm guessing it's just debugging/info, but maybe there's more progress i missed.

@ianferreira
Copy link

@zrzrv5 Does PyTorch work yet?

Have not gotten the docker or bare metal to work with Pytorch. Same hipErrorNoBinaryForGpu error.

@ianferreira
Copy link

@ianferreira Thanks for the previous comment, i'm fine with using docker now that i learned about it. I'll get around to finding out what's wrong with my bare metal setup in time. It's working, all i cared for.

About that example though, two GPUs are shown. Are those both used for the process automatically? I'm guessing it's just debugging/info, but maybe there's more progress i missed.

The script I used did not do multi-gpu, let me try and and make sure RCCL works....

@ianferreira
Copy link

@ianferreira Thanks for the previous comment, i'm fine with using docker now that i learned about it. I'll get around to finding out what's wrong with my bare metal setup in time. It's working, all i cared for.

About that example though, two GPUs are shown. Are those both used for the process automatically? I'm guessing it's just debugging/info, but maybe there's more progress i missed.

Just used MirroredStrategy and seems both GPU's are working...

strategy = tf.distribute.MirroredStrategy(["GPU:0", "GPU:1"]) print('Number of devices: {}'.format(strategy.num_replicas_in_sync))
2021-11-01 11:08:49.398961: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 15868 MB memory: -> device: 0, name: AMD Radeon RX 6800 XT, pci bus id: 0000:23:00.0
2021-11-01 11:08:49.630293: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 15868 MB memory: -> device: 1, name: AMD Radeon RX 6800 XT, pci bus id: 0000:4b:00.0
Number of devices: 2

Seems both working...but not 100% given I dont have "NVLINK" equivalent.
ian@ian-TRX40-AORUS-PRO-WIFI:~/Documents$ rocm-smi --showtopo

======================= ROCm System Management Interface =======================
=========================== Weight between two GPUs ============================
GPU0 GPU1
GPU0 0 40
GPU1 40 0

============================ Hops between two GPUs =============================
GPU0 GPU1
GPU0 0 2
GPU1 2 0

========================== Link Type between two GPUs ==========================
GPU0 GPU1
GPU0 0 PCIE
GPU1 PCIE 0

================================== Numa Nodes ==================================
GPU[0] : (Topology) Numa Node: 0
GPU[0] : (Topology) Numa Affinity: 4294967295
GPU[1] : (Topology) Numa Node: 0
GPU[1] : (Topology) Numa Affinity: 4294967295
============================= End of ROCm SMI Log ==============================

ian@ian-TRX40-AORUS-PRO-WIFI:~/Documents$ rocm-smi

======================= ROCm System Management Interface =======================
================================= Concise Info =================================
GPU Temp AvgPwr SCLK MCLK Fan Perf PwrCap VRAM% GPU%
0 34.0c 28.0W 500Mhz 456Mhz 0% auto 255.0W 99% 75%
1 35.0c 40.0W 2445Mhz 96Mhz 0% auto 255.0W 98% 99%

============================= End of ROCm SMI Log ==============================

@aoolmay
Copy link

aoolmay commented Nov 1, 2021

@ianferreira Really appreciate your inputs, i and probably quite a few more people would miss the NAVI enabled docker. Thanks man!

@ROCmSupport
Copy link

Hi All,
Thanks for reaching out.
As ROCm + Navi brought lot of critical issues in different areas/layers in the stack, things got delayed.
We have plans to enable official ROCm support on some set of Navi cards, as soon as possible, most likely, with early ROCm 5.x release: like 5.0.
Please stay tuned for the updates.
Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants