W6800 and ROCm #1595

KurtStineAI2 · 2021-10-22T15:30:53Z

Is there an ETA on when the W6800 will support ROCm? We'd like to trial ROCm for our ML applications, but Vega chips are getting more and more difficult to find. Instinct chips are not an option at this point as we would be trialing the hardware, not investing yet.

zrzrv5 · 2021-10-30T03:21:28Z

according to their docs, w6800 is supported now.
https://rocmdocs.amd.com/en/latest/Installation_Guide/Installation_new.html#confirm-you-have-a-rocm-capable-gpu

aoolmay · 2021-10-30T14:24:15Z

@zrzrv5 It's not included here https://github.com/RadeonOpenCompute/ROCm#Hardware-and-Software-Support
I took the time to install 4.5 to verify, NAVI is still a NOGO, not in regular, nor in PRO packaging.

@KurtStineAI2 If waiting for that support costs you money and opportunities, don't. I'm waiting over a year already, regretting every day.

Edit:
~~Not all bad though, further testing of 4.5 shows at least OpenCL on NAVI works predictably now. So that's progress, but the pace is unacceptable.~~LOL

ianferreira · 2021-10-31T20:31:43Z

@ROCmSupport you have committed to Navi support before end of year. What is going on?

zrzrv5 · 2021-11-01T00:49:14Z

@zrzrv5 It's not included here https://github.com/RadeonOpenCompute/ROCm#Hardware-and-Software-Support I took the time to install 4.5 to verify, NAVI is still a NOGO, not in regular, nor in PRO packaging.

@KurtStineAI2 If waiting for that support costs you money and opportunities, don't. I'm waiting over a year already, regretting every day.

Edit: Not all bad though, further testing of 4.5 shows at least OpenCL on NAVI works predictably now. So that's progress, but the pace is unacceptable.LOL

I just install the room 4.5 on my 6800xt. It kinda works (I guess), the latest Tensorflow docker works fine.

ianferreira · 2021-11-01T01:38:54Z

@zrzrv5 It's not included here https://github.com/RadeonOpenCompute/ROCm#Hardware-and-Software-Support I took the time to install 4.5 to verify, NAVI is still a NOGO, not in regular, nor in PRO packaging.
@KurtStineAI2 If waiting for that support costs you money and opportunities, don't. I'm waiting over a year already, regretting every day.
Edit: Not all bad though, further testing of 4.5 shows at least OpenCL on NAVI works predictably now. So that's progress, but the pace is unacceptable.LOL

I just install the room 4.5 on my 6800xt. It kinda works (I guess), the latest Tensorflow docker works fine.

Confirmed, the missing hip_binary is resolved. Simple repro

import tensorflow as tf
from tensorflow.keras.models import Sequential

model = Sequential()

This no passes with GPU devices created.

aoolmay · 2021-11-01T02:00:13Z

@ianferreira Never used the dockers. Can you point me to instructions on that?

ianferreira · 2021-11-01T02:02:39Z

@ianferreira Never used the dockers. Can you point me to instructions on that?

Yes, there is rocm/tensorflow image. The run command is a bit messy but here is link https://github.com/ROCmSoftwarePlatform/tensorflow-upstream

I also confirmed on bare metal it works!

Many Thanks @ROCmSupport after buying two RX6800 cards at "scalper prices" I can no finally put them to use!!!!!

aoolmay · 2021-11-01T03:04:38Z

@ianferreira You've put some hope into me so i tried a clean reinstall just to be sure, but i'm getting "hipErrorNoBinaryforGPU" again on baremetal. Any hints?
I hope it's not about any differences 6800XT might have from 6800 that get's it excluded somehow.

Edit:
Ok, i confirm at least docker works. I guess i have some issue with baremetal.

ianferreira · 2021-11-01T04:24:36Z

@aoolmay Make sure to uninstall the amdgpu pro driver. Then follow the steps to reinstall the new version.

I.e. for 20.04

$ sudo apt-get update

$ wget https://repo.radeon.com/amdgpu-install/21.40/ubuntu/focal/amdgpu-install-21.40.40500-1_all.deb

$ sudo apt-get install ./amdgpu-install-21.40.40500-1_all.deb

$ sudo apt-get update

Reboot.

You have to run the drun alias for docker. If you running without docker then remember to set you LD_LiBRARY path to /opt/rocm-4.5.0/lib

KurtStineAI2 · 2021-11-01T16:12:43Z

@zrzrv5 Does PyTorch work yet?

ianferreira · 2021-11-01T17:28:53Z

tf2) ian@ian-TRX40-AORUS-PRO-WIFI:~/Documents$ /home/ian/.venvs/tf2/bin/python /home/ian/Documents/tensorflow/mnist.py
2021-11-01 10:27:39.123035: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: SSE3 SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-11-01 10:27:39.125747: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 15868 MB memory: -> device: 0, name: AMD Radeon RX 6800 XT, pci bus id: 0000:23:00.0
2021-11-01 10:27:39.422219: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 15868 MB memory: -> device: 1, name: AMD Radeon RX 6800 XT, pci bus id: 0000:4b:00.0
Epoch 1/6
2021-11-01 10:27:40.136548: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:185] None of the MLIR Optimization Passes are enabled (registered 2)
2021-11-01 10:27:40.188295: I tensorflow/core/common_runtime/gpu_fusion_pass.cc:507] ROCm Fusion is enabled.
2021-11-01 10:27:40.196151: I tensorflow/core/common_runtime/gpu_fusion_pass.cc:507] ROCm Fusion is enabled.
2021-11-01 10:27:40.200935: I tensorflow/core/common_runtime/gpu_fusion_pass.cc:507] ROCm Fusion is enabled.
2021-11-01 10:27:40.202924: I tensorflow/core/common_runtime/gpu_fusion_pass.cc:507] ROCm Fusion is enabled.
2021-11-01 10:27:40.207395: I tensorflow/core/common_runtime/gpu_fusion_pass.cc:507] ROCm Fusion is enabled.
461/469 [============================>.] - ETA: 0s - loss: 0.3600 - sparse_categorical_accuracy: 0.90082021-11-01 10:27:47.310747: I tensorflow/core/common_runtime/gpu_fusion_pass.cc:507] ROCm Fusion is enabled.
2021-11-01 10:27:47.313001: I tensorflow/core/common_runtime/gpu_fusion_pass.cc:507] ROCm Fusion is enabled.
2021-11-01 10:27:47.314621: I tensorflow/core/common_runtime/gpu_fusion_pass.cc:507] ROCm Fusion is enabled.
2021-11-01 10:27:47.325016: I tensorflow/core/common_runtime/gpu_fusion_pass.cc:507] ROCm Fusion is enabled.
2021-11-01 10:27:47.327242: I tensorflow/core/common_runtime/gpu_fusion_pass.cc:507] ROCm Fusion is enabled.
2021-11-01 10:27:47.328846: I tensorflow/core/common_runtime/gpu_fusion_pass.cc:507] ROCm Fusion is enabled.
2021-11-01 10:27:47.384100: I tensorflow/core/common_runtime/gpu_fusion_pass.cc:507] ROCm Fusion is enabled.
2021-11-01 10:27:47.389266: I tensorflow/core/common_runtime/gpu_fusion_pass.cc:507] ROCm Fusion is enabled.
469/469 [==============================] - 8s 3ms/step - loss: 0.3577 - sparse_categorical_accuracy: 0.9014 - val_loss: 0.1925 - val_sparse_categorical_accuracy: 0.9439
Epoch 2/6
469/469 [==============================] - 1s 2ms/step - loss: 0.1654 - sparse_categorical_accuracy: 0.9531 - val_loss: 0.1360 - val_sparse_categorical_accuracy: 0.9605
Epoch 3/6
469/469 [==============================] - 1s 2ms/step - loss: 0.1200 - sparse_categorical_accuracy: 0.9652 - val_loss: 0.1169 - val_sparse_categorical_accuracy: 0.9641
Epoch 4/6
469/469 [==============================] - 1s 2ms/step - loss: 0.0926 - sparse_categorical_accuracy: 0.9730 - val_loss: 0.1007 - val_sparse_categorical_accuracy: 0.9688
Epoch 5/6
469/469 [==============================] - 1s 2ms/step - loss: 0.0748 - sparse_categorical_accuracy: 0.9788 - val_loss: 0.0882 - val_sparse_categorical_accuracy: 0.9734
Epoch 6/6
469/469 [==============================] - 1s 2ms/step - loss: 0.0618 - sparse_categorical_accuracy: 0.9821 - val_loss: 0.0845 - val_sparse_categorical_accuracy: 0.9741

aoolmay · 2021-11-01T17:59:23Z

@ianferreira Thanks for the previous comment, i'm fine with using docker now that i learned about it. I'll get around to finding out what's wrong with my bare metal setup in time. It's working, all i cared for.

About that example though, two GPUs are shown. Are those both used for the process automatically? I'm guessing it's just debugging/info, but maybe there's more progress i missed.

ianferreira · 2021-11-01T18:01:22Z

@zrzrv5 Does PyTorch work yet?

Have not gotten the docker or bare metal to work with Pytorch. Same hipErrorNoBinaryForGpu error.

ianferreira · 2021-11-01T18:01:53Z

@ianferreira Thanks for the previous comment, i'm fine with using docker now that i learned about it. I'll get around to finding out what's wrong with my bare metal setup in time. It's working, all i cared for.

About that example though, two GPUs are shown. Are those both used for the process automatically? I'm guessing it's just debugging/info, but maybe there's more progress i missed.

The script I used did not do multi-gpu, let me try and and make sure RCCL works....

ianferreira · 2021-11-01T18:13:58Z

@ianferreira Thanks for the previous comment, i'm fine with using docker now that i learned about it. I'll get around to finding out what's wrong with my bare metal setup in time. It's working, all i cared for.

About that example though, two GPUs are shown. Are those both used for the process automatically? I'm guessing it's just debugging/info, but maybe there's more progress i missed.

Just used MirroredStrategy and seems both GPU's are working...

strategy = tf.distribute.MirroredStrategy(["GPU:0", "GPU:1"]) print('Number of devices: {}'.format(strategy.num_replicas_in_sync))
2021-11-01 11:08:49.398961: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 15868 MB memory: -> device: 0, name: AMD Radeon RX 6800 XT, pci bus id: 0000:23:00.0
2021-11-01 11:08:49.630293: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 15868 MB memory: -> device: 1, name: AMD Radeon RX 6800 XT, pci bus id: 0000:4b:00.0
Number of devices: 2

Seems both working...but not 100% given I dont have "NVLINK" equivalent.
ian@ian-TRX40-AORUS-PRO-WIFI:~/Documents$ rocm-smi --showtopo

======================= ROCm System Management Interface =======================
=========================== Weight between two GPUs ============================
GPU0 GPU1
GPU0 0 40
GPU1 40 0

============================ Hops between two GPUs =============================
GPU0 GPU1
GPU0 0 2
GPU1 2 0

========================== Link Type between two GPUs ==========================
GPU0 GPU1
GPU0 0 PCIE
GPU1 PCIE 0

================================== Numa Nodes ==================================
GPU[0] : (Topology) Numa Node: 0
GPU[0] : (Topology) Numa Affinity: 4294967295
GPU[1] : (Topology) Numa Node: 0
GPU[1] : (Topology) Numa Affinity: 4294967295
============================= End of ROCm SMI Log ==============================

ian@ian-TRX40-AORUS-PRO-WIFI:~/Documents$ rocm-smi

======================= ROCm System Management Interface =======================
================================= Concise Info =================================
GPU Temp AvgPwr SCLK MCLK Fan Perf PwrCap VRAM% GPU%
0 34.0c 28.0W 500Mhz 456Mhz 0% auto 255.0W 99% 75%
1 35.0c 40.0W 2445Mhz 96Mhz 0% auto 255.0W 98% 99%

============================= End of ROCm SMI Log ==============================

aoolmay · 2021-11-01T18:18:17Z

@ianferreira Really appreciate your inputs, i and probably quite a few more people would miss the NAVI enabled docker. Thanks man!

ROCmSupport · 2021-11-02T12:22:27Z

Hi All,
Thanks for reaching out.
As ROCm + Navi brought lot of critical issues in different areas/layers in the stack, things got delayed.
We have plans to enable official ROCm support on some set of Navi cards, as soon as possible, most likely, with early ROCm 5.x release: like 5.0.
Please stay tuned for the updates.
Thank you.

ROCmSupport closed this as completed Nov 2, 2021

FCLC mentioned this issue Nov 15, 2021

W5X00 and W6X00 series support in ROCm 5.X GFX 1x00 Navi1 GFX 2x00 "big" navi #1617

Closed

FCLC mentioned this issue Mar 17, 2022

Navi 1 w5700 w5500 Radeon Pro v520 ROCm support in 5.x series #1706

Closed

FCLC mentioned this issue Mar 31, 2022

Which devices are even supported? (HIP/ROCm) #1714

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

W6800 and ROCm #1595

W6800 and ROCm #1595

KurtStineAI2 commented Oct 22, 2021

zrzrv5 commented Oct 30, 2021

aoolmay commented Oct 30, 2021 •

edited

ianferreira commented Oct 31, 2021

zrzrv5 commented Nov 1, 2021

ianferreira commented Nov 1, 2021

aoolmay commented Nov 1, 2021

ianferreira commented Nov 1, 2021

aoolmay commented Nov 1, 2021 •

edited

ianferreira commented Nov 1, 2021 •

edited

KurtStineAI2 commented Nov 1, 2021

ianferreira commented Nov 1, 2021

aoolmay commented Nov 1, 2021

ianferreira commented Nov 1, 2021

ianferreira commented Nov 1, 2021

ianferreira commented Nov 1, 2021

aoolmay commented Nov 1, 2021

ROCmSupport commented Nov 2, 2021

W6800 and ROCm #1595

W6800 and ROCm #1595

Comments

KurtStineAI2 commented Oct 22, 2021

zrzrv5 commented Oct 30, 2021

aoolmay commented Oct 30, 2021 • edited

ianferreira commented Oct 31, 2021

zrzrv5 commented Nov 1, 2021

ianferreira commented Nov 1, 2021

aoolmay commented Nov 1, 2021

ianferreira commented Nov 1, 2021

aoolmay commented Nov 1, 2021 • edited

ianferreira commented Nov 1, 2021 • edited

KurtStineAI2 commented Nov 1, 2021

ianferreira commented Nov 1, 2021

aoolmay commented Nov 1, 2021

ianferreira commented Nov 1, 2021

ianferreira commented Nov 1, 2021

ianferreira commented Nov 1, 2021

aoolmay commented Nov 1, 2021

ROCmSupport commented Nov 2, 2021

aoolmay commented Oct 30, 2021 •

edited

aoolmay commented Nov 1, 2021 •

edited

ianferreira commented Nov 1, 2021 •

edited