Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Q: What is status of ROCm on RX 5500 XT? #1306

Closed
Doev opened this issue Nov 26, 2020 · 89 comments
Closed

Q: What is status of ROCm on RX 5500 XT? #1306

Doev opened this issue Nov 26, 2020 · 89 comments

Comments

@Doev
Copy link

Doev commented Nov 26, 2020

Hello,

I like to evaluate if ROCm is suitable for deeplearning. What about the RX 5500 XT? Is it possible to use that cheap GPU for doing so?

@baryluk
Copy link

baryluk commented Nov 27, 2020

Duplicate of #887

@ROCmSupport
Copy link

Hi @Doev
Thanks for reaching out.
We are not officially supporting Navi series of cards with ROCm.
But you can still give a try, things might work.

@xuhuisheng
Copy link
Contributor

@ROCmSupport
Since rocm-libs didnt configure gfx1010/gfx1012 in AMDGPU_TARGETS, there will always meet guarantee(false && "hipErrorNoBinaryForGpu: Coudn't find binary for current devices!")

If you want to try navi10,

  1. openCL maybe work
  2. recompile rocm-libs related libraries like rocBLAS, rocSPARSE with AMDGPU_TARGETS=gfx1010;gfx1011;gfx1012, you may met some compilation issues.

@ROCmSupport
Copy link

Agree @xuhuisheng
Some things might work and some definitely can not.

@unexploredtest
Copy link

unexploredtest commented Nov 27, 2020

I don't know about RX 5500 XT, but for my RX 5500M(Navi 14), I face this issue in version 3.9 when I want to use TF:
#1269
I should note that OpenCL works fine.
Previous versions don't work for me at all(OpenCL doesn't get installed properly for some reasons).
EDIT: On Ubuntu 20.04.1
EDIT 2: I could also install OpenCL in Manjaro, but faced compilation errors during rocm-libs installation(Probably because of my 5.9 kernel)

@da-phil
Copy link

da-phil commented Nov 27, 2020

Agree @xuhuisheng
Some things might work and some definitely can not.

And which things definitely won't work?
It would be really great to know which bits and pieces of integrating navi chipsets into ROCm are missing. Then the open source community would at least know what is missing and what can be done by them.

@xuhuisheng
Copy link
Contributor

@da-phil
The frameworks like tensorflow/pytorch based on rocm-libs definitely wont work. Or you can recompile rocm-libs with AMDGPU_TARGETS=gfx1010 and fixed the issues though compiling.

OpenCL that didnt depends on rocm-libs may work.

@da-phil
Copy link

da-phil commented Nov 27, 2020

@da-phil
The frameworks like tensorflow/pytorch based on rocm-libs definitely wont work. Or you can recompile rocm-libs with AMDGPU_TARGETS=gfx1010 and fixed the issues though compiling.

OpenCL that didnt depends on rocm-libs may work.

Yup, OpenCL works well, it's not the issue. The issue are the deeplearning frameworks which currently don't work.
Did you get tensorflow or pytorch running with a navi based GPU by "just" recompiling some rocm-libs?
I read in another issue that you contemplated on getting a navi GPU once the prices drop, but don't own one yet, right?

@xuhuisheng
Copy link
Contributor

xuhuisheng commented Nov 27, 2020

@da-phil
Yes. I didnt have a navi10 gpu now. I read from AMD website, and get the navi10 didnot better than vega10 on computation. And it maybe wont get offcial supporting in high possibility.

Following Rigtorp's steps. I am sure that ROCm-3.10.x could compile successly on gfx1010, only rocSPARSE has some dpp-bcast issues, but I think we could follow rocPRIM's resovling way.
And Pytorch could be confirmed that it can compile successly on gfx1010 on Sept. But nobody go further. In my opinion, there maybe other issues.

I think I can share some building scripts for gfx1010, after ROCm-3.10 released. Anyone who interesting in it could have a try.

@da-phil
Copy link

da-phil commented Nov 27, 2020

Following Rigtorp's steps. I am sure that ROCm-3.10.x could compile successly on gfx1010, only rocSPARSE has some dpp-bcast issues, but I think we could follow rocPRIM's resovling way.
And Pytorch could be confirmed that it can compile successly on gfx1010 on Sept. But nobody go further. In my opinion, there maybe other issues.

Yeah, only compiling ROCm for gfx1010 is not enough, it would be also great if it would also work 😅

I think I can share some building scripts for gfx1010, after ROCm-3.10 released. Anyone who interesting in it could have a try.

This would be awesome, then I'd give it a try too 😄

@Doev
Copy link
Author

Doev commented Nov 27, 2020

Thanks for the many replies. I give it a try and I have ordered a RX 5500 XT, cause 200€ is not so much and if everything fails I can sell the card.

I don't understand why the NAVI architecture is so important, cause I think the RX 5500 XT is from the Vega series. Well I never owned an amd gpu before.

@livegenic-akyrychek
Copy link

@Doev try to use PlaidML, it might not give you huge boost but may be some improvement (if works on rx 5500 xt)
Have not tested yet... As I recently found out about it myself.

@unexploredtest
Copy link

Plaidml will work just with Keras, you should use nGraph(with PlaidML backend) besides it, but these are out-dated, though they said they're working on a new release.

@da-phil
Copy link

da-phil commented Nov 29, 2020

@da-phil
The frameworks like tensorflow/pytorch based on rocm-libs definitely wont work. Or you can recompile rocm-libs with AMDGPU_TARGETS=gfx1010 and fixed the issues though compiling.

OpenCL that didnt depends on rocm-libs may work.

Yup, OpenCL works well, it's not the issue. The issue are the deeplearning frameworks which currently don't work.
Did you get tensorflow or pytorch running with a navi based GPU by "just" recompiling some rocm-libs?
I read in another issue that you contemplated on getting a navi GPU once the prices drop, but don't own one yet, right?

@unexploredtest
Copy link

unexploredtest commented Dec 1, 2020

I think I can share some building scripts for gfx1010, after ROCm-3.10 released. Anyone who interesting in it could have a try.

When can we expect it?

@xuhuisheng
Copy link
Contributor

@da-phil @aliPMPAINT
I wrote a doc for navi10.
https://github.com/xuhuisheng/rocm-build/tree/navi10/navi10/README.md

@livegenic-akyrychek
Copy link

@xuhuisheng may be duplicate it in #887

@unexploredtest
Copy link

unexploredtest commented Dec 2, 2020

@xuhuisheng Thanks. Will test it on Friday. Just one question, with RX 5500M(gfx 1012) I should replace AMDGPU_TARGETS="gfx1010" with AMDGPU_TARGETS="gfx1012" or leave it unchanged?
Also, with only 16 GB of RAM, how much swap memory is recommended?

@xuhuisheng
Copy link
Contributor

@xuhuisheng Thanks. Will test it on Friday. Just one question, with RX 5500M(gfx 1012) I should replace AMDGPU_TARGETS="gfx1010" with AMDGPU_TARGETS="gfx1012" or leave it unchanged?
Also, with only 16 GB of RAM, how much swap memory is recommended?

must use gfx1012, or there will throw a hipErrorNoBinaryForGpu. you can get the gpu arch name by rocminfo

@xuhuisheng
Copy link
Contributor

@xuhuisheng may be duplicate it in #887

Too complicated to do recompiling rocm. cannot be sure if it works. So dont want to show it to who not very interesting.

@unexploredtest
Copy link

@xuhuisheng Thanks for the response, will report back within next Saturday. Much appreciated.
One last question, I have already installed ROCm 3.9.1, is sudo apt autoremove rocm-opencl rocm-dkms rocm-dev rocm-utils enough for the complete uninstallation?

@ROCmSupport
Copy link

Hi @aliPMPAINT
You can uninstall rocm using above command.
After that also, if something is there under /opt/rocm-xxx, recommend to check all packages as sudo dpkg -l | grep hsa and followed with replacing hsa to hip, llvm, rocm, rock.
Remove all visible packages one by one as sudo apt purge
Hope it helps.
Thank you.

@unexploredtest
Copy link

unexploredtest commented Dec 4, 2020

@xuhuisheng Ok, so I tried your guide. Unfortuantely it didn't work out, I either encountered guarantee(false && "hipErrorNoBinaryForGpu: Coudn't find binary for current devices!") or Segmentation fault (core dumped) when importing/using pytorch.
I also didn't encounter any compilation errors. Here is more info about my device and environment:
I have a Bravo 17 laptop, with GPU Radeon RX 5500M(Navi 14, gfx 1012) and Ryzen 7 4800H.
On Ubuntu 20.04.1 with the 5.6.0-1033-oem kernel (Coudn't test 5.4.x out because <5.5 is incompatible with my hardware)
I've tried both python 3.7 and 3.8, have used AMDGPU_TARGETS="gfx1012", and tried both export PYTORCH_ROCM_ARCH=gfx1010 and export PYTORCH_ROCM_ARCH=gfx1012
Will share more info later on.
EDIT: WAIT

@unexploredtest
Copy link

unexploredtest commented Dec 4, 2020

So, I tested 5.4.0-56 out. Because 5.4.0-56 isn't compatible with my hardware, I had to "crtl+alt+f2" in order to log in(I can't pass through booting).
When I did rocminfo I got:

�[37mROCk module is loaded�[0m
�[31mUnable to open /dev/kfd read-write: Cannot allocate memory�[0m
�[37malipmpaint is member of render group�[0m
�[31mhsa api call failure at: /src/rocminfo/rocminfo.cc:1142
�[31mCall returned HSA_STATUS_ERROR_OUT_OF_RESOURCES: The runtime failed to allocate the necessary resources. This error may also occur when the core runtime library needs to spawn threads or create internal OS-specific events.
�[0m

And clinfo:

Number of platforms                               1
  Platform Name                                   AMD Accelerated Parallel Processing
  Platform Vendor                                 Advanced Micro Devices, Inc.
  Platform Version                                OpenCL 2.0 AMD-APP (3212.0)
  Platform Profile                                FULL_PROFILE
  Platform Extensions                             cl_khr_icd cl_amd_event_callback 
  Platform Extensions function suffix             AMD

  Platform Name                                   AMD Accelerated Parallel Processing
Number of devices                                 0

NULL platform behavior
  clGetPlatformInfo(NULL, CL_PLATFORM_NAME, ...)  No platform
  clGetDeviceIDs(NULL, CL_DEVICE_TYPE_ALL, ...)   No platform
  clCreateContext(NULL, ...) [default]            No platform
  clCreateContext(NULL, ...) [other]              No platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_DEFAULT)  No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_CPU)  No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_GPU)  No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_ACCELERATOR)  No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_CUSTOM)  No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_ALL)  No devices found in platform

But they do get recognized on 5.6, I'll attach files.
clinfo.txt
rocminfo.txt

@xuhuisheng
Copy link
Contributor

xuhuisheng commented Dec 5, 2020

@aliPMPAINT
There are some differences between gfx1010 to gfx1012, I upload a nav14 directory for gfx1012.
https://github.com/xuhuisheng/rocm-build/tree/develop/navi14
But it didnot like to throw a hipErrorNoBinaryForGpu, except there are other components missing gfx1012 target.
I will go to find some test to check each components.

And if rocminfo cannot recoganize gfx1012, it means the kernel-driver or thunk-interface or hsa-runtime canot support target device. Now I dont know how to debug that level, so I suggest using version which could run rocminfo normally.

@unexploredtest
Copy link

@xuhuisheng
Thanks! I'll check this one out too.
rocminfo does recognize gfx 1012 but not on 5.4, just 5.6, so it's no big deal

@xuhuisheng
Copy link
Contributor

Upload some codes for check rocm-libs.
https://github.com/xuhuisheng/rocm-build/tree/develop/check

@unexploredtest
Copy link

@xuhuisheng
Will test it out within a week, thank you

@xuhuisheng
Copy link
Contributor

xuhuisheng commented Dec 23, 2020

@vdrhtc
Good job! Could you try these tests? Wish both of them could pass successfully.

If you had time, please try compiling pytorch for gfx1012. 😄

BTW, hipSPARSE is an abstract layer for ROCm and cuda, it may be not have to compiling for gfx1012. I will try to find out where is /opt/rocm-3.10. comes from. And Don't suggest to using the latest develop branch. As there maybe unstable functions.

@unexploredtest
Copy link

@vdrhtc Yeah, I had the same issue. hipSPARSE won't work.

@vdrhtc
Copy link

vdrhtc commented Dec 25, 2020

@xuhuisheng
The tests ran OK
run_hip.log
run_rocblas.log

I am also able to train a small feed-forward network with Pytorch on cuda:0 device, so I guess everything is all right...

@xuhuisheng
Copy link
Contributor

@vdrhtc
Sounds like ROCm-4.0 could support navi14.
Thank you very much for verifying this.

@unexploredtest
Copy link

unexploredtest commented Dec 25, 2020

@vdrhtc Yeah?

Trying to get around the problem, I have cloned the most recent version of hipSPARSE, which wouldn't compile either. However, when I have finally tried the code from the latest release downloaded as an archive, everything worked.

Could you provide the link? I also wanna test it
EDIT: this?:
https://github.com/ROCmSoftwarePlatform/hipSPARSE/releases/tag/rocm-4.0.0

@vdrhtc
Copy link

vdrhtc commented Dec 26, 2020

@aliPMPAINT Yes, your edit is correct!

@Spacefish
Copy link

Spacefish commented Dec 27, 2020

Tried it on my 5700 XT Navi 10. IT WORKS!!! 👍🏻 :)
Thank you AMD Devs! Nice Christmas present!
It´s significantly faster than CPU as well:
i made a video: https://www.youtube.com/watch?v=-iYwbnvV2w0

Edit: Among further inspection, it does not really work.. If you look at the loss, it does not improve.. So whatever it computes, it does not seem to be right / no weights are changed :(

@unexploredtest
Copy link

Oh no... Well, it seems like we should wait till ROCm adds official support for Navi series, hopefully.

@xuhuisheng
Copy link
Contributor

@Spacefish
Thanks for feedback. Right now, I have no idea for which may cause loss not changed.
I will go back if got any more information.

@qyb
Copy link

qyb commented Dec 28, 2020

Tried it on my 5700 XT Navi 10. IT WORKS!!! 👍🏻 :)
Thank you AMD Devs! Nice Christmas present!
It´s significantly faster than CPU as well:
i made a video: https://www.youtube.com/watch?v=-iYwbnvV2w0

Edit: Among further inspection, it does not really work.. If you look at the loss, it does not improve.. So whatever it computes, it does not seem to be right / no weights are changed :(

How do you install torchvision?
I have built pytorch as https://github.com/xuhuisheng/rocm-build/tree/develop/navi14
then run check/test_pytoch.py
"True ... Navi 14 [Radeon RX 5500/5500M / Pro 5500M]"
pip3 install --no-dependencies torchvision
At last I get the same result as you

@xuhuisheng
Copy link
Contributor

xuhuisheng commented Dec 28, 2020

@qyb
Could you rebuild torchvision and test again?
The build of torchvision is fast.

git clone https://github.com/pytorch/vision
cd vision
git checkout v0.8.2
python3 setup.py bdist_wheel
pip3 install dist/torchvision-0.8.0a0+2f40a48-cp38-cp38-linux_x86_64.whl

I remembered that vision used HIP to compile some cpp sources, but it didnot report hipErrorNoBinaryForGpu, so I am afraid it isn't the point.

UPDATE try pytorch-1.7.1(with gfx803) and torchvision-0.8.2(from pypi), the loss of mnist can compute properly. So the torchvision should be not the point.

@qyb
Copy link

qyb commented Dec 28, 2020

If i build from torchvision src, the example main.py throw hipErrorNoBinaryForGpu and crash, even with --no-cuda argument.

Now I change back torchvision-pypi
I noticed that mnist example script report:
MIOpen(HIP): Warning [ParseAndLoadDb] File is unreadable: /opt/rocm/miopen/share/miopen/db/gfx1012_11.HIP.fdb.txt

@xuhuisheng
Copy link
Contributor

xuhuisheng commented Dec 28, 2020

@qyb
Could you help me to run a test script on gfx1012?
https://github.com/xuhuisheng/rocm-build/blob/develop/check/test-pytorch-fc.py

It's an one full connection layer net, comes from https://d2l.ai/.
I try to find out whether there is the problem of pytorch, or it is the problem of MIOpen.
On my gfx803, it shows as bellow:

work@2f7125ec29dd:~/test$ python3 test-pytorch.py 
Sequential(
  (0): Linear(in_features=2, out_features=1, bias=True)
)
SGD (
Parameter Group 0
    dampening: 0
    lr: 0.03
    momentum: 0
    nesterov: False
    weight_decay: 0
)
epoch 1, loss: 1.165843
epoch 2, loss: 0.022983
epoch 3, loss: 0.000715
epoch 4, loss: 0.000028
epoch 5, loss: 0.000073
epoch 6, loss: 0.000128
epoch 7, loss: 0.000118
epoch 8, loss: 0.000051
epoch 9, loss: 0.000141
epoch 10, loss: 0.000065
[2, -3.4] Parameter containing:
tensor([[ 2.0002, -3.4005]], device='cuda:0', requires_grad=True)
4.2 Parameter containing:
tensor([4.2001], device='cuda:0', requires_grad=True)

@qyb
Copy link

qyb commented Dec 28, 2020

@qyb
Could you help me to run a test script on gfx1012?
https://github.com/xuhuisheng/rocm-build/blob/develop/check/test-pytorch-fc.py

It's an one full connection layer net, comes from https://d2l.ai/.
I try to find out whether there is the problem of pytorch, or it is the problem of MIOpen.
On my gfx803, it shows as bellow:

work@2f7125ec29dd:~/test$ python3 test-pytorch.py 
Sequential(
  (0): Linear(in_features=2, out_features=1, bias=True)
)
SGD (
Parameter Group 0
    dampening: 0
    lr: 0.03
    momentum: 0
    nesterov: False
    weight_decay: 0
)
epoch 1, loss: 1.165843
epoch 2, loss: 0.022983
epoch 3, loss: 0.000715
epoch 4, loss: 0.000028
epoch 5, loss: 0.000073
epoch 6, loss: 0.000128
epoch 7, loss: 0.000118
epoch 8, loss: 0.000051
epoch 9, loss: 0.000141
epoch 10, loss: 0.000065
[2, -3.4] Parameter containing:
tensor([[ 2.0002, -3.4005]], device='cuda:0', requires_grad=True)
4.2 Parameter containing:
tensor([4.2001], device='cuda:0', requires_grad=True)
$ python3 test-pytorch-fc.py 
Sequential(
  (0): Linear(in_features=2, out_features=1, bias=True)
)
SGD (
Parameter Group 0
    dampening: 0
    lr: 0.03
    momentum: 0
    nesterov: False
    weight_decay: 0
)
epoch 1, loss: 0.985102
epoch 2, loss: 0.007260
epoch 3, loss: 0.000400
epoch 4, loss: 0.000043
epoch 5, loss: 0.000081
epoch 6, loss: 0.000161
epoch 7, loss: 0.000140
epoch 8, loss: 0.000208
epoch 9, loss: 0.000073
epoch 10, loss: 0.000093
[2, -3.4] Parameter containing:
tensor([[ 2.0006, -3.3998]], device='cuda:0', requires_grad=True)
4.2 Parameter containing:
tensor([4.2000], device='cuda:0', requires_grad=True)

@xuhuisheng
Copy link
Contributor

@qyb
OK. So full connection layer training seems properly.
But convolution or pool layers used on mnist maybe fail.

Guess it should be the MIOpen issue. 😢

@vdrhtc
Copy link

vdrhtc commented Dec 28, 2020

@xuhuisheng

I have ran the tests in the MIOpen repo, some of them fail, please see the log attached.
cmake_build_check.log
But I am not 100% sure I have built them correctly -- I have only executed env.sh from the rocm-build directory before compilation.

@co-manifold
Copy link

Any updates on 5500 XT?

@xuhuisheng
Copy link
Contributor

@interpharaohmetric
Most likely, MIOpen cannot support gfx10 for convolution layer. So mnist failed.
Good news is AMD said gfx10 will get offcial supporting in 2021.

@nvmnghia
Copy link

nvmnghia commented Jun 1, 2021

I'm too lazy to read all of this, but are you telling me that some old GPUs are (partially) supported, while the new ones are not? Except for the new & pricey ones?

@timgws
Copy link

timgws commented Jun 21, 2021

@nvmnghia

I can not comment on exact timelines as of today, but, roughly, will be available in next 2 to 4 months.

#887 (comment)

@kvnptl
Copy link

kvnptl commented Jun 6, 2022

Still waiting for RX 5600 (Navi 10, RDNA 1) support ;(

@AnriEvs
Copy link

AnriEvs commented Sep 29, 2023

any news for this gpu compatibility?

@serhii-nakon
Copy link

Here Docker for RX/W5500(M) https://hub.docker.com/r/serhiin/rocm_gfx1012_pytorch

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests