Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ROCm 5.xx ever planning to include gfx90c GPUs? #1743

Closed
shridharkini6 opened this issue May 23, 2022 · 30 comments
Closed

ROCm 5.xx ever planning to include gfx90c GPUs? #1743

shridharkini6 opened this issue May 23, 2022 · 30 comments

Comments

@shridharkini6
Copy link

Hi
the official docker images of pytorch and tf docker are avialble only for gfx900(Vega10-type GPU - MI25, Vega56, Vega64), gfx906 (Vega20-type GPU - MI50, MI60) and gfx908 (MI100), gfx90a (MI200) and gfx1030 (Navi21).

When does gfx90c support expected.
Thanks

@Bengt
Copy link

Bengt commented May 27, 2022

Hi, @shridharkini6!

Thanks for your request. Since I am not an employee at AMD, I have no insight into what is planned there internally. However, at least some amount of library coverage seems to be a prerequisite for extending the Docker images to this class of GPUs, which are integrated into the CPU (or an "APU" in AMD's lingo). However, I do not see any support for the gfx90c as a TARGET in any of the public libraries. See my pull request for an attempt at a complete overview of the state of library support. PyTorch uses RCCL and MIOpen to run on ROCm, and so does TensorFlow. MIOpen in turn uses rocBLAS as its backend. For the available TARGETs, see the CMakeLists.txt of rocBLAS and the CMakeLists.txt of RCCL, respectively. As you can see, there is no support for gfx90c and in fact no other APU.

This aligns with what can be gathered from public sources, namely that AMD is focussing on the products which the hyperscalers or supercomputer customers are currently buying. I personally think this is fair enough, as those customers seem to be rather feature-sensitive. Starting from those high-profile customers, consider the following leaky pipe of support:

  • Enterprise
    ("Instinct"-branded products intended for hyperscalers and supercomputer customers, usually sold in servers or racks)
  • Professional
    ("Radeon PRO"-branded products intended for CAD and such use cases, usually sold in workstations)
  • Desktop
    ("Radeon"-branded products intended for demanding users like gamers and video editors, sold as dGPU components or pre-built systems)
  • APUs
    ("Ryzen with Radeon Graphics"-branded products intended for lighter workloads like office PCs and thin/light laptops)

Things might change a bit with the Ryzen 7000 line of desktop processors, which are announced to include a chiplet-ish GPU in the IO die. Such an arrangement does not currently fit into this leaky support pipe, but I would also not hold my breath for any kind of revolution. My bet would be on support gradually improving, as it has (not without setbacks) in the past.

@ffleader1
Copy link

I do not think it is AMD's top priority to support an APU when even the Navi 22 and Navi 23 are not supported. Also, AMD did pull the plug on supporting APUs long time before. So I think quite frankly, to answer your question, it is... never.

@AGenchev
Copy link

AGenchev commented May 31, 2022

@ffleader1 that's not so clever move from AMD, because they have nothing positioned against Nvidia Jetson type of hardware. So we buy Nvidia APUs despite they're not very FOSS friendly.

@langyuxf
Copy link

langyuxf commented Jun 8, 2022

Here is a workaround to run pytorch on gfx90c.
Just build pytorch for gfx900 and override gfx90c to gfx900.

Build pytorch
$ git clone https://github.com/pytorch/pytorch.git  
$ cd pytorch  
$ git submodule update --init --recursive
$ sudo pip3 install -r requirements.txt
$ sudo pip3 install enum34 numpy pyyaml setuptools typing cffi future hypothesis typing_extensions
$ sudo python3 tools/amd_build/build_amd.py
$ sudo PYTORCH_ROCM_ARCH=gfx900 USE_ROCM=1 MAX_JOBS=4 python3 setup.py install

Run an example
$ git clone https://github.com/pytorch/examples.git
$ cd examples/mnist
$ sudo pip3 install -r requirements.txt
$ sudo HSA_OVERRIDE_GFX_VERSION=9.0.0 python3 main.py
...
Train Epoch: 14 [51200/60000 (85%)]     Loss: 0.027863
Train Epoch: 14 [51840/60000 (86%)]     Loss: 0.017484
Train Epoch: 14 [52480/60000 (87%)]     Loss: 0.021983
Train Epoch: 14 [53120/60000 (88%)]     Loss: 0.003217
Train Epoch: 14 [53760/60000 (90%)]     Loss: 0.011038
Train Epoch: 14 [54400/60000 (91%)]     Loss: 0.007962
Train Epoch: 14 [55040/60000 (92%)]     Loss: 0.018526
Train Epoch: 14 [55680/60000 (93%)]     Loss: 0.001039
Train Epoch: 14 [56320/60000 (94%)]     Loss: 0.017513
Train Epoch: 14 [56960/60000 (95%)]     Loss: 0.028949
Train Epoch: 14 [57600/60000 (96%)]     Loss: 0.028286
Train Epoch: 14 [58240/60000 (97%)]     Loss: 0.064388
Train Epoch: 14 [58880/60000 (98%)]     Loss: 0.002042
Train Epoch: 14 [59520/60000 (99%)]     Loss: 0.002829

Test set: Average loss: 0.0280, Accuracy: 9921/10000 (99%)

Note:
1, Disable some power features for gfx90c
sudo modprobe amdgpu ppfeaturemask=0xfff73fff
2, ROCm
https://docs.amd.com/bundle/ROCm-Downloads-Guide-v5.0/page/ROCm_Installation.html
3, Pytorch
branch: master
commit: 815532d40c25e81d8c09b3c36403016bea394aee

@langyuxf
Copy link

langyuxf commented Jun 9, 2022

You can also use docker of pytorch on gfx90c. Just run like this. @shridharkini6

$ git clone https://github.com/pytorch/examples.git
$ cd examples/mnist
$ pip3 install -r requirements.txt
$ HSA_OVERRIDE_GFX_VERSION=9.0.0 python3 main.py

Note:
Your video memory should be at least 2GB.

@ffleader1
Copy link

You can also use docker of pytorch on gfx90c. Just run like this. @shridharkini6

$ git clone https://github.com/pytorch/examples.git
$ cd examples/mnist
$ pip3 install -r requirements.txt
$ HSA_OVERRIDE_GFX_VERSION=9.0.0 python3 main.py

Note: Your video memory should be at least 2GB.

Maybe you have not tried it but at least do u think your to method will work with unsupported GPUs, like gfx1031 for example.

@langyuxf
Copy link

langyuxf commented Jun 9, 2022

You can also use docker of pytorch on gfx90c. Just run like this. @shridharkini6

$ git clone https://github.com/pytorch/examples.git
$ cd examples/mnist
$ pip3 install -r requirements.txt
$ HSA_OVERRIDE_GFX_VERSION=9.0.0 python3 main.py

Note: Your video memory should be at least 2GB.

Maybe you have not tried it but at least do u think your to method will work with unsupported GPUs, like gfx1031 for example.

You may try, run like this.

$ HSA_OVERRIDE_GFX_VERSION=10.3.0 python3 main.py

@ffleader1
Copy link

ffleader1 commented Jun 9, 2022

You can also use docker of pytorch on gfx90c. Just run like this. @shridharkini6

$ git clone https://github.com/pytorch/examples.git
$ cd examples/mnist
$ pip3 install -r requirements.txt
$ HSA_OVERRIDE_GFX_VERSION=9.0.0 python3 main.py

Note: Your video memory should be at least 2GB.

Maybe you have not tried it but at least do u think your to method will work with unsupported GPUs, like gfx1031 for example.

You may try, run like this.

$ HSA_OVERRIDE_GFX_VERSION=10.3.0 python3 main.py

wait I am a bit confused.
Maybe I am missing something but your example is about running pytorch example right
But how do u get Rocm to install on gfx90c or gfx1031in the first place?
Thank you,

@langyuxf
Copy link

langyuxf commented Jun 9, 2022

You can also use docker of pytorch on gfx90c. Just run like this. @shridharkini6

$ git clone https://github.com/pytorch/examples.git
$ cd examples/mnist
$ pip3 install -r requirements.txt
$ HSA_OVERRIDE_GFX_VERSION=9.0.0 python3 main.py

Note: Your video memory should be at least 2GB.

Maybe you have not tried it but at least do u think your to method will work with unsupported GPUs, like gfx1031 for example.

You may try, run like this.
$ HSA_OVERRIDE_GFX_VERSION=10.3.0 python3 main.py

wait I am a bit confused. Maybe I am missing something but your example is about running pytorch example right But how do u get Rocm to install on gfx90c or gfx1031in the first place? Thank you,

1, Docker with PyTorch and ROCm installed
https://docs.amd.com/bundle/AMD-Deep-Learning-Guide-v5.1.3/page/Deep_Learning_Frameworks.html
2, ROCm Installation guide
https://docs.amd.com/bundle/ROCm-Installation-Guide-v5.0/page/Overview_of_ROCm_Installation_Methods.html

@ffleader1
Copy link

You can also use docker of pytorch on gfx90c. Just run like this. @shridharkini6

$ git clone https://github.com/pytorch/examples.git
$ cd examples/mnist
$ pip3 install -r requirements.txt
$ HSA_OVERRIDE_GFX_VERSION=9.0.0 python3 main.py

Note: Your video memory should be at least 2GB.

Maybe you have not tried it but at least do u think your to method will work with unsupported GPUs, like gfx1031 for example.

You may try, run like this.
$ HSA_OVERRIDE_GFX_VERSION=10.3.0 python3 main.py

wait I am a bit confused. Maybe I am missing something but your example is about running pytorch example right But how do u get Rocm to install on gfx90c or gfx1031in the first place? Thank you,

1, Docker with PyTorch and ROCm installed https://docs.amd.com/bundle/AMD-Deep-Learning-Guide-v5.1.3/page/Deep_Learning_Frameworks.html 2, ROCm Installation guide https://docs.amd.com/bundle/ROCm-Installation-Guide-v5.0/page/Overview_of_ROCm_Installation_Methods.html

I have not tried docker but for rocm, I am pretty sure the install will only be successful if your GPU is supported. I.e the rocm installation will not work on a gfx1031 or lower.

@shridharkini6
Copy link
Author

@xfyucg i followed your methods, looks to me training is using only CPU not GPU.

import torch
t = torch.tensor([5, 5, 5], dtype=torch.int64, device='cuda')

throws error like

Traceback (most recent call last):
File "", line 1, in
File "/opt/conda/lib/python3.7/site-packages/torch/cuda/init.py", line 216, in _lazy_init
torch._C._cuda_init()
RuntimeError: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx

Thanks

@langyuxf
Copy link

langyuxf commented Jun 13, 2022

@xfyucg i followed your methods, looks to me training is using only CPU not GPU.

import torch
t = torch.tensor([5, 5, 5], dtype=torch.int64, device='cuda')

throws error like

Traceback (most recent call last):
File "", line 1, in
File "/opt/conda/lib/python3.7/site-packages/torch/cuda/init.py", line 216, in _lazy_init
torch._C._cuda_init()
RuntimeError: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx

Thanks

Run like this, that works well on my Cezanne platform.

lang@lang-test:~/Videos/pytorch$ HSA_OVERRIDE_GFX_VERSION=9.0.0 python3
Python 3.8.10 (default, Mar 15 2022, 12:22:08)
[GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> t = torch.tensor([5, 5, 5], dtype=torch.int64, device='cuda')
>>> 

@shridharkini6
Copy link
Author

@xfyucg i followed your methods, looks to me training is using only CPU not GPU.

import torch
t = torch.tensor([5, 5, 5], dtype=torch.int64, device='cuda')

throws error like

Traceback (most recent call last):
File "", line 1, in
File "/opt/conda/lib/python3.7/site-packages/torch/cuda/init.py", line 216, in _lazy_init
torch._C._cuda_init()
RuntimeError: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx

Thanks

Run like this, that works well on my Cezanne platform.

lang@lang-test:~/Videos/pytorch$ HSA_OVERRIDE_GFX_VERSION=9.0.0 python3
Python 3.8.10 (default, Mar 15 2022, 12:22:08)
[GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> t = torch.tensor([5, 5, 5], dtype=torch.int64, device='cuda')
>>> 

Tried this as well..ended up with same error

@langyuxf
Copy link

@shridharkini6

@xfyucg i followed your methods, looks to me training is using only CPU not GPU.

import torch
t = torch.tensor([5, 5, 5], dtype=torch.int64, device='cuda')

throws error like

Traceback (most recent call last):
File "", line 1, in
File "/opt/conda/lib/python3.7/site-packages/torch/cuda/init.py", line 216, in _lazy_init
torch._C._cuda_init()
RuntimeError: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx

Thanks

Run like this, that works well on my Cezanne platform.

lang@lang-test:~/Videos/pytorch$ HSA_OVERRIDE_GFX_VERSION=9.0.0 python3
Python 3.8.10 (default, Mar 15 2022, 12:22:08)
[GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> t = torch.tensor([5, 5, 5], dtype=torch.int64, device='cuda')
>>> 

Tried this as well..ended up with same error

Can you put the output of $ rocminfo here?

@shridharkini6
Copy link
Author

@shridharkini6

@xfyucg i followed your methods, looks to me training is using only CPU not GPU.

import torch
t = torch.tensor([5, 5, 5], dtype=torch.int64, device='cuda')

throws error like

Traceback (most recent call last):
File "", line 1, in
File "/opt/conda/lib/python3.7/site-packages/torch/cuda/init.py", line 216, in _lazy_init
torch._C._cuda_init()
RuntimeError: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx

Thanks

Run like this, that works well on my Cezanne platform.

lang@lang-test:~/Videos/pytorch$ HSA_OVERRIDE_GFX_VERSION=9.0.0 python3
Python 3.8.10 (default, Mar 15 2022, 12:22:08)
[GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> t = torch.tensor([5, 5, 5], dtype=torch.int64, device='cuda')
>>> 

Tried this as well..ended up with same error

Can you put the output of $ rocminfo here?

ROCk module is loaded

HSA System Attributes

Runtime Version: 1.1
System Timestamp Freq.: 1000.000000MHz
Sig. Max Wait Duration: 18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model: LARGE
System Endianness: LITTLE

==========
HSA Agents


Agent 1


Name: AMD Ryzen 7 4700U with Radeon Graphics
Uuid: CPU-XX
Marketing Name: AMD Ryzen 7 4700U with Radeon Graphics
Vendor Name: CPU
Feature: None specified
Profile: FULL_PROFILE
Float Round Mode: NEAR
Max Queue Number: 0(0x0)
Queue Min Size: 0(0x0)
Queue Max Size: 0(0x0)
Queue Type: MULTI
Node: 0
Device Type: CPU
Cache Info:
L1: 32768(0x8000) KB
Chip ID: 0(0x0)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 2000
BDFID: 0
Internal Node ID: 0
Compute Unit: 8
SIMDs per CU: 0
Shader Engines: 0
Shader Arrs. per Eng.: 0
WatchPts on Addr. Ranges:1
Features: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: FINE GRAINED
Size: 7612028(0x74267c) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 2
Segment: GLOBAL; FLAGS: KERNARG, FINE GRAINED
Size: 7612028(0x74267c) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 3
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 7612028(0x74267c) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
ISA Info:


Agent 2


Name: gfx90c
Uuid: GPU-XX
Marketing Name:
Vendor Name: AMD
Feature: KERNEL_DISPATCH
Profile: BASE_PROFILE
Float Round Mode: NEAR
Max Queue Number: 128(0x80)
Queue Min Size: 64(0x40)
Queue Max Size: 131072(0x20000)
Queue Type: MULTI
Node: 1
Device Type: GPU
Cache Info:
L1: 16(0x10) KB
L2: 1024(0x400) KB
Chip ID: 5686(0x1636)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 1600
BDFID: 1024
Internal Node ID: 1
Compute Unit: 7
SIMDs per CU: 4
Shader Engines: 1
Shader Arrs. per Eng.: 1
WatchPts on Addr. Ranges:4
Features: KERNEL_DISPATCH
Fast F16 Operation: TRUE
Wavefront Size: 64(0x40)
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Max Waves Per CU: 40(0x28)
Max Work-item Per CU: 2560(0xa00)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
Max fbarriers/Workgrp: 32
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 524288(0x80000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 2
Segment: GROUP
Size: 64(0x40) KB
Allocatable: FALSE
Alloc Granule: 0KB
Alloc Alignment: 0KB
Accessible by all: FALSE
ISA Info:
ISA 1
Name: amdgcn-amd-amdhsa--gfx90c:xnack-
Machine Models: HSA_MACHINE_MODEL_LARGE
Profiles: HSA_PROFILE_BASE
Default Rounding Mode: NEAR
Default Rounding Mode: NEAR
Fast f16: TRUE
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
FBarrier Max Size: 32
*** Done ***

@langyuxf
Copy link

@shridharkini6 Are you using docker?
If yes, try to start your docker like this.

sudo docker run -it --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --device=/dev/kfd --device=/dev/dri --group-add video --ipc=host --shm-size 8G rocm/pytorch:latest

@shridharkini6
Copy link
Author

@shridharkini6 Are you using docker? If yes, try to start your docker like this.

sudo docker run -it --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --device=/dev/kfd --device=/dev/dri --group-add video --ipc=host --shm-size 8G rocm/pytorch:latest

I have tried the same..used rocm/pytorch:latest-base docker.

@langyuxf
Copy link

langyuxf commented Jun 15, 2022

@shridharkini6 Are you using docker? If yes, try to start your docker like this.

sudo docker run -it --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --device=/dev/kfd --device=/dev/dri --group-add video --ipc=host --shm-size 8G rocm/pytorch:latest

I have tried the same..used rocm/pytorch:latest-base docker.

According to https://docs.amd.com/bundle/AMD-Deep-Learning-Guide-v5.1.3/page/Deep_Learning_Frameworks.html
Option 3: Install PyTorch Using PyTorch ROCm Base Docker Image

docker pull rocm/pytorch:latest-base

NOTE This will download the base container, which does not contain PyTorch

So please use rocm/pytorch:latest

docker pull rocm/pytorch:latest

docker run -it --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --device=/dev/kfd --device=/dev/dri --group-add video --ipc=host --shm-size 8G rocm/pytorch:latest

sudo modprobe amdgpu ppfeaturemask=0xfff73fff

HSA_OVERRIDE_GFX_VERSION=9.0.0 python3

@ffleader1
Copy link

ffleader1 commented Jun 15, 2022

@shridharkini6 Are you using docker? If yes, try to start your docker like this.

sudo docker run -it --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --device=/dev/kfd --device=/dev/dri --group-add video --ipc=host --shm-size 8G rocm/pytorch:latest

I have tried the same..used rocm/pytorch:latest-base docker.

According to https://docs.amd.com/bundle/AMD-Deep-Learning-Guide-v5.1.3/page/Deep_Learning_Frameworks.html Option 3: Install PyTorch Using PyTorch ROCm Base Docker Image

docker pull rocm/pytorch:latest-base

NOTE This will download the base container, which does not contain PyTorch

So please use rocm/pytorch:latest

docker pull rocm/pytorch:latest

docker run -it --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --device=/dev/kfd --device=/dev/dri --group-add video --ipc=host --shm-size 8G rocm/pytorch:latest

sudo modprobe amdgpu ppfeaturemask=0xfff73fff

HSA_OVERRIDE_GFX_VERSION=9.0.0 python3

His hardware is not supported, and so is your I think. APUs in general do not work. Docker won't change unsatisfied prerequisites hardware availability.

@langyuxf
Copy link

@shridharkini6 Are you using docker? If yes, try to start your docker like this.

sudo docker run -it --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --device=/dev/kfd --device=/dev/dri --group-add video --ipc=host --shm-size 8G rocm/pytorch:latest

I have tried the same..used rocm/pytorch:latest-base docker.

According to https://docs.amd.com/bundle/AMD-Deep-Learning-Guide-v5.1.3/page/Deep_Learning_Frameworks.html Option 3: Install PyTorch Using PyTorch ROCm Base Docker Image
docker pull rocm/pytorch:latest-base
NOTE This will download the base container, which does not contain PyTorch
So please use rocm/pytorch:latest

docker pull rocm/pytorch:latest

docker run -it --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --device=/dev/kfd --device=/dev/dri --group-add video --ipc=host --shm-size 8G rocm/pytorch:latest

sudo modprobe amdgpu ppfeaturemask=0xfff73fff

HSA_OVERRIDE_GFX_VERSION=9.0.0 python3

His hardware is not supported, and so is your I think. APUs in general do not work. Docker won't change unsatisfied prerequisites hardware availability.

No, they use same ISA with gfx900. So for gfx90c, just override it to gfx900. That actually works.
He uses rocm/pytorch:latest-base, so he must build pytorch for rocm.

@shridharkini6
Copy link
Author

@xfyucg i have followed all the procedures you suggested. i.e used rocm/pytorch:latest-base and compiled pytorch from source. but get the same error

Traceback (most recent call last):
File "", line 1, in
File "/opt/conda/lib/python3.7/site-packages/torch/cuda/init.py", line 216, in _lazy_init
torch._C._cuda_init()
RuntimeError: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx

@langyuxf
Copy link

@xfyucg i have followed all the procedures you suggested. i.e used rocm/pytorch:latest-base and compiled pytorch from source. but get the same error

Traceback (most recent call last):
File "", line 1, in
File "/opt/conda/lib/python3.7/site-packages/torch/cuda/init.py", line 216, in _lazy_init
torch._C._cuda_init()
RuntimeError: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx

May be some environment issues, it's hard to debug. It's error-prone to build pytorch by yourself.
Why not use rocm/pytorch:latest? It's simple and also the recommended way.

@shridharkini6
Copy link
Author

@xfyucg i have followed all the procedures you suggested. i.e used rocm/pytorch:latest-base and compiled pytorch from source. but get the same error

Traceback (most recent call last):
File "", line 1, in
File "/opt/conda/lib/python3.7/site-packages/torch/cuda/init.py", line 216, in _lazy_init
torch._C._cuda_init()
RuntimeError: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx

May be some environment issues, it's hard to debug. It's error-prone to build pytorch by yourself. Why not use rocm/pytorch:latest? It's simple and also the recommended way.

@xfyucg yes i tried with rocm/pytorch:latest also. it throws similar errors. i hope it could be issues with base libraries as @Bengt mentioned.

@langyuxf
Copy link

@xfyucg i have followed all the procedures you suggested. i.e used rocm/pytorch:latest-base and compiled pytorch from source. but get the same error

Traceback (most recent call last):
File "", line 1, in
File "/opt/conda/lib/python3.7/site-packages/torch/cuda/init.py", line 216, in _lazy_init
torch._C._cuda_init()
RuntimeError: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx

May be some environment issues, it's hard to debug. It's error-prone to build pytorch by yourself. Why not use rocm/pytorch:latest? It's simple and also the recommended way.

@xfyucg yes i tried with rocm/pytorch:latest also. it throws similar errors. i hope it could be issues with base libraries as @Bengt mentioned.

No. If you install and start docker(rocm/pytorch:latest) correctly, you will get the error like following.

root@0f962c3a9d38:/var/lib/jenkins# python3
Python 3.7.13 (default, Mar 29 2022, 02:18:16)
[GCC 7.5.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.cuda.is_available()
"hipErrorNoBinaryForGpu: Unable to find code object for all current devices!"
Aborted (core dumped)
root@0f962c3a9d38:~#

After override gfx90c to gfx900

root@0f962c3a9d38:/var/lib/jenkins# HSA_OVERRIDE_GFX_VERSION=9.0.0 python3
Python 3.7.13 (default, Mar 29 2022, 02:18:16)
[GCC 7.5.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.cuda.is_available()
True
>>>

@langyuxf
Copy link

langyuxf commented Jun 22, 2022

Make sure amdgpu kernel mode driver is installed. If you use a generic kernel on Ubuntu 20.04, install amdgpu kernel mode driver as following.

sudo apt-get update
wget https://repo.radeon.com/amdgpu-install/22.10.3/ubuntu/focal/amdgpu-install_22.10.3.50103-1_all.deb 
sudo apt-get install ./amdgpu-install_22.10.3.50103-1_all.deb

amdgpu-install --usecase=dkms

@lucasew
Copy link

lucasew commented Dec 28, 2022

Try updating your system's kernel to a version newer than 6.0 and run the commands setting the following environment variable:

HSA_OVERRIDE_GFX_VERSION=9.0.0 

You can use export HSA_OVERRIDE_GFX_VERSION=9.0.0 in the shell you are running the commands to propagate the environment variable to child processes. That's what allowed the rocm/pytorch container to not crash on import or crash when doing simple tensor operations like torch.tensor([[1,2],[3,4]]).to(torch.device('cuda')).

I tested this on NixOS, branch 22.11, kernel 6.0.13 and latest rocm/pytorch container with a Ryzen 5600G.

@jithunnair-amd
Copy link
Contributor

CC @hongxiayang

@abhimeda
Copy link
Collaborator

@shridharkini6 Hi, is your issue resolved on the latest ROCm? If so can we close this ticket?

@nixrunner
Copy link

Is this still applicable to latest ROCm?

@ppanchad-amd
Copy link

@shridharkini6 Unfortunately your APU (gfx90c) is not currently supported in the latest ROCm. Thanks!

@ppanchad-amd ppanchad-amd closed this as not planned Won't fix, can't repro, duplicate, stale Jun 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants