Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenCL stopped working "no devices" when upgrading to ROCm 3.0 from ROCm 2.10 #977

Open
preda opened this issue Dec 20, 2019 · 46 comments
Open

Comments

@preda
Copy link

@preda preda commented Dec 20, 2019

I updated from ROCm 2.10 to ROCm 3.0, and OpenCL stopped working by reporting 0 devices.
There are no errors in dmesg.

Kernel: Linux 5.4.5 and 5.5.0-rc2 (same behavior on both), GPU RadeonVII.
rocm-smi reports correctly all the GPUs, so it seems the hardware is detected and initialized correctly:

~/rocm-opencl$ ~/ROC-smi/rocm-smi 
GPU  Temp   AvgPwr  SCLK    MCLK    Fan     Perf  PwrCap  VRAM%  GPU%  
0    33.0c  27.0W   809Mhz  351Mhz  20.0%   auto  250.0W    0%   0%    
[etc]

But both /usr/bin/clinfo and /opt/rocm/opencl/bin/x86_64/clinfo report no devices:

/opt/rocm/opencl/bin/x86_64/clinfo 
Number of platforms:				 1
  Platform Profile:				 FULL_PROFILE
  Platform Version:				 OpenCL 2.1 AMD-APP (3052.0)
  Platform Name:				 AMD Accelerated Parallel Processing
  Platform Vendor:				 Advanced Micro Devices, Inc.
  Platform Extensions:				 cl_khr_icd cl_amd_event_callback cl_amd_offline_devices 


  Platform Name:				 AMD Accelerated Parallel Processing
ERROR: clGetDeviceIDs(-1)
/usr/bin/clinfo  
Number of platforms                               1
  Platform Name                                   AMD Accelerated Parallel Processing
  Platform Vendor                                 Advanced Micro Devices, Inc.
  Platform Version                                OpenCL 2.1 AMD-APP (3052.0)
  Platform Profile                                FULL_PROFILE
  Platform Extensions                             cl_khr_icd cl_amd_event_callback cl_amd_offline_devices 
  Platform Host timer resolution                  1ns
  Platform Extensions function suffix             AMD

  Platform Name                                   AMD Accelerated Parallel Processing
Number of devices                                 0

NULL platform behavior
  clGetPlatformInfo(NULL, CL_PLATFORM_NAME, ...)  No platform
  clGetDeviceIDs(NULL, CL_DEVICE_TYPE_ALL, ...)   No platform
  clCreateContext(NULL, ...) [default]            No platform
  clCreateContext(NULL, ...) [other]              No platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_DEFAULT)  No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_CPU)  No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_GPU)  No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_ACCELERATOR)  No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_CUSTOM)  No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_ALL)  No devices found in platform

The ROCm packages that I have installed:

hsa-rocr-dev/Ubuntu 16.04,now 1.1.9.0-rocm-rel-3.0-6-7128d0d amd64 [installed,automatic]
hsakmt-roct/Ubuntu 16.04,now 1.0.9-298-gea01eb3 amd64 [installed,automatic]
rocm-opencl/Ubuntu 16.04,now 2.0.0-rocm-rel-3.0-6-9a4afec amd64 [installed]
@preda

This comment has been minimized.

Copy link
Author

@preda preda commented Dec 20, 2019

Moving back to ROCm 2.10 (from 3.0) produces working OpenCL, I see these packages are installed

hsa-ext-rocr-dev hsa-rocr-dev hsakmt-roct hsakmt-roct-dev rocm-opencl
hsa-ext-rocr-dev/Ubuntu 16.04,now 1.1.9-139-g0d1ca36 amd64 [installed,automatic]
hsa-rocr-dev/Ubuntu 16.04,now 1.1.9-139-g0d1ca36 amd64 [installed,automatic]
hsakmt-roct-dev/Ubuntu 16.04,now 1.0.9-245-gc0e4b8d amd64 [installed,automatic]
hsakmt-roct/Ubuntu 16.04,now 1.0.9-245-gc0e4b8d amd64 [installed,automatic]
rocm-opencl/Ubuntu 16.04,now 1.2.0-rocm-rel-2.10-14-31325c4 amd64 [installed]
@preda

This comment has been minimized.

Copy link
Author

@preda preda commented Dec 20, 2019

Moving forward to ROCm 3.0 again with the same set of packages installed OpenCL finds no devices

hsa-ext-rocr-dev/Ubuntu 16.04,now 1.1.9.0-rocm-rel-3.0-6-7128d0d amd64 [installed,auto-removable]
hsa-rocr-dev/Ubuntu 16.04,now 1.1.9.0-rocm-rel-3.0-6-7128d0d amd64 [installed,automatic]
hsakmt-roct-dev/Ubuntu 16.04,now 1.0.9-298-gea01eb3 amd64 [installed,auto-removable]
hsakmt-roct/Ubuntu 16.04,now 1.0.9-298-gea01eb3 amd64 [installed,automatic]
rocm-opencl/Ubuntu 16.04,now 2.0.0-rocm-rel-3.0-6-9a4afec amd64 [installed]
@btspce

This comment has been minimized.

Copy link

@btspce btspce commented Dec 20, 2019

Same on Raven Ridge 2700u APU upgraded from 2.10 to 3.0.
Fedora 31 kernel 5.3.16-300.fc31.x86_64

$ clinfo
Number of platforms 0

$ rocminfo

HSA System Attributes

Runtime Version: 1.1
System Timestamp Freq.: 1000.000000MHz
Sig. Max Wait Duration: 18446744073709551615 (number of timestamp)
Machine Model: LARGE
System Endianness: LITTLE

==========
HSA Agents


Agent 1


Name: AMD Ryzen 7 PRO 2700U w/ Radeon Vega Mobile Gfx
Vendor Name: CPU
Feature: None specified
Profile: FULL_PROFILE
Float Round Mode: NEAR
Max Queue Number: 0
Queue Min Size: 0
Queue Max Size: 0
Queue Type: MULTI
Node: 0
Device Type: CPU
Cache Info:
L1: 32KB
Chip ID: 5597
Cacheline Size: 64
Max Clock Frequency (MHz):2200
BDFID: 1024
Compute Unit: 8
Features: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: KERNARG, FINE GRAINED
Size: 33554048KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Acessible by all: TRUE
ISA Info:
N/A


Agent 2


Name: gfx902
Vendor Name: AMD
Feature: KERNEL_DISPATCH
Profile: FULL_PROFILE
Float Round Mode: NEAR
Max Queue Number: 128
Queue Min Size: 4096
Queue Max Size: 131072
Queue Type: MULTI
Node: 0
Device Type: GPU
Cache Info:
L1: 16KB
Chip ID: 5597
Cacheline Size: 64
Max Clock Frequency (MHz):1300
BDFID: 1024
Compute Unit: 11
Features: KERNEL_DISPATCH
Fast F16 Operation: FALSE
Wavefront Size: 64
Workgroup Max Size: 1024
Workgroup Max Size Per Dimension:
Dim[0]: 67109888
Dim[1]: 67109888
Dim[2]: 0
Grid Max Size: 4294967295
Waves Per CU: 160
Max Work-item Per CU: 10240
Grid Max Size per Dimension:
Dim[0]: 4294967295
Dim[1]: 4294967295
Dim[2]: 4294967295
Max number Of fbarriers Per Workgroup:32
Pool Info:
Pool 1
Segment: GROUP
Size: 64KB
Allocatable: FALSE
Alloc Granule: 0KB
Alloc Alignment: 0KB
Acessible by all: FALSE
ISA Info:
ISA 1
Name: amdgcn-amd-amdhsa--gfx902+xnack
Machine Models: HSA_MACHINE_MODEL_LARGE
Profiles: HSA_PROFILE_BASE
Default Rounding Mode: NEAR
Default Rounding Mode: NEAR
Fast f16: TRUE
Workgroup Max Dimension:
Dim[0]: 67109888
Dim[1]: 1024
Dim[2]: 16777217
Workgroup Max Size: 1024
Grid Max Dimension:
x 4294967295
y 4294967295
z 4294967295
Grid Max Size: 4294967295
FBarrier Max Size: 32
*** Done ***

@btspce

This comment has been minimized.

Copy link

@btspce btspce commented Dec 20, 2019

After switching out hsakmt package from Fedora to the one provided by ROCm I now got these packages installed.
install rocm-utils rocm-opencl-devel rocminfo hsa-ext-rocr-dev
Packages Altered:
Install hsa-ext-rocr-dev-1.1.9.0_rocm_rel_3.0_6_7128d0dc-1.x86_64 @ROCm
Install hsa-rocr-dev-1.1.9.0_rocm_rel_3.0_6_7128d0dc-1.x86_64 @ROCm
Install hsakmt-roct-1.0.9_298_gea01eb3-1.x86_64 @ROCm
Install rocm-clang-ocl-0.5.0.47_rocm_rel_3.0_6_cfddddb-1.x86_64 @ROCm
Install rocm-opencl-2.0.0-rocm_rel_3.0_6_9a4afec13.x86_64 @ROCm
Install rocm-opencl-devel-2.0.0-rocm_rel_3.0_6_9a4afec13.x86_64 @ROCm
Install rocm-utils-3.0.6-1.x86_64 @ROCm
Install rocminfo-1.0.0-1.x86_64 @ROCm

And clinfo

$ clinfo
Number of platforms 1
Platform Name AMD Accelerated Parallel Processing
Platform Vendor Advanced Micro Devices, Inc.
Platform Version OpenCL 2.1 AMD-APP (3052.0)
Platform Profile FULL_PROFILE
Platform Extensions cl_khr_icd cl_amd_event_callback cl_amd_offline_devices
Platform Host timer resolution 1ns
Platform Extensions function suffix AMD

Platform Name AMD Accelerated Parallel Processing
Number of devices 0

NULL platform behavior
clGetPlatformInfo(NULL, CL_PLATFORM_NAME, ...) No platform
clGetDeviceIDs(NULL, CL_DEVICE_TYPE_ALL, ...) No platform
clCreateContext(NULL, ...) [default] No platform
clCreateContext(NULL, ...) [other] No platform
clCreateContextFromType(NULL, CL_DEVICE_TYPE_DEFAULT) No devices found in platform
clCreateContextFromType(NULL, CL_DEVICE_TYPE_CPU) No devices found in platform
clCreateContextFromType(NULL, CL_DEVICE_TYPE_GPU) No devices found in platform
clCreateContextFromType(NULL, CL_DEVICE_TYPE_ACCELERATOR) No devices found in platform
clCreateContextFromType(NULL, CL_DEVICE_TYPE_CUSTOM) No devices found in platform
clCreateContextFromType(NULL, CL_DEVICE_TYPE_ALL) No devices found in platform

@rkothako

This comment has been minimized.

Copy link

@rkothako rkothako commented Dec 21, 2019

Thanks @preda and @btspce
Can you please let know the exact steps to reproduce this problem.

@btspce

This comment has been minimized.

Copy link

@btspce btspce commented Dec 21, 2019

A simple upgrade from ROCm 2.10 to 3.0 of the packages at http://repo.radeon.com/rocm/yum/rpm when they became availible. clinfo shows 0 devices availbile directly after upgrade and after reboot. Removing packages and reverting back to 2.10 solves the problem.

$ sudo dnf update
Packages Altered:
Upgraded hsa-ext-rocr-dev-1.1.9.0_rocm_rel_3.0_6_7128d0dc-1.x86_64 @ROCm
Upgraded hsa-rocr-dev-1.1.9.0_rocm_rel_3.0_6_7128d0dc-1.x86_64 @ROCm
Upgraded hsakmt-roct-1.0.9_298_gea01eb3-1.x86_64 @ROCm
Upgraded rocm-clang-ocl-0.5.0.47_rocm_rel_3.0_6_cfddddb-1.x86_64 @ROCm
Upgraded rocm-opencl-2.0.0-rocm_rel_3.0_6_9a4afec13.x86_64 @ROCm
Upgraded rocm-opencl-devel-2.0.0-rocm_rel_3.0_6_9a4afec13.x86_64 @ROCm
Upgraded rocm-utils-3.0.6-1.x86_64 @ROCm
Upgraded rocminfo-1.0.0-1.x86_64 @ROCm

$ clinfo
Number of platforms 1
Platform Name AMD Accelerated Parallel Processing
Platform Vendor Advanced Micro Devices, Inc.
Platform Version OpenCL 2.1 AMD-APP (3052.0)
Platform Profile FULL_PROFILE
Platform Extensions cl_khr_icd cl_amd_event_callback cl_amd_offline_devices
Platform Host timer resolution 1ns
Platform Extensions function suffix AMD

Platform Name AMD Accelerated Parallel Processing
Number of devices 0

NULL platform behavior
clGetPlatformInfo(NULL, CL_PLATFORM_NAME, ...) No platform
clGetDeviceIDs(NULL, CL_DEVICE_TYPE_ALL, ...) No platform
clCreateContext(NULL, ...) [default] No platform
clCreateContext(NULL, ...) [other] No platform
clCreateContextFromType(NULL, CL_DEVICE_TYPE_DEFAULT) No devices found in platform
clCreateContextFromType(NULL, CL_DEVICE_TYPE_CPU) No devices found in platform
clCreateContextFromType(NULL, CL_DEVICE_TYPE_GPU) No devices found in platform
clCreateContextFromType(NULL, CL_DEVICE_TYPE_ACCELERATOR) No devices found in platform
clCreateContextFromType(NULL, CL_DEVICE_TYPE_CUSTOM) No devices found in platform
clCreateContextFromType(NULL, CL_DEVICE_TYPE_ALL) No devices found in platform

@preda

This comment has been minimized.

Copy link
Author

@preda preda commented Dec 21, 2019

Thanks @preda and @btspce
Can you please let know the exact steps to reproduce this problem.

The steps are: starting with a working OpenCL-only ROCm 2.10 installation, do a

$ sudo apt upgrade

which updates the packages as already indicated. At this point OpenCL stops detecting any devices, and this persists after a reboot. rocm-smi continues to detect correctly all the GPUs at all times.

@ddobreff

This comment has been minimized.

Copy link

@ddobreff ddobreff commented Dec 21, 2019

You need to install comgr too. Its no longer a part of rocm-opencl package for unknown reason - libamd_comgr.so

@preda

This comment has been minimized.

Copy link
Author

@preda preda commented Dec 21, 2019

You need to install comgr too. Its no longer a part of rocm-opencl package for unknown reason - libamd_comgr.so

I did try and installed comgr too, it didn't fix it.

Maybe AMD could use the information from this thread to fix the 3.0 upgrade and update the instructions:

  • introduce a package dependency of rocm-opencl on comgr if needed
  • remove the old amdocl64.icd during the upgrade if required
@btspce

This comment has been minimized.

Copy link

@btspce btspce commented Dec 22, 2019

3.0 is not working. I made an error in an previous comment and ended up with both 2.2 repo and 3.0 in my rocm.repo after switching back and forth a few times. I have deleted my previous comment that stated that this was fixed.

@btspce

This comment has been minimized.

Copy link

@btspce btspce commented Dec 22, 2019

When reinstalling 3.0 packages I noticed this:

Running scriptlet: hsa-rocr-dev-1.1.9.0_rocm_rel_3.0_6_7128d0dc-1.x86_6 8/9
/var/tmp/rpm-tmp.HOl0fW: line 1: [: missing `]'

@btspce

This comment has been minimized.

Copy link

@btspce btspce commented Dec 22, 2019

clinfo shows:

Number of P2P devices (AMD) 0
P2P devices (AMD) <printDeviceInfo:147: get number of CL_DEVICE_P2P_DEVICES_AMD : error -30>

When hsa-ext-rocr-dev-1.1.9.0-rocm-rel-3.0-6-7128d0dc-Linux.rpm is installed for image support clinfo crashes and so does Darktable. Last version that worked was hsa-ext-rocr-dev-1.1.9-122-ge5c4efb1-Linux.rpm from ROCm 2.9

$ clinfo
Segmentation fault (core dumped)

@OlegSmelov

This comment has been minimized.

Copy link

@OlegSmelov OlegSmelov commented Dec 23, 2019

For those wondering how to revert to a previous version on Debian-based distros:

sudo apt autoremove rocm-dkms rock-dkms
sudo vim /etc/apt/sources.list.d/rocm.list

Replace http://repo.radeon.com/rocm/apt/debian/ with http://repo.radeon.com/rocm/apt/2.10.0/

sudo apt update
sudo apt install rocm-dkms # or any other set of packages you need
@rkothako

This comment has been minimized.

Copy link

@rkothako rkothako commented Dec 23, 2019

Thanks all.
Clinfo works good with 3.0 upgrade from 2.10 as below

  • sudo apt install rocm-dkms [2.10]
  • sudo apt upgrade [ use this to upgrade to 3.0]
    Clinfo fail to find devices when we do upgrade as below
  • sudo apt install rock-dkms rocm-opencl-dev [ 2.10 - install opencl only rocm ]
  • sudo apt upgrade [upgrade to 3.0]

We have logged an internal issue for proper fix.
Currently we are working on this issue.

@preda

This comment has been minimized.

Copy link
Author

@preda preda commented Dec 28, 2019

Thanks all.
Clinfo works good with 3.0 upgrade from 2.10 as below

* sudo apt install rocm-dkms [2.10]

* sudo apt upgrade  [ use this to upgrade to 3.0]
  Clinfo fail to find devices when we do upgrade as below

* sudo apt install rock-dkms rocm-opencl-dev  [ 2.10 - install opencl only rocm ]

* sudo apt upgrade [upgrade to 3.0]

We have logged an internal issue for proper fix.
Currently we are working on this issue.

@rkothako could you please clarify what are the working steps for an upgrade from 2.10 to 3.0 OpenCL-only without dkms? i.e. I'm not using rocm-dkms, and most likely rocm-dkms would fail to compile anyway on the kernel I'm using (5.5).

And you could also please explain what is the problem that is fixed by the working upgrade steps (to help our understanding), thanks.

@acowley acowley mentioned this issue Dec 31, 2019
10 of 10 tasks complete
@csuji

This comment has been minimized.

Copy link

@csuji csuji commented Jan 1, 2020

Same here after upgrading from 2.10 to 3.0-6 with Vega Frontier card.

clinfo
Number of platforms                               1
  Platform Name                                   AMD Accelerated Parallel Processing
  Platform Vendor                                 Advanced Micro Devices, Inc.
  Platform Version                                OpenCL 2.1 AMD-APP (3052.0)
  Platform Profile                                FULL_PROFILE
  Platform Extensions                             cl_khr_icd cl_amd_event_callback cl_amd_offline_devices 
  Platform Host timer resolution                  1ns
  Platform Extensions function suffix             AMD

  Platform Name                                   AMD Accelerated Parallel Processing
Number of devices                                 1
  Device Name                                     gfx900
  Device Vendor                                   Advanced Micro Devices, Inc.
  Device Vendor ID                                0x1002
  Device Version                                  OpenCL 2.0 
  Driver Version                                  3052.0 (HSA1.1,LC)
  Device OpenCL C Version                         OpenCL C 2.0 
  Device Type                                     GPU
  Device Board Name (AMD)                         Vega 10 XTX [Radeon Vega Frontier Edition]
...
NULL platform behavior
  clGetPlatformInfo(NULL, CL_PLATFORM_NAME, ...)  No platform
  clGetDeviceIDs(NULL, CL_DEVICE_TYPE_ALL, ...)   No platform
  clCreateContext(NULL, ...) [default]            No platform
  clCreateContext(NULL, ...) [other]              Success [AMD]
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_DEFAULT)  Success (1)
    Platform Name                                 AMD Accelerated Parallel Processing
    Device Name                                   gfx900
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_CPU)  No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_GPU)  Success (1)
    Platform Name                                 AMD Accelerated Parallel Processing
    Device Name                                   gfx900
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_ACCELERATOR)  No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_CUSTOM)  No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_ALL)  Success (1)
    Platform Name                                 AMD Accelerated Parallel Processing
    Device Name                                   gfx900

How do you guys do regression testing???

@wxsatman

This comment has been minimized.

Copy link

@wxsatman wxsatman commented Jan 2, 2020

New to ROCm stack but have used AMD OpenCL before.

I just tried to ROCm on OpenSuse from Yum repository and I assume I am having the same problem?

I first upgraded to Kernel 5.4 so Kernel support should be there

/opt/rocm/bin/rocminfo gives reasonable results (splits ThreadRipper into 4 agents??) then shows as below for GPU.

When I run /opt/rocm/opencl/bin/x86_64/clinfo I get as below:

Number of platforms: 1
Platform Profile: FULL_PROFILE
Platform Version: OpenCL 2.1 AMD-APP (3052.0)
Platform Name: AMD Accelerated Parallel Processing
Platform Vendor: Advanced Micro Devices, Inc.
Platform Extensions: cl_khr_icd cl_amd_event_callback cl_amd_offline_devices

Platform Name: AMD Accelerated Parallel Processing
ERROR: clGetDeviceIDs(-1)

So as I guess this thread indicates OpenCL just does not work with ROCm 3.0??

If this is not the case how do you fix this?

How could this happen was OpenCL really not tested before this was released? Yikes!

When will a fix be available?

Partial Output from rocminfo:


Agent 5


Name: gfx900
Marketing Name: Vega 10 XT [Radeon RX Vega 64]
Vendor Name: AMD
Feature: KERNEL_DISPATCH
Profile: BASE_PROFILE
Float Round Mode: NEAR
Max Queue Number: 128(0x80)
Queue Min Size: 4096(0x1000)
Queue Max Size: 131072(0x20000)
Queue Type: MULTI
Node: 4
Device Type: GPU
Cache Info:
L1: 16(0x10) KB
Chip ID: 26751(0x687f)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 1750

@wxsatman

This comment has been minimized.

Copy link

@wxsatman wxsatman commented Jan 2, 2020

I will note that I looked at the rpm's that were installed from repository, as below, and noticed that opencl* has a 2.0.0 as part of version where other important rocm stuff has 3.0.0 is it right or should there be a later version of the opencl* packages??

linux-k3mw:/home/goesmgr # rpm -qa | grep rocm
hipcub-2.9.0.88_rocm_rel_3.0_6_6ee0aed-1.x86_64
rocrand-2.10.0.656_rocm_rel_3.0_6_b9f838b-1.x86_64
rocprim-2.9.0.950_rocm_rel_3.0_6_b85751b-1.x86_64
rocm-utils-3.0.0-1.x86_64
rocm-clang-ocl-0.5.0.47_cfddddb-1.x86_64
rocm-opencl-2.0.0-_9a4afec13.x86_64
rocalution-1.6.3.460_rocm_rel_3.0_6_2382876-1.x86_64
rocm-cmake-0.3.0.134_e6d1ef3-1.x86_64
rocm-libs-3.0.0-1.x86_64
rocblas-2.12.1.1749_rocm_rel_3.0_6_ca5535b-1.x86_64
rocfft-0.9.9.760_rocm_rel_3.0_6_aee1339-1.x86_64
rocm-debug-agent-1.0.0-1.x86_64
rocm-dev-3.0.0-1.x86_64
procmail-3.22-lp150.2.3.x86_64
rocm-profiler-5.6.7262-g93fb592.x86_64
rocm-smi-1.0.0_192_g01752f2-1.x86_64
rocm-device-libs-1.0.0.559_628eea4-1.x86_64
hipsparse-1.3.3.208_rocm_rel_3.0_6_f98f82e-1.x86_64
rocm-opencl-devel-2.0.0-_9a4afec13.x86_64
rocminfo-1.0.0-1.x86_64
rocthrust-2.9.0.413_rocm_rel_3.0_6_957b1e9-1.x86_64
rocm-smi-lib64-2.2.0.8.local_build_0_8ffe1bc-1.x86_64
rocsparse-1.5.15.691_rocm_rel_3.0_6_aee785e-1.x86_64
hipblas-0.18.0.281_rocm_rel_3.0_6_da8f5a2-1.x86_64

@preda

This comment has been minimized.

Copy link
Author

@preda preda commented Jan 4, 2020

@rkothako : is there a way to upgrade from ROCm 2.10 to ROCm 3.0, OpenCL only, without dkms? Please let me know how I can do this upgrade.

@rkothako

This comment has been minimized.

Copy link

@rkothako rkothako commented Jan 6, 2020

Hi @preda and all,
We have found the root cause the problem and the workaround is given below.
After upgrading OpenCL-only-ROCm from 2.10 to 3.0, just install the packages on top of it: comgr rocm-smi-lib64
(sudo apt install comgr rocm-smi-lib64)
Then clinfo will start working.

@csuji

This comment has been minimized.

Copy link

@csuji csuji commented Jan 7, 2020

Does not work for me with Ubuntu 18.04.3 LTS and kernel 5.0.0-37-generic, Vega Frontier card. Can you please post a step by step guide and test the next release with some common distributions? Thanks!

@rkothako

This comment has been minimized.

Copy link

@rkothako rkothako commented Jan 7, 2020

Steps to follow:

  1. Install OpenCL only ROCm for 2.10
    sudo apt install rock-dkms rocm-opencl-dev
  2. Upgrade to ROCm 3.0
    sudo apt upgrade
  3. Run clinfo
    /opt/rocm/opencl/bin/x86_64/clinfo --> Clinfo fails to run
  4. Install comgr rocm-smi-lib64
    sudo apt install comgr rocm-smi-lib64
  5. Run clinfo
    /opt/rocm/opencl/bin/x86_64/clinfo --> Clinfo runs well
@preda

This comment has been minimized.

Copy link
Author

@preda preda commented Jan 7, 2020

Steps to follow:

1. Install OpenCL only ROCm for 2.10
   sudo apt install rock-dkms rocm-opencl-dev

2. Upgrade to ROCm 3.0
   sudo apt upgrade

3. Run clinfo
   /opt/rocm/opencl/bin/x86_64/clinfo --> Clinfo fails to run

4. Install comgr rocm-smi-lib64
   sudo apt install comgr rocm-smi-lib64

5. Run clinfo
   /opt/rocm/opencl/bin/x86_64/clinfo --> Clinfo runs well

@rkothako thank you, but I am talking about an install without rock-dkms, as is required when using a recent kernel that is not supported by rock-dkms. Did you try your instructions on a system with Linux kernel 5.4 or 5.5?

@csuji

This comment has been minimized.

Copy link

@csuji csuji commented Jan 7, 2020

@rkothako Thanks, this works for /opt/rocm clinfo. Problem is now (or still since I though that was the fix) that leela zero (https://github.com/leela-zero/leela-zero Go engine with OpenCL) fails to compile all 290 kernel it tries during tuning (worked with 2.10):
./leelaz --benchmark -t6 -w somenet_downloaded_network.gz Failed to compile: 290 kernels. Failed to find a working configuration. Check your OpenCL drivers. Minimum error: 100.000000. Error bound: 0.000100 terminate called after throwing an instance of 'std::runtime_error' what(): Tuner failed to find working configuration. Aborted (core dumped)
KataGo (https://github.com/lightvector/KataGo) also fails to tune with 3.0 AND 2.10 (did not check older versions)
I think these are good test cases for your OpenCL implementation as they try a lot of kernels during tuning. Perhaps you could include them in some regression tests!

@wxsatman

This comment has been minimized.

Copy link

@wxsatman wxsatman commented Jan 8, 2020

rkohato & all

Looked and I have the comgr and rocm-smi packages installed and it is not working.

I am using OpenSuse and 5.4 Kernel as a result the rock-dkms is not used/needed.

I have a "clean" install of just the 3.0 version from the latest zypper repository:

zypper ar http://repo.radeon.com/rocm/zyp/zypper/ rocm-repo

In my post earlier it shows the rocm related packages that were installed.

To the developers here: Why don't you try to just do a clean OpenSuse Linux install and then add your repository and the packages. Then run clinfo and straighten out that issue and then run some real testing on common OpenCL Kernels and applications. Then post updates with a readme of tested configurations.

I do not see anyone but the most experienced users wanting to try/use the ROCm stuff at this point. Little point in spending all this effort on ROCm if no one can actually use it!

@jcdutton

This comment has been minimized.

Copy link

@jcdutton jcdutton commented Jan 14, 2020

I have a similar problem. Kernel 5.4.6 from mainline. ROCm from git: Version 3.0
rocminfo show the GPU. clinfo show no GPUs.
I turned some debug on in clinfo, and it outputs this:
clinfo-debug.txt

It looks to me that the following error messages are what is wrong:
:1:/usr/src/rocm/ROCm-OpenCL-Runtime/opencl/runtime/device/rocm/rocdevice.cpp:1714: Trying to allocate host memory
hsa_amd_memory_pool_allocate stat: 1008
:1:/usr/src/rocm/ROCm-OpenCL-Runtime/opencl/runtime/device/rocm/rocdevice.cpp:1718: Fail allocation host memory
:1:/usr/src/rocm/ROCm-OpenCL-Runtime/opencl/runtime/device/rocm/rocdevice.cpp:354: Couldn't allocate a transfer buffer!
:1:/usr/src/rocm/ROCm-OpenCL-Runtime/opencl/runtime/device/rocm/rocdevice.cpp:798: Couldn't allocate transfer buffer objects for read
:1:/usr/src/rocm/ROCm-OpenCL-Runtime/opencl/runtime/device/rocm/rocdevice.cpp:568: Error creating new instance of Device.

Unfortunately, the failure is happening within the binary only amd lib, so further analysis is difficult.

@jcdutton

This comment has been minimized.

Copy link

@jcdutton jcdutton commented Jan 14, 2020

Doing an strace gives some more info:
This call is failing:
[pid 12036] mbind(0x7f2650200000, 1052672, 0x8001 /* MPOL_??? */, [0x0000000000000002], 3, 0) = -1 EINVAL (Invalid argument)
As this is from within a binary only lib, I do not have the associated source code.
This is rocdevice.cpp:1714:
hsa_status_t stat = hsa_amd_memory_pool_allocate(segment, size, 0, &ptr);
the function "hsa_amd_memory_pool_allocate" is in the binary only lib.

More of the strace:
[pid 12036] write(2, ":1:/usr/src/rocm/ROCm-OpenCL-Run"..., 115:1:/usr/src/rocm/ROCm-OpenCL-Runtime/opencl/runtime/device/rocm/rocdevice.cpp:1714: Trying to allocate host memory
) = 115
[pid 12036] mmap(NULL, 2105344, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 0x7f26501fe000
[pid 12036] munmap(0x7f26501fe000, 8192) = 0
[pid 12036] munmap(0x7f2650301000, 1044480) = 0
[pid 12036] mmap(0x7f2650200000, 1052672, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f2650200000
[pid 12036] get_mempolicy(NULL, NULL, 0, NULL, 0) = 0
[pid 12036] mbind(0x7f2650200000, 1052672, 0x8001 /* MPOL_??? /, [0x0000000000000002], 3, 0) = -1 EINVAL (Invalid argument)
[pid 12036] mbind(0x7f2650200000, 1052672, MPOL_DEFAULT, NULL, 0, 0) = 0
[pid 12036] munmap(0x7f2650200000, 1052672) = 0
[pid 12036] mmap(NULL, 2105344, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 0x7f26501fe000
[pid 12036] munmap(0x7f26501fe000, 8192) = 0
[pid 12036] munmap(0x7f2650301000, 1044480) = 0
[pid 12036] mmap(0x7f2650200000, 1052672, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f2650200000
[pid 12036] get_mempolicy(NULL, NULL, 0, NULL, 0) = 0
[pid 12036] mbind(0x7f2650200000, 1052672, 0x8001 /
MPOL_??? */, [0x0000000000000002], 3, 0) = -1 EINVAL (Invalid argument)
[pid 12036] mbind(0x7f2650200000, 1052672, MPOL_DEFAULT, NULL, 0, 0) = 0
[pid 12036] munmap(0x7f2650200000, 1052672) = 0
[pid 12036] fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 0), ...}) = 0
[pid 12036] write(1, "hsa_amd_memory_pool_allocate sta"..., 40) = 40
[pid 12036] write(2, ":1:/usr/src/rocm/ROCm-OpenCL-Run"..., 112:1:/usr/src/rocm/ROCm-OpenCL-Runtime/opencl/runtime/device/rocm/rocdevice.cpp:1718: Fail allocation host memory
) = 112
[pid 12036] write(2, ":1:/usr/src/rocm/ROCm-OpenCL-Run"..., 120:1:/usr/src/rocm/ROCm-OpenCL-Runtime/opencl/runtime/device/rocm/rocdevice.cpp:354: Couldn't allocate a transfer buffer!

@jcdutton

This comment has been minimized.

Copy link

@jcdutton jcdutton commented Jan 14, 2020

I think I have worked out what the problem is when running clinfo not finding vega gpu card:
[pid 4130] mbind(0x7f6ed7266000, 4096, 0x8001 /* MPOL_??? /, [0x0000000000000001], 3, 0) = 0
[pid 4130] mbind(0x7f6dc0400000, 1052672, 0x8001 /
MPOL_??? */, [0x0000000000000002], 3, 0) = -1 EINVAL (Invalid argument)

This is trying to mbind to node0 (success) and then tries to mbind to node1 (fails).
The rocminfo output shows the following:
Agent 1 (CPU)
...
Segment: GLOBAL; FLAGS: KERNARG, FINE GRAINED
Size: 33554048KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Acessible by all: TRUE
...
Agent 2 (CPU)
Segment: GROUP
Size: 64KB
Allocatable: FALSE
Alloc Granule: 0KB
Alloc Alignment: 0KB
Acessible by all: FALSE

So, the RAM is attached to node0(Agent1), but no RAM is attached to node1(Agent2).
The AMD binary blob tries both nodes. The AMD binary blob is giving up if node1 has no RAM attached.

I have a AMD Threadripper 1950 which has a similar pattern of RAM install. All attached to node0, no RAM attached to node1.

@jcdutton

This comment has been minimized.

Copy link

@jcdutton jcdutton commented Jan 19, 2020

@preda @btspce @wxsatman
FIX FOUND:
My PC has 2 RAM chips. 16GB per chip. Previously both chips were on node0. As per the motherboard manual for installing 2 chips.
If I move the RAM so that 1 chip is on node0 and the other chip is on node1.
clinfo now detects my GPU.
So, my advice for the people seeing this problem is to re-arrange the RAM chips into different slots.
This bug still needs fixing though. Or at least a more useful error message so the user knows they need to move RAM chips about.

@MatPoliquin

This comment has been minimized.

Copy link

@MatPoliquin MatPoliquin commented Feb 7, 2020

I have a similar issue with clinfo, after installing rocm 3.0 in a fresh install of Ubuntu 19.04 with kernel 5.0. My gpu is a RX580

Output of clinfo
`Number of platforms: 1
Platform Profile: FULL_PROFILE
Platform Version: OpenCL 2.1 AMD-APP (3052.0)
Platform Name: AMD Accelerated Parallel Processing
Platform Vendor: Advanced Micro Devices, Inc.
Platform Extensions: cl_khr_icd cl_amd_event_callback cl_amd_offline_devices

Platform Name: AMD Accelerated Parallel Processing
`

@preda

This comment has been minimized.

Copy link
Author

@preda preda commented Feb 10, 2020

Is AMD still investigating how to get OpenCL to keep working after an update to ROCm 3.0? Otherwise maybe the oficial solution could be posted here..

@ableeker

This comment has been minimized.

Copy link

@ableeker ableeker commented Feb 12, 2020

In #474 cryptomilk suggests to symlink rocm libamdocl64.so to /usr/lib64/libamdoclcl.so. I installed 3.10. I found libamdocl64.so in /opt/rocm/opencl/lib/x86_64, so I symlinked libamdoclcl.so to it. This didn't work however, clinfo still claims to have found no devices. And opencl programs, luxmark 3.1 for instance, still refuse to work. If I remove 3.0, and install 2.10, everything's working fine again.

@cryptomilk

This comment has been minimized.

Copy link

@cryptomilk cryptomilk commented Feb 13, 2020

Not really. libamdoclcl64.so (double cl) seems to be only part of the proprietary driver. It shouldn't be required for ROCm.

Either the kernel interface has changed in 3.0 and hasn't been updated in one of the recent kernels or there is an issue in ROCm 3.0 with upstream kernels which needs fixing. However I think with ROCm 3.1 the issue will be addressed. Just stay with 2.10 till it is released.

@jcdutton

This comment has been minimized.

Copy link

@jcdutton jcdutton commented Feb 13, 2020

Well, I had this problem. I have kernel 5.4.6 working with ROCm 3.0 on a Vega 56 with a AMD 1950X cpu.
I diagnosed the problem to a bug in the AMD binary blob bit of ROCm. ROCm is not 100% open source.
Please see my previous posts here for the work around. I.e. Move RAM chips.
The problem is related to NUMA. So it only affects CPUs that have more than 1 node, which I think, at this point, is only some of the higher end AMD CPUs.
I have a simple C test program, that probes the NUMA configuration in the same way the binary blob tries, and demonstrates the exact case where the clinfo fails.
When I move the RAM chips, my simple C test program passes and clinfo then works.
Unfortunately, the ROCm binary blob is where the probing happens, so it cannot be fixed without help from AMD.

@dmarcuse

This comment has been minimized.

Copy link

@dmarcuse dmarcuse commented Feb 13, 2020

@jcdutton could you share your code? I'd like to test my system (Ryzen 3700X and Vega 56) and I'm also just curious about it 😛

@jcdutton

This comment has been minimized.

Copy link

@jcdutton jcdutton commented Feb 13, 2020

rocm3.0-test.c.txt

Compile with:
gcc -g -O0 -c -o rocm3.0-test.o rocm3.0-test.c
gcc -g -o rocm3.0-test rocm3.0-test.o -lnuma

Example of good output:

./rocm3.0-test

You have 2 CPU nodes
CPU Node 0 has RAM. OK
CPU Node 1 has RAM. OK
CPU Nodes and RAM layout checked. This should work with ROCM 3.0

@ableeker

This comment has been minimized.

Copy link

@ableeker ableeker commented Feb 13, 2020

@jcdutton thanks for the code. It compiled just fine. The output for my computer:

You have 1 CPU nodes
CPU Node 0 has RAM. OK
CPU Nodes and RAM layout checked. This should work with ROCM 3.0

Its a PC with an Intel i5-4460 Haswell CPU, and 16 GB RAM, 4 x 4GB RAM. So all 4 slots are filled. Unfortunately rocm opencl 3.10 isn't working, clinfo says 0 devices, and opencl programs don't work, or crash.

@jcdutton

This comment has been minimized.

Copy link

@jcdutton jcdutton commented Feb 14, 2020

@ableeker
Ok, good, you don't have the RAM layout bug.
Please post the output from rocminfo

@dmarcuse

This comment has been minimized.

Copy link

@dmarcuse dmarcuse commented Feb 14, 2020

@jcdutton Thanks for posting your script! Unfortunately it doesn't seem to be accurate for my machine. It reports that it should work, outputting the following on both Linux 5.5 and Linux 5.4:

You have 1 CPU nodes
CPU Node 0 has RAM. OK
CPU Nodes and RAM layout checked. This should work with ROCM 3.0

However, on Linux 5.5, clinfo and rocminfo both segfault. Both work properly in Linux 5.4 (LTS), leading me to believe that something was changed in one of the kernel modules in 5.5 that broke compatibility.

@jcdutton

This comment has been minimized.

Copy link

@jcdutton jcdutton commented Feb 14, 2020

@dmarcuse
Yes, 5.5 is a problem. See #1007
I have the same problem with 5.5.

The test program is only testing one edge case, that causes clinfo to fail to recognize a GPU. There might be other reasons rocm 3.0 does not work.
In which case, running "rocminfo" helps diagnose that.

@dmarcuse

This comment has been minimized.

Copy link

@dmarcuse dmarcuse commented Feb 14, 2020

Ah, I didn't realize there was a separate issue for 5.5. Hopefully they'll both be fixed soon.

@ableeker

This comment has been minimized.

Copy link

@ableeker ableeker commented Feb 15, 2020

@jcdutton

ROCk module is loaded
ableeker is member of video group

HSA System Attributes

Runtime Version: 1.1
System Timestamp Freq.: 1000.000000MHz
Sig. Max Wait Duration: 18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model: LARGE
System Endianness: LITTLE

==========
HSA Agents


Agent 1


Name: Intel(R) Core(TM) i5-4460 CPU @ 3.20GHz
Marketing Name: Intel(R) Core(TM) i5-4460 CPU @ 3.20GHz
Vendor Name: CPU
Feature: None specified
Profile: FULL_PROFILE
Float Round Mode: NEAR
Max Queue Number: 0(0x0)
Queue Min Size: 0(0x0)
Queue Max Size: 0(0x0)
Queue Type: MULTI
Node: 0
Device Type: CPU
Cache Info:
L1: 32768(0x8000) KB
Chip ID: 0(0x0)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 3400
BDFID: 0
Internal Node ID: 0
Compute Unit: 4
SIMDs per CU: 0
Shader Engines: 0
Shader Arrs. per Eng.: 0
WatchPts on Addr. Ranges:1
Features: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: KERNARG, FINE GRAINED
Size: 16343228(0xf960bc) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Acessible by all: TRUE
Pool 2
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 16343228(0xf960bc) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Acessible by all: TRUE
ISA Info:
N/A


Agent 2


Name: gfx900
Marketing Name: Vega 10 XL/XT [Radeon RX Vega 56/64]
Vendor Name: AMD
Feature: KERNEL_DISPATCH
Profile: BASE_PROFILE
Float Round Mode: NEAR
Max Queue Number: 128(0x80)
Queue Min Size: 4096(0x1000)
Queue Max Size: 131072(0x20000)
Queue Type: MULTI
Node: 1
Device Type: GPU
Cache Info:
L1: 16(0x10) KB
Chip ID: 26751(0x687f)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 1630
BDFID: 768
Internal Node ID: 1
Compute Unit: 64
SIMDs per CU: 4
Shader Engines: 4
Shader Arrs. per Eng.: 1
WatchPts on Addr. Ranges:4
Features: KERNEL_DISPATCH
Fast F16 Operation: FALSE
Wavefront Size: 64(0x40)
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Max Waves Per CU: 40(0x28)
Max Work-item Per CU: 2560(0xa00)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
Max fbarriers/Workgrp: 32
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 8372224(0x7fc000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Acessible by all: FALSE
Pool 2
Segment: GROUP
Size: 64(0x40) KB
Allocatable: FALSE
Alloc Granule: 0KB
Alloc Alignment: 0KB
Acessible by all: FALSE
ISA Info:
ISA 1
Name: amdgcn-amd-amdhsa--gfx900
Machine Models: HSA_MACHINE_MODEL_LARGE
Profiles: HSA_PROFILE_BASE
Default Rounding Mode: NEAR
Default Rounding Mode: NEAR
Fast f16: TRUE
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
FBarrier Max Size: 32
*** Done ***

@ableeker

This comment has been minimized.

Copy link

@ableeker ableeker commented Feb 15, 2020

I had installed rocm-opencl, had put the user in the video group, and added the udev rule. This was working well with 2.10, but failed to work with 3.0. For 3.0 I've installed all packages mentioned here as well, except rocm-dkms/rock-dkms, because I'm using Ubuntu 19.10 with kernel 5.3. rocminfo is the only one that sees the device (gfx900) it seems.

rocm clinfo:

Number of platforms: 1
Platform Profile: FULL_PROFILE
Platform Version: OpenCL 2.1 AMD-APP (3052.0)
Platform Name: AMD Accelerated Parallel Processing
Platform Vendor: Advanced Micro Devices, Inc.
Platform Extensions: cl_khr_icd cl_amd_event_callback cl_amd_offline_devices

Platform Name: AMD Accelerated Parallel Processing
ERROR: clGetDeviceIDs(-1)

@jcdutton

This comment has been minimized.

Copy link

@jcdutton jcdutton commented Feb 16, 2020

@ableeker
Check that you have the following files:
/opt/rocm/lib/libhsakmt.so.1.0.6
/opt/rocm/hsa/lib/libhsa-runtime64.so.1.1.9
/opt/rocm/opencl/lib/x86_64/libamdocl64.so
/opt/rocm/opencl/lib/x86_64/libOpenCL.so.1
/dev/kfd
/usr/lib/x86_64-linux-gnu/libnuma.so.1.0.0
/dev/shm/hsakmt_shared_mem
/dev/shm/sem.hsakmt_semaphore
/etc/OpenCL/vendors/amdocl64.icd

@ableeker

This comment has been minimized.

Copy link

@ableeker ableeker commented Feb 16, 2020

@jcdutton
These files aren't all present, /dev/shm is empty, so hsakmt_shared_mem, and sem.hsakmt_semaphore are missing.

@ableeker

This comment has been minimized.

Copy link

@ableeker ableeker commented Feb 16, 2020

@jcdutton
Heh... I had another look, and the files in /dev/shm are now present. So all files in the list are present. But clinfo still claims there are no device, and clopencl applications aren't working.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
You can’t perform that action at this time.