Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rocminfo Fails #38

Closed
BemusedCat opened this issue Aug 24, 2020 · 8 comments
Closed

Rocminfo Fails #38

BemusedCat opened this issue Aug 24, 2020 · 8 comments

Comments

@BemusedCat
Copy link

BemusedCat commented Aug 24, 2020

I tried everything to run rocm-tensorflow but unable to do so .
Tried everything but nothing works
My rocminfo
ROCk module is loaded
Unable to open /dev/kfd read-write: Bad address
abhigyan is member of render group
hsa api call failure at: /src/rocminfo/rocminfo.cc:1142
Call returned HSA_STATUS_ERROR_OUT_OF_RESOURCES: The runtime failed to allocate the necessary resources. This error may also occur when the core runtime library needs to spawn threads or create internal OS-specific events.
`

@fxkamd
Copy link

fxkamd commented Aug 24, 2020

Sounds like an installation or configuration problem. Can you post dmesg output and "ls -l /dev/kfd" for a start?

@skeelyamd
Copy link
Collaborator

The problem is permission to access the device driver interface file as indicated here "Unable to open /dev/kfd". The permissions /group membership needed depends on your specific environment. Running 'ls -l /dev/kfd' will show the owning group and it's permissions. Ensure that you are a member of that group.

@BemusedCat
Copy link
Author

This is the output of 'ls -l /dev/kfd'.
crw-rw-rw- 1 root render 237, 0 Aug 25 00:11 /dev/kfd

When I run tensorflow I get this error.
2020-08-25 02:39:00.789159: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libhip_hcc.so'; dlerror: libhip_hcc.so: cannot open shared object file: No such file or directory
2020-08-25 02:39:00.789209: F tensorflow/stream_executor/lib/statusor.cc:34] Attempting to fetch value instead of handling error Failed precondition: Could not load dynamic library 'libhip_hcc.so'; dlerror: libhip_hcc.so: cannot open shared object file: No such file or directory
Aborted (core dumped)

@skeelyamd
Copy link
Collaborator

That looks like HIP is not installed. Did you install the complete set of rocm packages? What is in /opt/rocm/lib and /opt/rocm/bin?

@BemusedCat
Copy link
Author

BemusedCat commented Aug 24, 2020

Yes I did installed.
I even fresh installed ubuntu and again did everything.

this are files in /opt/rocm/lib

cmake libhsakmt.so.1 librocalution.so librocsolver.so oclc_isa_version_1012.amdgcn.bc hc.amdgcn.bc libhsakmt.so.1.0.30700 librocalution.so.0 librocsolver.so.0 oclc_isa_version_700.amdgcn.bc hip.amdgcn.bc libhsa-runtime64.so librocalution.so.0.1.30700 librocsolver.so.0.1.30700 oclc_isa_version_701.amdgcn.bc libamd_comgr.so libhsa-runtime64.so.1 librocblas.so librocsparse.so oclc_isa_version_702.amdgcn.bc libamd_comgr.so.1 libhsa-runtime64.so.1.2.30700 librocblas.so.0 librocsparse.so.0 oclc_isa_version_801.amdgcn.bc libamd_comgr.so.1.6.30700 libmiopengemm.so librocblas.so.0.1.30700 librocsparse.so.0.1.30700 oclc_isa_version_802.amdgcn.bc libamdhip64.so libmiopengemm.so.1 librocfft-device.so libroctracer64.so oclc_isa_version_803.amdgcn.bc libamdhip64.so.3 libmiopengemm.so.1.0.30700 librocfft-device.so.0 libroctracer64.so.1 oclc_isa_version_810.amdgcn.bc libamdhip64.so.3.7.30700 libMIOpen.so librocfft-device.so.0.1.30700 libroctracer64.so.1.0.30700 oclc_isa_version_900.amdgcn.bc libCXLActivityLogger.so libMIOpen.so.1 librocfft.so libroctx64.so oclc_isa_version_902.amdgcn.bc libhipblas.so libMIOpen.so.1.0.30700 librocfft.so.0 libroctx64.so.1 oclc_isa_version_904.amdgcn.bc libhipblas.so.0 libOpenCL.so librocfft.so.0.1.30700 libroctx64.so.1.0.30700 oclc_isa_version_906.amdgcn.bc libhipblas.so.0.1.30700 libOpenCL.so.1 librocm-dbgapi.so ockl.amdgcn.bc oclc_isa_version_908.amdgcn.bc libhiprand.so libOpenCL.so.1.2 librocm-dbgapi.so.0 oclc_correctly_rounded_sqrt_off.amdgcn.bc oclc_unsafe_math_off.amdgcn.bc libhiprand.so.1 library librocm-dbgapi.so.0.30.0 oclc_correctly_rounded_sqrt_on.amdgcn.bc oclc_unsafe_math_on.amdgcn.bc libhiprand.so.1.1.30700 librccl.so librocm_smi64.so oclc_daz_opt_off.amdgcn.bc oclc_wavefrontsize64_off.amdgcn.bc libhipsparse.so librccl.so.1 librocm_smi64.so.2 oclc_daz_opt_on.amdgcn.bc oclc_wavefrontsize64_on.amdgcn.bc libhipsparse.so.0 librccl.so.1.0.30700 librocprofiler64.so oclc_finite_only_off.amdgcn.bc ocml.amdgcn.bc libhipsparse.so.0.1.30700 librocalution_hip.so librocrand.so oclc_finite_only_on.amdgcn.bc opencl.amdgcn.bc libhsa-amd-aqlprofile64.so librocalution_hip.so.0 librocrand.so.1 oclc_isa_version_1010.amdgcn.bc libhsakmt.so librocalution_hip.so.0.1.30700 librocrand.so.1.1.30700 oclc_isa_version_1011.amdgcn.bc

/opt/rocm/bin

ca findcode.sh hipcc_cmake_linker_helper hipconvertinplace.sh hipexamine.sh lpl rocminfo rocprof clang-ocl finduncodep.sh hipconfig hipdemangleatp hipify-cmakefile rocgdb rocm-smi extractkernel hipcc hipconvertinplace-perl.sh hipexamine-perl.sh hipify-perl rocm_agent_enumerator rocm_smi.py

@zyzzyxdonta
Copy link

zyzzyxdonta commented Oct 8, 2020

Hello,
are there any news on this issue?

I ran into the same problem using the rocm/dev-ubuntu-18.04 docker image inside a CI machine. It took me quite a bit of digging to find this because rocminfo is called by rocm_agent_enumerator which, instead of reporting this error when rocminfo fails, just gets stuck.

I noticed another thing, though I'm not sure where to report this: rocm_agent_enumerator uses lspci which is not installed inside the docker image.

$ ls -l /dev/kfd
crw-rw-rw- 1 root video 240, 0 Oct  8 13:37 /dev/kfd
$ /opt/rocm-3.8.0/bin/rocminfo &
$ ROCINFO_PID="$!"
$ sleep 30
ROCk module is loaded
Unable to open /dev/kfd read-write: Resource temporarily unavailable
Failed to get user name to check for video group membership
hsa api call failure at: /src/rocminfo/rocminfo.cc:1142
Call returned HSA_STATUS_ERROR_OUT_OF_RESOURCES: The runtime failed to allocate the necessary resources. This error may also occur when the core runtime library needs to spawn threads or create internal OS-specific events.
$ ps -eF --forest
UID          PID    PPID  C    SZ   RSS PSR STIME TTY          TIME CMD
root           1       0  0  4635  3276  14 13:37 ?        00:00:00 /bin/bash
root           7       1  0  4660  2560  29 13:37 ?        00:00:00 /bin/bash
root        1137       7  0     0     0  27 13:38 ?        00:00:00  \_ [rocminfo]
root        1139    1137  0     0     0  24 13:38 ?        00:00:00  |   \_ [sh] <defunct>
root        1142       7  0  8602  3000   7 13:39 ?        00:00:00  \_ ps -eF --forest

@ppanchad-amd
Copy link

@BemusedCat @zyzzyxdonta Apologies for the lack of response. Can you please test with the latest ROCm 6.2? If issue is resolved, please close the ticket. Thanks!

@ppanchad-amd
Copy link

@BemusedCat @zyzzyxdonta Closing ticket. Please re-open the ticket if you still encounter the same issue with the latest ROCm. Thanks!

@ppanchad-amd ppanchad-amd closed this as not planned Won't fix, can't repro, duplicate, stale Sep 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants