New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: rocBLAS error: Cannot read TensileLibrary.dat: No such file or directory #1339
Comments
Hi @slipperyslipped. Your GPU uses the gfx1031 instruction set, but the binaries distributed by AMD are not built for that architecture as it is not officially supported. However, the gfx1030 instruction set is identical to the gfx1031 instruction set in all but name. For this reason, there are ways to get the existing binaries running on your GPU. As a workaround, I would recommend setting the environment variable |
Hi, I was blocked by the same problem I am using a gfx 1013 device. Can I set pytorch not to use rocBlas for this ? |
I'm not an expert on PyTorch, but the gfx1013 ISA is a superset of the gfx1010 ISA. You can set |
@cgmb gfx1010 produces the same issue: $ drun --rm rocm/dev-ubuntu-22.04:5.6-complete
root@ftl:/# ls -1 /opt/rocm/lib/rocblas/library/TensileLibrary_lazy_gfx*
/opt/rocm/lib/rocblas/library/TensileLibrary_lazy_gfx1030.dat
/opt/rocm/lib/rocblas/library/TensileLibrary_lazy_gfx1100.dat
/opt/rocm/lib/rocblas/library/TensileLibrary_lazy_gfx1101.dat
/opt/rocm/lib/rocblas/library/TensileLibrary_lazy_gfx1102.dat
/opt/rocm/lib/rocblas/library/TensileLibrary_lazy_gfx803.dat
/opt/rocm/lib/rocblas/library/TensileLibrary_lazy_gfx900.dat
/opt/rocm/lib/rocblas/library/TensileLibrary_lazy_gfx906.dat
/opt/rocm/lib/rocblas/library/TensileLibrary_lazy_gfx908.dat
/opt/rocm/lib/rocblas/library/TensileLibrary_lazy_gfx90a.dat As you see there is no |
Thanks @ulyssesrr. That's a great analysis of the problem. It's perhaps worth noting that the OS-provided rocBLAS package on Debian 13 (Testing/Trixie) and the upcoming Ubuntu 23.10 (Mantic Minotaur) builds Tensile with The OS-provided package for rocBLAS on Debian/Ubuntu also automatically handles loading code objects for ISAs that are known to be compatible as I'd suggested earlier in this thread. For this reason, the OS-provided package has much wider hardware compatibility than the AMD-provided package on GFX9 and GFX10 hardware. I have not tested the OS-provided packages on all hardware platforms, but the tests are also packaged in the OS package Just mentioning it, since that's probably a useful workaround for some people on hardware that is not officially supported. Even folks on other operating systems could potentially spin up a docker container with an Ubuntu or Debian image and |
@cgmb I forgot to mention that the rocBLAS build script on 5.6.0 seems to have an issue where The rmake.py script treats the cmake flags However I was getting them enabled by default, thus I had to actually opt-out, I'm guessing it is being done here: I didn't debug much, just rolled a patch and went my way(Which I ended not needing as I patched the Tensile issue): As I didn't debug much, I didn't feel confident to open an Issue. |
FYI seeing what seems to be the same GPU is a 7800 XT. Stack from running a basic PyTorch example under GDB is shown below. I did have to override gfx version to either
|
Did you only install the Radeon Software or did you also install ROCm? |
@YellowRoseCx Yes, rocm was installed. But there were some errors and perhaps there is a version mismatch. I have since reinstalled the whole machine and here is the current state:
Same segfault and stack looks similar. Here is a basic log of what I tried this time:
This time I opted for the AMDGPU install flow option in the ROCm install guide. Running the installer from the Note PyTorch repo is |
The RX 7800 XT (Navi 32) is gfx1101. You likely were overriding the gfx version to 11.0.0. However, that is not safe. The gfx1100 ISA has more registers than the gfx1101 ISA and there are other important differences in the ABI too. With Navi 21/22/23/24, the gfx version override approach more or less worked, despite not being officially supported. Users execute code built for Navi 21 on any of those chips and I don't know of any problems encountered from doing so. The compiler handled each of those ISAs identically. Navi 31/32/33 are not like that. There are known differences between those chips that the compiler is accounting for when it generates code for each architecture. (This isn't the cause of the specific TensileLibrary.dat error you encountered, but it's a warning that you may encounter other problems even once the Tensile issue is resolved, if you're using that override.) |
@cgmb Thanks for the ISA incompatibility heads up for Navi 31/32/33. Good to know. I actually had just started going through the RDNA 3 ISA doc, but did not notice any chip-specific differences called out so far. Is there other documentation I should review, or will there eventually be updates to highlight differences? Since this is off-topic for this issue, is there a better place to follow (or open) an issue wrt to documentation? |
JFYI I got working stable diffusion automatic with rocm 5.7 working on Phoenix APU (7840u) via setting it to 11.0.0
Without this override I got
|
For other arch such as gfx1103, I think the right way to use it is to generate a new TensileLibrary.dat file to get optimal performance. Do we have a way to trigger this process? |
@TorreZuk can you take a look or merge it please 😢 ROCm/Tensile#1862 My code wont run without it on rx 6600 xt |
@hiepxanh sure I will push to see if it can get reviewed sooner rather than later. |
Tried to get my 6650 XT to work with llama.ccp by installing rocm-hip-sdk and got the same error after I think it failed to properly build on first launch:
Launching through the gpu again just gives me the last error now. |
@NaturalHate, build for gfx1030 and run with |
If i have to build it myself then I guess I'll pass. |
No i can send you if you use rx6600 there is a lot of people already build it. Just copy pate and it run |
I don't. I use a 6650 XT. |
@hiepxanh Hey taking my moment to thank you:) I use rx6600 XT and the environment variable saved me! @NaturalHate I'm not expert on those hardware stuff but from your error message the architecture is |
@NaturalHate LostRuins/koboldcpp#441 He gave me this file on koboldcpp, it work, you can try it since it the same 1032 platform. @wayneyaoo you are welcome, I digging a lot and I think I should save others time, this issue is really frustrated |
Describe the bug
Basically getting some form of this error, either
rocBLAS error: Cannot read /opt/rocm-5.4.0/lib/rocblas/library/TensileLibrary.dat: Illegal seek
orCannot read TensileLibrary.dat: No such file or directory
To Reproduce
rocblas-dev (= 2.46.0.50400-72~22.04)
Steps to reproduce the behavior:
Expected behavior
No error?
Log-files
Environment
Make sure that ROCm is correctly installed and to capture detailed environment information run the following command:
Getting this error: ```No LSB modules are available.
The text was updated successfully, but these errors were encountered: