New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tensile won't produce backend libraries for archs without optimized logic files when using --separate-architectures #1757
Comments
Although that's probably not the right place, I really needed to say thank you! I've struggled with that basically since my card has been released and finally I was able to fix it because of you. Doing compute stuff is just a nightmare with AMD, really. |
Fixes ROCm#1757. Enables architectures that don't have optimized logic files to also produce libraries when `--separate-architectures` or `--lazy-library-loading` is turned on. Previously, one must disable both of these two flags in order for rocBLAS to run on architectures like `gfx1010`. Test plan: ``` cmake -GNinja -B build -S . \ -DCMAKE_C_COMPILER=hipcc \ -DCMAKE_CXX_COMPILER=hipcc \ -DBUILD_CLIENTS_TESTS=OFF \ -DBUILD_CLIENTS_BENCHMARKS=OFF \ -DBUILD_CLIENTS_SAMPLES=OFF \ -DBUILD_TESTING=OFF \ -DBUILD_WITH_TENSILE=ON \ -DTensile_PRINT_DEBUG=ON \ -DTensile_LIBRARY_FORMAT=msgpack \ -DTensile_CPU_THREADS=14 \ -DTensile_LAZY_LIBRARY_LOADING=ON \ -DAMDGPU_TARGETS="..." ``` With `AMDGPU_TARGETS` being one of the following - `AMDGPU_TARGETS=gfx1010` - `AMDGPU_TARGETS=gfx1030;gfx1010` - `AMDGPU_TARGETS=gfx803;gfx900;gfx906:xnack-;gfx908:xnack-;gfx90a:xnack+;gfx90a:xnack-;gfx1010;gfx1012;gfx1030;gfx1100;gfx1101;gfx1102` In all three cases, `$ROCM_PATH/lib/rocblas/library/TensileLibrary_lazy_gfx1010.dat` is produced and all other `*.dat` files remain unchanged. Signed-off-by: Gavin Zhao <git@gzgz.dev>
#1862 has an updated version of this patch for ROCm >=5.5. |
Fixes #1757. Enables architectures that don't have optimized logic files to also produce libraries when `--separate-architectures` or `--lazy-library-loading` is turned on. Previously, one must disable both of these two flags in order for rocBLAS to run on architectures like `gfx1010`.
This change triggered a fail in rocblas test. |
I actually have a fix for a test failure that I just found out today, but may I get a failure log to ensure that it's the same failure I'm getting? |
command line: ./rocblas-test --gtest_output=xml --gtest_color=yes --gtest_filter=*quick*:*pre_checkin*-*known_bug* Error: [----------] 32 tests from _/gemm_ex_get_solutions /var/jenkins_home/workspace/eckin_rocBLAS-internal_develop_2/z344iq86D/rocblas/clients/gtest/../include/blas_ex/testing_gemm_ex_get_solutions.hpp:152: Failure Value of: status_match(rocblas_status_success, status_) Actual: false (got rocblas_status_invalid_value instead of rocblas_status_success) Expected: true [ FAILED ] _/gemm_ex_get_solutions.blas3_tensile/pre_checkin_gemm_ex_get_solutions_f16_rf16_rf16_rf16_rf32_r_CN_250_250_250_1_250_250_1_250_250, where GetParam() = { function: "gemm_ex_get_solutions", name: "gemm_ex_get_solutions", category: "pre_checkin", known_bug_platforms: "", beta: 1.0, stride_a: 62500, stride_b: 62500, stride_c: 62500, stride_d: 62500, M: 250, N: 250, K: 250, lda: 250, ldb: 250, ldc: 250, ldd: 250, a_type: f16_r, b_type: f16_r, c_type: f16_r, d_type: f16_r, composite_compute_type: invalid, initialization: rand_int, gpu_arch: "", flush_batch_count: 1, transA: 'C', transB: 'N' } (6 ms) |
We are investigating the issue now, but have not found the cause yet. |
As long as I tried, this fail does not happen if I revert the fallback change. |
I'm building rocBLAS locally to test. I've been working on ISA compatibility improvements in rocBLAS so my local copy has some modifications. With my current modifications my While it's building, could you change the if statement at https://github.com/ROCm/rocBLAS/blob/5211f0dca313c56c2163b8602581242c8cb608f1/library/src/tensile_host.cpp#L991C1-L992C43 from if(library)
*library = host.get_library(); to if(adapter)
*library = host.get_library(); and see if you get a segfault (sigsev)? |
Actually. I could not reproduce the fail with my local rocblas build, but it fails on our CI environment. |
What GPU arch does the CI environment has?
No, sometimes |
The fail above is gfx1101. I do not have gfx1032 environment. |
To clarify, above that line there's a comment:
If the "library" refers to the |
Also does it only fail on Level-3? Or are there also failures with Level-2 and Level-1 operations as well? |
I am not familiar with rocblas side, but if get_library_and_adapter() is called from rocblas_initialize(), library seems to be NULL since library is not specified here. |
I'm confused as to why my change would affect |
If we do not specify -a option when building rocblas, rocblas picks Tensile library for all architectures including gfx1010 and 1012 added by the fallback change. |
That is my guess. We still do not understand why it fails. |
rocBLAS still compiling. Will report back when I get to run the tests and reproduce the failure. |
A SIGSEV was triggered, let me debug what went wrong. Edit: the exact failure also reproduced. |
|
Putting some investigation notes here. I will spend more time to dig through this later in this week, but if anyone wants to investigate feel free to build on top of here. Through my tracing I found that solution selection doesn't seem to be affected. If you print out every single solution found in |
hello guys, fist of all thanks for all the hard work you doing to make rx5700 work, im just a hobbist and not even close to be near your league of expertice.
to make work my rx5700 with llamacpp? want to ask frist before do the test and break my ubuntu lol |
I'm not sure any more if that is all that it takes, because I fiddled a LOT to make my now replaced RX 5700 XT work, but I've used these settings:
In some places I've also used
but in any case I did not use the overrides that you have used, with these low versions / numbers. That shouldn't break anything, as it is solely related to things that make use of ROCm. Worst case it won't work. Edit: |
The way that separate architectures and lazy loading were implemented was really not ideal. The complexity of building all the necessary data structures during initialization should really be pushed to build time, and there should be no meaningful logic executing during initialization. The initialization could be nothing more than The use of an unindexed key-value pair format like msgpack is the underlying cause of these bugs, because the slow conversion of that data into the Tensile in-memory format drives the introduction of complicated logic to try to be clever about the loading. If a more appropriate data format was used, there would be no need to be clever. This is not the most helpful comment of mine, because I presume folks here want this bug fixed in less time than it would take to rearchitect the Tensile on-disk data format. The separate-architectures and lazy-loading features just frustrate me. I was there when those features were designed and implemented (by a very close friend of mine who is no longer at AMD), and I told the author this back then too. Redesign the on-disk data format and you will:
|
Use librocblas-dev and libhipblas-dev from Ubuntu 23.10 or later. Here's an example of how to build and run llama-cpp for any discrete Vega, RDNA 1, RDNA 2, CDNA 1 or CDNA 2 GPU in a docker container: https://gist.github.com/cgmb/be113c04cd740425f637aa33c3e4ea33 It might also work on Polaris, but it might not (since the software for that architecture has a lot of bugs). |
@smirgol one of the contributors of rocblas says that we can compile llamacpp with hipblas and mix old and new gpus |
lol i just notice its cgmb, sup bro i just dm you in the other repo chat lol |
Fix for gemm_ex_get_solutions issue has been merged into Tensile and rocBLAS develop branch. |
Fixes ROCm#1757. Enables architectures that don't have optimized logic files to also produce libraries when `--separate-architectures` or `--lazy-library-loading` is turned on. Previously, one must disable both of these two flags in order for rocBLAS to run on architectures like `gfx1010`. Test plan: ``` cmake -GNinja -B build -S . \ -DCMAKE_C_COMPILER=hipcc \ -DCMAKE_CXX_COMPILER=hipcc \ -DBUILD_CLIENTS_TESTS=OFF \ -DBUILD_CLIENTS_BENCHMARKS=OFF \ -DBUILD_CLIENTS_SAMPLES=OFF \ -DBUILD_TESTING=OFF \ -DBUILD_WITH_TENSILE=ON \ -DTensile_PRINT_DEBUG=ON \ -DTensile_LIBRARY_FORMAT=msgpack \ -DTensile_CPU_THREADS=14 \ -DTensile_LAZY_LIBRARY_LOADING=ON \ -DAMDGPU_TARGETS="..." ``` With `AMDGPU_TARGETS` being one of the following - `AMDGPU_TARGETS=gfx1010` - `AMDGPU_TARGETS=gfx1030;gfx1010` - `AMDGPU_TARGETS=gfx803;gfx900;gfx906:xnack-;gfx908:xnack-;gfx90a:xnack+;gfx90a:xnack-;gfx1010;gfx1012;gfx1030;gfx1100;gfx1101;gfx1102` In all three cases, `$ROCM_PATH/lib/rocblas/library/TensileLibrary_lazy_gfx1010.dat` is produced and all other `*.dat` files remain unchanged. Signed-off-by: Gavin Zhao <git@gzgz.dev>
@nakajee I think at the current stage we don't have to test on
|
If you can help test on your environment that'd be great. |
I will implement some workaround for this fail.
|
Fix for asm cap error. |
The |
Fixes #1757. Enables architectures that don't have optimized logic files to also produce libraries when `--separate-architectures` or `--lazy-library-loading` is turned on. Previously, one must disable both of these two flags in order for rocBLAS to run on architectures like `gfx1010`. Test plan: ``` cmake -GNinja -B build -S . \ -DCMAKE_C_COMPILER=hipcc \ -DCMAKE_CXX_COMPILER=hipcc \ -DBUILD_CLIENTS_TESTS=OFF \ -DBUILD_CLIENTS_BENCHMARKS=OFF \ -DBUILD_CLIENTS_SAMPLES=OFF \ -DBUILD_TESTING=OFF \ -DBUILD_WITH_TENSILE=ON \ -DTensile_PRINT_DEBUG=ON \ -DTensile_LIBRARY_FORMAT=msgpack \ -DTensile_CPU_THREADS=14 \ -DTensile_LAZY_LIBRARY_LOADING=ON \ -DAMDGPU_TARGETS="..." ``` With `AMDGPU_TARGETS` being one of the following - `AMDGPU_TARGETS=gfx1010` - `AMDGPU_TARGETS=gfx1030;gfx1010` - `AMDGPU_TARGETS=gfx803;gfx900;gfx906:xnack-;gfx908:xnack-;gfx90a:xnack+;gfx90a:xnack-;gfx1010;gfx1012;gfx1030;gfx1100;gfx1101;gfx1102` In all three cases, `$ROCM_PATH/lib/rocblas/library/TensileLibrary_lazy_gfx1010.dat` is produced and all other `*.dat` files remain unchanged. Signed-off-by: Gavin Zhao <git@gzgz.dev>
Patch taken from ROCm/Tensile#1757
Fixes ROCm#1757. Enables architectures that don't have optimized logic files to also produce libraries when `--separate-architectures` or `--lazy-library-loading` is turned on. Previously, one must disable both of these two flags in order for rocBLAS to run on architectures like `gfx1010`. Test plan: ``` cmake -GNinja -B build -S . \ -DCMAKE_C_COMPILER=hipcc \ -DCMAKE_CXX_COMPILER=hipcc \ -DBUILD_CLIENTS_TESTS=OFF \ -DBUILD_CLIENTS_BENCHMARKS=OFF \ -DBUILD_CLIENTS_SAMPLES=OFF \ -DBUILD_TESTING=OFF \ -DBUILD_WITH_TENSILE=ON \ -DTensile_PRINT_DEBUG=ON \ -DTensile_LIBRARY_FORMAT=msgpack \ -DTensile_CPU_THREADS=14 \ -DTensile_LAZY_LIBRARY_LOADING=ON \ -DAMDGPU_TARGETS="..." ``` With `AMDGPU_TARGETS` being one of the following - `AMDGPU_TARGETS=gfx1010` - `AMDGPU_TARGETS=gfx1030;gfx1010` - `AMDGPU_TARGETS=gfx803;gfx900;gfx906:xnack-;gfx908:xnack-;gfx90a:xnack+;gfx90a:xnack-;gfx1010;gfx1012;gfx1030;gfx1100;gfx1101;gfx1102` In all three cases, `$ROCM_PATH/lib/rocblas/library/TensileLibrary_lazy_gfx1010.dat` is produced and all other `*.dat` files remain unchanged. Signed-off-by: Gavin Zhao <git@gzgz.dev>
Issue
Tensile won't produce backend libraries for archs without optimized logic files when using --separate-architectures.
Description
According with #1165 (comment) "gfx1010 has been enabled by default in rocBLAS builds since ROCm 4.3.0." however since rocBLAS does not have optimized logic files for navi10 no library is produced for gfx1010.
$ drun --rm rocm/dev-ubuntu-22.04:5.6-complete root@ftl:/# ls -1 /opt/rocm/lib/rocblas/library/TensileLibrary_lazy_gfx* /opt/rocm/lib/rocblas/library/TensileLibrary_lazy_gfx1030.dat /opt/rocm/lib/rocblas/library/TensileLibrary_lazy_gfx1100.dat /opt/rocm/lib/rocblas/library/TensileLibrary_lazy_gfx1101.dat /opt/rocm/lib/rocblas/library/TensileLibrary_lazy_gfx1102.dat /opt/rocm/lib/rocblas/library/TensileLibrary_lazy_gfx803.dat /opt/rocm/lib/rocblas/library/TensileLibrary_lazy_gfx900.dat /opt/rocm/lib/rocblas/library/TensileLibrary_lazy_gfx906.dat /opt/rocm/lib/rocblas/library/TensileLibrary_lazy_gfx908.dat /opt/rocm/lib/rocblas/library/TensileLibrary_lazy_gfx90a.dat
Expected
Tensile should produce libraries for all requested architectures, using the fallback logic files for archs missing optimized logic files.
Workaround
Building rocBLAS with
--merge-architectures --no-lazy-library-loading
seems to avoid the issue.Patch
https://github.com/ulyssesrr/docker-rocm-xtra/blob/3be41a9d79ff4f4324f3f34383b2282529c0c4b7/rocm-xtra-builder-rocblas/patches/Tensile-fix-fallback-arch-build.patch
The text was updated successfully, but these errors were encountered: