[Issue]: Can not generate TensileLibrary.dat for gfx1100 #831

lhl · 2024-06-14T09:42:48Z

Problem Description

I am trying to use https://pytorch.org/torchtune/ on gfx1100 (W7900 and 7900 XTX) with ROCm 6.1.2 - this fails because the latest 6.1.2 ROCm distro's hipblaslt-dev6.1.2_0.7.0.60102-119~22.04_amd64 does not include gfx1100 kernels. I have successfully compiled and installed my own version from source (0.8.0-56aab12f~dirty) , however I can't seem to get it to generate a TensileLibrary.dat (the closest is TensileLibrary_gfx1100.dat) and when I symlink that I get this error:

rocblaslt error: Could not initialize Tensile host:
std::bad_cast
Segmentation fault (core dumped)

Operating System

Ubuntu 22.04.4 LTS (Jammy Jellyfish)

CPU

AMD Ryzen 5 5600G with Radeon Graphics

GPU

AMD Radeon RX 7900 XTX, AMD Radeon Pro W7900

Other

No response

ROCm Version

ROCm 6.1.2

ROCm Component

hipBLASLt

Steps to Reproduce

It's not available as a selection, but I am using the official ROCm 6.1.2 Ubuntu packages.

I am using the current torchtune 0.1.1 package: https://pypi.org/project/torchtune/

The hipblaslt-dev6.1.2 package does not have gfx1100 kernels:

extop_gfx90a.co
extop_gfx940.co
extop_gfx941.co
extop_gfx942.co
hipblasltExtOpLibrary.dat
Kernels.so-000-gfx90a-xnack+.hsaco
Kernels.so-000-gfx90a-xnack-.hsaco
Kernels.so-000-gfx940.hsaco
Kernels.so-000-gfx941.hsaco
Kernels.so-000-gfx942.hsaco
TensileLibrary.dat
TensileLibrary_gfx90a.co
TensileLibrary_gfx940.co
TensileLibrary_gfx941.co
TensileLibrary_gfx942.co
TensileManifest.txt

and so dies with a HIPBLAS_STATUS_NOT_SUPPORTED error.

I've compiled my new version roughtly as the docs suggest:

# modify CMakeLists.txt to turn off lazy in case this was causing issues?
option(Tensile_LAZY_LIBRARY_LOADING "Tensile to load kernels on demand?" OFF)

# dies if I do all, so I just build for my architecture
time ./install.sh -idc -a gfx1100

Once installed I get

extop_gfx1100.co
hipblasltExtOpLibrary.dat
Kernels.so-000-gfx1100.hsaco
TensileLibrary_gfx1100.co
TensileLibrary_gfx1100.dat
TensileManifest.txt

Running torchtune with this installed gives me this error:

CUDA_VISIBLE_DEVICES=1 tune run lora_finetune_single_device --config recipes/configs/llama3/8B_lora_single_device.yaml
...
rocblaslt error: Cannot read /opt/rocm/lib/hipblaslt/library/TensileLibrary.dat: No such file or directory

rocblaslt error: Could not load /opt/rocm/lib/hipblaslt/library/TensileLibrary.dat
Segmentation fault (core dumped)

Since the location is hard-coded I symlink the TensileLibrary_gfx1100.dat to TensileLibrary.dat which gives me this error:

rocblaslt error: Could not initialize Tensile host:
std::bad_cast
Segmentation fault (core dumped)

At this point I'm pretty stumped. torchtune is 100% PyTorch and the 7900 XTX and W7900 should have full support so I'm not sure if what I'm encountering is a configuration error or an implementation bug?

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

/opt/rocm/bin/rocminfo --support
ROCk module version 6.7.0 is loaded
=====================
HSA System Attributes
=====================
Runtime Version:         1.13
Runtime Ext Version:     1.4
System Timestamp Freq.:  1000.000000MHz
Sig. Max Wait Duration:  18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model:           LARGE
System Endianness:       LITTLE
Mwaitx:                  DISABLED
DMAbuf Support:          YES

==========
HSA Agents
==========
*******
Agent 1
*******
  Name:                    AMD Ryzen 5 5600G with Radeon Graphics
  Uuid:                    CPU-XX
  Marketing Name:          AMD Ryzen 5 5600G with Radeon Graphics
  Vendor Name:             CPU
  Feature:                 None specified
  Profile:                 FULL_PROFILE
  Float Round Mode:        NEAR
  Max Queue Number:        0(0x0)
  Queue Min Size:          0(0x0)
  Queue Max Size:          0(0x0)
  Queue Type:              MULTI
  Node:                    0
  Device Type:             CPU
  Cache Info:
    L1:                      32768(0x8000) KB
  Chip ID:                 0(0x0)
  ASIC Revision:           0(0x0)
  Cacheline Size:          64(0x40)
  Max Clock Freq. (MHz):   4464
  BDFID:                   0
  Internal Node ID:        0
  Compute Unit:            12
  SIMDs per CU:            0
  Shader Engines:          0
  Shader Arrs. per Eng.:   0
  WatchPts on Addr. Ranges:1
  Features:                None
  Pool Info:
    Pool 1
      Segment:                 GLOBAL; FLAGS: FINE GRAINED
      Size:                    65619828(0x3e94774) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Recommended Granule:4KB
      Alloc Alignment:         4KB
      Accessible by all:       TRUE
    Pool 2
      Segment:                 GLOBAL; FLAGS: KERNARG, FINE GRAINED
      Size:                    65619828(0x3e94774) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Recommended Granule:4KB
      Alloc Alignment:         4KB
      Accessible by all:       TRUE
    Pool 3
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED
      Size:                    65619828(0x3e94774) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Recommended Granule:4KB
      Alloc Alignment:         4KB
      Accessible by all:       TRUE
  ISA Info:
*******
Agent 2
*******
  Name:                    gfx1100
  Uuid:                    GPU-e282895b62c2b295
  Marketing Name:          AMD Radeon PRO W7900
  Vendor Name:             AMD
  Feature:                 KERNEL_DISPATCH
  Profile:                 BASE_PROFILE
  Float Round Mode:        NEAR
  Max Queue Number:        128(0x80)
  Queue Min Size:          64(0x40)
  Queue Max Size:          131072(0x20000)
  Queue Type:              MULTI
  Node:                    1
  Device Type:             GPU
  Cache Info:
    L1:                      32(0x20) KB
    L2:                      6144(0x1800) KB
    L3:                      98304(0x18000) KB
  Chip ID:                 29768(0x7448)
  ASIC Revision:           0(0x0)
  Cacheline Size:          64(0x40)
  Max Clock Freq. (MHz):   1760
  BDFID:                   768
  Internal Node ID:        1
  Compute Unit:            96
  SIMDs per CU:            2
  Shader Engines:          6
  Shader Arrs. per Eng.:   2
  WatchPts on Addr. Ranges:4
  Coherent Host Access:    FALSE
  Features:                KERNEL_DISPATCH
  Fast F16 Operation:      TRUE
  Wavefront Size:          32(0x20)
  Workgroup Max Size:      1024(0x400)
  Workgroup Max Size per Dimension:
    x                        1024(0x400)
    y                        1024(0x400)
    z                        1024(0x400)
  Max Waves Per CU:        32(0x20)
  Max Work-item Per CU:    1024(0x400)
  Grid Max Size:           4294967295(0xffffffff)
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)
    y                        4294967295(0xffffffff)
    z                        4294967295(0xffffffff)
  Max fbarriers/Workgrp:   32
  Packet Processor uCode:: 202
  SDMA engine uCode::      20
  IOMMU Support::          None
  Pool Info:
    Pool 1
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED
      Size:                    47169536(0x2cfc000) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Recommended Granule:2048KB
      Alloc Alignment:         4KB
      Accessible by all:       FALSE
    Pool 2
      Segment:                 GLOBAL; FLAGS: EXTENDED FINE GRAINED
      Size:                    47169536(0x2cfc000) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Recommended Granule:2048KB
      Alloc Alignment:         4KB
      Accessible by all:       FALSE
    Pool 3
      Segment:                 GROUP
      Size:                    64(0x40) KB
      Allocatable:             FALSE
      Alloc Granule:           0KB
      Alloc Recommended Granule:0KB
      Alloc Alignment:         0KB
      Accessible by all:       FALSE
  ISA Info:
    ISA 1
      Name:                    amdgcn-amd-amdhsa--gfx1100
      Machine Models:          HSA_MACHINE_MODEL_LARGE
      Profiles:                HSA_PROFILE_BASE
      Default Rounding Mode:   NEAR
      Default Rounding Mode:   NEAR
      Fast f16:                TRUE
      Workgroup Max Size:      1024(0x400)
      Workgroup Max Size per Dimension:
        x                        1024(0x400)
        y                        1024(0x400)
        z                        1024(0x400)
      Grid Max Size:           4294967295(0xffffffff)
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)
        y                        4294967295(0xffffffff)
        z                        4294967295(0xffffffff)
      FBarrier Max Size:       32
*******
Agent 3
*******
  Name:                    gfx1100
  Uuid:                    GPU-cc4d02090dc9c3ff
  Marketing Name:          Radeon RX 7900 XTX
  Vendor Name:             AMD
  Feature:                 KERNEL_DISPATCH
  Profile:                 BASE_PROFILE
  Float Round Mode:        NEAR
  Max Queue Number:        128(0x80)
  Queue Min Size:          64(0x40)
  Queue Max Size:          131072(0x20000)
  Queue Type:              MULTI
  Node:                    2
  Device Type:             GPU
  Cache Info:
    L1:                      32(0x20) KB
    L2:                      6144(0x1800) KB
    L3:                      98304(0x18000) KB
  Chip ID:                 29772(0x744c)
  ASIC Revision:           0(0x0)
  Cacheline Size:          64(0x40)
  Max Clock Freq. (MHz):   2371
  BDFID:                   2560
  Internal Node ID:        2
  Compute Unit:            96
  SIMDs per CU:            2
  Shader Engines:          6
  Shader Arrs. per Eng.:   2
  WatchPts on Addr. Ranges:4
  Coherent Host Access:    FALSE
  Features:                KERNEL_DISPATCH
  Fast F16 Operation:      TRUE
  Wavefront Size:          32(0x20)
  Workgroup Max Size:      1024(0x400)
  Workgroup Max Size per Dimension:
    x                        1024(0x400)
    y                        1024(0x400)
    z                        1024(0x400)
  Max Waves Per CU:        32(0x20)
  Max Work-item Per CU:    1024(0x400)
  Grid Max Size:           4294967295(0xffffffff)
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)
    y                        4294967295(0xffffffff)
    z                        4294967295(0xffffffff)
  Max fbarriers/Workgrp:   32
  Packet Processor uCode:: 202
  SDMA engine uCode::      20
  IOMMU Support::          None
  Pool Info:
    Pool 1
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED
      Size:                    25149440(0x17fc000) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Recommended Granule:2048KB
      Alloc Alignment:         4KB
      Accessible by all:       FALSE
    Pool 2
      Segment:                 GLOBAL; FLAGS: EXTENDED FINE GRAINED
      Size:                    25149440(0x17fc000) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Recommended Granule:2048KB
      Alloc Alignment:         4KB
      Accessible by all:       FALSE
    Pool 3
      Segment:                 GROUP
      Size:                    64(0x40) KB
      Allocatable:             FALSE
      Alloc Granule:           0KB
      Alloc Recommended Granule:0KB
      Alloc Alignment:         0KB
      Accessible by all:       FALSE
  ISA Info:
    ISA 1
      Name:                    amdgcn-amd-amdhsa--gfx1100
      Machine Models:          HSA_MACHINE_MODEL_LARGE
      Profiles:                HSA_PROFILE_BASE
      Default Rounding Mode:   NEAR
      Default Rounding Mode:   NEAR
      Fast f16:                TRUE
      Workgroup Max Size:      1024(0x400)
      Workgroup Max Size per Dimension:
        x                        1024(0x400)
        y                        1024(0x400)
        z                        1024(0x400)
      Grid Max Size:           4294967295(0xffffffff)
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)
        y                        4294967295(0xffffffff)
        z                        4294967295(0xffffffff)
      FBarrier Max Size:       32
*** Done ***

Additional Information

No response

The text was updated successfully, but these errors were encountered:

lhl · 2024-06-14T09:51:44Z

btw, I am able to run hipblaslt-test and hipblaslt-bench so I know my hipblaslt build is working...

hipblaslt-bench --sizem 128 --sizen 128 --sizek 128 --precision f16_r --transA N --transB N --alpha 1 --beta 0 --compute_type f32_r --algo_method heuristic --iters 100 --device 1 --verify
hipBLASLt version: 800
hipBLASLt git version: 56aab12f-dirty
Query device success: there are 2 devices
-------------------------------------------------------------------------------
Device ID 0 : AMD Radeon PRO W7900 gfx1100
with 48.3 GB memory, max. SCLK 1760 MHz, max. MCLK 1124 MHz, compute capability 11.0
maxGridDimX 2147483647, sharedMemPerBlock 65.5 KB, maxThreadsPerBlock 1024, warpSize 32
-------------------------------------------------------------------------------
Device ID 1 : Radeon RX 7900 XTX gfx1100
with 25.8 GB memory, max. SCLK 2371 MHz, max. MCLK 1249 MHz, compute capability 11.0
maxGridDimX 2147483647, sharedMemPerBlock 65.5 KB, maxThreadsPerBlock 1024, warpSize 32
-------------------------------------------------------------------------------

Is supported 1 / Total solutions: 1
[0]transA,transB,grouped_gemm,batch_count,m,n,k,alpha,lda,stride_a,beta,ldb,stride_b,ldc,stride_c,ldd,stride_d,d_type,compute_type,activation_type,bias_vector,hipblaslt-Gflops,us,CPU-Gflops,CPU-us,norm_error_1
    N,N,0,1,128,128,128,1,128,16384,0,128,16384,128,16384,128,16384,f16_r,f32_r,none,0, 341.556, 12.28,5.71431,734,1.60027e-05

lhl · 2024-06-17T13:54:50Z

FYI, I tracked down the issue PyTorch (I use the 6.1 nightly) has its own libhipblaslt.so that overrides/causes problems. If you compile torchtune from source (which you need for the latest features like sample packing) you will also potentially have it's own copy of hipblaslt that you need to get rid of/replace.

minzhezhou · 2024-07-07T18:36:37Z

FYI, I tracked down the issue PyTorch (I use the 6.1 nightly) has its own libhipblaslt.so that overrides/causes problems. If you compile torchtune from source (which you need for the latest features like sample packing) you will also potentially have it's own copy of hipblaslt that you need to get rid of/replace.

I copied the self built hsaco files into the tensile lib path, and got similar error, but I didn't get core dump. How did you track back to the libhipblaslt.so from torch?

lhl closed this as completed Jun 17, 2024

minzhezhou mentioned this issue Jul 7, 2024

[Issue]: Could not load /opt/rocm-6.1.3/lib/rocblas/library/TensileLibrary.dat #856

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Issue]: Can not generate TensileLibrary.dat for gfx1100 #831

[Issue]: Can not generate TensileLibrary.dat for gfx1100 #831

lhl commented Jun 14, 2024 •

edited

Loading

lhl commented Jun 14, 2024

lhl commented Jun 17, 2024

minzhezhou commented Jul 7, 2024

[Issue]: Can not generate TensileLibrary.dat for gfx1100 #831

[Issue]: Can not generate TensileLibrary.dat for gfx1100 #831

Comments

lhl commented Jun 14, 2024 • edited Loading

Problem Description

Operating System

CPU

GPU

Other

ROCm Version

ROCm Component

Steps to Reproduce

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

Additional Information

lhl commented Jun 14, 2024

lhl commented Jun 17, 2024

minzhezhou commented Jul 7, 2024

lhl commented Jun 14, 2024 •

edited

Loading