Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Issue]: Can not generate TensileLibrary.dat for gfx1100 #831

Closed
lhl opened this issue Jun 14, 2024 · 3 comments
Closed

[Issue]: Can not generate TensileLibrary.dat for gfx1100 #831

lhl opened this issue Jun 14, 2024 · 3 comments

Comments

@lhl
Copy link

lhl commented Jun 14, 2024

Problem Description

I am trying to use https://pytorch.org/torchtune/ on gfx1100 (W7900 and 7900 XTX) with ROCm 6.1.2 - this fails because the latest 6.1.2 ROCm distro's hipblaslt-dev6.1.2_0.7.0.60102-119~22.04_amd64 does not include gfx1100 kernels. I have successfully compiled and installed my own version from source (0.8.0-56aab12f~dirty) , however I can't seem to get it to generate a TensileLibrary.dat (the closest is TensileLibrary_gfx1100.dat) and when I symlink that I get this error:

rocblaslt error: Could not initialize Tensile host:
std::bad_cast
Segmentation fault (core dumped)

Operating System

Ubuntu 22.04.4 LTS (Jammy Jellyfish)

CPU

AMD Ryzen 5 5600G with Radeon Graphics

GPU

AMD Radeon RX 7900 XTX, AMD Radeon Pro W7900

Other

No response

ROCm Version

ROCm 6.1.2

ROCm Component

hipBLASLt

Steps to Reproduce

It's not available as a selection, but I am using the official ROCm 6.1.2 Ubuntu packages.

I am using the current torchtune 0.1.1 package: https://pypi.org/project/torchtune/

The hipblaslt-dev6.1.2 package does not have gfx1100 kernels:

extop_gfx90a.co
extop_gfx940.co
extop_gfx941.co
extop_gfx942.co
hipblasltExtOpLibrary.dat
Kernels.so-000-gfx90a-xnack+.hsaco
Kernels.so-000-gfx90a-xnack-.hsaco
Kernels.so-000-gfx940.hsaco
Kernels.so-000-gfx941.hsaco
Kernels.so-000-gfx942.hsaco
TensileLibrary.dat
TensileLibrary_gfx90a.co
TensileLibrary_gfx940.co
TensileLibrary_gfx941.co
TensileLibrary_gfx942.co
TensileManifest.txt

and so dies with a HIPBLAS_STATUS_NOT_SUPPORTED error.

I've compiled my new version roughtly as the docs suggest:

# modify CMakeLists.txt to turn off lazy in case this was causing issues?
option(Tensile_LAZY_LIBRARY_LOADING "Tensile to load kernels on demand?" OFF)

# dies if I do all, so I just build for my architecture
time ./install.sh -idc -a gfx1100

Once installed I get

extop_gfx1100.co
hipblasltExtOpLibrary.dat
Kernels.so-000-gfx1100.hsaco
TensileLibrary_gfx1100.co
TensileLibrary_gfx1100.dat
TensileManifest.txt

Running torchtune with this installed gives me this error:

CUDA_VISIBLE_DEVICES=1 tune run lora_finetune_single_device --config recipes/configs/llama3/8B_lora_single_device.yaml
...
rocblaslt error: Cannot read /opt/rocm/lib/hipblaslt/library/TensileLibrary.dat: No such file or directory

rocblaslt error: Could not load /opt/rocm/lib/hipblaslt/library/TensileLibrary.dat
Segmentation fault (core dumped)

Since the location is hard-coded I symlink the TensileLibrary_gfx1100.dat to TensileLibrary.dat which gives me this error:

rocblaslt error: Could not initialize Tensile host:
std::bad_cast
Segmentation fault (core dumped)

At this point I'm pretty stumped. torchtune is 100% PyTorch and the 7900 XTX and W7900 should have full support so I'm not sure if what I'm encountering is a configuration error or an implementation bug?

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

/opt/rocm/bin/rocminfo --support
ROCk module version 6.7.0 is loaded
=====================
HSA System Attributes
=====================
Runtime Version:         1.13
Runtime Ext Version:     1.4
System Timestamp Freq.:  1000.000000MHz
Sig. Max Wait Duration:  18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model:           LARGE
System Endianness:       LITTLE
Mwaitx:                  DISABLED
DMAbuf Support:          YES

==========
HSA Agents
==========
*******
Agent 1
*******
  Name:                    AMD Ryzen 5 5600G with Radeon Graphics
  Uuid:                    CPU-XX
  Marketing Name:          AMD Ryzen 5 5600G with Radeon Graphics
  Vendor Name:             CPU
  Feature:                 None specified
  Profile:                 FULL_PROFILE
  Float Round Mode:        NEAR
  Max Queue Number:        0(0x0)
  Queue Min Size:          0(0x0)
  Queue Max Size:          0(0x0)
  Queue Type:              MULTI
  Node:                    0
  Device Type:             CPU
  Cache Info:
    L1:                      32768(0x8000) KB
  Chip ID:                 0(0x0)
  ASIC Revision:           0(0x0)
  Cacheline Size:          64(0x40)
  Max Clock Freq. (MHz):   4464
  BDFID:                   0
  Internal Node ID:        0
  Compute Unit:            12
  SIMDs per CU:            0
  Shader Engines:          0
  Shader Arrs. per Eng.:   0
  WatchPts on Addr. Ranges:1
  Features:                None
  Pool Info:
    Pool 1
      Segment:                 GLOBAL; FLAGS: FINE GRAINED
      Size:                    65619828(0x3e94774) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Recommended Granule:4KB
      Alloc Alignment:         4KB
      Accessible by all:       TRUE
    Pool 2
      Segment:                 GLOBAL; FLAGS: KERNARG, FINE GRAINED
      Size:                    65619828(0x3e94774) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Recommended Granule:4KB
      Alloc Alignment:         4KB
      Accessible by all:       TRUE
    Pool 3
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED
      Size:                    65619828(0x3e94774) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Recommended Granule:4KB
      Alloc Alignment:         4KB
      Accessible by all:       TRUE
  ISA Info:
*******
Agent 2
*******
  Name:                    gfx1100
  Uuid:                    GPU-e282895b62c2b295
  Marketing Name:          AMD Radeon PRO W7900
  Vendor Name:             AMD
  Feature:                 KERNEL_DISPATCH
  Profile:                 BASE_PROFILE
  Float Round Mode:        NEAR
  Max Queue Number:        128(0x80)
  Queue Min Size:          64(0x40)
  Queue Max Size:          131072(0x20000)
  Queue Type:              MULTI
  Node:                    1
  Device Type:             GPU
  Cache Info:
    L1:                      32(0x20) KB
    L2:                      6144(0x1800) KB
    L3:                      98304(0x18000) KB
  Chip ID:                 29768(0x7448)
  ASIC Revision:           0(0x0)
  Cacheline Size:          64(0x40)
  Max Clock Freq. (MHz):   1760
  BDFID:                   768
  Internal Node ID:        1
  Compute Unit:            96
  SIMDs per CU:            2
  Shader Engines:          6
  Shader Arrs. per Eng.:   2
  WatchPts on Addr. Ranges:4
  Coherent Host Access:    FALSE
  Features:                KERNEL_DISPATCH
  Fast F16 Operation:      TRUE
  Wavefront Size:          32(0x20)
  Workgroup Max Size:      1024(0x400)
  Workgroup Max Size per Dimension:
    x                        1024(0x400)
    y                        1024(0x400)
    z                        1024(0x400)
  Max Waves Per CU:        32(0x20)
  Max Work-item Per CU:    1024(0x400)
  Grid Max Size:           4294967295(0xffffffff)
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)
    y                        4294967295(0xffffffff)
    z                        4294967295(0xffffffff)
  Max fbarriers/Workgrp:   32
  Packet Processor uCode:: 202
  SDMA engine uCode::      20
  IOMMU Support::          None
  Pool Info:
    Pool 1
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED
      Size:                    47169536(0x2cfc000) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Recommended Granule:2048KB
      Alloc Alignment:         4KB
      Accessible by all:       FALSE
    Pool 2
      Segment:                 GLOBAL; FLAGS: EXTENDED FINE GRAINED
      Size:                    47169536(0x2cfc000) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Recommended Granule:2048KB
      Alloc Alignment:         4KB
      Accessible by all:       FALSE
    Pool 3
      Segment:                 GROUP
      Size:                    64(0x40) KB
      Allocatable:             FALSE
      Alloc Granule:           0KB
      Alloc Recommended Granule:0KB
      Alloc Alignment:         0KB
      Accessible by all:       FALSE
  ISA Info:
    ISA 1
      Name:                    amdgcn-amd-amdhsa--gfx1100
      Machine Models:          HSA_MACHINE_MODEL_LARGE
      Profiles:                HSA_PROFILE_BASE
      Default Rounding Mode:   NEAR
      Default Rounding Mode:   NEAR
      Fast f16:                TRUE
      Workgroup Max Size:      1024(0x400)
      Workgroup Max Size per Dimension:
        x                        1024(0x400)
        y                        1024(0x400)
        z                        1024(0x400)
      Grid Max Size:           4294967295(0xffffffff)
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)
        y                        4294967295(0xffffffff)
        z                        4294967295(0xffffffff)
      FBarrier Max Size:       32
*******
Agent 3
*******
  Name:                    gfx1100
  Uuid:                    GPU-cc4d02090dc9c3ff
  Marketing Name:          Radeon RX 7900 XTX
  Vendor Name:             AMD
  Feature:                 KERNEL_DISPATCH
  Profile:                 BASE_PROFILE
  Float Round Mode:        NEAR
  Max Queue Number:        128(0x80)
  Queue Min Size:          64(0x40)
  Queue Max Size:          131072(0x20000)
  Queue Type:              MULTI
  Node:                    2
  Device Type:             GPU
  Cache Info:
    L1:                      32(0x20) KB
    L2:                      6144(0x1800) KB
    L3:                      98304(0x18000) KB
  Chip ID:                 29772(0x744c)
  ASIC Revision:           0(0x0)
  Cacheline Size:          64(0x40)
  Max Clock Freq. (MHz):   2371
  BDFID:                   2560
  Internal Node ID:        2
  Compute Unit:            96
  SIMDs per CU:            2
  Shader Engines:          6
  Shader Arrs. per Eng.:   2
  WatchPts on Addr. Ranges:4
  Coherent Host Access:    FALSE
  Features:                KERNEL_DISPATCH
  Fast F16 Operation:      TRUE
  Wavefront Size:          32(0x20)
  Workgroup Max Size:      1024(0x400)
  Workgroup Max Size per Dimension:
    x                        1024(0x400)
    y                        1024(0x400)
    z                        1024(0x400)
  Max Waves Per CU:        32(0x20)
  Max Work-item Per CU:    1024(0x400)
  Grid Max Size:           4294967295(0xffffffff)
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)
    y                        4294967295(0xffffffff)
    z                        4294967295(0xffffffff)
  Max fbarriers/Workgrp:   32
  Packet Processor uCode:: 202
  SDMA engine uCode::      20
  IOMMU Support::          None
  Pool Info:
    Pool 1
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED
      Size:                    25149440(0x17fc000) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Recommended Granule:2048KB
      Alloc Alignment:         4KB
      Accessible by all:       FALSE
    Pool 2
      Segment:                 GLOBAL; FLAGS: EXTENDED FINE GRAINED
      Size:                    25149440(0x17fc000) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Recommended Granule:2048KB
      Alloc Alignment:         4KB
      Accessible by all:       FALSE
    Pool 3
      Segment:                 GROUP
      Size:                    64(0x40) KB
      Allocatable:             FALSE
      Alloc Granule:           0KB
      Alloc Recommended Granule:0KB
      Alloc Alignment:         0KB
      Accessible by all:       FALSE
  ISA Info:
    ISA 1
      Name:                    amdgcn-amd-amdhsa--gfx1100
      Machine Models:          HSA_MACHINE_MODEL_LARGE
      Profiles:                HSA_PROFILE_BASE
      Default Rounding Mode:   NEAR
      Default Rounding Mode:   NEAR
      Fast f16:                TRUE
      Workgroup Max Size:      1024(0x400)
      Workgroup Max Size per Dimension:
        x                        1024(0x400)
        y                        1024(0x400)
        z                        1024(0x400)
      Grid Max Size:           4294967295(0xffffffff)
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)
        y                        4294967295(0xffffffff)
        z                        4294967295(0xffffffff)
      FBarrier Max Size:       32
*** Done ***

Additional Information

No response

@lhl
Copy link
Author

lhl commented Jun 14, 2024

btw, I am able to run hipblaslt-test and hipblaslt-bench so I know my hipblaslt build is working...

hipblaslt-bench --sizem 128 --sizen 128 --sizek 128 --precision f16_r --transA N --transB N --alpha 1 --beta 0 --compute_type f32_r --algo_method heuristic --iters 100 --device 1 --verify
hipBLASLt version: 800
hipBLASLt git version: 56aab12f-dirty
Query device success: there are 2 devices
-------------------------------------------------------------------------------
Device ID 0 : AMD Radeon PRO W7900 gfx1100
with 48.3 GB memory, max. SCLK 1760 MHz, max. MCLK 1124 MHz, compute capability 11.0
maxGridDimX 2147483647, sharedMemPerBlock 65.5 KB, maxThreadsPerBlock 1024, warpSize 32
-------------------------------------------------------------------------------
Device ID 1 : Radeon RX 7900 XTX gfx1100
with 25.8 GB memory, max. SCLK 2371 MHz, max. MCLK 1249 MHz, compute capability 11.0
maxGridDimX 2147483647, sharedMemPerBlock 65.5 KB, maxThreadsPerBlock 1024, warpSize 32
-------------------------------------------------------------------------------

Is supported 1 / Total solutions: 1
[0]transA,transB,grouped_gemm,batch_count,m,n,k,alpha,lda,stride_a,beta,ldb,stride_b,ldc,stride_c,ldd,stride_d,d_type,compute_type,activation_type,bias_vector,hipblaslt-Gflops,us,CPU-Gflops,CPU-us,norm_error_1
    N,N,0,1,128,128,128,1,128,16384,0,128,16384,128,16384,128,16384,f16_r,f32_r,none,0, 341.556, 12.28,5.71431,734,1.60027e-05

@lhl
Copy link
Author

lhl commented Jun 17, 2024

FYI, I tracked down the issue PyTorch (I use the 6.1 nightly) has its own libhipblaslt.so that overrides/causes problems. If you compile torchtune from source (which you need for the latest features like sample packing) you will also potentially have it's own copy of hipblaslt that you need to get rid of/replace.

@lhl lhl closed this as completed Jun 17, 2024
@minzhezhou
Copy link

FYI, I tracked down the issue PyTorch (I use the 6.1 nightly) has its own libhipblaslt.so that overrides/causes problems. If you compile torchtune from source (which you need for the latest features like sample packing) you will also potentially have it's own copy of hipblaslt that you need to get rid of/replace.

I copied the self built hsaco files into the tensile lib path, and got similar error, but I didn't get core dump. How did you track back to the libhipblaslt.so from torch?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants