`cblas_sbgemm` extremely slow on WSL & AMD CPU

Hello, I'm trying to run `cblas_sbgemm` on **WSL** & AMD CPU but find it extremely slow, e.g. 30x slower than `cblas_sgemm`. Anyone knows how to debug and solve this issue?

Related issue (run on **Anaconda Prompt**): https://github.com/OpenMathLib/OpenBLAS/issues/4672

### System info:
```
Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         48 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  12
  On-line CPU(s) list:   0-11
Vendor ID:               AuthenticAMD
  Model name:            AMD Ryzen 5 7640HS w/ Radeon 760M Graphics
    CPU family:          25
    Model:               116
    Thread(s) per core:  2
    Core(s) per socket:  6
    Socket(s):           1
    Stepping:            1
    BogoMIPS:            8583.32
    Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse ss
                         e2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nons
                         top_tsc cpuid extd_apicid pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx
                          f16c rdrand hypervisor lahf_lm cmp_legacy svm cr8_legacy abm sse4a misalignsse 3dnowprefetch o
                         svw topoext perfctr_core ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid
                          avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl
                         xsaveopt xsavec xgetbv1 xsaves avx512_bf16 clzero xsaveerptr arat npt nrip_save tsc_scale vmcb_
                         clean flushbyasid decodeassists pausefilter pfthreshold v_vmsave_vmload avx512vbmi umip avx512_
                         vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid
Virtualization features:
  Virtualization:        AMD-V
  Hypervisor vendor:     Microsoft
  Virtualization type:   full
Caches (sum of all):
  L1d:                   192 KiB (6 instances)
  L1i:                   192 KiB (6 instances)
  L2:                    6 MiB (6 instances)
  L3:                    16 MiB (1 instance)
Vulnerabilities:
  Gather data sampling:  Not affected
  Itlb multihit:         Not affected
  L1tf:                  Not affected
  Mds:                   Not affected
  Meltdown:              Not affected
  Mmio stale data:       Not affected
  Retbleed:              Not affected
  Spec rstack overflow:  Mitigation; safe RET, no microcode
  Spec store bypass:     Mitigation; Speculative Store Bypass disabled via prctl and seccomp
  Spectre v1:            Mitigation; usercopy/swapgs barriers and __user pointer sanitization
  Spectre v2:            Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP conditional, RSB filling, PBRSB-eIBRS
                         Not affected
  Srbds:                 Not affected
  Tsx async abort:       Not affected
```

### Build command:
```
cmake -DBUILD_BFLOAT16=ON -DBUILD_WITHOUT_LAPACK=yes -DNOFORTRAN=1 ..
make -j16
make install
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`cblas_sbgemm` extremely slow on WSL & AMD CPU #4673

System info:

Build command:

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

cblas_sbgemm extremely slow on WSL & AMD CPU #4673

Description

System info:

Build command:

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

`cblas_sbgemm` extremely slow on WSL & AMD CPU #4673