MKL FP16 GEMM crash on MTL iGPU

# Summary

I found on MTL iGPU, if I call FP16 gemm of onemkl (no matter using OneAPI 2024.0 or 2024.2), the program will crash, and if I call it many times, it will cause my machine to freeze directly.
However, on ARC, everything is fine.

# Version
oneAPI 2024.0 or oneAPI 2024.2 .

# Environment

* minimal c++ program
* Windows 11
* icx 2024.0.2 or 2024.2
* Hardware: Intel Core Ultra iGPU

# Steps to reproduce
```c++
#include <sycl/sycl.hpp>
#include <oneapi/mkl.hpp>
#include <iostream>
using namespace sycl;
int main() {

   queue q{gpu_selector_v};
   std::cout << "Device: " << q.get_device().get_info<info::device::name>() << std::endl;

   const int M = 1024;
   const int N = 11008;
   const int K = 4096;

   float* A_h = new float[M * K];
   float* B_h = new float[K * N];
   float* C_h = new float[M * N];
   // random
   for (int i = 0; i < M * K; ++i) {
       A_h[i] = static_cast<float>(rand()) / static_cast<float>(RAND_MAX);
   }
   for (int i = 0; i < K * N; ++i) {
       B_h[i] = static_cast<float>(rand()) / static_cast<float>(RAND_MAX);
   }
   // convert input to half
   sycl::half* A_h_half = new sycl::half[M * K];
   sycl::half* B_h_half = new sycl::half[K * N];
   for (int i = 0; i < M * K; ++i) {
       A_h_half[i] = sycl::half(A_h[i]);
   }
   for (int i = 0; i < K * N; ++i) {
       B_h_half[i] = sycl::half(B_h[i]);
   }

   buffer<sycl::half> A(A_h_half, M * K);
   buffer<sycl::half> B(B_h_half, K * N);
   buffer<float> C(C_h, M * N);
   // Use OneMKL to do GEMM
   {
       q.submit([&](handler &h) {
           sycl::accessor A_acc(A, h, sycl::write_only, sycl::no_init);
           sycl::accessor B_acc(B, h, sycl::write_only, sycl::no_init);
           sycl::accessor C_acc(C, h, sycl::write_only, sycl::no_init);
           oneapi::mkl::blas::row_major::gemm(
                q,
                oneapi::mkl::transpose::nontrans,
                oneapi::mkl::transpose::trans,
                M, N, K,
                1.0f, A_acc.get_pointer(), K,
                B_acc.get_pointer(), K,
                0.0f, C_acc.get_pointer(), N);
       }).wait();
   }

   delete[] A_h;
   delete[] B_h;
   delete[] C_h;
   delete[] A_h_half;
   delete[] B_h_half;

   printf("run success!\n");
   return 0;
}
```
running above script with below command:
```bash
# for linux
source /opt/intel/oneapi/setvars.sh
icpx -std=c++17 -fsycl -fopenmp -lpthread -l mkl_intel_lp64 -lmkl_core -lmkl_intel_thread -lmkl_sycl_blas -lmkl_intel_ilp64 -lmkl_tbb_thread -o gemm_fp16 gemm_fp16.cpp

# for windows
call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat" --force
icx -std=c++17 -fsycl -fopenmp mkl_sycl_blas_dll.lib mkl_intel_ilp64_dll.lib mkl_sequential_dll.lib mkl_core_dll.lib -o gemm_fp16 gemm_fp16.cpp
```

# Observed behavior
If I run above command on Linux Arc A770, it works fine:
![image](https://github.com/oneapi-src/oneMKL/assets/105281011/5a9192fd-70be-4280-bf00-02f788208e61)

If I run above command on Windows MTL iGPU, it fails and even cause a black screen:
![image](https://github.com/oneapi-src/oneMKL/assets/105281011/f1417887-91fe-43fc-a27b-abeb76688a77)

# Expected behavior

I hope above FP16 GEMM can work for MTL iGPU. Thanks!


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

MKL FP16 GEMM crash on MTL iGPU #524

Summary

Version

Environment

Steps to reproduce

Observed behavior

Expected behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

MKL FP16 GEMM crash on MTL iGPU #524

Description

Summary

Version

Environment

Steps to reproduce

Observed behavior

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions