-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PVC support #406
Comments
oneMKL issue:
|
I have a similar issue in #411 with the routine
|
Couple of breadcrumbs on the MKL PVC issue (also serving as documentation for myself). @kballeda tried oneAPI 2024.1.0, but that gave the same error. I tried building the support library with the local compiler on IDC: withenv("PATH"=>"$(ENV["PATH"]):/opt/intel/oneapi/2024.0/bin/",
"LD_LIBRARY_PATH"=>"/opt/intel/oneapi/2024.0/lib") do
cmake() do cmake_path
ninja() do ninja_path
run(```$cmake_path -DCMAKE_CXX_COMPILER="icpx"
-DCMAKE_CXX_FLAGS="-fsycl -isystem /opt/intel/oneapi/2024.0/include -isystem $include_dir"
-DCMAKE_SHARED_LINKER_FLAGS="-L/opt/intel/oneapi/2024.0/lib"
-DCMAKE_INSTALL_RPATH="/opt/intel/oneapi/2024.0/lib"
-DCMAKE_INSTALL_PREFIX=$install_dir
-GNinja -S $(@__DIR__) -B $build_dir```)
run(`$ninja_path -C $(build_dir) install`)
end
end
end That resulted in the same error. So the issue is probably not with the oneAPI distribution on Conda. I then wrote a C program that links against ze_loader and oneapi_support, doing everything that oneAPI.jl does: #include <stdio.h>
#include <stdlib.h>
#include <assert.h>
#include "level_zero/ze_api.h"
#include "deps/src/onemkl.h"
#include "deps/src/sycl.h"
int main() {
ze_result_t result;
ze_driver_handle_t driver = 0;
ze_device_handle_t device = 0;
ze_context_handle_t context = 0;
ze_command_queue_handle_t queue = 0;
// Initialize oneAPI Level Zero
result = zeInit(0);
assert(result == ZE_RESULT_SUCCESS);
// Initialize the driver
uint32_t driver_count = 0;
result = zeDriverGet(&driver_count, NULL);
assert(result == ZE_RESULT_SUCCESS && driver_count > 0);
result = zeDriverGet(&driver_count, &driver);
assert(result == ZE_RESULT_SUCCESS);
// Create a context
ze_context_desc_t context_desc = {
.stype = ZE_STRUCTURE_TYPE_CONTEXT_DESC,
.pNext = NULL,
.flags = 0
};
result = zeContextCreate(driver, &context_desc, &context);
assert(result == ZE_RESULT_SUCCESS);
// Get a device handle
uint32_t device_count = 0;
result = zeDeviceGet(driver, &device_count, NULL);
assert(result == ZE_RESULT_SUCCESS);
ze_device_handle_t* devices = (ze_device_handle_t*)malloc(device_count * sizeof(ze_device_handle_t));
result = zeDeviceGet(driver, &device_count, devices);
assert(result == ZE_RESULT_SUCCESS);
device = devices[0];
// Create a command queue
ze_command_queue_desc_t queue_desc = {
.stype = ZE_STRUCTURE_TYPE_COMMAND_QUEUE_DESC,
.pNext = NULL,
.ordinal = 0,
.mode = ZE_COMMAND_QUEUE_MODE_DEFAULT,
.priority = ZE_COMMAND_QUEUE_PRIORITY_NORMAL,
.flags = 0
};
result = zeCommandQueueCreate(context, device, &queue_desc, &queue);
assert(result == ZE_RESULT_SUCCESS);
// Allocate memory
int m = 10, n = 10, k = 10;
float *A, *B, *C;
ze_device_mem_alloc_desc_t alloc_desc = {
.stype = ZE_STRUCTURE_TYPE_DEVICE_MEM_ALLOC_DESC,
.pNext = NULL,
.flags = 0,
.ordinal = 0
};
result = zeMemAllocDevice(context, &alloc_desc, m * k * sizeof(float), 1, device, (void**)&A);
assert(result == ZE_RESULT_SUCCESS);
result = zeMemAllocDevice(context, &alloc_desc, k * n * sizeof(float), 1, device, (void**)&B);
assert(result == ZE_RESULT_SUCCESS);
result = zeMemAllocDevice(context, &alloc_desc, m * n * sizeof(float), 1, device, (void**)&C);
assert(result == ZE_RESULT_SUCCESS);
// Create SYCL objects
syclPlatform_t sycl_platform = 0;
syclDevice_t sycl_device = 0;
syclContext_t sycl_context = 0;
syclQueue_t sycl_queue = 0;
syclPlatformCreate(&sycl_platform, driver);
syclDeviceCreate(&sycl_device, sycl_platform, device);
syclContextCreate(&sycl_context, &sycl_device, 1, context, 1);
syclQueueCreate(&sycl_queue, sycl_context, sycl_device, queue, 1);
// Call MKL's SGEMM function
float alpha = 1.0;
float beta = 0.0;
onemklSgemm(sycl_queue, ONEMKL_TRANSPOSE_NONTRANS, ONEMKL_TRANSPOSE_NONTRANS,
m, n, k, alpha, A, m, B, k, beta, C, m);
return 0;
} Compile and execute:
Here, I'm linking against a locally built
Interestingly, this C program works successfully with both versions of the support library. That doesn't only confirm again that the toolchain isn't to blame, but also that the library as packaged on Yggdrasil seems to work correctly. One thing that's different here from regular Julia execution, is the fact that
Also, I tried to verify this hypothesis by loading the C loader with our builds of NEO/ze_loader/IGC/gmmlib instead, by setting LD_LIBRARY_PATH to the artifact_dirs: oneAPI.oneL0.oneAPI_Level_Zero_Loader_jll.artifact_dir
"/home/sdp/.julia/artifacts/07d2a0b1b466f4d6fab3f80843bd68cb0036c027
oneAPI.oneL0.NEO_jll.artifact_dir
"/home/sdp/.julia/artifacts/2a52b1197a324e3df923175b2035c42899f069f2
oneAPI.oneL0.NEO_jll.libigc_jll.artifact_dir
"/home/sdp/.julia/artifacts/1fad9b4961d944e4422a7f63a4d2a65421e4e126
oneAPI.oneL0.NEO_jll.gmmlib_jll.artifact_dir
"/home/sdp/.julia/artifacts/be9d1cd776269d571d16522f35ed5c6af4309a4b
And this works just fine... A couple libraries, like I'm running out of ideas at this point. What complicates this all, is the plugin systems involved (which I don't fully understand). For all I know that could decide to not load PVC support and have MKL fall back to the (aborting) code path instead, but without a way to reproduce that outside of Julia it's going to be hard to debug this. |
About that... Line 89 in f20d471
Okay, that's enough for today. Now that I've uncovered the root issue, it should hopefully be easier to resolve this. |
I remember that we added the OCL_ICD_VENDORS environment to direct the libOpenCL.so to load from Julia's artifact repo instead of using system's libigdrcl.so. It was to fix a problem with the system libigcdrcl.so not compatible with what oneMKL build with. It is strange that the libigdrcl.so in Conda is not compatible with oneMKL. I am confused? |
In your LD_DEBUG=libs trace, the C++ execution is using the system's oneAPI 2024 libOpenCL.so where the Julia run uses the one from the Conda artifact: C++: Julia:
If we remove the OCL_ICD_VENDORS environment setting from the Julia package, Julia will always pick the system's libOpenCL.so. This will require the users to install a compatible OpenCL library on their systems. This is undesirable but if it is the only way to make it work, we will just have to do it and warn the users. |
Here is a link to how OCL_ICD_VENDORS affects the loading of OpenCL library: https://manpages.ubuntu.com/manpages/trusty/en/man7/libOpenCL.so.7.html |
Going back our Level Zero driver build from Intel compute-runtime, have we added the cmake option: -DNEO_ENABLE_i915_PRELIM_DETECTION=TRUE? If not, it could be the problem. |
Never mind, the cmake build flag is set to TRUE according to the NEO build log: _bk;t=1711712348714 �[0m�[1m[11:39:08] �[22m�[31m ---> CMAKE_FLAGS+=(-DNEO_ENABLE_i915_PRELIM_DETECTION=TRUE)�[39m So the Level Zero driver should be good. |
That doesn't matter though; the C++ reproducer above also fails when using the libOpenCL.so from Conda (which is the one we redistribute as part of the oneAPI support library package).
It isn't only undesirable, it also won't work, because the problem seems to lie with |
FWIW, here's the easier way to reproduce the issue by compiling and executing the C++ MWE from above using all our libraries: julia --project -e '
using oneAPI
run(`gcc wip.c -g -o wip -L$(oneAPI.oneL0.oneAPI_Level_Zero_Loader_jll.artifact_dir)/lib -L$(oneAPI.Support.oneAPI_Support_jll.artifact_dir)/lib -lze_loader -loneapi_support`)
withenv("LD_LIBRARY_PATH" => "$(oneAPI.oneL0.oneAPI_Level_Zero_Loader_jll.artifact_dir)/lib:$(oneAPI.Support.oneAPI_Support_jll.artifact_dir)/lib:$(oneAPI.oneL0.NEO_jll.artifact_dir)/lib:$(oneAPI.oneL0.NEO_jll.libigc_jll.artifact_dir)/lib:$(oneAPI.oneL0.NEO_jll.gmmlib_jll.artifact_dir)/lib",
"OCL_ICD_VENDORS" => "$(oneAPI.oneL0.NEO_jll.libigdrcl)") do
run(`./wip`)
end' Adding Given that the PVC platform somehow isn't loaded/supported/whatever, I figured that this may be visible using
Crucially though, removing the
So it may be an issue with the Conda-provided libOpenCL.so after all? Or oneAPI_Support_jll is somehow breaking PVC OpenCL in another way. Here's the |
One of the strange things about the Conda libOpenCL we ship is that it has way more dependencies than I would expect:
vs. the system one, having none:
@pengtu Do you know why the OpenCL ICD loader we pull from the |
Summary of current status:
@pengtu Can you figure out what the deal is here? As far as I can tell the generic ICD loader works fine wrt. both PVC support and the SYCL Plugin Interface, and it seems like oneMKL is doing strange things with libOpenCL.so that are hard/impossible to debug. |
This issue is to keep track of oneAPI.jl support for PVC hardware.
Remaining issues:
libze_tracing_layer
to avoid conflict on IDCThe text was updated successfully, but these errors were encountered: