Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MKL upgrade breaks support library #357

Closed
kballeda opened this issue Aug 30, 2023 · 18 comments · Fixed by #386
Closed

MKL upgrade breaks support library #357

kballeda opened this issue Aug 30, 2023 · 18 comments · Fixed by #386

Comments

@kballeda
Copy link
Contributor

I had encountered following error when I run the onemkl.jl tests. This error goes away if I set oneAPI locally.
Intel MKL FATAL ERROR: Error on loading function 'clGetPlatformIDs'
This starts occurring since 2023.2 / Julia 1.9.3 onwards.

This has been observed as part of CI tests for open PRs,
image

https://buildkite.com/julialang/oneapi-dot-jl/builds/773

@maleadt maleadt changed the title Runtime error related to MKL MKL upgrade breaks support library Sep 13, 2023
@maleadt maleadt mentioned this issue Sep 13, 2023
@maleadt
Copy link
Member

maleadt commented Sep 13, 2023

This is caused by the MKL upgrade on Conda. Maybe you're in a better position to figure out what's up here?
I've blocked the upgrade for now, see #361. The binaries we redistribute are already pinned to 2023.0.0, but we also copy more libraries so maybe those binaries wouldn't suffer from this: https://github.com/JuliaPackaging/Yggdrasil/blob/45ce878a4764c4b0478386c5b48dadcfa20a3323/O/oneAPI_Support/build_tarballs.jl#L82-L85

@kballeda
Copy link
Contributor Author

kballeda commented Sep 14, 2023

@maleadt : This issue is due to failed to find libOpenCL package within conda packages for local builds

  1413938:
   1413938:     calling init: /home/kali/.julia/scratchspaces/8f75cd03-7ff8-4ecb-9b8f-daf728133b1b/conda/lib/libpi_opencl.so
   1413938:
   1413938:     find library=libOpenCL.so [0]; searching
   1413938:      search cache=/etc/ld.so.cache
   1413938:      **search path=/lib/x86_64-linux-gnu:/usr/lib/x86_64-linux-gnu:/lib:/usr/lib              (system search path)**
   1413938:       trying file=/lib/x86_64-linux-gnu/libOpenCL.so
   1413938:       trying file=/usr/lib/x86_64-linux-gnu/libOpenCL.so
   1413938:       trying file=/lib/libOpenCL.so
   1413938:       trying file=/usr/lib/libOpenCL.so
   1413938:
  1413938:       trying file=/usr/lib/libOpenCL.so
   1413938:
Intel MKL FATAL ERROR: Error on loading function 'clGetPlatformIDs'.

Ideally it should be looking for it here, /home/kali/.julia/scratchspaces/8f75cd03-7ff8-4ecb-9b8f-daf728133b1b/conda/lib/
If we set the LD_LIBRARY_PATH then test passes.

By default libpi_opencl.so points to system path and search path must be set correctly. Please take a look.

:~/Kali/oneAPI.jl$ ldd ~/.julia/scratchspaces/8f75cd03-7ff8-4ecb-9b8f-daf728133b1b/conda/lib/libpi_opencl.so
        linux-vdso.so.1 (0x00007ffdb1fac000)
        **libOpenCL.so.1 => /lib/x86_64-linux-gnu/libOpenCL.so.1 (0x00007f9642b65000)**
        libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f9642b5f000)
        libsvml.so => not found
        libirng.so => not found
        libimf.so => not found
        libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f9642a10000)
        libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007f96429e9000)
        libintlc.so.5 => not found
        libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f96429c6000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f96427d4000)
        /lib64/ld-linux-x86-64.so.2 (0x00007f9642e49000)

@kballeda
Copy link
Contributor Author

@maleadt : how can we get the search path to into conda package under julia scratcspaces directory like before? what caused this behavioral change ?

@maleadt
Copy link
Member

maleadt commented Sep 29, 2023

The change is with the binaries from Conda, I didn't change over here. Presumably the OpenCL dependency didn't exist before, or at least it didn't trigger with the functionality we're executing here.

The solution is probably to copy the OpenCL library into the scratch space. We already do this for the JLLs we build on Yggdrasil.

@kballeda
Copy link
Contributor Author

kballeda commented Oct 10, 2023

@maleadt We find that OpenCL library is already in the scratch space in Conda distribution but during execution of oneAPI.jl it is loaded from local system install. We need it to dynamically load from scratch space instead. how can we do this?

@maleadt
Copy link
Member

maleadt commented Oct 10, 2023

I'd try to copy it to the location of the support library instead, like we do on Yggdrasil: https://github.com/JuliaPackaging/Yggdrasil/blob/9cfb94ebb0f33b9128f49aea81d9bc6cb9a0d9fc/O/oneAPI_Support/build_tarballs.jl#L75-L85

@kballeda
Copy link
Contributor Author

@maleadt we need to provide opencl library as part of oneapi.jl artifactory and it must be initialized as part of instantiate process like other libraries. when libpi_opencl is called for opencl initiation as part of build_local() it must pick the libopencl downloaded from artifactory.

@maleadt
Copy link
Member

maleadt commented Oct 19, 2023

The generated library has a correct RPATH, so I don't see why it shouldn't be able to find libOpencl:

❯ ldd /home/tim/.julia/scratchspaces/8f75cd03-7ff8-4ecb-9b8f-daf728133b1b/deps/lib/liboneapi_support.so
	libOpenCL.so.1 => /home/tim/.julia/scratchspaces/8f75cd03-7ff8-4ecb-9b8f-daf728133b1b/conda/lib/libOpenCL.so.1 (0x00007f48e6aa1000)

This looks like a design flaw with the plugin infrastructure. We cannot change the system-wide dynamic library search path, nor can we set LD_LIBRARY_PATH. I guess we could pre-load libOpencl.so though.

@kballeda
Copy link
Contributor Author

The generated library has a correct RPATH, so I don't see why it shouldn't be able to find libOpencl:

❯ ldd /home/tim/.julia/scratchspaces/8f75cd03-7ff8-4ecb-9b8f-daf728133b1b/deps/lib/liboneapi_support.so
	libOpenCL.so.1 => /home/tim/.julia/scratchspaces/8f75cd03-7ff8-4ecb-9b8f-daf728133b1b/conda/lib/libOpenCL.so.1 (0x00007f48e6aa1000)

This looks like a design flaw with the plugin infrastructure. We cannot change the system-wide dynamic library search path, nor can we set LD_LIBRARY_PATH. I guess we could pre-load libOpencl.so though.

How about the libpi_opencl.so in your case ? Reference to my comment above, it points to local directory. #357 (comment)

@maleadt
Copy link
Member

maleadt commented Oct 19, 2023

In fact, our RPATH works just fine, as the process does open libOpenCL from the conda libdir:

   3562494:	calling init: /home/tim/.julia/scratchspaces/8f75cd03-7ff8-4ecb-9b8f-daf728133b1b/conda/lib/libOpenCL.so.1

   3562494:	calling init: /home/tim/Julia/depot/scratchspaces/8f75cd03-7ff8-4ecb-9b8f-daf728133b1b/conda/lib/libpi_opencl.so
   3562494:
   3562494:	find library=libOpenCL.so [0]; searching
   3562494:	 search cache=/etc/ld.so.cache
   3562494:	 search path=/lib/x86_64-linux-gnu:/usr/lib/x86_64-linux-gnu:/lib:/usr/lib		(system search path)
   3562494:	  trying file=/lib/x86_64-linux-gnu/libOpenCL.so
   3562494:	  trying file=/usr/lib/x86_64-linux-gnu/libOpenCL.so
   3562494:	  trying file=/lib/libOpenCL.so
   3562494:	  trying file=/usr/lib/libOpenCL.so

So switching to OpenCL_jll doesn't help, as expected:

   3564707:	calling init: /home/tim/.julia/artifacts/f7bfc0eff76d18282497107dfa410e2ee35c536b/lib/libOpenCL.so

   3564707:	find library=libOpenCL.so [0]; searching
   3564707:	 search cache=/etc/ld.so.cache
   3564707:	 search path=/lib/x86_64-linux-gnu:/usr/lib/x86_64-linux-gnu:/lib:/usr/lib		(system search path)
   3564707:	  trying file=/lib/x86_64-linux-gnu/libOpenCL.so
   3564707:	  trying file=/usr/lib/x86_64-linux-gnu/libOpenCL.so
   3564707:	  trying file=/lib/libOpenCL.so
   3564707:	  trying file=/usr/lib/libOpenCL.so

The fact that libpi_opencl decides to do another look for libOpenCL, ignoring the already-loaded version and the RPATH configured on our library, seems like a bug in MKL.

@maleadt
Copy link
Member

maleadt commented Oct 19, 2023

How about the libpi_opencl.so in your case ? Reference to my comment above, it points to local directory. #357 (comment)

libpi_opencl.so's RPATH doesn't matter here. That's a Conda-provided library, we shouldn't have to change it. libOpenCL.so is already loaded, so AFAIU the plugin infrastructure shouldn't be trying to load another copy.

@pengtu
Copy link
Contributor

pengtu commented Oct 19, 2023

@maleadt : Thank you for the analysis and explanation. We will formulate a bug report to the libpi_opencl.so library to debug and fix this problem.

@amontoison
Copy link
Member

Can we use temporary the trick suggested by @kballeda with LD_LIBRARY_PATH to test the PRs with continuous integration?

@maleadt
Copy link
Member

maleadt commented Oct 20, 2023

Although we can do this for CI, we can't do this with JLLs. Previously, the bug that updating MKL surfaced didn't trigger with JLLs because we're using an older MKL in Yggdrasil, but in this PR you're also running into the OpenCL issue with the downgraded MKL, so I think we would also run into this with our JLLs (and again, we can't use the LD_LIBRARY_PATH hack there).

@amontoison
Copy link
Member

Ok, I see. We can only wait a fix upstream by Intel.

@kballeda
Copy link
Contributor Author

#386
This is a bug from MKL side and it got resolved in oneAPI 2024 onwards.
Whereas we need to still hard code it to 2024.0.0 as the latest 2024.0.2 patch is incomplete.

@maleadt
Copy link
Member

maleadt commented Feb 26, 2024

Whereas we need to still hard code it to 2024.0.0 as the latest 2024.0.2 patch is incomplete.

You mean it's broken in other ways, or not released yet?

@maleadt maleadt linked a pull request Feb 26, 2024 that will close this issue
@kballeda
Copy link
Contributor Author

kballeda commented Feb 26, 2024

Whereas we need to still hard code it to 2024.0.0 as the latest 2024.0.2 patch is incomplete.

You mean it's broken in other ways, or not released yet?

They released it but conda packages for 2024.0.2 miss some of the llvm components.
I am following up with Intel conda packages team internally on that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants