-
Notifications
You must be signed in to change notification settings - Fork 170
[BLAS] SYCL-Graph integration for native-command #669
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BLAS] SYCL-Graph integration for native-command #669
Conversation
Currently on a CUDA backend to SYCL when running `GGML_SYCL_DISABLE_GRAPH=0 ./bin/test-backend-ops -b SYCL0` I see crashes from 3 operations: 1) `-o MUL_MAT`: Issue arising from recording of oneMath `ext_codeplay_enqueue_native_command`. 2) `-o CONCAT` : Use of blocking waits on a queue that's being recorded https://github.com/ggml-org/llama.cpp/blob/master/ggml/src/ggml-sycl/concat.cpp#L185-L187, can these wait calls just be removed? 3) `-o MUL_MAT_ID`: Blocking wait on a recording queue for a copy to host memory https://github.com/ggml-org/llama.cpp/blob/master/ggml/src/ggml-sycl/ggml-sycl.cpp#L3072-L3074 , host work could be wrapped in a host-task? For 1) I have come up with a oneMath fix in uxlfoundation/oneMath#669, I've put a provisional git tag to pull in this PR for testing, but will update to the upstream commit once merged. For 2 & 3) we've noticed that `ggml-cuda.cu` has the [check_node_graph_compatibility_and_refresh_copy_ops](https://github.com/ggml-org/llama.cpp/blob/39e73ae0d69f882d7e29cecc6dd8f5052fca6731/ggml/src/ggml-cuda/ggml-cuda.cu#L2458-L2458) method for checking if a graph can be used, even if enabled. I've taken a similar approach in this PR by adding a method to `ggml-sycl.cpp` for checking if a graph can be used for the operations even if a user has asked for it to be enabled.
Currently on a CUDA backend to SYCL when running `GGML_SYCL_DISABLE_GRAPH=0 ./bin/test-backend-ops -b SYCL0` I see crashes from 3 operations: 1) `-o MUL_MAT`: Issue arising from recording of oneMath `ext_codeplay_enqueue_native_command`. 2) `-o CONCAT` : Use of blocking waits on a queue that's being recorded https://github.com/ggml-org/llama.cpp/blob/master/ggml/src/ggml-sycl/concat.cpp#L185-L187, can these wait calls just be removed? 3) `-o MUL_MAT_ID`: Blocking wait on a recording queue for a copy to host memory https://github.com/ggml-org/llama.cpp/blob/master/ggml/src/ggml-sycl/ggml-sycl.cpp#L3072-L3074 , host work could be wrapped in a host-task? For 1) I have come up with a oneMath fix in uxlfoundation/oneMath#669, I've put a provisional git tag to pull in this PR for testing, but will update to the upstream commit once merged. For 2 & 3) we've noticed that `ggml-cuda.cu` has the [check_node_graph_compatibility_and_refresh_copy_ops](https://github.com/ggml-org/llama.cpp/blob/39e73ae0d69f882d7e29cecc6dd8f5052fca6731/ggml/src/ggml-cuda/ggml-cuda.cu#L2458-L2458) method for checking if a graph can be used, even if enabled. I've taken a similar approach in this PR by adding a method to `ggml-sycl.cpp` for checking if a graph can be used for the operations even if a user has asked for it to be enabled.
Currently on a CUDA backend to SYCL when running `GGML_SYCL_DISABLE_GRAPH=0 ./bin/test-backend-ops -b SYCL0` I see crashes from 3 operations: 1) `-o MUL_MAT`: Issue arising from recording of oneMath `ext_codeplay_enqueue_native_command`. 2) `-o CONCAT` : Use of blocking waits on a queue that's being recorded https://github.com/ggml-org/llama.cpp/blob/master/ggml/src/ggml-sycl/concat.cpp#L185-L187, can these wait calls just be removed? 3) `-o MUL_MAT_ID`: Blocking wait on a recording queue for a copy to host memory https://github.com/ggml-org/llama.cpp/blob/master/ggml/src/ggml-sycl/ggml-sycl.cpp#L3072-L3074 , host work could be wrapped in a host-task? For 1) I have come up with a oneMath fix in uxlfoundation/oneMath#669, I've put a provisional git tag to pull in this PR for testing, but will update to the upstream commit once merged. For 2 & 3) we've noticed that `ggml-cuda.cu` has the [check_node_graph_compatibility_and_refresh_copy_ops](https://github.com/ggml-org/llama.cpp/blob/39e73ae0d69f882d7e29cecc6dd8f5052fca6731/ggml/src/ggml-cuda/ggml-cuda.cu#L2458-L2458) method for checking if a graph can be used, even if enabled. I've taken a similar approach in this PR by adding a method to `ggml-sycl.cpp` for checking if a graph can be used for the operations even if a user has asked for it to be enabled.
671d9bc
to
3c06934
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To answer your questions:
- Yes we need to ensure this can build with the latest public oneAPI release
- Yes it would be best to have a test with SYCL-Graph if this is something we want to support. I'm thinking it could be a separate test file. I don't think we would need to test every operation, some sort of example using a single oneMath operation could be enough?
Let me know if you think you will still need this oneMath change. The llama PR using oneDNN looks promising and if it works well we could remove the dependency on oneMath.
a332a53
to
8d153ca
Compare
32d9344
to
6e3c97c
Compare
6e3c97c
to
5b9dbfc
Compare
5b9dbfc
to
7a12cf4
Compare
In order to support applications calling the library with a sycl queue recording to a SYCL-Graph, check if the `ext_codeplay_enqueue_native_command` command-group is being recorded to a graph object. If so use the native stream recording APIs to add the blas calls as nodes in the graph. In particular this fixes the llama.cpp unit test `MUL_MAT(type_a=f16,type_b=f32,m=16,n=1,k=256,bs=[1,1],nr=[2,1],per=[0,1,2,3],v=0)` on CUDA with SYCL-Graph enabled. Previously this would throw an error: ```sh $ GGML_SYCL_DISABLE_GRAPH=0 ./bin/test-backend-ops -b SYCL0 -o MUL_MAT -p type_a=f16,type_b=f32,m=16,n=1,k=256,bs=\\[1,1\\],nr=\\[2 UR CUDA ERROR: Value: 700 Name: CUDA_ERROR_ILLEGAL_ADDRESS Description: an illegal memory access was encountered Function: operator() Source Location: $HOME/dpcpp/unified-runtime/source/adapters/cuda/queue.cpp:154 Native API failed. Native API returns: 2147483646 (UR_RESULT_ERROR_UNKNOWN) Exception caught at file:$HOME/llama.cpp/ggml/src/ggml-sycl/ggml-sycl.cpp, line:3598, func:operator() SYCL error: CHECK_TRY_ERROR((stream)->wait()): Meet error in this line code! in function ggml_backend_sycl_synchronize at $HOME/llama.cpp/ggml/src/ggml-sycl/ggml-sycl.cpp:3598 $HOME/llama.cpp/ggml/src/ggml-sycl/../ggml-sycl/common.hpp:118: SYCL error Could not attach to process. If your uid matches the uid of the target process, check the setting of /proc/sys/kernel/yama/ptrace_scope, or try again as the root user. For more details, see /etc/sysctl.d/10-ptrace.conf ptrace: Operation not permitted. No stack. The program is not being run. ```
Create SYCL-graph extension specific tests for blas in `tests/unit_tests/blas/sycl-graph`. Currently only covering `gemm_usm` and `gemm_batch_usm` These are stubbed out for the CT tests variants, and when the SYCL compiler doesn't support the `sycl_ext_oneapi_graph` extension.
7a12cf4
to
aec2ab1
Compare
@andrewtbarker Would you be able to do the other review on this PR? Picking on you since you've reviewed my other PRs, but if there are other codeowners that could do the review then feel free to tag them instead. |
/intelci: run |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This generally looks good, thanks especially for the nice comments. I have one question for my own understanding but it doesn't block merging.
I think the internal CI is down for some time. @EwanC would you be able to attach a log of the tests running on Nvidia HW? That's the best alternative. |
Unfortunately the results I see locally are non-deterministic and I see fails (which vary in number on each run) on the develop branch and this feature branch. I've inserted the logs below however. Tip DPC++ with CUDA 12.8 used for below testing$ git log | head
commit 4b3ed9b13632bc31eb35c34a3ef37e17c80ffe05
Author: Wu Yingcong <yingcong.wu@intel.com>
Date: Thu Jun 12 16:21:08 2025 +0800
[DeviceSanitizer] Remove not needed test fix. (#18946)
The fix is no longer need and not working at the moment since it does
not come with `%{run}`.
$ ./bin/sycl-ls
[cuda:gpu][cuda:0] NVIDIA CUDA BACKEND, NVIDIA GeForce GT 1030 6.1 [CUDA 12.8] Logs |
Thanks for sharing. Yeah I'm not sure why you see inconsistent tests failing. It may have to do with this device that we're not using for tests usually. |
Update oneMath commit to merged PR uxlfoundation/oneMath#669 which adds SYCL-Graph support for recording CUDA BLAS commands. With this change the `MUL_MAT` tests now pass on DPC++ CUDA backends with SYCL-Graph enabled. Prior to this change, an error would be thrown. ``` $ GGML_SYCL_DISABLE_GRAPH=0 ./bin/test-backend-ops -b SYCL0 -o MUL_MAT -p type_a=f16,type_b=f32,m=16,n=1,k=256,bs=\\[1,1\\],nr=\\[2 UR CUDA ERROR: Value: 700 Name: CUDA_ERROR_ILLEGAL_ADDRESS Description: an illegal memory access was encountered Function: operator() Source Location: $HOME/dpcpp/unified-runtime/source/adapters/cuda/queue.cpp:154 Native API failed. Native API returns: 2147483646 (UR_RESULT_ERROR_UNKNOWN) Exception caught at file:$HOME/llama.cpp/ggml/src/ggml-sycl/ggml-sycl.cpp, line:3598, func:operator() SYCL error: CHECK_TRY_ERROR((stream)->wait()): Meet error in this line code! in function ggml_backend_sycl_synchronize at $HOME/llama.cpp/ggml/src/ggml-sycl/ggml-sycl.cpp:3598 $HOME/llama.cpp/ggml/src/ggml-sycl/../ggml-sycl/common.hpp:118: SYCL error Could not attach to process. If your uid matches the uid of the target process, check the setting of /proc/sys/kernel/yama/ptrace_scope, or try again as the root user. For more details, see /etc/sysctl.d/10-ptrace.conf ptrace: Operation not permitted. No stack. The program is not being run. ```
In order to support applications calling the library with a sycl queue recording to a SYCL-Graph, check if the
ext_codeplay_enqueue_native_command
command-group is being recorded to a graph object. If so use the native stream recording APIs to add the blas calls as nodes in the graph.In particular this fixes the llama.cpp
MUL_MAT
unit tests on CUDA with SYCL-Graph enabled. Previously this would throw an error:With this patch we can successfully record the blas native-command command-groups used as part of the llama.cpp oneMath calls. In particular USM
gemm
andgemm_batch
which I've extended the tests to verify the correctness of.